CN112102074A - Grading card modeling method - Google Patents

Grading card modeling method Download PDF

Info

Publication number
CN112102074A
CN112102074A CN202011099338.9A CN202011099338A CN112102074A CN 112102074 A CN112102074 A CN 112102074A CN 202011099338 A CN202011099338 A CN 202011099338A CN 112102074 A CN112102074 A CN 112102074A
Authority
CN
China
Prior art keywords
variables
variable
derivative
scoring
logistic regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011099338.9A
Other languages
Chinese (zh)
Other versions
CN112102074B (en
Inventor
黄又钢
许洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Qianhai Hongxi Intelligent Technology Co ltd
Original Assignee
Shenzhen Qianhai Hongxi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Qianhai Hongxi Intelligent Technology Co ltd filed Critical Shenzhen Qianhai Hongxi Intelligent Technology Co ltd
Priority to CN202011099338.9A priority Critical patent/CN112102074B/en
Publication of CN112102074A publication Critical patent/CN112102074A/en
Application granted granted Critical
Publication of CN112102074B publication Critical patent/CN112102074B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a scoring card modeling method which comprises the steps of determining variables, screening the variables, performing logistic regression on intermediate derivative variables and verifying a model. In the variable determination process, not only WOE variables but also Recode variables are introduced, and then variables are constructed and screened from two aspects of stability and accuracy; the method introduces variable clustering analysis based on factor analysis and PCA algorithm, clusters independent variables according to principal components thereof, and selects a plurality of variables in each cluster, thereby furthest retaining the interpretation degree and the coverage of dimensionality. The scoring card modeling method of the invention completes the modeling of the scoring model, only needs human intervention in specific steps, shortens the existing modeling time from one month to three days by adopting the semi-automatic modeling method, and overcomes the technical problem of longer modeling time in the prior art.

Description

Grading card modeling method
Technical Field
The invention relates to the field of credit wind control management, in particular to a grading card modeling method for a loan object.
Background
The traditional scoring card construction process in the credit wind control field is very complex and generally comprises the steps of data exploration, WOE calculation, variable screening, correlation analysis, box separation adjustment, model parameter adjustment, model evaluation, scoring card conversion, model stability verification and the like. The traditional model modeling process mainly has the following problems:
1. the modeling time is long: because the above process is very complicated, the time of about one month is usually required from data exploration to the completion of the stability test in the establishment of the scoring card, which greatly influences the response speed of the wind control system based on the scoring card to market changes. Wherein, the most time-consuming and labor-consuming steps are mainly focused on variable screening, box-dividing adjustment and model parameter adjustment. These steps often require repeated iterative testing.
2. The model effect is as follows: traditional models rely on only two aspects in variable screening:
a) screening independent variables with high correlation with dependent variables, such as IV/KS/Gini values and the like;
b) in order to reduce collinearity, an independent variable with higher correlation is removed;
based on the screening conditions, the number of the model entering variables before the training of the logistic regression model is excessively reduced, so that the precision of the model is influenced.
The misoperation rate is high: the manual establishment of the traditional scoring card requires a great deal of data analysis and data arrangement in each step, which greatly increases the possibility of errors.
Disclosure of Invention
The invention aims to provide a scoring card modeling method, which is used for solving the technical problems that the time consumption is long, the scoring card modeling process cannot adapt to external market change, the modeling model effect is poor and the error rate is high in the prior art.
The invention provides a scoring card modeling method which comprises the following steps:
step S1, determining variables and screening variables: converting variables into derivative variables, and removing part of the derivative variables from all the derivative variables through a screening algorithm to obtain screened intermediate derivative variables with high interpretability and low collinearity;
step S2, intermediate derivative variable logistic regression: firstly, backward logistic regression is carried out on the intermediate derivative variables, and invalid or intermediate derivative variables with strong collinearity are successively removed; performing forward back-compensation on the eliminated intermediate derivative variables, trying to add back the eliminated intermediate derivative variables one by one to ensure that the model effect is optimal, and determining final model-entering variables and weights thereof after the backward logistic regression and forward back-compensation processes are finished, namely determining a scoring model;
step S3, model verification: verifying the scoring model by using a sample verification set, and judging the rationality of a verification result; and when the verification result is not reasonable, returning to execute the step S2.
The scoring card modeling method of the invention is adopted to complete the modeling of the scoring model, only human intervention is needed in specific steps, the semi-automatic modeling method is adopted to shorten the existing modeling time from one month to three days, and the technical problem of longer modeling time in the prior art is solved; in addition, two regression iteration processes of back logistic regression and forward complementation are adopted in the step of logistic regression of the intermediate derivative variables in the sample training set, the variables with higher common linearity in the intermediate derivative variables are verified and removed again, the determined weights of the intermediate derivative variables are guaranteed to be most reasonable after multiple iterations, even if partial variables are lost in practical application, the determined weights cannot influence the output result of the scoring model, scoring of the scoring model can be more accurate and rapid, and downtime is prevented. Finally, the scoring card scoring model of the invention needs other steps of standardized encapsulation except the steps of manual intervention, and the system automatically operates, thereby reducing the possibility of errors of manual modeling.
Drawings
FIG. 1 is an overall flow chart of a scoring card modeling method of the present invention;
FIG. 2 is a flowchart of step S1 of the present invention;
FIG. 3 is a flowchart of step S13 of the present invention;
fig. 4 is a flow chart of another embodiment of the scoring card modeling method of the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the embodiments and drawings of the specification:
referring to fig. 1, the present invention discloses a scoring card modeling method. The method is mainly used for evaluating the consumption and credit repayment capacity of the customer for the bank financial institution and for evaluating the risk of the credit financial institution.
The scoring card modeling method specifically comprises the following steps:
step S1, determining variables and screening variables: and converting variables into derivative variables, and removing part of the derivative variables from all the derivative variables through a screening algorithm to obtain screened intermediate derivative variables with high interpretability and low collinearity.
In this step, data exploration and correction are first required: and counting sample training data, determining the type and distribution of variables required in the sample training data, and manually judging and correcting part of error variable types.
In this step, the main objects of the sample training data are individual or enterprise organizations facing the bank, and the sample training data includes data describing the individual or enterprise organizations, such as: personal age, gender, credit rating, loan amount, repayment duration, marriage, job location nature, income, loan route, famous property, etc.
After the bank determines the sample training data, the variables of the sample training data need to be checked, corrected and abnormal variable formats or values in the sample training data need to be modified before model modeling, so that conditions can be provided for further sample training data modeling.
In this step, the variables are then required to determine: and converting the variables into derivative variables, and eliminating part of the derivative variables in all the derivative variables through a screening algorithm to obtain screened intermediate derivative variables with high interpretability and low collinearity.
The variable determination refers to that the most relevant variables influencing the score are selected from the total variables of the objects provided by the bank through a screening algorithm, and the variables have no colinearity as much as possible, so that the result output by the final scoring model is ensured to be as accurate and stable as possible and is less influenced by other external variables.
Referring to fig. 2, in particular, a method of determining variables and converting the variables into derivative variables includes:
step S11: and respectively carrying out Evidence Weight (WOE, Weight of Evidence) calculation and recoding (Recode) calculation on variables provided by the bank to obtain two groups of derivative variables.
Wherein, the Evidence Weight (WOE) calculation can analyze the total variables and determine which of the total variables has great influence on the scoring result. The recoding (Recode) calculation refers to that after some variable is lost, abnormal and special values in a part of samples are processed, the Recode function is used for modifying the variable and the special values are modified into continuous variables, so that the conditions of loss, numerical value abnormality and the like do not exist in all the variables after the evidence weight calculation. Compared with the traditional case of performing evidence weight calculation only on variables, more variables are reserved, although the complexity is increased in the construction of the scoring model, the increased complexity is within an acceptable range, and most importantly, the accuracy and the stability of the constructed scoring model can be improved after the Recode calculation is introduced.
Step S12: and performing correlation on the calculation results of the two groups of the derived variables and variable clustering analysis based on factor analysis, and removing the derived variables with high collinearity.
The colinearity means that the influence of one variable on the scoring result is similar to or the same as the influence of the other variable on the scoring result, and the stability of the scoring model formed after the model is constructed based on the two variables is poor; when the co-linear variables are more than the number of the finished models, the universality of the scoring model is rapidly deteriorated, and even the scoring model cannot adapt to actual requirements, so that the co-linearity among the variables is ensured as far as possible during model construction, and the scoring model is described and fitted from multiple dimensions, so that the scoring model is more stable.
As shown in fig. 3, the following method is further included in the correlation and variable clustering analysis:
first, step S121: determining a number of alternative clusters based on the number of the derived variables;
subsequently, step S122: clustering the derived variables according to alternative clustering numbers based on a factor Analysis (PCA) algorithm and a Principal Component Analysis (PCA) algorithm;
after that, step S123: evaluating the interpretability of the clustering results under different alternative clustering numbers to the whole sample training set of the derived variables, and selecting the clustering mode with the maximum interpretability as a clustering result;
further, step S124: selecting a plurality of optimal derivative variables from each cluster of the final clustering mode, and preferentially selecting a WOE variable when the optimal derivative variables are selected and the clusters have the WOE variable and the Recode variable;
in this step, the selected plurality of optimal variables includes:
a derivative variable with the smallest coefficient of similarity (Determination ratio) in each cluster;
(ii) the derivative variable with the highest Kolmogorov-Smirnov test value in each cluster;
when there are a plurality of derived variables having a coefficient of fit (coefficient of Determination) less than 0.3 in the cluster, the derived variable having the highest test value of Kolmogorov-Smirnov among the derived variables is selected.
Finally, step S125: and summarizing a plurality of derived variables selected in each cluster, and if the WOE variable and the Recode variable derived from the same variable exist at the same time, preferentially selecting the WOE variable, thereby finally screening out the intermediate derived variables.
After the above operations are performed on the variables in the sample training data, intermediate derivative variables are obtained, and the intermediate derivative variables are recorded in the logistic regression model at the back side and serve as the basis for determining the variable weights by using the basic variables.
Step S2: intermediate derivative variable logistic regression: firstly, backward logistic regression is carried out on the intermediate derivative variables, and invalid or intermediate derivative variables with strong collinearity are successively removed; performing forward back-compensation on the eliminated intermediate derivative variables, trying to add back the eliminated intermediate derivative variables one by one to ensure that the model effect is optimal, and determining final model-entering variables and weights thereof after the backward logistic regression and forward back-compensation processes are finished, namely determining a scoring model;
before executing step S2, in order to obtain the logistic regression model more quickly, it is preferable to further increase the determination weight direction: determining a weight direction of the intermediate derivative variable, wherein the determined weight direction can enable a calculation result utilizing the intermediate derivative variable and the weight to conform to a scoring trend of a sample training set.
Specifically, the backward logic return is: removing one variable meeting a first condition from all the intermediate derivative variables every time, and performing logistic regression iterative operation on the residual intermediate derivative variables after removal every time until no variable in all the intermediate derivative variables meets the first condition;
the first condition includes:
the weight value of the WOE derived variable is a negative value or the weight direction of the Recode derived variable is wrong; or
Derivative variables with excessive p-value, i.e., variable weights outside the Wald (wald) confidence interval; or
The coefficient of variance expansion (VIF) of the derived variable weights is too large.
In the embodiment, the weights of the variables calculated by the WOE are all positive values, and when the weight value of one variable is a negative value, the condition that the collinearity of one or more variables is high exists in the WOE calculation process, and the variable with the weight being the negative value is one of the variables with the high collinearity, so the variable should be eliminated at this time;
the p-value is too large, namely a certain weight value is in an unreliable interval in the whole weight distribution, namely the weight value offset is too large, and at the moment, in order to ensure the accuracy of the scoring model, the variable is directly removed and then logistic regression iteration is carried out again;
the fact that the coefficient of variance expansion (VIF) of the variable weight is too large refers to the weight analysis of one variable in all variables, and the weight of the variable is large relative to the coefficient of variance expansion of the weights of all variables, that is, the discreteness is too strong and deviates from the distribution interval, at this time, the variable should be eliminated and then the weight is calculated by logistic regression again.
The method in step S2 further includes forward back-supplementing, where the forward back-supplementing is to add the intermediate derivative variables removed from the backward logistic regression one by one back to the total model-entering variable of the backward logistic regression, perform logistic regression iterative operation on the intermediate derivative variables of the whole body after the addition, and determine the intermediate derivative variables removed from the back if the operation result satisfies the second condition, until all the intermediate derivative variables removed from the backward logistic regression are detected.
The second condition includes:
the intermediate derivative variable added back makes the weighting coefficient still correct; and is
The p-value of the intermediate derivative variable added back is of reasonable magnitude and lies within the Wald (wald) confidence interval; and is
The intermediate derivative variables added back are such that the coefficient of variance expansion (VIF) of the weights of the entire intermediate derivative variables is within a reasonable range.
In the embodiment of the present invention, after adding the removed intermediate derived variables back to the whole intermediate verification variables, if the added variables make the second condition true, the added variables are considered to be valid variables, and the stability and accuracy of the scoring model are not affected, and at this time, the accuracy and stability of the scoring model can be further improved by adding the variables.
Step S3, model verification: verifying the scoring model by using a sample verification set, and judging the rationality of a verification result; and when the verification result is not reasonable, returning to execute the step S2. In this step, when the verification result is not reasonable, it is considered that there is a problem in the variable determination in step S2, for example, a variable having a large influence is partially deleted, or a variable having a high degree of co-linearity partially exists in the determined variables, and in this case, the most essential way is to return to step S2 to complete the variable determination and perform the subsequent steps again.
Specifically, the method for verifying the model in step S3 includes: and respectively inputting the sample training set and the sample verification set into a scoring model formed by intermediate derivative variable logistic regression, calculating a Kolmogorov-Smirnov (KS) test value of the sample training set and the sample verification set through the scoring model, and verifying the rationality of the test value.
Referring to fig. 4, in some other embodiments of the present invention, the scoring card modeling method of the present invention further includes:
step S4, score conversion and adjustment of conversion parameters: and converting the verification result into a score, judging the reasonability of the score, and manually correcting conversion parameters in the conversion process so as to output the final score.
The method for converting scores and adjusting conversion parameters in step S4 includes: centering and standardizing the result calculated by the scoring model, and converting the result calculated by the scoring model into an interval; during conversion, part of sample training data which is removed manually according to sampling conditions and scoring model calculation is subjected to fine adjustment of conversion parameters, so that the final scoring value is more suitable for practical application.
For example, the result value calculated by the scoring model is between 0 and 1, the scoring result is converted into a distribution interval more suitable for human analysis and observation, for example, within 1 to 1000, and special sample conditions removed before or sampling calculation conditions of the sample need to be considered while conversion is carried out, so that conversion is matched with actual requirements as much as possible.
Step S5, verifying the stability of the model: and preliminarily applying the scoring model, verifying the stability of the scoring model, and carrying out model fine adjustment in due time.
In this step, the scoring model is already modeled, and then the scoring model enters an application stage, and the scoring model is directly applied to financial institutions, banks and the like in the application stage to analyze the consumption and credit repayment capacity of customers, and only when some deviation or some large deviations occur, the partial variables and weights of the scoring model need to be modified in a later intervention mode.
Further, the scoring model can automatically provide verification and test reports after being run, the verification and test reports including at least one of the following: KS values, lifting degrees and statistics of each bin of a sample training set and/or a verification data set; performing contrast test on the sample training set and the verification data set; carrying out score distribution and stability verification on the sample training set; and (5) verifying the variable stability of the scoring model.
The scoring card modeling method of the invention is adopted to complete the modeling of the scoring model, only human intervention is needed in specific steps, the semi-automatic modeling method is adopted to shorten the existing modeling time from one month to three days, and the technical problem of longer modeling time in the prior art is solved; in addition, two regression iteration processes of post-logistic regression and pre-logistic regression are adopted in the step of logistic regression of intermediate variables in the sample training data, the variables with higher common linearity in the intermediate variables are verified and removed again, meanwhile, the determined weights of the variables are guaranteed to be most reasonable after multiple iterations, even if partial variables are missing in practical application, the determined weights cannot influence the output result of the scoring model, and the scoring of the scoring model can be more accurate. Finally, the scoring card scoring model of the invention needs other steps of standardized encapsulation except the steps of manual intervention, and the system automatically operates, thereby reducing the possibility of errors of manual modeling.
The method of the invention also has the following beneficial effects:
the dimension coverage is wider:
a) the method simultaneously introduces WOE (weight Of event) variables and recode variables (the variables which process null values, special values and abnormal values), and further constructs and screens the variables from two aspects Of stability and accuracy.
b) The algorithm innovatively introduces the clustering analysis based on the factor analysis and PCA (principal component analysis) algorithm, clusters independent variables according to the principal components of the independent variables, and selects a plurality of variables in each cluster, thereby furthest retaining the interpretation degree and the coverage of dimensionality.
The fitting accuracy is high:
c) because the WOE and recode variables are introduced into the algorithm at the same time, and the variables are screened by adopting cluster analysis to increase the coverage surface of the algorithm, the logistic regression modeling variables of the algorithm are generally more than those of the traditional modeling process, and the accuracy is relatively higher.
d) On the basis of the traditional backward logistic regression algorithm, the algorithm creatively adds a forward back-supplement process, namely adding the previously eliminated variables back to logistic regression iteration one by one to judge the effectiveness of the variables again, thereby retaining the effective variables to the maximum extent and improving the model precision.
The modeling time is short: the algorithm is divided into a plurality of main modules, and each module can independently run. The inventor integrates the optimal method and the optimal process for realizing each function into a module, thereby reducing the process of repeatedly debugging a certain function in the manual modeling process, and only needing a few steps for the user to intervene in manual judgment, thereby greatly shortening the modeling time.
The mistake is not easy to occur: because most functions are automated and less manual intervention is needed, modeling time is shortened, and the possibility of errors in manual data analysis is reduced.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the protection scope of the present invention, although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A scoring card modeling method is characterized by comprising the following steps:
step S1, determining variables and screening variables: converting variables into derivative variables, and removing part of the derivative variables from all the derivative variables through a screening algorithm to obtain screened intermediate derivative variables with high interpretability and low collinearity;
step S2, intermediate derivative variable logistic regression: firstly, backward logistic regression is carried out on the intermediate derivative variables, and invalid or intermediate derivative variables with strong collinearity are successively removed; performing forward back-compensation on the eliminated intermediate derivative variables, trying to add back the eliminated intermediate derivative variables one by one to ensure that the model effect is optimal, and determining final model-entering variables and weights thereof after the backward logistic regression and forward back-compensation processes are finished, namely determining a scoring model;
step S3, model verification: verifying the scoring model by using a sample verification set, and judging the rationality of a verification result; and when the verification result is not reasonable, returning to execute the step S2.
2. The scoring card modeling method of claim 1, further comprising:
step S4, score conversion and adjustment of conversion parameters: converting the verification result into a score, judging the reasonability of the score, and manually correcting a conversion parameter in the conversion process so as to output a final score;
step S5, verifying the stability of the model: and preliminarily applying the scoring model, verifying the stability of the scoring model, and carrying out model fine adjustment in due time.
3. The scoring card modeling method of claim 1, wherein said step S2 of determining variables and converting said variables into derivative variables comprises:
step S11: respectively performing Evidence Weight (WOE, Weight of Evidence) calculation and recoding (Recode) calculation on the variables to obtain two groups of derivative variables;
step S12: performing correlation on the calculation results of the two groups of the derived variables and variable clustering analysis based on factor analysis, and rejecting the derived variables with high collinearity;
step S13: and manually and finely adjusting the determined derivative variables and variable values to divide boxes, and obtaining the intermediate derivative variables according to box dividing results.
4. A scoring card modeling method as claimed in claim 3, wherein said correlation and variable clustering analysis method in step S12 comprises the steps of:
step S121: determining a number of alternative clusters based on the number of the derived variables;
step S122: clustering the derived variables according to alternative clustering numbers based on a factor Analysis (PCA) algorithm and a Principal Component Analysis (PCA) algorithm;
step S123: evaluating the interpretability of the clustering results under different alternative clustering numbers to the whole sample training set of the derived variables, and selecting the clustering mode with the maximum interpretability as a clustering result;
step S124: selecting a plurality of optimal derivative variables from each cluster of the final clustering mode, and preferentially selecting a WOE variable when the cluster has the WOE variable and the Recode variable when the optimal derivative variables are selected;
step S125: and summarizing a plurality of derived variables selected in each cluster, and if the WOE variable and the Recode variable derived from the same variable exist at the same time, preferentially selecting the WOE variable, thereby finally screening out the intermediate derived variables.
5. The method as claimed in claim 4, wherein the plurality of derived variables selected in step S124 includes:
a derivative variable with the smallest coefficient of similarity (Determination ratio) in each cluster;
(ii) the derivative variable with the highest Kolmogorov-Smirnov test value in each cluster;
when there are a plurality of derived variables having a coefficient of fit (coefficient of Determination) less than 0.3 in the cluster, the derived variable having the highest test value of Kolmogorov-Smirnov among the derived variables is selected.
6. A scoring card modeling method as claimed in any one of claims 1 to 5, wherein said method of intermediate variable logistic regression in step S2 includes:
determining the weight direction: determining a weight direction of the intermediate derivative variable, wherein the determined weight direction can enable a calculation result utilizing the intermediate derivative variable and the weight to conform to a scoring trend of a sample training set.
7. The scoring card modeling method as recited in claim 6, wherein the method of logistic regression of the intermediate variables in the step S2 includes:
backward logistic regression: removing one variable meeting a first condition from all the intermediate derivative variables every time, and performing logistic regression iterative operation on the residual intermediate derivative variables after removal every time until no variable in the intermediate derivative variables meets the first condition;
the first condition includes:
the weight value of the WOE derived variable is a negative value or the weight direction of the Recode derived variable is wrong; or
Derivative variables with excessive p-value, i.e., variable weights outside the Wald (wald) confidence interval; or
The coefficient of variance expansion (VIF) of the derived variable weights is too large.
8. The scoring card modeling method of claim 6, wherein the method of logistic regression of the intermediate variables in step S2 further comprises:
and (3) forward compensation: adding the intermediate derivative variables removed in the backward logistic regression one by one into a total module-entering variable of the backward logistic regression, performing logistic regression iterative operation on the whole added intermediate derivative variable, judging whether the operation result meets a second condition, and if so, determining the intermediate derivative variables removed in the backward logistic regression until all the intermediate derivative variables removed in the backward logistic regression are detected;
the second condition includes:
the intermediate derivative variable added back makes the weighting coefficient still correct; and is
The p-value of the intermediate derivative variable added back is of reasonable magnitude and lies within the Wald (wald) confidence interval; and is
The intermediate derivative variables added back are such that the coefficient of variance expansion (VIF) of the weights of the entire intermediate derivative variables is within a reasonable range.
9. The scoring card modeling method as recited in claim 2, wherein the step S3 model verification method comprises: and respectively inputting the sample training set and the sample verification set into a scoring model formed by intermediate derivative variable logistic regression, calculating a Kolmogorov-Smirnov (KS) test value of the sample training set and the sample verification set through the scoring model, and verifying the rationality of the test value.
10. The method for modeling a score card as claimed in claim 1, wherein the method for transforming the score and adjusting the transformation parameters of step S4 comprises: centering and standardizing the result calculated by the scoring model, and converting the result calculated by the scoring model into an interval; during conversion, part of sample training data which is removed manually according to sampling conditions and scoring model calculation is subjected to fine adjustment of conversion parameters, so that the final scoring value is more suitable for practical application.
CN202011099338.9A 2020-10-14 2020-10-14 Score card modeling method Active CN112102074B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011099338.9A CN112102074B (en) 2020-10-14 2020-10-14 Score card modeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011099338.9A CN112102074B (en) 2020-10-14 2020-10-14 Score card modeling method

Publications (2)

Publication Number Publication Date
CN112102074A true CN112102074A (en) 2020-12-18
CN112102074B CN112102074B (en) 2024-01-30

Family

ID=73782743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011099338.9A Active CN112102074B (en) 2020-10-14 2020-10-14 Score card modeling method

Country Status (1)

Country Link
CN (1) CN112102074B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989606A (en) * 2021-03-16 2021-06-18 上海哥瑞利软件股份有限公司 Data algorithm model checking method, system and computer storage medium
CN113572753A (en) * 2021-07-16 2021-10-29 北京淇瑀信息科技有限公司 User equipment authentication method and device based on Newton's cooling law
CN114298532A (en) * 2021-12-27 2022-04-08 智慧芽信息科技(苏州)有限公司 Scoring card model generation method, using method, device, equipment and storage medium

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
BR8202956A (en) * 1981-05-22 1983-05-03 Data General Corp DIGITAL COMPUTER SYSTEM
CN103440410A (en) * 2013-08-15 2013-12-11 广东电网公司 Main variable individual defect probability forecasting method
CN104699717A (en) * 2013-12-10 2015-06-10 ***股份有限公司 Data mining method
CN106548350A (en) * 2016-11-17 2017-03-29 腾讯科技(深圳)有限公司 A kind of data processing method and server
CN106600455A (en) * 2016-11-25 2017-04-26 国网河南省电力公司电力科学研究院 Electric charge sensitivity assessment method based on logistic regression
CN107301467A (en) * 2017-04-11 2017-10-27 程在舒 Chinese Future population number predicted method
CN108416495A (en) * 2018-01-30 2018-08-17 杭州排列科技有限公司 Scorecard method for establishing model based on machine learning and device
CN108898479A (en) * 2018-06-28 2018-11-27 中国农业银行股份有限公司 The construction method and device of Credit Evaluation Model
CN109191282A (en) * 2018-08-23 2019-01-11 北京玖富普惠信息技术有限公司 Methods of marking and system are monitored in a kind of loan of Behavior-based control model
CN109598095A (en) * 2019-01-07 2019-04-09 平安科技(深圳)有限公司 Method for building up, device, computer equipment and the storage medium of scorecard model
CN109636591A (en) * 2018-12-28 2019-04-16 浙江工业大学 A kind of credit scoring card development approach based on machine learning
CN109858566A (en) * 2019-03-01 2019-06-07 成都新希望金融信息有限公司 A method of it being added to the scorecard of mould dimension based on multilayered model building
CN110197426A (en) * 2018-04-16 2019-09-03 腾讯科技(深圳)有限公司 A kind of method for building up of credit scoring model, device and readable storage medium storing program for executing
CN110322142A (en) * 2019-07-01 2019-10-11 百维金科(上海)信息科技有限公司 A kind of big data air control model and inline system configuration technology
CN110956273A (en) * 2019-11-07 2020-04-03 中信银行股份有限公司 Credit scoring method and system integrating multiple machine learning models
CN111178675A (en) * 2019-12-05 2020-05-19 佰聆数据股份有限公司 LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment
CN111311128A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Consumption financial credit scoring card development method based on third-party data
CN111311400A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Modeling method and system of grading card model based on GBDT algorithm
CN111583031A (en) * 2020-05-15 2020-08-25 上海海事大学 Application scoring card model building method based on ensemble learning
CN111738819A (en) * 2020-06-15 2020-10-02 中国建设银行股份有限公司 Method, device and equipment for screening characterization data

Patent Citations (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
BR8202956A (en) * 1981-05-22 1983-05-03 Data General Corp DIGITAL COMPUTER SYSTEM
CN103440410A (en) * 2013-08-15 2013-12-11 广东电网公司 Main variable individual defect probability forecasting method
CN104699717A (en) * 2013-12-10 2015-06-10 ***股份有限公司 Data mining method
WO2015085916A1 (en) * 2013-12-10 2015-06-18 ***股份有限公司 Data mining method
CN106548350A (en) * 2016-11-17 2017-03-29 腾讯科技(深圳)有限公司 A kind of data processing method and server
CN106600455A (en) * 2016-11-25 2017-04-26 国网河南省电力公司电力科学研究院 Electric charge sensitivity assessment method based on logistic regression
CN107301467A (en) * 2017-04-11 2017-10-27 程在舒 Chinese Future population number predicted method
CN108416495A (en) * 2018-01-30 2018-08-17 杭州排列科技有限公司 Scorecard method for establishing model based on machine learning and device
CN110197426A (en) * 2018-04-16 2019-09-03 腾讯科技(深圳)有限公司 A kind of method for building up of credit scoring model, device and readable storage medium storing program for executing
CN108898479A (en) * 2018-06-28 2018-11-27 中国农业银行股份有限公司 The construction method and device of Credit Evaluation Model
CN109191282A (en) * 2018-08-23 2019-01-11 北京玖富普惠信息技术有限公司 Methods of marking and system are monitored in a kind of loan of Behavior-based control model
CN109636591A (en) * 2018-12-28 2019-04-16 浙江工业大学 A kind of credit scoring card development approach based on machine learning
CN109598095A (en) * 2019-01-07 2019-04-09 平安科技(深圳)有限公司 Method for building up, device, computer equipment and the storage medium of scorecard model
CN109858566A (en) * 2019-03-01 2019-06-07 成都新希望金融信息有限公司 A method of it being added to the scorecard of mould dimension based on multilayered model building
CN110322142A (en) * 2019-07-01 2019-10-11 百维金科(上海)信息科技有限公司 A kind of big data air control model and inline system configuration technology
CN110956273A (en) * 2019-11-07 2020-04-03 中信银行股份有限公司 Credit scoring method and system integrating multiple machine learning models
CN111178675A (en) * 2019-12-05 2020-05-19 佰聆数据股份有限公司 LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment
CN111311128A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Consumption financial credit scoring card development method based on third-party data
CN111311400A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Modeling method and system of grading card model based on GBDT algorithm
CN111583031A (en) * 2020-05-15 2020-08-25 上海海事大学 Application scoring card model building method based on ensemble learning
CN111738819A (en) * 2020-06-15 2020-10-02 中国建设银行股份有限公司 Method, device and equipment for screening characterization data

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
MICHAEL OLUSEGUN AKINWANDE等: "Variance inflation factor: as a condition for the inclusion of suppressor variable (s) in regression analysis", OPEM JOURNAL OF STATISTICS *
RUIZ S.等: "Credit Scoring in Microfinance Using Non-traditional Data", PROGRESS IN ARTIFICIAL INTELLIGENCE(EPIAA 2017), pages 447 - 458 *
刘伟江;魏海;运天鹤;: "基于卷积神经网络的客户信用评估模型研究", 数据分析与知识发现, no. 06, pages 80 - 90 *
汪政元;伍业锋;: "基于贡献度随机森林模型的公司债信用风险实证分析", 经济数学, no. 03, pages 33 - 40 *
纪守领 等: "机器学习模型可解机器学习模型可解释性方法、应用与安全研究综述释性方法、应用与安全研究综述", 计算机研究与发展 *
耿俊成;张小斐;袁少光;万迪明;: "基于逻辑回归模型的电力客户停电敏感度评分卡研究与实现", 电力需求侧管理, no. 03 *
陈战勇: "珠联璧合:基于及其学习的网络借贷信用评分卡模型研究", 武汉金融, pages 42 - 50 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989606A (en) * 2021-03-16 2021-06-18 上海哥瑞利软件股份有限公司 Data algorithm model checking method, system and computer storage medium
CN113572753A (en) * 2021-07-16 2021-10-29 北京淇瑀信息科技有限公司 User equipment authentication method and device based on Newton's cooling law
CN113572753B (en) * 2021-07-16 2023-03-14 北京淇瑀信息科技有限公司 User equipment authentication method and device based on Newton's cooling law
CN114298532A (en) * 2021-12-27 2022-04-08 智慧芽信息科技(苏州)有限公司 Scoring card model generation method, using method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112102074B (en) 2024-01-30

Similar Documents

Publication Publication Date Title
CN112102074B (en) Score card modeling method
CN107193876B (en) Missing data filling method based on nearest neighbor KNN algorithm
Xia et al. Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending
Du Jardin Dynamics of firm financial evolution and bankruptcy prediction
CN109034194B (en) Transaction fraud behavior deep detection method based on feature differentiation
CN110543616B (en) SMT solder paste printing volume prediction method based on industrial big data
CN107832581A (en) Trend prediction method and device
CN108898480A (en) Loan grade assessment system and method for credit extension loan
WO2024036709A1 (en) Anomalous data detection method and apparatus
CN110059126B (en) LKJ abnormal value data-based complex correlation network analysis method and system
KR101851367B1 (en) Method for evaluating credit rating, and apparatus and computer-readable recording media using the same
CN112037006A (en) Credit risk identification method and device for small and micro enterprises
US20140317066A1 (en) Method of analysing data
Lopes et al. Predicting recovery of credit operations on a brazilian bank
CN112329862A (en) Decision tree-based anti-money laundering method and system
CN112634022A (en) Credit risk assessment method and system based on unbalanced data processing
CN112116197A (en) Adverse behavior early warning method and system based on supplier evaluation system
CN115526276A (en) Wind tunnel balance calibration load prediction method with robustness
CN115360703A (en) Practical power distribution network state estimation method
CN114757397A (en) Bad material prediction method, bad material prediction device and electronic equipment
CN113177733A (en) Medium and small micro-enterprise data modeling method and system based on convolutional neural network
CN109284320B (en) Automatic regression diagnosis method on big data platform
Cornaglia et al. Rating philosophy and dynamic properties of internal rating systems: A general framework and an application to backtesting
CN113034264A (en) Method and device for establishing customer loss early warning model, terminal equipment and medium
Vučinić et al. MEASURING SYSTEMIC BANKING RESILIENCE: A STRESS TESTING APPROACH

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant