CN116091206A

CN116091206A - Credit evaluation method, credit evaluation device, electronic equipment and storage medium

Info

Publication number: CN116091206A
Application number: CN202310047227.0A
Authority: CN
Inventors: 范晓忻; 曹鸿强; 李鼐
Original assignee: 3golden Beijing Information Technology Co ltd
Current assignee: 3golden Beijing Information Technology Co ltd
Priority date: 2023-01-31
Filing date: 2023-01-31
Publication date: 2023-05-09
Anticipated expiration: 2043-01-31
Also published as: CN116091206B

Abstract

The invention relates to a credit evaluation method, a credit evaluation device, electronic equipment and a storage medium, wherein the credit evaluation method comprises the following steps: acquiring an evaluation index of an operation main body, wherein the evaluation index comprises basic strength, operation capability, performance capability, debt repayment capability and development prospect of the operation main body; performing data processing and characteristic engineering operation on the evaluation index to obtain target data; inputting target data into a agricultural air control model to output a credit evaluation result of an operation subject; the agricultural air control model is a model obtained after training by adopting a training data set based on a machine learning model, wherein the training data set is a data set obtained by performing data processing and characteristic engineering operation on evaluation indexes of a plurality of management subjects. Compared with an expert experience method, the method does not involve subjective factors of experts and business personnel in the whole model design process, and fully reflects the objectivity of the model.

Description

Credit evaluation method, credit evaluation device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of credit evaluation technologies, and in particular, to a credit evaluation method, apparatus, electronic device, and storage medium.

Background

In the development process, the management body always faces the problem of financing, and the situation of the agricultural management body in financing is more difficult to solve.

Currently, when credit evaluation is performed on an operation subject, a weighting method mainly based on expert experience is mostly adopted. The business personnel comb out the key indexes, and the expert gives a weight to all the key indexes. And then, consistency test, hierarchical single sorting, hierarchical total sorting and the like are carried out to determine the feasibility of the index weight, and the final index weight is output. It is obvious that the initial weighting of the index by the expert is the basis of the subsequent flow, and is important, so that the subjective weighting of the index by the expert greatly influences the final model effect. The reliance of the model on the subjective experience of the expert makes the model subject to certain limitations and bias towards subjectivity.

Disclosure of Invention

The invention provides a credit evaluation method, a credit evaluation device, electronic equipment and a storage medium, which are used for solving the defects that a model is limited to a certain extent and is biased to subjective due to subjective experience of an expert when a management subject performs credit evaluation in the prior art, and realizing objectification of credit evaluation.

The invention provides a credit evaluation method, which comprises the following steps:

acquiring evaluation indexes of an operation subject, wherein the evaluation indexes comprise basic strength, operation capability, performance capability, debt repayment capability and development prospects of the operation subject;

performing data processing and characteristic engineering operation on the evaluation index to obtain target data;

inputting the target data into an agricultural air control model to output a credit evaluation result of the management main body;

the agricultural air control model is a model obtained after training by adopting a training data set based on a machine learning model, wherein the training data set is a data set obtained by performing data processing and characteristic engineering operation on evaluation indexes of a plurality of management subjects.

According to the credit evaluation method provided by the invention, the data processing of the evaluation index comprises the following steps:

data cleaning is carried out on the evaluation index, wherein the data cleaning is to clean high-deletion-rate data, repeated data and high-base low-base data;

and carrying out balance processing on the evaluation index, wherein the balance processing comprises a small sample oversampling method, a large sample undersampling method and an SMOTE method.

According to the credit evaluation method provided by the invention, the feature engineering comprises feature derivatization, feature binning, feature encoding and feature screening, and the feature engineering operation is performed on the evaluation index, and comprises the following steps:

And selecting one or more of the characteristic projects to operate the evaluation index.

According to the credit evaluation method provided by the invention, the characteristic box comprises a square box, and the square box comprises:

taking out all unique values of the features, arranging the unique values in ascending order, and calculating chi-square values of two adjacent unique value sample sets by combining good and bad label columns;

combining the minimum chi-square values until the bin number reaches the preset bin number;

and carrying out box combination according to the number of samples in the box to obtain a chi-square box dividing result.

According to the credit evaluation method provided by the invention, the feature codes adopt woe codes, and the woe codes calculate woe values corresponding to the sub-bins for the sample set after the sub-bins, wherein the higher the woe value is, the higher the probability of being a bad customer is, and the lower the credit is finally obtained.

According to the credit evaluation method provided by the invention, the characteristic screening comprises monotonicity screening, IV value screening, VIF screening, correlation screening, model importance screening and AHP analytic hierarchy process screening.

According to the credit evaluation method provided by the invention, the AHP analytic hierarchy process screening comprises the following steps:

establishing a hierarchical structure model;

constructing a judgment matrix;

ordering the hierarchical list and checking consistency thereof;

And (6) checking the total rank and consistency thereof.

The invention also provides a credit evaluation device, which comprises:

the system comprises an acquisition module, a management module and a development module, wherein the acquisition module is used for acquiring evaluation indexes of an operation main body, wherein the evaluation indexes comprise basic strength, operation capability, performance capability, debt repayment capability and development prospect of the operation main body;

the data processing module is used for carrying out data processing and characteristic engineering operation on the evaluation index so as to obtain target data;

the evaluation module is used for inputting the target data into an agricultural air control model so as to output a credit evaluation result of the management main body;

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a credit evaluation method as described in any of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a credit assessment method as described in any of the above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements a credit rating method as any one of the above.

Compared with an expert experience method, the credit evaluation device, the electronic equipment and the storage medium provided by the invention do not involve subjective factors of experts and business personnel in the whole model design process, and the objectivity of the model is fully embodied. By adopting a model fusion mode, credit scores output by other black box models are used as derivative features to be added into the training of a credit evaluation model, so that the feature dimension is enriched. Before model training, the characteristics which can truly enter the model training are screened out through a plurality of layers of characteristic screening links, and each link has principles and hardness standards on whether the characteristics pass the screening. The whole model building process is detailed, the logic is clearer, and the interpretability of each feature is stronger.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a credit evaluation method provided by the invention;

FIG. 2 is a visual result diagram of credit rating output provided by the present invention;

FIG. 3 is a schematic diagram of a credit rating system of the present invention;

fig. 4 is a schematic structural diagram of an electronic device provided by the present invention.

Reference numerals:

310. an acquisition module; 320. a data processing module; 330. an evaluation module; 410. a processor; 420. a communication interface; 430. a memory; 440. a communication bus.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The credit evaluation method, apparatus, electronic device, and storage medium are described below with reference to fig. 1 to 4.

As shown in fig. 1, in one embodiment, a credit rating method includes the steps of:

Step S110, obtaining evaluation indexes of the management body, wherein the evaluation indexes comprise basic strength, management capability, performance capability, debt repayment capability and development prospects of the management body.

The guarantee service is divided into three stages: an admittance stage, a credit giving stage and a post-credit stage. The credit evaluation model is used for determining whether to guarantee the application main body in the credit granting stage, the model evaluates the credit of each application main body and gives the credit and the credit grade, the next link is entered only when the credit and the credit grade are higher than the preset threshold, and otherwise, the whole application flow is terminated. The credit score is given to the subject according to the characteristics of the agricultural management subject, and the five aspects of basic strength, management capability, performance capability, debt repayment capability and development prospect are integrated. The multi-dimensional information such as in-line data, big data bureau data, credit investigation data and the like is fully called, a plurality of index information is synthesized, scientific weight proportion is carried out, the future default possibility of the client is predicted, and the quantitative evaluation of the credit risk before credit of the client is applied is realized. And defining a threshold value for distinguishing good and bad clients according to the risk tolerance of the current stage to judge whether the clients are admitted or not, and outputting the model as credit grade and credit score. And standardized, batched and automated decision support is provided for service development.

Step S120, data processing and characteristic engineering operation are carried out on the evaluation index to obtain target data.

Specifically, the data processing for the evaluation index includes: data cleaning is carried out on the evaluation indexes, and the data cleaning is to clean high-deletion-rate data, repeated data and high-base low-base data; and carrying out balance treatment on the evaluation index, wherein the balance treatment comprises a small sample oversampling method, a large sample undersampling method and an SMOTE method. The feature engineering comprises feature derivatization, feature binning, feature encoding and feature screening, and the feature engineering operation is carried out on the evaluation index, and comprises the following steps: and selecting one or more of the characteristic projects to operate the evaluation index.

Step S130, inputting the target data into the agricultural air control model to output the credit evaluation result of the management subject.

The machine learning model development is covered with stages of target definition, data evaluation, feature engineering, model training, model evaluation and the like, and iteration operation can be performed according to the output result of each stage so as to obtain a machine learning model meeting the requirements. In the whole model development scheme, data-related work including common extraction, conversion, loading, feature engineering and the like is important.

In the goal definition stage, the goal definition work of machine learning modeling is usually dominated by business requirements, and is classified according to the output type of the modeling goal, and there are usually classification, clustering, prediction, recommendation, etc., for example, a common credit score card model is a modeling problem of prediction-to-classification, and the work in the stage will determine sample marking logic for supervised learning. Sample marking logic is generally required to meet the actual requirements of the service. In the wind control model, the interpretation of bad customers is generally divided into two categories: overdue and compensatory. Overdue refers to the customer not paying on time within a specified time of each period; compensation refers to the fact that the customer does not complete repayment after the whole loan period, and is replaced by a guarantee company. The marking logic of the sample is generally determined according to the overdue and the proportion of compensating clients in the historical data and the loss caused by the overdue and compensating clients. The final yield at this stage is a label dataset containing principal identities and good or bad customer labels.

In the initial index acquisition and evaluation stage, the machine learning modeling does not need intervention of manual priori knowledge, but is more dependent on the quality of data, so that the quality evaluation of modeling data is crucial, and common evaluation targets comprise the degree of data deletion, data restoration of noise, abnormal data interpretation, fluctuation caused by activity or service change and the like. This process is important and if data-related problems are not analyzed in this link, problems with the modeling process are likely to occur.

1. And (5) obtaining an initial index.

According to the main characteristics of agricultural management, five aspects of basic strength, management capability, performance capability, debt repayment capability and development prospect are integrated, and through fully calling multidimensional information such as in-line data, big data bureau data, credit investigation data and the like, a plurality of index information is integrated, and finally basic information, credit risk, asset income, anti-guarantee information and the like are formed to serve as initial index items for model construction. Meanwhile, when a specific index item is selected, two principles of unification and standardization are complied with: unification means that the data of all sample clients on the index is not empty as much as possible; the standardization means that a single channel is used instead of multiple channels in a mixed way for single index when the third party data is called because the data standards adopted by all channels are different.

In the index acquisition stage, according to different data channels, the data association modes are also different, and the initial index is the initial source of all subsequent indexes, so that the quality of the initial index is related to the quality of the whole model. In the processing of the initial indexes, related indexes are acquired as much as possible, and the characteristics which can enter the model are determined through the screening of the characteristics.

Knowledge of the original table is the basis of initial index processing, such as individual unfulfilled validation judgment, and unfulfilled amounts of specific cases of each individual in 1, 3, 5 and the like are aggregated as processed indexes through grouping in the table of individual unfulfilled validation judgment.

2. And (5) data cleaning.

(1) Data loss

Data cleansing is mainly directed to three types of data: data with high deletion rate, data with more repetition values and data with high base and low base. The lack of data is caused by a variety of reasons. In general, if the missing rate of a certain data field is high, the availability of the field is poor, and the index related to the field should be avoided as much as possible in modeling; under the condition of low field missing rate, a reasonable method is adopted to process the missing value, mainly comprising an alternative method, an deduction estimation method, a difference complement reasoning method and the like, and more complex methods such as random forest, regression analysis, GBDT (Gradient Boosting Decision Tree, an algorithm based on a decision tree) and the like can also be adopted. At this stage, different missing value processing methods need to be tried, and one of the best model effects is selected for practical use.

(2) Data repetition

The cause of data duplication is complex, and part of it arises in the data maintenance phase. In practice, the repeated data has a great influence on the operation effect of the model, which tends to lower the operation efficiency of the model and affect the interpretation of the output model, and the data cleaning method for the repeated values mainly comprises a matching repeated recording method, an expert system method and the like.

(3) High and low radicals

For the obtained features, the probability of occurrence of high and low basis is extremely high. The high basis means that for a certain feature, the proportion of the types of feature values in all data of the feature is high enough, and the feature values are scattered too much, so that useful information cannot be extracted from the feature; low basis means that for a feature, the number of categories of feature values is low enough in all data of the feature, and most feature values are the same, resulting in limited information provided by the feature.

(4) Data conversion

The data conversion can eliminate the influence of dimension and can lead the data distribution to be normalized. The common conversion methods include box-cox conversion, logarithmic conversion, etc., and can also be min-max standardization and z-score standardization, and different methods can be selected according to specific situations, such as: log function conversion, atan function conversion, descriptive scaling Decimal normalization, logistic/Softmax transformation, etc.

Meanwhile, according to the model requirement, the data can be integrated. Including the association of the respective correlation tables, aggregation of synonymous fields, etc.

(5) Data anomalies

The anomaly detection mainly cleans noise data, and can find out abnormal data of an abnormal group by adopting a clustering method and the like besides that the data exceeding a specified range is the noise data. And deleting or converting the abnormal data according to the data and the selected model.

3. Data imbalance handling

The data processing work aims at carrying out proportion balance processing on the original data so that the positive proportion and the negative proportion of the whole sample are in a reasonable interval. For the case of too few positive samples, the following three common balancing methods are generally adopted: the method comprises the steps of undersampling a small sample, undersampling a large sample and SMOTE.

(1) Subclass sample oversampling

Sub-class sample oversampling creates copies of sub-class samples, or manually to mimic the distribution of such samples to create new samples, without introducing substantially more data into the model. When a new sample is created, a K-nearest neighbor algorithm can be adopted to screen the new sample, and the new sample with high similarity with the positive sample is selected as a final new sample. In addition, excessive data emphasizing positive samples can amplify the influence of noise in the positive samples on the model, and have the hidden trouble of overfitting. When oversampling is performed using subclass samples, the oversampling ratio should not be too large.

(2) Undersampling of large class of samples

The undersampling of the large sample refers to that only a part of negative samples are extracted from the negative samples occupying a large proportion, and are combined with all positive samples to be used as the whole sample of the model. A drawback of this approach is that a large amount of negative sample data is discarded, resulting in a consequent loss of information contained in this portion of data. Meanwhile, for the same model, because a large number of negative samples are discarded, the sample size is obviously reduced, and the model has a tendency of overfitting.

(3) SMOTE method

The SMOTE method does not balance the positive and negative ratios with a simple replica of the small sample, but undersamples the large sample while creating a new sample by analyzing the small sample. The method is a compromise process of the first two sampling methods, and solves the problem of overfitting caused by insufficient generalization to a certain extent.

4. Remaining data processing effort

Besides the necessary data processing, data searching work is also needed to be carried out, and effective data such as univariate analysis, bivariate analysis and the like are extracted. Statistical tests (normal tests, chi-square tests, and the like), correlation tests, monotonicity, and the like can also be introduced, and variables which do not meet the test standard can be automatically filtered.

In this stage, the data of basic strength, operation capability, performance capability, repayment capability and development prospect are integrated, and the quality evaluation and data cleaning are performed to obtain a clean data set meeting the basic conditions of the model, and then the data set enters the next stage.

In the feature engineering stage, the objective of the feature engineering stage is to ensure that the input feature vector can effectively represent the information required by the modeling target, and the work of the stage often needs a certain priori knowledge, for example, which derivative indexes possibly have a strong correlation with the model, or a preliminary direction exists, and the feature vector required by the model is processed from the original data in an exhaustive manner, and then the feature screening is performed.

In the traditional modeling process based on the statistical learning theory, the requirement on the feature vector of the modeling input is strict, and the assumption test such as independence and the like is needed to be carried out, so that the problems such as collinearity and the like are avoided, or the operation such as normalization and the like is carried out. The algorithms adopted in the current machine learning modeling, including logistic regression, random forest, gradient lifting tree, support vector machine, neural network, etc., have no too many restrictions on feature vectors, and usually only need to ensure logic correctness. This stage will produce the training data set required for modeling, typically

Text file in the form of>

And y is a business logic label.

Based on training data, according to preliminary assumption of model algorithm, feature engineering operations such as feature derivation, feature binning, feature encoding, feature screening and the like are selectively carried out, so that a data set which finally enters a training model is obtained.

1. Feature derivation

The feature derivation at this stage is divided into three directions: firstly, starting from priori knowledge, listing derivable directions according to business experience, and then checking the correlation of derived indexes through real data; secondly, the machine learning direction is combined with the time sequence to realize automatic feature derivation by using the featureols; thirdly, other models, such as a random forest, a GBDT, lightGBM (Light Gradient Boosting Machine, a distributed gradient lifting framework based on a decision tree algorithm) and the like, are used for predicting the default probability, the predicted result is used as a characteristic to carry out the follow-up flow of a scoring model, a mode of model fusion is essentially adopted, and the results of the other models are finally applied to a credit evaluation model.

2. Characteristic sub-box

The data indexes are divided into two types of numerical value type and category type, and feature box division operation is often required aiming at the numerical value type features and the category type features of high bases, and the data indexes are generally divided into a plurality of modes such as equidistant modes, equal frequency modes, chi-square modes, decision trees and the like. Wherein the equidistant and equal frequency are unsupervised bins and the chi-squared and decision tree are supervised bins.

(1) Decision tree box

And carrying out decision tree box division processing on the information data of the sample user according to the preset maximum box number, the preset minimum leaf node number and the preset minimum leaf node duty ratio to obtain a plurality of box divisions.

(2) Square card separating box

The chi-square value sorting is carried out aiming at each feature, all unique values of the features are firstly taken out and sorted in ascending order, and the chi-square values of two adjacent unique value sample sets are calculated by combining good and bad label columns. The smallest chi-square values are combined each time until the bin count reaches a preset bin count (typically, a plurality of bin counts, e.g., 4, 5, 6 bins), at which point the chi-square distribution in each bin of the bin result is the most similar sample. Meanwhile, the box with too few samples in the box is combined with other boxes through the preset minimum ratio of the samples in the box, and a final chi-square box dividing result is obtained.

The specific logic is as follows:

，

。

3. feature encoding

The optional coding modes of feature coding include woe (Weight of Evidence, evidence weight, also called as one coding of independent variables) and catboost, target, and for low-base category type features, one-hot coding, dummy coding and the like can be used for processing.

In the credit rating model, woe coding is generally used. woe codes calculate woe values corresponding to the sub-bins for the sample set after the sub-bins, and the specific calculation formula is as follows: (bad client fraction in the case of the binned bad client/bad client fraction in the case of the binned good client/good client fraction in the case of the total client) will convert the non-linear feature into a linear feature, i.e. the higher the woe value, the greater the probability of being a bad client and the lower the resulting credit. Thus, the feature weights obtained after training the model for the features transformed by woe should be either positive or negative.

4. Feature screening

(1) Monotonicity screening and IV (Information Value, information value, used to represent the degree of contribution of a feature to a target prediction, i.e., the predictive power of a feature) value screening

In this stage, since there may be a plurality of the preset number of bins, 1 feature may be changed into a plurality of the preset number of bins, it is first necessary to verify the monotonicity of the bin division result, to verify whether the feature matches the monotonicity criterion after the bin division, and to preserve the one of the highest correlation from the plurality of features of the bin division. The specific monotonicity criteria are: after the duplication of each characteristic is removed, the woe values are in a single increasing and single decreasing or U-shaped trend, namely the times of the second derivative being 0 are not more than 1.

The criteria for correlation evaluation are IV values. The IV value is for each feature, and the feature IV value is obtained by first calculating the IV value of each bin for each feature and then adding the IV values. The calculation formula of iv value is: the bin woe value (the bin bad client ratio/bad client ratio in total clients) - (the bin good client ratio/good client ratio in total clients). IV values less than 0.02 are generally considered poor predictive power, and are discarded; IV >0.5, possibly related to sample distribution, or possibly having a strong actual predictive power, requiring manual inspection; IV values between 0.02 and 0.5 are better predictive to enter the next process.

The feature set generated at this stage is the first screening feature.

(2) VIF screening

VIF is a variance expansion factor, which refers to the ratio of the variance in the absence of multiple collinearity to the variance in the presence of multiple collinearity between characteristic variables, specifically defined as: 1/(1-R) the higher the VIF, the stronger the linear correlation between the characteristic variable and the dependent variable. When the VIF is larger than a preset threshold, the strong collinearity exists between the characteristic variables, and the model is easy to be unstable, so that the VIF is required to be smaller than the preset threshold.

The feature set generated at this stage is the second screening feature.

(3) Correlation screening

The correlation consists of two parts: correlation between independent variables, correlation between independent variables and dependent variables. Firstly, screening the correlation between independent variables and dependent variables, and removing the characteristics of which the correlation is lower than a preset threshold value 1; and then, carrying out correlation among the independent variables, screening out independent variable pairs with correlation higher than a preset threshold value 2, and respectively calculating the correlation among the independent variables and reserving the independent variables with high correlation. Eventually forming a third screening feature.

(4) Model importance

Features of high feature importance in black box models tend to be more predictive features. The more important features can be screened out by training the importance of the output features with a tree model. Algorithms such as Catboost, XGBoost, lightGBM, GBDT and random forests are commonly used, each model independently outputs a feature importance, the output results are combined, and features which are ranked in the feature importance for a plurality of times are screened.

The feature set generated at this stage is the fourth screening feature.

(5) AHP (Analytic Hierarchy Process )

In the feature derivation link, the construction of models such as gradient lifting trees, random forests and the like is considered to respectively output a credit as a feature, and the method essentially belongs to the model fusion. For credit characteristics output by the black box model, after four characteristic screening links of monotonicity, IV value, VIF value and relativity are passed, consistency check of an AHP analytic hierarchy process is also needed. And adding a training link of the credit evaluation model after the test is passed, and if the training link does not pass, referring to the idea of backward search, circularly removing each feature, and finding out the optimal credit feature subset which can pass the test. Unlike conventional AHP hierarchies, the covariance matrix of each credit feature is used instead of the judgment matrix given by the expert, and each credit feature does not need to be weighted.

(5.1) establishing a hierarchical model

The decision target, the considered factors (decision criteria) and the decision object are divided into a highest layer, a middle layer and a lowest layer according to the interrelationship between the decision target, the considered factors (decision criteria) and the decision object, and a hierarchy chart is drawn. The highest layer refers to the problem to be solved for decision purposes. The lowest layer refers to the alternative at decision time. The middle layer refers to the considered factors and decision criteria. For two adjacent layers, the higher layer is called the target layer, and the lower layer is called the factor layer.

(5.2) construction of judgment (pairwise comparison) matrix

And calculating a covariance matrix A according to quantitative index data reflected by the elements.

And transforming and calculating the covariance matrix to construct a judgment matrix. The judgment matrix should have two properties: the product of the values corresponding to the diagonal line are 1 and the value corresponding to the diagonal line is 1.

(5.3) hierarchical Single ordering and consistency check

And calculating characteristic roots of the judgment matrix. Determination matrix maximumFeature root

Is normalized (the sum of the elements in the vector is equal to 1) and is denoted as W. The element of W is the ranking weight of the relative importance of the same level factor to the factor of the previous level factor, and the process is called level list ranking. If the hierarchical order can be confirmed, a consistency check is required, and the consistency check refers to an allowable range for determining inconsistency for a. Wherein, the unique non-zero characteristic root of the n-order consistent matrix is n; maximum feature root of n-order judgment matrix A>

If and only if->

When A is a consistent matrix.

Due to the continuous dependence of lambda

The more λ is greater than n, the more serious the A inconsistency is, the smaller the CI the consistency index is calculated with CI, indicating greater consistency. The feature vector corresponding to the maximum feature value is used as a weight vector of the influence degree of the compared factors on a certain factor of an upper layer, and the larger the inconsistency degree is, the larger the judgment error is caused. The degree of inconsistency of a can thus be measured by the magnitude of the lambda-n value. The defined consistency index is:

Ci=0, with complete consistency; CI is close to 0, and satisfactory consistency is achieved; the larger the CI, the more serious the inconsistency.

To measure the size of CI, a random uniformity index RI is introduced:

the random consistency index RI is related to the order of the judgment matrix, and in general, the larger the matrix order is, the larger the probability of occurrence of consistency random deviation is, and the corresponding relation is as follows:

the RI standard value of the average random consistency index (the RI values will have small difference according to different standards)

Considering that the deviation of the consistency may be caused by random reasons, when checking whether the judging matrix has satisfactory consistency, the CI and the random consistency index RI are also required to be compared to obtain a checking coefficient CR, and the formula is as follows:

in general, if CR <0.1, the decision matrix is considered to pass the consistency check, otherwise there is no satisfactory consistency.

(5.4) Total rank order and consistency check

The weight of all factors of a certain level for the relative importance of the highest level (total target) is calculated and is called the total level ranking. This process is performed sequentially from the highest level to the lowest level.

The feature set generated at this stage is the fifth screening feature. And is also a training dataset that would enter model training.

In the model training stage:

1. data set partitioning

Before model training begins, it is typically necessary to segment the data set generated in the previous stage to construct a training set, a validation set, and a test set. The training set is used for model training and parameter tuning, the verification set is used for selecting a model and parameters, and the test set is used for evaluating model results. The data set division mainly adopts a leave-out method, a cross-validation method and a self-service method.

(1) Leave-out method

The method is as follows: the data set is divided into three parts according to a certain proportion, 100 times of division is carried out, and the average value is taken as the comprehensive evaluation effect of the model, so that the method is suitable for scenes with large initial data samples.

(2) Cross validation method

Cross validation method: dividing the data set into k mutually exclusive subsets, selecting k-1 subsets each time as a sample set and the rest subset as a verification set, thereby performing model training and testing for k times, and finally returning the average value of k test results. The stability and fidelity of the evaluation result is largely dependent on the value of k, which is also called "k-fold cross validation".

(3) Self-help method

Self-help method: for data set D, samples are m, one sample is randomly taken from D each time and put into D ', m times are repeated, m samples (0.632 samples) are owned in D' to serve as training sets, D/D '(approximately m/e samples which are not appeared in D') serves as verification sets, and the test result is called 'out-of-package estimation'. This approach is useful where the data set is small and it is difficult to distinguish between training sets/test sets, but changing the distribution of the initial training set introduces estimation bias.

2. Model training

The algorithm used for establishing the model has many choices, and takes classification scenes as examples, such as random forest, support vector machine, gradient lifting tree, neural network, LASSO and other algorithms are all common and effective algorithms. Different algorithms have different advantages and disadvantages in terms of prediction performance, tuning cost, training overhead and the like, and can be selected according to specific scenes. In credit assessment models, logistic regression is typically used to build the model in order to pursue the interpretability of the model.

The characteristics are subjected to woe coding conversion after binning, so that the weight of each characteristic output by the final model is all positive or all negative. In addition, the credit evaluation model has two different directions in the training link according to different model requirements.

(1) p-value test

And (3) ensuring that the weight of each characteristic of the model is all positive or all negative and pursuing the model effect, and simultaneously, checking the p value of the final index of the model is smaller than a preset threshold p. The specific model training logic is as follows: external circulation and internal circulation.

In the internal circulation, after the model is trained, whether the characteristic is all positive or all negative or not is judged, and the p value is smaller than a preset threshold p. If the condition is met, outputting the model, deleting the feature with the maximum p value in the feature total set, then re-running the model by using the residual feature, judging the condition again, and if the condition is not met, averaging the current p value result and the previous p value result, deleting the feature with the maximum p value in the feature total set, and then re-running the model by using the residual feature.

In the outer loop, each outer loop comprises (feature number+1) times of inner loops, and after (feature number+1) times of inner loops are finished, the feature with the maximum p value is deleted, and the inner loops are carried out again. And (5) looping until a model meeting the two conditions is found, and otherwise, outputting an empty model.

(2) Lasso regression and ridge regression

Model overfitting is suppressed by adding a penalty term to the loss function. Where Lasso regression uses a one-norm and ridge regression uses a two-norm. The model training logic is the same as (1), but the condition only requires that the weights of the various features of the model be all positive or all negative.

3. Credit conversion

The resulting probability of a client's breach after model training is, from the interpretive perspective, typically converted into credits, including the credits for each bin of features. The credit score is calculated by the following specific steps: score=a-B log (odds), where a is the basal fraction and B is the doubling fraction.

In the model evaluation stage

Model evaluation is usually included in the modeling process, but the actual effect of the model is often closely related to a service scene, and only outputting a general evaluation index often cannot objectively describe the influence and help of the model on the service. General evaluation metrics include Precision, recall, AUC, KS values for classification scenes, R-Square, RMSE for predicted scenes, and so forth. The general evaluation process is based on an offline test set, a certain observation period is needed to monitor the effect of the model after the model is online, and the model is periodically evaluated to determine that no service change or activity and the like cause the model to fail.

Interpretation of model results

According to the data and the characteristics, a model can be built, so that grading cards for visualizing all indexes are read out, and the classification effect is achieved.

Taking a credit evaluation model of a certain agricultural and public security project as an example, a visual result output by the model is shown in fig. 2. In fig. 2, scoring interval: the score value is 300-1000, and the higher the score is, the lower the credit risk is, and the lower the probability of default is.

The different scoring intervals correspond to the credit risk classes as follows:

[300, 500) high risk advice rejection;

[500, 550) risk advice attention;

[650, 1000] low risk advice pass.

Compared with an expert experience method, the credit evaluation method does not involve subjective factors of experts and business personnel in the whole model design process, and fully reflects objectivity of the model. By adopting a model fusion mode, credit scores output by other black box models are used as derivative features to be added into the training of a credit evaluation model, so that the feature dimension is enriched. Before model training, the characteristics which can truly enter the model training are screened out through a plurality of layers of characteristic screening links, and each link has principles and hardness standards on whether the characteristics pass the screening. The whole model building process is detailed, the logic is clearer, and the interpretability of each feature is stronger.

The credit evaluation device provided by the invention is described below, and the credit evaluation device described below and the credit evaluation method described above can be referred to correspondingly to each other.

As shown in fig. 3, in one embodiment, the credit evaluation device of the present invention includes:

an obtaining module 310, configured to obtain an evaluation index of an operation subject, where the evaluation index includes a basic strength, an operation capability, a performance capability, a debt repayment capability, and a development prospect of the operation subject;

the data processing module 320 is configured to perform data processing and feature engineering operation on the evaluation index to obtain target data;

the evaluation module 330 is configured to input the target data into an agricultural air control model, so as to output a credit evaluation result of the operating subject;

In this embodiment, the data processing module is specifically configured to:

In this embodiment, the feature engineering includes feature derivation, feature binning, feature encoding and feature screening, and the data processing module is specifically configured to:

In this embodiment, the feature box division includes a decision tree box division, and the decision tree box division performs box division processing on the evaluation index according to a preset maximum box number, a preset minimum leaf node number and a preset minimum leaf node duty ratio.

In this embodiment, the feature code uses woe code, and the woe code calculates woe values corresponding to the bins for the sample set after the bins, where the higher the woe value is, the higher the probability of being a bad customer is, and the lower the final credit is.

In this example, the feature screening includes monotonicity screening, IV value screening, VIF screening, correlation screening, model importance screening, and AHP hierarchical screening.

In this embodiment, the AHP hierarchical screening method includes:

establishing a hierarchical structure model;

Constructing a judgment matrix;

ordering the hierarchical list and checking consistency thereof;

and (6) checking the total rank and consistency thereof.

Compared with an expert experience method, the credit evaluation device of the embodiment does not involve subjective factors of experts and business personnel in the whole model design process, and fully reflects objectivity of the model. By adopting a model fusion mode, credit scores output by other black box models are used as derivative features to be added into the training of a credit evaluation model, so that the feature dimension is enriched. Before model training, the characteristics which can truly enter the model training are screened out through a plurality of layers of characteristic screening links, and each link has principles and hardness standards on whether the characteristics pass the screening. The whole model building process is detailed, the logic is clearer, and the interpretability of each feature is stronger.

Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. The processor 410 may invoke logic instructions in the memory 430 to perform a credit evaluation method comprising:

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the credit evaluation method provided by the above methods, comprising:

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the credit evaluation method provided by the above methods, comprising:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of credit assessment, the method comprising:

2. The credit evaluation method according to claim 1, wherein the data processing of the evaluation index includes:

3. The credit evaluation method according to claim 1, wherein the feature engineering includes feature derivation, feature binning, feature encoding and feature screening, and the performing feature engineering operation on the evaluation index includes:

4. A credit evaluation method according to claim 3, wherein the characteristic box includes a chi-square box including:

5. The credit evaluation method according to claim 4, wherein the feature codes are woe codes, and the woe codes calculate woe values corresponding to the bins for the sample sets after the bins, wherein the higher the woe value is, the higher the probability of being a bad customer is, and the lower the credit is finally obtained.

6. The credit assessment method according to claim 4, wherein the feature screening includes monotonicity screening, IV value screening, VIF screening, correlation screening, model importance screening and AHP hierarchical analysis screening.

7. The credit rating method according to claim 6, wherein the AHP hierarchical screening includes:

establishing a hierarchical structure model;

constructing a judgment matrix;

ordering the hierarchical list and checking consistency thereof;

and (6) checking the total rank and consistency thereof.

8. A credit rating device, the device comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the credit assessment method of any one of claims 1 to 7 when the program is executed by the processor.

10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the credit assessment method according to any one of claims 1 to 7.