CN113724060A

CN113724060A - Credit risk assessment method and system

Info

Publication number: CN113724060A
Application number: CN202110245073.7A
Authority: CN
Inventors: 陈秀华; 宫辰
Original assignee: Nanjing Haoxiang Basic Software Research Institute Co ltd; Nanjing University of Science and Technology
Current assignee: Nanjing Haoxiang Basic Software Research Institute Co ltd; Nanjing University of Science and Technology
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-11-30

Abstract

The invention discloses a credit risk assessment method and a credit risk assessment system. The method comprises the following steps: acquiring credit risk assessment data and a current projection matrix; determining a classifier by taking the minimum misclassification experience risk as a target according to the credit risk evaluation data and the current projection matrix; classifying the non-label credit risk data by adopting a classifier, and distributing a pseudo label to the non-label sample data to obtain pseudo label data; performing linear discriminant analysis on the pseudo label data and the positive sample data to obtain an updated projection matrix; if the iteration end condition is met, outputting a classifier and an updated projection matrix; and performing credit risk assessment on the credit risk assessment data according to the classifier and the updated projection matrix to obtain a credit risk assessment result. By adopting the method and the system, a robust classifier is favorably constructed by introducing linear discriminant analysis, and the credit risk assessment effect is improved.

Description

Credit risk assessment method and system

Technical Field

The invention relates to the technical field of credit risk assessment, in particular to a credit risk assessment method and a credit risk assessment system.

Background

In the field of machine learning, the classification task is a very fundamental piece of research. Typically, the data sets in a binary task contain both positively labeled exemplars and negatively labeled exemplars. However, in reality labels for negative examples are often difficult to obtain, e.g., in credit risk assessment, bad credits may be unambiguously considered as positive examples, while unevaluated credit risk data is not necessarily a negative example (i.e., good credits). In recent years, credit card fraudulent transactions have been growing at an unprecedented rate and have become a major problem in the financial sector. As a result of these fraudulent activities, significant losses are incurred by both the merchant and the financial institution. Therefore, credit risk assessment is an indispensable link in credit loan approval for the financial department.

Most of the existing credit risk assessment methods are based on a supervision mechanism, and the reality situation of credit risk assessment is not completely met. Although the existing credit risk assessment method can obtain a better assessment classification effect, the current credit risk assessment method has the problems that the current credit risk assessment method is difficult to acquire negative samples, expensive in acquisition cost and the like in the current life, and the current credit risk assessment method is not separable, so that great difficulty is brought to the establishment of a robust classifier. Therefore, how to improve the credit risk assessment effect is a problem to be solved urgently.

Disclosure of Invention

The invention aims to provide a credit risk assessment method and a credit risk assessment system, which are beneficial to constructing a robust classifier by introducing linear discriminant analysis and improve the credit risk assessment effect.

In order to achieve the purpose, the invention provides the following scheme:

a credit risk assessment method, comprising:

acquiring credit risk assessment data and a current projection matrix; the credit risk assessment data comprises single-class credit risk data and unlabeled credit risk data; the single class credit risk data comprises a plurality of positive sample data, and the unlabeled credit risk data comprises a plurality of unlabeled sample data; the current projection matrix is obtained by performing linear discriminant analysis on the credit risk assessment data;

determining a classifier according to the credit risk assessment data and the current projection matrix by taking the minimized misclassification experience risk as a target;

classifying the unlabeled credit risk data by adopting the classifier, and distributing pseudo labels to the unlabeled sample data to obtain pseudo label data;

performing linear discriminant analysis on the pseudo label data and the positive sample data to obtain an updated projection matrix;

judging whether an iteration end condition is met; if yes, outputting the classifier and the updated projection matrix; if not, taking the updated projection matrix as a current projection matrix, and then returning to the step of determining a classifier by taking the minimum misclassification risk as a target according to the credit risk assessment data and the current projection matrix;

and performing credit risk assessment on the credit risk assessment data according to the classifier and the updated projection matrix to obtain a credit risk assessment result.

Optionally, after acquiring the credit risk assessment data, further comprising:

and carrying out normalization processing on the credit risk data to obtain normalized credit risk evaluation data.

Optionally, the determining a classifier based on the credit risk assessment data and the current projection matrix with a goal of minimizing the misclassification experience risk specifically includes:

determining a classifier according to the credit risk assessment data and the current projection matrix by adopting the following formula:

in the formula,

for misclassification experience risk, f is the classifier, f (-) is the classifier output result, pi is the prior probability of the positive class,

in order to be the positive sample data,

for unlabeled sample data, l (-) is a loss function, λ is a trade-off parameter, n_pIs the number of positive samples, n_uThe number of unlabeled samples, i is the number, and R is the projection matrix.

Optionally, the performing linear discriminant analysis on the pseudo tag data and the positive sample data to obtain an updated projection matrix specifically includes:

performing linear discriminant analysis on the pseudo label data and the positive sample data by adopting the following formula to obtain an updated projection matrix:

wherein,

s_b＝(μ_p-μ_n)(μ_p-μ_n)^T

wherein R is a projection matrix, S_bIs the degree of divergence in class, S_wIs interplass divergence, mu_pIs the mean vector of the positive sample data, μ_nIs the mean vector of the negative sample data, X is the sample, X_pIs a positive sample set, X_nIs a negative sample set; the positive sample set is data with credit risk, and the negative sample set is data without credit risk.

Optionally, the performing credit risk assessment on the credit risk assessment data according to the classifier and the updated projection matrix to obtain a credit risk assessment result specifically includes:

and according to the updated projection matrix and the credit risk assessment data, performing credit risk classification by using the classifier to obtain a credit risk classification result.

A credit risk assessment system, comprising:

the acquisition module is used for acquiring credit risk assessment data and a current projection matrix; the credit risk assessment data comprises single-class credit risk data and unlabeled credit risk data; the single class credit risk data comprises a plurality of positive sample data, and the unlabeled credit risk data comprises a plurality of unlabeled sample data; the current projection matrix is obtained by performing linear discriminant analysis on the credit risk assessment data;

a classifier determining module, configured to determine a classifier based on the credit risk assessment data and the current projection matrix with a goal of minimizing a misclassification experience risk;

a pseudo label data generating module, configured to classify the non-label credit risk data by using the classifier, and allocate a pseudo label to the non-label sample data to obtain pseudo label data;

the linear discriminant analysis module is used for performing linear discriminant analysis on the pseudo label data and the positive sample data to obtain an updated projection matrix;

the judging module is used for judging whether the iteration ending condition is met or not; if yes, executing an output module; if not, executing an updating module;

the updating module is used for taking the updated projection matrix as a current projection matrix and then executing the classifier determining module;

an output module for outputting the classifier and the updated projection matrix;

and the credit risk evaluation module is used for performing credit risk evaluation on the credit risk evaluation data according to the classifier and the updated projection matrix to obtain a credit risk evaluation result.

Optionally, the method further comprises:

and the processing module is used for carrying out normalization processing on the credit risk data to obtain normalized credit risk evaluation data.

Optionally, the classifier determining module specifically includes:

a classifier determining unit, configured to determine a classifier according to the credit risk assessment data and the current projection matrix by using the following formula:

in the formula,

in order to be the positive sample data,

Optionally, the linear discriminant analysis module specifically includes:

a linear discriminant analysis unit, configured to perform linear discriminant analysis on the pseudo tag data and the positive sample data by using the following formula, so as to obtain an updated projection matrix:

wherein,

S_b＝(μ_p-μ_n)(μ_p-μ_n)^T

in the formula,r is a projection matrix, S_bIs the degree of divergence in class, S_wIs interplass divergence, mu_pIs the mean vector of the positive sample data, μ_nIs the mean vector of the negative sample data, X is the sample, X_pIs a positive sample set, X_nIs a negative sample set; the positive sample set is data with credit risk, and the negative sample set is data without credit risk.

Optionally, the credit risk assessment module specifically includes:

and the credit risk evaluation unit is used for carrying out credit risk classification by adopting the classifier according to the updated projection matrix and the credit risk evaluation data to obtain a credit risk classification result.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a credit risk assessment method and a credit risk assessment system, which are used for acquiring credit risk assessment data and a current projection matrix; determining a classifier by taking the minimum misclassification experience risk as a target according to the credit risk evaluation data and the current projection matrix; classifying the non-label credit risk data by adopting a classifier, and distributing a pseudo label to the non-label sample data to obtain pseudo label data; performing linear discriminant analysis on the pseudo label data and the positive sample data to obtain an updated projection matrix; if the iteration end condition is met, outputting a classifier and an updated projection matrix; and performing credit risk assessment on the credit risk assessment data according to the classifier and the updated projection matrix to obtain a credit risk assessment result. The method greatly reduces the sample marking cost, is closer to the situation that the risk assessment for the traditional Chinese medicine lacks negative sample data, simultaneously considers the distribution situation of the data, utilizes linear discriminant analysis to increase the discriminability of the data, is more favorable for constructing a robust classifier, directly utilizes the single-class credit risk data and the non-label credit risk data to evaluate, and has accurate classification and stable effect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flowchart of a credit risk assessment method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a credit risk assessment system according to an embodiment of the present invention;

FIG. 3 is a graph comparing the effects of the embodiments of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Examples

Fig. 1 is a flowchart of a credit risk assessment method according to an embodiment of the present invention, and as shown in fig. 1, a credit risk assessment method includes:

step 101: acquiring credit risk assessment data and a current projection matrix; the credit risk assessment data comprises single-class credit risk data and unlabeled credit risk data; the single-class credit risk data comprises a plurality of positive sample data, and the non-label credit risk data comprises a plurality of non-label sample data; the current projection matrix is obtained by performing linear discriminant analysis on the credit risk assessment data.

Step 101, then also includes: and carrying out normalization processing on the credit risk data to obtain normalized credit risk evaluation data.

Step 102: and determining a classifier according to the credit risk assessment data and the current projection matrix by taking the minimum misclassification experience risk as a target.

Step 102, specifically comprising:

in the formula,

in order to be the positive sample data,

Step 103: and classifying the unlabeled credit risk data by adopting a classifier, and distributing a pseudo label to the unlabeled sample data to obtain pseudo label data.

Step 104: and performing linear discriminant analysis on the pseudo label data and the positive sample data to obtain an updated projection matrix.

Step 104, specifically comprising:

wherein,

s_b＝(μ_p-μ_n)(μ_p-μ_n)^T

Step 105: judging whether an iteration end condition is met; if yes, go to step 107; if not, go to step 106.

Step 106: the updated projection matrix is used as the current projection matrix, and then the process returns to step 102.

Step 107: and outputting the classifier and the updated projection matrix.

Step 108: and performing credit risk assessment on the credit risk assessment data according to the classifier and the updated projection matrix to obtain a credit risk assessment result.

Step 108, specifically comprising:

and according to the updated projection matrix and the credit risk assessment data, performing credit risk classification by adopting a classifier to obtain a credit risk classification result.

FIG. 2 is a block diagram of a credit risk assessment system according to an embodiment of the present invention. As shown in fig. 2, a credit risk assessment system includes:

an obtaining module 201, configured to obtain credit risk assessment data and a current projection matrix; the credit risk assessment data comprises single-class credit risk data and unlabeled credit risk data; the single-class credit risk data comprises a plurality of positive sample data, and the non-label credit risk data comprises a plurality of non-label sample data; the current projection matrix is obtained by performing linear discriminant analysis on the credit risk assessment data;

A classifier determination module 202, configured to determine a classifier based on the credit risk assessment data and the current projection matrix with the objective of minimizing the misclassification experience risk;

the classifier determining module 202 specifically includes:

and the classifier determining unit is used for determining a classifier by adopting the following formula according to the credit risk assessment data and the current projection matrix:

in the formula,

in order to be the positive sample data,

The pseudo tag data generating module 203 is configured to classify the non-tag credit risk data by using a classifier, and allocate a pseudo tag to non-tag sample data to obtain pseudo tag data;

the linear discriminant analysis module 204 is configured to perform linear discriminant analysis on the pseudo tag data and the positive sample data to obtain an updated projection matrix;

the linear discriminant analysis module 204 specifically includes:

the linear discriminant analysis unit is used for performing linear discriminant analysis on the pseudo label data and the positive sample data by adopting the following formula to obtain an updated projection matrix:

wherein,

S_b＝(μ_p-μ_n)(μ_p-μ_n)^T

A judging module 205, configured to judge whether an iteration end condition is met; if yes, executing an output module; if not, executing an updating module;

an update module 206, configured to use the updated projection matrix as a current projection matrix, and then execute the classifier determination module;

an output module 207 for outputting the classifier and the updated projection matrix;

and the credit risk evaluation module 208 is configured to perform credit risk evaluation on the credit risk evaluation data according to the classifier and the updated projection matrix to obtain a credit risk evaluation result.

The credit risk assessment module 208 specifically includes:

and the credit risk evaluation unit is used for classifying the credit risk by adopting a classifier according to the updated projection matrix and the updated credit risk evaluation data to obtain a credit risk classification result.

To further illustrate the discriminant credit risk assessment method based on single-class classification provided by the present invention, the following is specifically described:

according to the invention, an optimal projection matrix is searched by iteratively solving a double-layer optimization problem, so that the class spacing of original data in a new feature space is increased and the class inner spacing is reduced, and the discriminability of the data is increased, thereby constructing a robust classifier and realizing the intelligent evaluation of the credit risk only depending on single-class samples and label-free samples.

The specific implementation steps are as follows:

step 1: and (4) preprocessing and normalizing data. Dividing a data set of credit risk sample data to obtain a positive sample set

And unlabeled sample set

Wherein n is_pAnd n_uThe number of samples in the positive and unlabeled exemplar sets, respectively. Poor credits are considered as a positive sample set in the credit risk assessment, and collected good credits and undetected credit risks are considered as an unlabeled sample set, where samples in the unlabeled sample set may be good credits or poor credits. Then, normalization processing is carried out on the sample characteristics to enable the characteristic value to be in the interval [0, 1 ]]And (4) the following steps.

Step 2: and training a classifier. Taking the positive sample in the step 1

And unlabeled samples

Respectively obtained by projecting a projection matrix R into a new feature space

And

constructing a misclassification experience risk based on the positive sample and the unlabeled sample:

for function f (R)^Tx) using a linear parametric model:

wherein,

is a set of basis functions, alpha is the coefficient of the classifier f, and b is the bias term of the classifier f. As the basis function, a gaussian function, a linear function, or a polynomial function may be used as the basis function. Using this model, equation (1) can be further expressed as:

to obtain the optimal classifier f, it is necessary to minimize the empirical risk of the above equation, i.e.

Here, the square loss is used

As a loss function of the above optimization problem, where z is a variable. B in model (2) is incorporated into alpha, and

is enlarged by

Then carry with l₂The objective function of the regularization term becomes:

wherein phi_pIs a matrix of values for positive samples, phi_uIs a matrix of values for unlabeled exemplars,

is a basis function with 1 being a column vector of all 1's. To find the minimum of this objective function, the first derivative is found and made equal to zero, resulting in an analytical solution for α:

and step 3: the unlabeled exemplars are assigned a pseudo label. Subjecting alpha obtained in step 2 to

Each sample in the label-free data set is assigned with a pseudo label, and then the original positive sample set is combined according to the pseudo label to obtain the positive and negative sample sets of the whole data set

And

wherein,

and

respectively representing the unlabeled data classified by the classifier obtained in step 2The positive and negative samples of the sample are collected,

and

respectively representing the number of positive and negative samples in this case, then

And

and 4, step 4: and (5) solving a projection matrix. I.e. to solve for

Since R here^TS_bR and R^TS_wR is a matrix and not a scalar and therefore cannot be optimized as a scalar function. However, other alternative optimization objectives may be implemented, such as

Therein, II_diagA is the product of the main diagonal elements of A. The optimization procedure of H (R) can be converted into

Wherein m is the feature dimension after projection. Note that the rightmost side of the above equation is the generalized Rayleigh quotient, the maximum of which is the matrix

The maximum eigenvalue of (2), the product of the maximum m values is the matrix

And the corresponding matrix R is a matrix formed by expanding eigenvectors corresponding to the largest m eigenvalues at the moment. Utilizing the positive and negative sample sets obtained in the step 3

And

can find out

And then a projection matrix R is obtained.

And 5: repeating the step 2 to the step 4 until convergence, and obtaining the optimal classifier f^*And an optimal projection matrix R^*。

And finally, classifying the credit risk test data according to the obtained model parameters. Using the optimal projection matrix R^*Transforming the credit risk test data set into a new feature space, and then using the optimal classifier f^*And classifying to obtain the accuracy of the final credit risk assessment result.

The present invention takes the German Credit actual dataset, which classifies credits as "good" and "bad" according to a set of attributes, as an example of Credit risk assessment. Characteristic attributes include the status of an existing checking account, credit record, credit usage, years of employment, property, personal identity, installment rate as a percentage of disposable revenue, etc. In order to verify the robustness of the discriminant credit risk assessment method based on single-class classification, the invention sets the unmarked rate of the positive class to be 20%, 30% and 40% respectively when constructing the data sets of the positive class and the unlabeled class, namely 20%, 30% and 40% of bad credit samples and all good credit samples are taken respectively to form the unlabeled class sample set. Fig. 3 is a graph comparing the effect of the method of the present invention and an unbiased single-class classification method on the German Credit actual dataset when the positive class unlabeling rates are 20%, 30% and 40%, respectively, the ordinate of fig. 3 represents the accuracy, and fig. 3 shows a graph comparing the effect of the method of the present invention and an unbiased single-class classification method on the German Credit actual dataset under the above three conditions. As can be seen from FIG. 3, the method of the present invention further enhances the credit risk assessment effect of the unbiased single-class classification method on the data set under the condition that the positive class unmarked rate is 20%, 30% and 40%.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In summary, this summary should not be construed to limit the present invention.

Claims

1. A credit risk assessment method, comprising:

2. The credit risk assessment method of claim 1, further comprising, after obtaining the credit risk assessment data:

3. The method according to claim 1, wherein the determining a classifier based on the credit risk assessment data and the current projection matrix with the goal of minimizing the misclassification experience risk specifically comprises:

in the formula,

in order to be the positive sample data,

4. The method according to claim 1, wherein the performing linear discriminant analysis on the pseudo tag data and the positive sample data to obtain an updated projection matrix specifically comprises:

wherein,

S_b＝(μ_p-μ_n)(μ_p-μ_n)^T

5. The method according to claim 1, wherein the performing credit risk assessment on the credit risk assessment data according to the classifier and the updated projection matrix to obtain a credit risk assessment result comprises:

6. A credit risk assessment system, comprising:

7. The credit risk assessment system of claim 6, further comprising:

8. The credit risk assessment system of claim 6, wherein the classifier determination module specifically comprises:

in the formula,

in order to be the positive sample data,

9. The credit risk assessment system of claim 6, wherein the linear discriminant analysis module specifically comprises:

wherein,

S_b＝(μ_p-μ_n)(μ_p-μ_n)^T

10. The credit risk assessment system of claim 6, wherein the credit risk assessment module specifically comprises: