CN111062444B

CN111062444B - Credit risk prediction method, credit risk prediction system, credit risk prediction terminal and storage medium

Info

Publication number: CN111062444B
Application number: CN201911331410.3A
Authority: CN
Inventors: 李心儿; 刘彦; 张在美; 谢国琪
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-12-21
Filing date: 2019-12-21
Publication date: 2023-12-08
Anticipated expiration: 2039-12-21
Also published as: CN111062444A

Abstract

The invention discloses a credit risk prediction method, a credit risk prediction system, a credit risk prediction terminal and a credit risk prediction storage medium, wherein the credit risk prediction method comprises the following steps: training a credit risk prediction model to be trained by adopting information of a user with repayment behaviors to obtain a primarily trained credit risk prediction model; predicting the credit risk level of the user which fails the loan audit by adopting a preliminarily trained credit risk prediction model to obtain the credit risk prediction level of the user which fails the loan audit; and training the primarily trained credit risk prediction model to obtain a final credit analysis prediction model by training the user information and the corresponding actual credit risk level of the user with repayment behaviors and the user information and the corresponding credit risk prediction level of the user with failed loan verification. The invention solves the problem of low accuracy of predicting the credit risk of the user in the existing credit scoring model.

Description

Credit risk prediction method, credit risk prediction system, credit risk prediction terminal and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a credit risk prediction method, a credit risk prediction system, a credit risk prediction terminal, and a credit risk prediction computer readable storage medium.

Background

In recent years, with the continuous development of internet finance, the online loan market has gradually been integrated into the daily life of humans. The online loan market provides a convenient service that allows direct lending transactions between users. But this convenience also presents a great potential risk to users, especially investors. Therefore, how to predict the credit risk of borrowers is a major issue in the online loan market.

The occurrence of the credit rating model alleviates the problem to a certain extent, but the traditional credit rating model is constructed based on the information of the users who are allowed to loan and lacks the information of other users who are refused to loan, so that the credit risk prediction of the users by the credit rating model still has deviation, and the accuracy of the risk prediction is low.

Disclosure of Invention

The invention mainly aims to provide a credit risk prediction method, a credit risk prediction system, a credit risk prediction terminal and a credit risk prediction computer readable storage medium, and aims to solve the technical problem that the accuracy rate of predicting credit risk of a user in an existing credit scoring model is low.

In order to achieve the above object, the present invention provides a credit risk prediction method, including the steps of:

Collecting information of a user with repayment behaviors as a first sample, and marking the actual credit risk level of the first sample according to a preset mapping relation between the repayment behaviors of the user and the credit risk level;

collecting user information which is not passed by loan audit as a second sample;

training the credit risk prediction model to be trained according to the plurality of first samples and the corresponding actual credit risk levels to obtain a primarily trained credit risk prediction model;

predicting the credit risk level of the second sample by adopting a primarily trained credit risk prediction model to obtain the credit risk prediction level of the second sample;

and training the primarily trained credit risk prediction model by using the plurality of first samples and the corresponding actual credit risk grades and the plurality of second samples and the corresponding credit risk prediction grades to obtain a final credit analysis prediction model.

Optionally, the step of collecting information of the user who has undergone the payment action as the first sample includes:

collecting information of users who have paid repayment behaviors;

filtering and/or preprocessing sensitive information in the process of carrying out repayment on user information;

And taking the filtered and/or preprocessed user information with the repayment behaviors as a first sample.

Optionally, the step of collecting the user information that the loan audit fails as the second sample includes:

collecting user information which is not passed by loan audit;

filtering and/or preprocessing sensitive information in the process of user information which fails to pass loan verification;

and taking the user information which is not passed by the filtered and/or preprocessed loan audit as a second sample.

Optionally, if the credit risk prediction model to be trained includes at least one preset different classification algorithm, at least one preset different clustering algorithm, and a fusion algorithm to be trained, the step of training the credit risk prediction model to be trained according to the plurality of first samples and the corresponding actual credit risk classes, and obtaining the preliminary trained credit risk prediction model includes:

inputting a plurality of first samples into each preset classification algorithm and each preset clustering algorithm respectively to obtain a first classification result output by each preset classification algorithm and a first clustering result output by each preset clustering algorithm, wherein the first classification result output by each preset classification algorithm comprises the probability that each first sample belongs to the corresponding credit risk class and the credit risk prediction class of each first sample, and the first clustering result output by each preset clustering algorithm comprises the clusters with the same number as the preset credit risk class, the probability that each cluster belongs to the corresponding credit risk class and the cluster class to which each first sample belongs;

Training the fusion algorithm to be trained according to a preset cluster actual probability matrix, actual credit risk levels of all first samples, first classification results output by all preset classification algorithms and first clustering results output by all preset clustering algorithms to obtain a primary training fusion algorithm.

Optionally, the step of training the fusion algorithm to be trained according to the preset cluster actual probability matrix, the actual credit risk level of each first sample, the first classification result output by each preset classification algorithm, and the first clustering result output by each preset clustering algorithm, and obtaining the fusion algorithm to be trained comprises the following steps:

constructing a sample actual probability matrix according to the actual credit risk level of each first sample;

according to the first classification result output by each preset classification algorithm and the first clustering result output by each preset clustering algorithm, a sample prediction average probability matrix, a cluster prediction average probability matrix, a distribution matrix and a homogeneity matrix are constructed;

inputting a preset cluster actual probability matrix, a sample prediction average probability matrix, a cluster prediction average probability matrix, a distribution matrix and a homogeneity matrix into a fusion algorithm to be trained, and acquiring preliminary parameters in the fusion algorithm to be trained by adopting a block coordinate descent algorithm.

Optionally, the step of training the credit risk prediction model to be trained according to the plurality of first samples and the corresponding actual credit risk levels, and obtaining the preliminary trained credit risk prediction model further includes:

training the credit risk prediction model to be trained according to the K-ten fold cross validation method, the plurality of first samples and the corresponding actual credit risk levels to obtain a primarily trained credit risk prediction model.

Optionally, the step of predicting the credit risk level of the second sample by using the preliminary trained credit risk prediction model, and obtaining the credit risk prediction level of the second sample includes:

inputting a plurality of second samples into each preset classification algorithm and each preset clustering algorithm to obtain second classification results respectively output by each preset classification algorithm and second clustering results respectively output by each preset clustering algorithm;

and inputting the second classification results respectively output by the preset classification algorithms and the second classification results respectively output by the preset clustering algorithms into a fusion algorithm of preliminary training, and outputting credit risk prediction grades of the second samples.

In addition, to achieve the above object, the present invention further provides a credit risk prediction system, which includes:

The first acquisition module is used for acquiring information of the user with the repayment behaviors as a first sample and marking the actual credit risk level of the first sample according to a preset mapping relation between the repayment behaviors of the user and the credit risk level;

the second acquisition module is used for acquiring user information which is not passed by loan audit as a second sample;

the first training module is used for training the credit risk prediction model to be trained according to a plurality of first samples and corresponding actual credit risk levels to obtain a primarily trained credit risk prediction model;

the prediction module is used for predicting the credit risk level of the second sample by adopting a primarily trained credit risk prediction model to obtain the credit risk prediction level of the second sample;

and the second training module is used for training the credit risk prediction model which is initially trained by the plurality of first samples and the corresponding actual credit risk grades and the plurality of second samples and the corresponding credit risk prediction grades to obtain a final credit analysis prediction model.

In addition, to achieve the above object, the present invention also provides a terminal including a memory, a processor, and a computer program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the credit risk prediction method as described above.

In addition, to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the credit risk prediction method as described above.

The credit risk prediction method, the credit risk prediction system, the credit risk prediction terminal and the credit risk prediction computer readable storage medium provided by the invention are characterized in that information of a user with repayment behaviors is collected as a first sample, and the actual credit risk level of the first sample is marked according to a preset mapping relation between the repayment behaviors of the user and the credit risk level; collecting user information which is not passed by loan audit as a second sample; training the credit risk prediction model to be trained according to the plurality of first samples and the corresponding actual credit risk levels to obtain a primarily trained credit risk prediction model; predicting the credit risk level of the second sample by adopting a primarily trained credit risk prediction model to obtain the credit risk prediction level of the second sample; and training the primarily trained credit risk prediction model by using the plurality of first samples and the corresponding actual credit risk grades and the plurality of second samples and the corresponding credit risk prediction grades to obtain a final credit analysis prediction model. When the credit analysis prediction model is constructed, the user information of the allowed loan is firstly used for carrying out preliminary training on the model, and then the user information of the allowed loan and the user information of the refused loan are used together for carrying out retraining on the model, so that the obtained model has high risk prediction accuracy rate for potential users meeting the loan condition, and the accuracy rate of risk prediction for users not meeting the loan condition is improved, thereby integrally improving the credit risk prediction accuracy rate of the model for the users.

Drawings

FIG. 1 is a schematic diagram of a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a credit risk prediction method according to a first embodiment of the present invention;

FIG. 3 is a detailed flowchart of step S30 in the first embodiment of the credit risk prediction method according to the present invention;

FIG. 4 is a detailed flowchart of step S40 in the first embodiment of the credit risk prediction method of the present invention

Fig. 5 is a schematic diagram of functional modules of the credit risk prediction system according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic hardware structure of a terminal according to various embodiments of the present invention. The terminal comprises a communication module 01, a memory 02, a processor 03 and other components. Those skilled in the art will appreciate that the terminal shown in fig. 1 may also include more or fewer components than shown, or may combine certain components, or may be arranged in a different manner. The processor 03 is connected to the memory 02 and the communication module 01, respectively, and a computer program is stored in the memory 02 and executed by the processor 03 at the same time.

The communication module 01 is connectable to an external device via a network. The communication module 01 can receive data sent by external equipment, and can also send data, instructions and information to the external equipment, wherein the external equipment can be electronic equipment such as a mobile phone, a tablet personal computer, a notebook computer, a desktop computer and the like.

The memory 02 is used for storing software programs and various data. The memory 02 may mainly include a storage program area that may store an operating system, an application program (building a distribution matrix) required for at least one function, and the like, and a storage data area; the storage data area may store data or information, etc. created according to the use of the terminal. In addition, memory 02 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 03, which is a control center of the terminal, connects various parts of the entire terminal using various interfaces and lines, performs various functions of the terminal and processes data by running or executing software programs and/or modules stored in the memory 02 and calling data stored in the memory 02, thereby performing overall monitoring of the terminal. The processor 03 may include one or more processing units; preferably, the processor 03 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 03.

Although not shown in fig. 1, the terminal may further include a circuit control module, where the circuit control module is used to connect with a mains supply, to implement power control, and ensure normal operation of other components.

It will be appreciated by those skilled in the art that the terminal structure shown in fig. 1 is not limiting of the terminal and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

According to the above hardware structure, various embodiments of the method of the present invention are presented.

Referring to fig. 2, in a first embodiment of the credit risk prediction method of the present invention, the credit risk prediction method includes the steps of:

step S10, collecting information of a user with repayment behaviors as a first sample, and marking the actual credit risk level of the first sample according to a preset mapping relation between the repayment behaviors of the user and the credit risk level;

in the scheme, information of the user with repayment behaviors is collected as a first sample, wherein the repayment behaviors of the user comprise on-time repayment, short-delay repayment, long-delay repayment and non-repayment. The terminal marks the actual credit risk level of the first sample according to the preset mapping relation between the repayment behavior of the user and the credit risk level, the credit risk level can be set to be high, medium and low, and can be set to be 5 levels of 1-5, and the setting of the credit risk level is not limited in this example.

Specifically, the step of collecting, as the first sample, information of the user who has performed the payment action in the step S10 includes:

step S11, collecting information of users who have paid-off behaviors;

step S12, filtering and/or preprocessing sensitive information in the process of carrying out the repayment action on the user information;

step S13, the filtered and/or preprocessed user information with repayment behaviors is taken as a first sample.

Since the collected user information may have sensitive information revealing the privacy of the user, such as an identification card number, a name, a family member, etc., it is necessary to automatically identify the sensitive information by using a keyword identification method, and filter, i.e. delete, the sensitive information. In order to improve the training effect on the model, the user information can be preprocessed before the model training is carried out by adopting the information, the preprocessing comprises normalization or standardization, for example, the on hot transformation is carried out on the type data, and the normalization is carried out on the numerical data. It is understood that the data preprocessing method includes, but is not limited to, one hot transformation and normalization used in the present example. After sensitive information filtering and/or preprocessing is performed on the user information with the repayment behaviors, the filtered and/or preprocessed user information with the repayment behaviors can be directly used as a first sample. The user information related to the application is agreed by the user and collected in a legal mode.

Step S20, collecting user information which is not passed by loan verification as a second sample;

the terminal collects the user information which is not passed by the loan audit as a second sample, and the user does not have repayment behaviors because the user fails to pass the audit in the process of applying the loan, so the second sample has no actual risk level.

Specifically, the step S20 includes:

step S21, collecting user information which is not passed by loan audit;

step S22, sensitive information filtering and/or preprocessing is carried out on the user information which fails to pass the loan verification;

and S23, taking the filtered and/or preprocessed loan audit failed user information as a second sample.

Since the collected user information may have sensitive information revealing the privacy of the user, such as an identification card number, a name, a family member, etc., it is necessary to automatically identify the sensitive information by using a keyword identification method, and filter, i.e. delete, the sensitive information. In order to improve the training effect on the model, the user information can be preprocessed before the model training is carried out by adopting the information, the preprocessing comprises normalization or standardization, for example, the on hot transformation is carried out on the type data, and the normalization is carried out on the numerical data. It is understood that the data preprocessing method includes, but is not limited to, one hot transformation and normalization used in the present example. After the sensitive information filtering and/or preprocessing is performed on the user information of the refused loan, the filtered and/or preprocessed user information of the refused loan can be directly used as a second sample.

Step S30, training a credit risk prediction model to be trained according to a plurality of first samples and corresponding actual credit risk levels to obtain a primarily trained credit risk prediction model;

training the credit risk prediction model to be trained by adopting a plurality of first samples and the actual credit risk level corresponding to each sample to obtain initial parameters of the credit risk prediction model to be trained, and taking the credit risk prediction model with the initial parameters as a preliminary training credit risk prediction model.

Specifically, referring to fig. 3, fig. 3 is a flowchart detailing a step of training a credit risk prediction model to be trained according to a plurality of first samples and corresponding actual credit risk levels if the credit risk prediction model to be trained includes at least one preset different classification algorithm and at least one preset different clustering algorithm in an embodiment of the present invention, to obtain a primarily trained credit risk prediction model, and based on the above embodiment, the step S30 includes:

step S31, a plurality of first samples are simultaneously and respectively input into each preset classification algorithm and each preset clustering algorithm to obtain a first classification result output by each preset classification algorithm and a first clustering result output by each preset clustering algorithm, wherein the first classification result output by each preset classification algorithm comprises the probability that each first sample belongs to the corresponding credit risk class and the credit risk prediction class of each first sample, and the first clustering result output by each preset clustering algorithm comprises the same number of clusters as the number of the preset credit risk classes, the probability that each cluster belongs to the corresponding credit risk class and the cluster class to which each first sample belongs;

When the constructed risk prediction model to be trained comprises at least one preset different classification algorithm, at least one preset different clustering algorithm and a fusion algorithm to be trained, namely the risk prediction model is a semi-supervised learning model based on a combination of the classification algorithm and the clustering algorithm. The classification algorithm adopted by the risk prediction model can be one or more of NBC (Native Bayesian Classifier, naive Bayesian classification) algorithm, logistic regression algorithm, various decision tree algorithms, SVM (Support Vector Machine ) algorithm, K nearest neighbor algorithm, neural network algorithm and the like; the clustering algorithm can be one or more of a K-means clustering algorithm, a K-MEDOIDS clustering algorithm, a hierarchical clustering algorithm, a GMM Gaussian mixture model, a graph group detection algorithm, a density-based clustering algorithm and the like. All preset classification algorithms and all clustering algorithms in the risk prediction model to be trained are trained, and the classification algorithms and the clustering algorithms after training and verification are performed by adopting a plurality of first samples before the risk prediction model to be trained is constructed.

The terminal inputs a plurality of first samples into each preset classification algorithm and each clustering algorithm in the risk prediction model to be trained at the same time, and a first classification result output by each preset classification algorithm and a first clustering result output by each preset clustering algorithm are obtained. The first classification result output by each preset classification algorithm comprises the probability that each first sample belongs to the corresponding credit risk class and the credit risk prediction class of each first sample, and the first clustering result output by each preset clustering algorithm comprises the cluster class with the same number as the preset credit risk class, the probability that each cluster class belongs to the corresponding credit risk class and the cluster class to which each first sample belongs

Step S32, training a fusion algorithm to be trained according to a preset cluster actual probability matrix, actual credit risk levels of all first samples, first classification results output by all preset classification algorithms and first clustering results output by all preset clustering algorithms to obtain a fusion algorithm for preliminary training.

Training a fusion algorithm to be trained by a preset cluster actual probability matrix, actual credit risk levels of all first samples, first classification results output by all preset classification algorithms and first clustering results output by all preset clustering algorithms to obtain a fusion algorithm for preliminary training.

The preset cluster actual probability matrix is as follows:

when i is kL+j, c _ij When i is not kl+j, =1, c _ij ＝0，i＝1，2，...G，j＝1，2，...L，k＝0，1...K ₁ +K ₂ L is the number of preset credit risk levels, G is the total number of clusters divided by each preset classification algorithm and each preset clustering algorithm, the general classification algorithm is to output the credit risk prediction level of each sample, but when the credit risk level is predicted for the sample, each classification algorithm is already divided into clusters with the same number as the preset credit risk level, and the clustering algorithm is divided into clusters with the same number as the preset credit risk level, so K ₁ * L is the total number of clusters divided by each preset classification algorithm, and K is ₂ * L is the total number of clusters divided by each preset clustering algorithm, and G=K ₁ *L+K ₂ *L，K ₁ And K ₂ And the number of the preset classification algorithms and the number of the preset clustering algorithms in the credit risk prediction model to be trained.

Specifically, the step S32 includes:

step S321, constructing a sample actual probability matrix according to the actual credit risk level of each first sample;

according to the actual credit risk level of each first sample, a sample actual probability matrix is constructed, and the sample actual probability matrix is as follows:

b when the actual credit risk level of the ith sample is the jth credit risk level _ij When the actual credit risk level of the ith sample is not the jth credit risk level =1, b _ij =0, i=1, 2..o, j=1, 2..l, O is the total number of first samples input into the risk prediction model to be trained. For example, if the actual credit risk level of the 5 th first sample is high and the 2 nd credit risk level is high, b ₅₂ If the actual credit risk level of the 5 th first sample is high and the 2 nd credit risk level is not high, =1, b ₅₂ ＝0。

Step S322, a sample prediction average probability matrix, a cluster prediction average probability matrix, a distribution matrix and a homogeneity matrix are constructed according to a first classification result output by each preset classification algorithm and a first clustering result output by each preset clustering algorithm;

According to the first classification result output by each preset classification algorithm and the first clustering result output by each preset clustering algorithm, a sample prediction average probability matrix is constructed, wherein the sample prediction average probability matrix is as follows:

average of predictive probabilities of belonging to the j-th class of credit risk class for the i-th sample, where p _ik1j Is the kth ₁ Calculating the probability of the ith sample belonging to the jth credit risk level by a preset classification algorithm, p _ik2j Is the kth ₂ Calculating the probability that the ith sample belongs to the jth credit risk level by a preset clustering algorithm, wherein O is the total number of the first samples input into the risk prediction model to be trained, L is the number of credit risk level categories, and K ₁ K is the total number of preset classification algorithms ₂ I=1, 2..o, j=1, 2..l for the preset total number of clustering algorithms.

According to the first classification result output by each preset classification algorithm and the first clustering result output by each preset clustering algorithm, a cluster prediction average probability matrix is constructed, wherein the cluster prediction average probability matrix is as follows:

k ₁ ＝1，2...K ₁ ，k ₂ ＝1，2...K ₂ wherein->Is the kth ₁ Cluster prediction average probability submatrix of each classification algorithm, wherein +.>By first calculating the kth ₁ The sum of probabilities that all samples with the credit risk prediction grade of the ith class belong to the j-th predicted credit risk grade in the first classification result output by the classification algorithm is divided by the probability average value obtained by dividing the sum by the number of samples with the credit risk prediction grade of the ith class in the classification result, for example, the samples with the credit risk prediction grade of the 4 th class in the first classification result output by the 5 th classification algorithm have four samples S1, S2, S3 and S4, and the probability value distribution of the four samples belonging to the credit risk prediction grade of the 2 nd class calculated by the 5 th classification algorithm is 0.12, 0.13, 0.12 and 0.11, then the cluster prediction average probability submatrix of the 5 th classification algorithm

Is the kth ₂ Cluster prediction average probability submatrix of individual clustering algorithm, wherein +.>Is the probability that the ith cluster class belongs to the jth credit risk level in the clustering result output by the kth 2 clustering algorithm.

Constructing a cluster distribution matrix according to a first classification result output by each preset classification algorithm and a first clustering result output by each preset clustering algorithm, wherein the distribution matrix is as follows:

k ₁ ＝1，2...K ₁ ，k ₂ ＝1，2...K ₂ wherein->Is the kth ₁ A distribution sub-matrix corresponding to the classification algorithm when the ith sample is kth ₁ The credit risk prediction grade predicted by the individual classification algorithm is j < th >>The value of (2) is 1, otherwise 0;

is the kth ₂ A distribution sub-matrix corresponding to the clustering algorithm, when the ith sample is kth ₂ When the clustering algorithm is divided into the j-th cluster class for predicting credit risk level, the j-th cluster class is +.>And the value of (2) is 1, otherwise 0.

Constructing a homography matrix according to a first classification result output by each preset classification algorithm and a first clustering result output by each preset clustering algorithm, wherein the homography matrix is as follows:

wherein d is _ij The number of times the credit risk prediction level of the ith sample and the jth sample predicted by each classification algorithm is the same plus the number of times the samples are divided into the same cluster class by each clustering algorithm, wherein d when i=j _ij ＝(K ₁ +K ₂ ) L. For example, the first sample S1 is classified into a high, a high and a medium credit risk prediction class predicted by 3 classification algorithms, respectively, into a high credit risk class cluster and a medium credit risk class cluster by 2 clustering algorithms, respectively, and the second first sample S2 is classified into 3 classification algorithmsThe respectively predicted credit risk prediction grades are low, low and medium, and are respectively divided into clusters with low credit risk grade and clusters with medium credit risk grade by 2 clustering algorithms, wherein d is ₁₂ ＝2。

Step S323, inputting a preset cluster actual probability matrix, a sample prediction average probability matrix, a cluster prediction average probability matrix, a distribution matrix and a homogeneity matrix into a fusion algorithm to be trained, and acquiring preliminary parameters in the fusion algorithm to be trained by adopting a block coordinate descent algorithm.

Inputting a preset cluster actual probability matrix, a sample prediction average probability matrix, a cluster prediction average probability matrix, a distribution matrix and a homogeneity matrix into a fusion algorithm to be trained, solving the fusion algorithm to be trained by adopting a block coordinate descent algorithm to obtain preliminary parameters, and taking the fusion algorithm comprising the preliminary parameters as the fusion algorithm of the preliminary training.

The fusion algorithm is as follows:

0≤y，z，w≤1，0＜x≤1，

x+y+z+w＝1，

wherein x, y, z and w are parameters in a preset fusion algorithm, O is the total number of first samples input into the risk prediction model to be trained, and K ₁ K is the number of preset classification algorithms ₂ E is the preset clustering algorithm quantity _ij D is the element of the ith row and jth column in the distribution matrix _ij For the element of the ith row and jth column in the homography matrix, < >>And->The first of the actual probability matrices of the samplesi row vector and j row vector, +.>For the j-th row vector of the actual probability matrix of the preset cluster class,>predicting the ith row vector in the average probability matrix for the sample,/->And predicting the j-th row vector of the average probability matrix for the cluster.

The process of training the fusion algorithm is to solve and optimize the parameters of the fusion algorithm.

Step S40, predicting the credit risk level of the second sample by adopting a primarily trained credit risk prediction model to obtain the credit risk prediction level of the second sample;

after the credit risk prediction model of the preliminary training is obtained, the credit risk levels of the plurality of second samples are preliminary predicted by adopting the credit risk prediction model of the preliminary training, and the result of the preliminary prediction is used as the credit risk prediction level of each second sample.

Specifically, referring to fig. 4, fig. 4 is a flowchart detailing a step of predicting a credit risk level of a second sample by using a preliminarily trained credit risk prediction model to obtain the credit risk prediction level of the second sample according to another embodiment of the present invention, and based on the above embodiment, the step S40 includes:

step S41, inputting a plurality of second samples into each preset classification algorithm and each preset clustering algorithm to obtain a second classification result respectively output by each preset classification algorithm and a second clustering result respectively output by each preset clustering algorithm;

and inputting a plurality of second samples into each preset classification algorithm and each preset clustering algorithm in the credit risk prediction model of the preliminary training to obtain second classification results respectively output by each preset classification algorithm and second classification results respectively output by each preset clustering algorithm.

Step S42, inputting the second classification results respectively output by the preset classification algorithms and the second classification results respectively output by the preset clustering algorithms into a fusion algorithm of preliminary training, and outputting credit risk prediction grades of the second samples.

And then inputting the second classification result and the second aggregation result into a fusion algorithm of preliminary training in a credit risk prediction model of the preliminary training, calculating each probability value corresponding to each second sample belonging to different credit risk prediction grades, and taking the credit risk prediction grade corresponding to the maximum probability value in each probability value of the sample as the credit risk prediction grade of the sample.

And S50, training the primarily trained credit risk prediction model by using a plurality of first samples and corresponding actual credit risk levels and a plurality of second samples and corresponding credit risk prediction levels to obtain a final credit analysis prediction model.

The obtained first samples and the corresponding actual credit risk levels, the obtained second samples and the corresponding credit risk prediction levels are used as training samples of the primarily trained credit risk prediction model, the primarily trained credit risk prediction model is trained, the specific process is the same as that of the step S30, and the detailed description is omitted, so that a final credit analysis prediction model is finally obtained.

According to the embodiment, information of the user with the repayment behaviors is collected to be used as a first sample, and the actual credit risk level of the first sample is marked according to a preset mapping relation between the repayment behaviors of the user and the credit risk level; collecting user information which is not passed by loan audit as a second sample; training the credit risk prediction model to be trained according to the plurality of first samples and the corresponding actual credit risk levels to obtain a primarily trained credit risk prediction model; predicting the credit risk level of the second sample by adopting a primarily trained credit risk prediction model to obtain the credit risk prediction level of the second sample; and training the primarily trained credit risk prediction model by using the plurality of first samples and the corresponding actual credit risk grades and the plurality of second samples and the corresponding credit risk prediction grades to obtain a final credit analysis prediction model. When the credit analysis prediction model is constructed, the user information of the allowed loan is firstly used for carrying out preliminary training on the model, and then the user information of the allowed loan and the user information of the refused loan are used together for carrying out retraining on the model, so that the obtained model has high risk prediction accuracy rate for potential users meeting the loan condition, and the accuracy rate of risk prediction for users not meeting the loan condition is improved, thereby integrally improving the credit risk prediction accuracy rate of the model for the users.

Further, a second embodiment of the credit risk prediction method according to the present application is proposed according to the first embodiment of the credit risk prediction method according to the present application, in this embodiment, the step S30 further includes:

and step S33, training the credit risk prediction model to be trained according to the K-ten fold cross verification method, the plurality of first samples and the corresponding actual credit risk levels to obtain a primarily trained credit risk prediction model.

In this embodiment, a plurality of first samples are divided into ten groups (generally equal division) of sample subsets, each group of sample subsets is respectively used as a verification sample set, the rest K-1 groups of sample subsets are used as training sample sets, so that 10 pairs of training sample sets-verification sample sets are obtained, each training sample set is used for training a credit risk prediction model to be trained, stability of a corresponding candidate credit risk prediction model is obtained and evaluated, stability evaluation results are obtained, and then a credit risk prediction model with the best stability performance result is selected from the stability evaluation results according to the stability evaluation results corresponding to each candidate credit risk prediction model to serve as a primarily trained credit risk prediction model. The K-V cross-validation method can effectively avoid the occurrence of under fitting and over fitting, and finally the obtained model is convincing.

The invention also provides a credit risk prediction system, which comprises:

the first collection module 10 is configured to collect information of a user who has performed a payment action as a first sample, and mark an actual credit risk level of the first sample according to a preset mapping relationship between the payment action and the credit risk level of the user;

a second collection module 20 for collecting user information that the loan audit fails as a second sample;

the first training module 30 is configured to train the credit risk prediction model to be trained according to the plurality of first samples and the corresponding actual credit risk levels, and obtain a primarily trained credit risk prediction model;

the prediction module 40 is configured to predict a credit risk level of the second sample by using a credit risk prediction model that is primarily trained, so as to obtain a credit risk prediction level of the second sample;

the second training module 50 is configured to train the primarily trained credit risk prediction model with the plurality of first samples and the corresponding actual credit risk levels and the plurality of second samples and the corresponding credit risk prediction levels to obtain a final credit analysis prediction model.

Optionally, the first acquisition module 10 further includes:

The first acquisition unit is used for acquiring information of a user who has paid a payment action;

the first processing unit is used for filtering and/or preprocessing sensitive information in the process of carrying out the user information of the repayment behaviors;

and the first sample generation unit is used for taking the filtered and/or preprocessed user information with the repayment behaviors as a first sample.

Optionally, the second acquisition module 20 further includes:

the second acquisition unit is used for acquiring user information which is not passed by loan audit;

the second processing unit is used for filtering and/or preprocessing sensitive information in the process of user information which fails to pass loan verification;

and the second sample generation unit is used for taking the filtered and/or preprocessed loan audit failed user information as a second sample.

Optionally, if the credit risk prediction model to be trained includes at least one preset different classification algorithm, at least one preset different clustering algorithm, and a fusion algorithm to be trained, the first training module 30 includes:

the first input unit is used for inputting a plurality of first samples into each preset classification algorithm and each preset clustering algorithm respectively to obtain a first classification result output by each preset classification algorithm and a first clustering result output by each preset clustering algorithm, wherein the first classification result output by each preset classification algorithm comprises the probability that each first sample belongs to the corresponding credit risk class and the credit risk prediction class of each first sample, and the first clustering result output by each preset clustering algorithm comprises the clusters with the same number as the preset credit risk class, the probability that each cluster belongs to the corresponding credit risk class and the cluster class to which each first sample belongs;

The first training unit is used for training the fusion algorithm to be trained according to the preset cluster actual probability matrix, the actual credit risk level of each first sample, the first classification result output by each preset classification algorithm and the first clustering result output by each preset clustering algorithm to obtain a fusion algorithm for preliminary training.

Optionally, the first training unit includes:

the first construction subunit is used for constructing a sample actual probability matrix according to the actual credit risk level of each first sample;

the second construction subunit is used for constructing a sample prediction average probability matrix, a cluster prediction average probability matrix, a distribution matrix and a homogeneity matrix according to the first classification result output by each preset classification algorithm and the first clustering result output by each preset clustering algorithm;

the acquisition subunit is used for inputting a preset cluster actual probability matrix, a sample prediction average probability matrix, a cluster prediction average probability matrix, a distribution matrix and a homogeneity matrix into a fusion algorithm to be trained, and acquiring preliminary parameters in the fusion algorithm to be trained by adopting a block coordinate descent algorithm.

Optionally, the first training module 30 further includes:

The second training unit is used for training the credit risk prediction model to be trained according to the K-ten fold cross verification method, the plurality of first samples and the corresponding actual credit risk levels, and obtaining a primarily trained credit risk prediction model.

Optionally, the prediction module 40 includes:

the second input unit is used for inputting a plurality of second samples into each preset classification algorithm and each preset clustering algorithm to obtain a second classification result respectively output by each preset classification algorithm and a second clustering result respectively output by each preset clustering algorithm;

the output unit is used for inputting the second classification results respectively output by the preset classification algorithms and the second classification results respectively output by the preset clustering algorithms into the fusion algorithm of the preliminary training, and outputting the credit risk prediction grade of each second sample.

The present invention also proposes a computer-readable storage medium on which a computer program is stored. The computer readable storage medium may be the Memory 02 in the credit risk prediction terminal of fig. 1, or may be at least one of ROM (Read-Only Memory)/RAM (Random Access Memory ), magnetic disk, optical disk, etc., and the computer readable storage medium includes a plurality of information for causing the credit risk prediction terminal to perform the method according to the embodiments of the present invention.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method of credit risk prediction, the method comprising the steps of:

training the primarily trained credit risk prediction model by using a plurality of first samples and corresponding actual credit risk grades and a plurality of second samples and corresponding credit risk prediction grades to obtain a final credit analysis prediction model;

if the credit risk prediction model to be trained includes at least one preset different classification algorithm, at least one preset different clustering algorithm and a fusion algorithm to be trained, the step of training the credit risk prediction model to be trained according to the plurality of first samples and the corresponding actual credit risk classes to obtain a preliminary trained credit risk prediction model includes:

training a fusion algorithm to be trained according to a preset cluster actual probability matrix, actual credit risk levels of all first samples, first classification results output by all preset classification algorithms and first clustering results output by all preset clustering algorithms to obtain a fusion algorithm for preliminary training;

training the fusion algorithm to be trained according to a preset cluster actual probability matrix, actual credit risk levels of all first samples, first classification results output by all preset classification algorithms and first clustering results output by all preset clustering algorithms, and obtaining a primary training fusion algorithm comprises the following steps:

the homogeneity matrix is:

wherein d is _ij For the ith sampleThe number of times of credit risk prediction grades predicted by each classification algorithm is the same as the number of times of credit risk prediction grades predicted by each classification algorithm of the jth sample, the number of times of dividing the jth sample into the same cluster class by each clustering algorithm, and O is the total number of first samples input into a risk prediction model to be trained; wherein when i=j, dij= (k1+k2) L;

2. The credit risk prediction method according to claim 1, wherein the step of collecting information of the user who has performed the repayment action as the first sample includes:

Collecting information of users who have paid repayment behaviors;

3. The credit risk prediction method according to claim 1, wherein the step of collecting, as the second sample, user information that the loan audit fails includes:

collecting user information which is not passed by loan audit;

4. A credit risk prediction method according to claim 3, wherein the step of predicting the credit risk level of the second sample using the preliminary trained credit risk prediction model, and obtaining the credit risk prediction level of the second sample comprises:

5. A credit risk level prediction system, the system comprising:

the second training module is used for training the credit risk prediction model of the preliminary training to obtain a final credit analysis prediction model by using a plurality of first samples and corresponding actual credit risk grades and a plurality of second samples and corresponding credit risk prediction grades;

the homogeneity matrix is:

wherein dij is the same number of credit risk prediction grades of the ith sample and the jth sample predicted by each classification algorithm plus the number of times divided into the same cluster class by each clustering algorithm, and O is the total number of first samples input into the risk prediction model to be trained; wherein when i=j, dij= (k1+k2) L;

6. A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the credit risk prediction method of any of claims 1 to 4.

7. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the credit risk prediction method according to any of claims 1 to 4.