CN114722941A - Credit default identification method, apparatus, device and medium - Google Patents

Credit default identification method, apparatus, device and medium Download PDF

Info

Publication number
CN114722941A
CN114722941A CN202210374880.3A CN202210374880A CN114722941A CN 114722941 A CN114722941 A CN 114722941A CN 202210374880 A CN202210374880 A CN 202210374880A CN 114722941 A CN114722941 A CN 114722941A
Authority
CN
China
Prior art keywords
data
financial transaction
transaction data
layer
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210374880.3A
Other languages
Chinese (zh)
Inventor
温丽娜
王丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN202210374880.3A priority Critical patent/CN114722941A/en
Publication of CN114722941A publication Critical patent/CN114722941A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Medical Informatics (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The invention discloses a credit default identification method, a credit default identification device, credit default identification equipment and a credit default identification medium. A credit default identification method, comprising: training financial transaction data by adopting a convolutional neural network algorithm to extract effective characteristics so as to reduce the dimension of the financial transaction data; and constructing a data set according to the effective characteristics, and performing model training and model testing on the data set by adopting a random forest algorithm to perform default prediction. Compared with the prior art, the method and the device for identifying the credit default are suitable for feature extraction of high-dimensional data, and accuracy of credit default identification is improved.

Description

Credit default identification method, apparatus, device and medium
Technical Field
The invention relates to the technical field of data processing, in particular to a credit default identification method, device, equipment and medium.
Background
With the rapid development of the internet technology, the popularization of convenient payment modes and the support of the maturity of related security technologies, the development space of the financial field is extended, and internet finance is born in line with the development of the times and grows rapidly. Traditional financial service institutions including banks, insurance, securities and the like play a vital role in promoting the development of financial services based on the Internet, namely, the development of traditional financial field business is promoted, and meanwhile, the service range of the financial field is expanded. With the intelligentization and the convenience of financial transaction modes and the appearance of advanced online payment platforms, the financial service is efficient and convenient, and the user experience of customers is improved. Nowadays, large-scale online or offline transaction records are generated in the internet financial market in a short time, the development of internet financial outbreaks is promoted by the convenience of intelligent payment and the diversity of network financial transaction types, meanwhile, risks and challenges are brought to the internet financial outbreaks, the proportion of traditional credit risks is not high, and most of the risks are caused by fraudulent transactions.
The artificial intelligence technology represented by deep learning is combined with massive transaction data in a big data environment, the problem that high-dimensional complex data are difficult to analyze can be solved, the cost for labor is obviously reduced, the risk control and business processing capacity of the artificial intelligence technology are enhanced, and the artificial intelligence technology is well applied to the fields of intelligent financial and big data risk control. Compared with the technologies of manual description and feature construction, deep learning can fully utilize mass data resources under the background of big data to train a model, can realize description of rich internal information hidden by original data, obtains the most effective feature expression, and fully excavates the association between data by constructing a relatively complex network structure. How to quickly and accurately identify fraudulent transactions is a great challenge that is commonly faced in the financial field. The method reasonably controls fraud risks in the financial field, organizes illegal financial behaviors, avoids economic loss, prevents and controls behaviors of excessive consumption and excessive liability of users through technical capacity, and has great theoretical significance and practical application value in terms of competitiveness and risk management level of domestic financial institutions.
However, most of the prior art adopts a machine learning method, has limited feature learning capability, needs manual feature extraction and support of business knowledge, and is not suitable for feature extraction of high-dimensional data. This makes the prior art less accurate in identifying credit violations.
Disclosure of Invention
The invention provides a credit default identification method, a credit default identification device and a credit default identification medium, which are suitable for feature extraction of high-dimensional data and improve the accuracy of credit default identification.
According to an aspect of the present invention, there is provided a credit default identification method, including:
training financial transaction data by adopting a convolutional neural network algorithm to extract effective characteristics so as to reduce the dimension of the financial transaction data;
and constructing a data set according to the effective characteristics, and performing model training and model testing on the data set by adopting a random forest algorithm to perform default prediction.
Optionally, the financial transaction data is post-loan data.
Optionally, the convolutional neural network algorithm comprises:
converting the financial transaction data into a matrix at an input layer and transmitting the matrix to a convolutional layer;
performing feature extraction on the financial transaction data at the convolutional layer, and transmitting the feature extraction to a maximum pooling layer;
the input data is down-sampled in the maximum pooling layer, the size of the input data is reduced, and the input data is transmitted to the global maximum pooling layer;
further reducing the dimension of the input multi-dimensional vector in the global maximum pooling layer, converting the input multi-dimensional vector into a one-dimensional vector and transmitting the one-dimensional vector to the full connection layer;
the full connection layer comprises a preset number of output neurons and is transmitted to the dropout layer;
a regularization method is adopted in the dropout layer and is transmitted to the sigmoid layer to prevent overfitting;
and adopting a random forest algorithm to replace the sigmoid layer to classify the prediction results, and transmitting the prediction results to an output layer.
Optionally, performing model training and model testing on the data set by using a random forest algorithm, including:
the financial transaction data subjected to feature extraction by adopting a convolutional neural network algorithm is subjected to resampling, samples are extracted in a place-to-place manner, and a plurality of data sets are obtained;
establishing a decision tree for the data set; the generation of the decision tree adopts a process of recursively constructing a binary classification tree, training sets are randomly extracted from a plurality of obtained training sets to form subsets, and the decision tree is constructed;
repeating the obtaining step of the training set and the step of establishing a decision tree for the training set, and forming a random forest classifier for credit default identification by using the trained decision tree;
and respectively carrying out classification prediction according to the generated credit default classification model, counting classification results of all decision trees on the same user transaction, and selecting the most categories as final classification categories.
Optionally, before training the financial transaction data by using a convolutional neural network algorithm to extract valid features, the method further includes:
and performing data preprocessing on the financial transaction data.
Optionally, the data preprocessing comprises:
performing data cleaning on the financial transaction data, performing data missing value filling on the financial transaction data, and performing data normalization on the financial transaction data.
Optionally, public data provided by the Kaggle match is adopted as the financial transaction data, and the credit default identification method is verified experimentally.
According to another aspect of the present invention, there is provided a credit default identification apparatus, comprising:
the characteristic learning module is used for training financial transaction data by adopting a convolutional neural network algorithm to extract effective characteristics so as to reduce the dimension of the financial transaction data;
and the default prediction module is used for constructing a data set according to the effective characteristics, and performing model training and model testing on the data set by adopting a random forest algorithm degree so as to perform default prediction.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the credit violation identification method of any of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer-readable storage medium having stored thereon computer instructions for causing a processor to execute a credit violation identification method according to any one of the embodiments of the present invention.
The embodiment of the invention provides a credit default identification method combining a convolutional neural network and a random forest algorithm, a CNN-RF model is constructed, and the identification of fraud transactions from credit financial transaction data with high dimensional imbalance is realized. Specifically, the core of the credit default identification method in the embodiment of the invention is to complete two-stage tasks, in the first stage, the feature extraction is carried out on the original financial transaction data training by constructing a convolutional neural network model to realize effective dimension reduction, and in the second stage, the classification prediction is carried out by a random forest classifier. The embodiment of the invention fully utilizes the advantage that the CNN can carry out feature extraction on the complex data and the advantage of stronger generalization capability of the random forest algorithm, and has better performance in the aspect of credit default identification. Therefore, aiming at the problems of time and labor consumption in manual feature extraction and limited feature learning capacity of the traditional credit default identification model, the embodiment of the invention adopts the CNN to extract effective features from high-dimensional financial transaction data, and improves the efficiency of feature extraction. And the embodiment of the invention realizes the characteristic extraction from the credit financial transaction data with high dimensional imbalance and identifies the fraudulent transaction. In addition, aiming at the characteristic that the generalization capability of the convolutional neural network classifier is weak, the embodiment of the invention combines the CNN and the random forest algorithm to classify the transaction data in the credit. Experimental results prove that the model ensures higher classification precision, the AUC value is improved to a great extent, and the model has a good effect on credit default identification.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart illustrating a method for identifying a credit default in accordance with an embodiment of the present invention;
FIG. 2 is a schematic flow chart diagram illustrating another method for identifying a credit default in accordance with an embodiment of the present invention;
FIG. 3 is a schematic diagram of a model of a credit default identification method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a convolutional neural network model according to an embodiment of the present invention;
FIG. 5 is a schematic flowchart of a convolutional neural network model training process according to an embodiment of the present invention;
FIG. 6 is a schematic flow chart of model training and model testing for a data set by using a random forest algorithm according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a method for using a random forest algorithm according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a credit default identification apparatus according to an embodiment of the present invention;
FIG. 9 shows a schematic diagram of an electronic device that may be used to implement an embodiment of the invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Embodiments of the present invention provide a credit default identification method, which may be performed by a credit default identification apparatus, which may be implemented by software and/or hardware. Fig. 1 is a flowchart illustrating a credit default identification method according to an embodiment of the present invention. Referring to fig. 1, the credit default identification method includes the steps of:
and S110, training the financial transaction data by adopting a convolutional neural network algorithm to extract effective characteristics so as to reduce the dimension of the financial transaction data.
The Convolutional Neural Network (CNN) can be independent of financial business knowledge, complex steps such as feature engineering and feature selection are omitted, features can be learned directly from original data, the essential features and diversified features of the original data can be better represented through the features learned by the CNN model, the model also has strong function approximation capability, self-adaptive learning capability and fault-tolerant capability, and detection standards suitable for current data can be flexibly established by using actually obtained data, so that the discrimination of abnormal possibility is made. The fields of behavior recognition, natural language processing, voice recognition and the like obtain better classification recognition effect through the CNN technology, and the application field of the CNN technology is widened. The method has the advantages of excellent feature extraction capability, strong adaptability and the like, can be well used for solving the credit default identification research and supports the development of internet finance.
The credit loan risk control comprises three stages of before loan, during loan and after loan, the traditional rule or machine learning credit default identification model is concentrated before loan, analysis and processing are carried out by collecting credit investigation data, text data, human-society data and the like, then feature engineering and feature selection are carried out by combining with relevant professional knowledge in the financial field, and finally a machine learning model is constructed. This mode can typically filter some fraudulent users, but cannot monitor post-loan risk. With the rapid development of internet finance and the wide application of related services, trillions of transaction records are generated at every moment, the transaction types are various, more and more transaction fraud is faced, and the loss of financial institutions can be reduced by monitoring the transaction behaviors of customers in credit, so that the abnormal transaction behaviors of the customers can be accurately identified, and the control of the occurrence of default conditions is an important target of the financial industry. The embodiment of the invention monitors the transaction data of the user after loan, and mines fraud behaviors from the transaction data to reduce the loss of financial institutions.
And S120, constructing a data set according to the effective characteristics, and performing model training and model testing on the data set by adopting a random forest algorithm to perform default prediction.
In 2001, Leo Breiman combines Bagging and the idea of Random subspace to provide definition for Random Forest (RF for short): and acquiring a training sample of the model by using a random sampling mode, wherein the feature of each decision tree used for splitting the node is also a randomly sampled feature subspace, and then obtaining an optimal splitting value by calculating. The essence of the RF algorithm is that a classifier with strong classification capability is formed by training a plurality of decision tree weak classifiers and then integrating them. And calculating a final classification result by all the base classifiers according to a specified principle. The random forest algorithm has better prediction performance and smaller generalization error, particularly has better classification prediction on large-scale unbalanced data, and can better solve the problem of overfitting. Meanwhile, the method is insensitive to abnormal values and noise data, can keep good accuracy, and has high-dimensional data processing parallelism and expandability. Therefore, the random forest algorithm has been widely applied to the fields of gene sequence classification and regression in the field of biological information, customer information analysis and anti-fraud in the field of economic finance, speech discrimination and speech synthesis in the field of speech, and the like. The random forest algorithm is used as an important branch of integrated learning, has strong adaptability to a data set, and can effectively balance errors when processing an unbalanced data set. In addition, compared with a single classifier, the method is strong in generalization capability and good in anti-noise capability.
The embodiment of the invention provides a credit default identification method combining a convolutional neural network and a random forest algorithm, a CNN-RF model is constructed, and the identification of fraud transactions from credit financial transaction data with high dimensional imbalance is realized. Specifically, the core of the credit default identification method in the embodiment of the invention is to complete two-stage tasks, in the first stage, the feature extraction is carried out on the original financial transaction data training by constructing a convolutional neural network model to realize effective dimension reduction, and in the second stage, the classification prediction is carried out by a random forest classifier. The embodiment of the invention fully utilizes the advantage that the CNN can carry out feature extraction on the complex data and the advantage of stronger generalization capability of the random forest algorithm, and has better performance in the aspect of credit default identification. Therefore, aiming at the problems of time and labor consumption in manual feature extraction and limited feature learning capacity of the traditional credit default identification model, the embodiment of the invention adopts the CNN to extract effective features from high-dimensional financial transaction data, and improves the efficiency of feature extraction. And the embodiment of the invention realizes the characteristic extraction from the credit financial transaction data with high dimensional imbalance and identifies the fraudulent transaction. In addition, aiming at the characteristic that the generalization capability of the convolutional neural network classifier is weak, the embodiment of the invention combines the CNN and the random forest algorithm to classify the transaction data in the credit. Experimental results prove that the model ensures higher classification precision, the AUC value is improved to a great extent, and the model has a good effect on credit default identification.
Fig. 2 is a flowchart illustrating another credit default identification method according to an embodiment of the present invention. Referring to fig. 2, on the basis of the foregoing embodiments, before training the financial transaction data by using a convolutional neural network algorithm to extract valid features, the method further includes: and data preprocessing is carried out on the financial transaction data to remove interference items, so that the default prediction model is more accurate, and the accuracy of credit default identification is further improved. The method comprises the following specific steps:
and S210, carrying out data preprocessing on the financial transaction data.
Illustratively, the data preprocessing includes: the data cleaning is carried out on the financial transaction data, the data missing value filling is carried out on the financial transaction data, and the data normalization is carried out on the financial transaction data. The financial transaction data is subjected to data normalization processing, so that the data in each column are mapped to the range between [0 and 1], and a normalized data set is obtained. And dividing the processed standardized data set into a training sample and a testing sample.
S220, training the financial transaction data by adopting a convolutional neural network algorithm to extract effective characteristics so as to reduce the dimension of the financial transaction data.
And S230, constructing a data set according to the effective characteristics, and performing model training and model testing on the data set by adopting a random forest algorithm to perform default prediction.
Through S210-S230, the task of identifying the fraudulent transaction from the credit financial transaction data with high-dimensional imbalance is completed, the method is suitable for feature extraction of the high-dimensional data, and the accuracy of credit default identification is improved.
Fig. 3 is a schematic model diagram of a credit default identification method according to an embodiment of the present invention. Referring to fig. 3, on the basis of the foregoing embodiments, optionally, the model formed by the convolutional neural network algorithm is a CNN feature extractor, and the model formed by the random forest algorithm is a random forest classifier. The execution method of the model comprises the steps of firstly carrying out data preprocessing on the acquired original financial transaction data set to obtain financial transaction data. Then, effective features are obtained through a CNN feature extractor, and the effective features form a feature set phasor. The feature set phasor comprises a training set and a test set, wherein the training set is used for model training, and the test set is used for model testing. And finally, the training set and the tester pass through a random forest classifier to obtain the default prediction result. The outcome of the default prediction includes a default label and a normal label.
On the basis of the above embodiments, optionally, the convolutional neural network algorithm includes a basic structure determining step and a hyper-parameter adjusting step.
Fig. 4 is a schematic structural diagram of a convolutional neural network model according to an embodiment of the present invention. Referring to fig. 4, an input layer, a convolutional layer, a max pooling layer, a global max pooling layer, a full connection layer, a dropout layer, a sigmoid layer, and an output layer. On the basis of the foregoing embodiments, optionally, the convolutional neural network algorithm includes:
and the input layer is used for converting the financial transaction data into a matrix and transmitting the matrix to the convolutional layer. Illustratively, the embodiment of the invention can convert 740-dimensional raw data into a matrix, thereby facilitating the convolution layer to calculate and extract features.
And the convolution layer is used for carrying out feature extraction on the financial transaction data and transmitting the data to the maximum pooling layer. Illustratively, feature extraction is performed on the input data using 32 convolution kernels, where the convolution kernel size is 1 × 74, the step size is 1, and the resulting feature map size is (10, 32).
And the maximum pooling layer is used for down-sampling the input data, reducing the size of the input data and transmitting the input data to the global maximum pooling layer. Illustratively, the pooled kernel size is 1 x 32, the step size is 1, and the resulting output is (10, 32). The convolutional layer and the max-pooling layer constitute the hidden layer of the CNN, and the number of the hidden layers of the CNN is set to 1 layer, for example, the CNN has 7 layers.
And the global maximum pooling layer is used for further reducing the dimension of the input multidimensional vector, converting the multidimensional vector into a one-dimensional vector and transmitting the one-dimensional vector to the full-connection layer. Illustratively, a 32-dimensional one-dimensional vector is finally output.
And the full connection layer comprises a preset number of output neurons and is transmitted to the dropout layer. Illustratively, the number of output neurons is 128.
And the dropout layer adopts a regularization method and transmits the regularization method to the sigmoid layer to prevent overfitting. Specifically, a regularization method is adopted, so that the neurons fail according to a certain probability in the training process, and the model is prevented from being over-fitted.
And the sigmoid layer adopts a random forest algorithm to replace the sigmoid layer to classify the prediction results and transmits the prediction results to the output layer. The sigmoid layer is used for classifying the prediction result, and the random forest algorithm is also used for classifying the prediction result, so that the sigmoid layer can be replaced by the random forest algorithm.
After the above operations, the feature dimension changes from 740 dimensions before entering the convolutional neural network to 128 dimensions.
After the model infrastructure is determined, different values of the model hyper-parameters are constructed subsequently, and the setting of the initial parameter optimal solution of the CNN is completed. Wherein the parameters relate to the length of a convolution kernel, the sliding step length, the number of hidden layers, the number of epochs, the setting of a CNN network structure and the like. Illustratively, the specific training procedure of the CNN model covers four phases, namely a forward propagation phase, an error reverse propagation phase, a gradient calculation phase, and a gradient application and parameter update phase, and the training is finished after a certain number of iterations. Fig. 5 is a schematic flowchart of a convolutional neural network model training process according to an embodiment of the present invention. Referring to fig. 5, on the basis of the above embodiments, optionally, the forward propagation stage in the convolutional neural network model training process includes: initializing; given an input vector and a target output; calculating the output of each unit of the hidden layer and the output layer; solving a target value and an actual output offset e; judging whether e is in an allowable range; if so, finishing training and fixing the weight and the threshold; otherwise, an error back propagation stage is performed. The error back propagation stage comprises: calculating the error of the neuron in the network layer; solving an error gradient; updating the weight value; and adopting the updated weight value to calculate the output of each unit of the hidden layer and the output layer, and continuously executing a forward propagation stage.
After the basic structure determining step and the hyper-parameter adjusting step, the hyper-parameter settings involved in the obtained convolutional neural network model are shown in table 1.
TABLE 1
Figure BDA0003589921170000111
The training parameters and outputs for each layer in the convolutional neural network model are shown in table 2.
TABLE 2
Network layer names Layer output Training parameters
Input layer (None,10,74) 0
Convolutional layer (None,3,64) 9536
Maximum pooling layer (None,3,64) 0
Convolutional layer (None,1,64) 8256
Maximum pooling layer (None,1,64) 0
Global maximum pooling layer (None,64) 0
Full connection layer (None,128) 8320
DROPOUT layer (None,128) 0
SIGMOID layer (None,1) 129
Fig. 6 is a schematic flow chart of performing model training and model testing on a data set by using a random forest algorithm according to an embodiment of the present invention. Referring to fig. 6, on the basis of the foregoing embodiments, optionally, after training of the feature extractor based on the convolutional neural network is completed, a sample set is input into the feature extractor to perform learning of the feature set, a sigmoid layer in the convolutional neural network is replaced by a random forest algorithm for the last layer output of the whole frame, which is used as a final classifier, and a chini (gini) coefficient impurity degree is used as a criterion for decision tree feature selection. Specifically, a random forest algorithm is adopted to carry out model training and model testing on a data set, and the method comprises the following steps:
s310, resampling the financial transaction data subjected to feature extraction by the convolutional neural network algorithm, and extracting samples in a replaced mode to obtain a plurality of data sets.
Referring to fig. 7, exemplarily, a financial transaction data set D after CNN feature extraction, where the kth data set is Dk, and generates random vectors for the kth decision tree, an embodiment of the present invention uses θ k, h (Dk, θ k) to represent the kth decision tree model.
S320, establishing a decision tree for the data set; the generation of the decision tree adopts a process of recursively constructing a binary classification tree, training sets are randomly extracted from a plurality of obtained training sets to form subsets, and the decision tree is constructed.
Illustratively, a decision tree is constructed separately for each of the k data sets. And selecting splitting attributes in a random mode, determining that the number H of the features is less than or equal to X through all 128 attributes of the financial transaction data extracted by the CNN features, randomly selecting H attributes from the 128 attribute features as splitting attribute sets for all internal nodes, and selecting the best splitting mode for the node to be detected. Determination of classification attributes by kini index: if the sample set of N categories is contained in T, the Keyny coefficient is calculated according to the following formula.
Figure BDA0003589921170000121
In the formula pjMeaning the number of occurrences of the class j, assuming that the set T is divided into m subsets N1,N2,…,Nm. Wherein the calculation of the kini coefficient is shown in the following formula.
Figure BDA0003589921170000122
After traversing all possible segmentation points of all attributes, selecting the corresponding feature segmentation point with the minimum value of the Kenyi coefficient as a final segmentation standard.
S330, repeating the obtaining step of the training set and the step of establishing a decision tree for the training set, and forming a random forest classifier for credit default identification by using the trained decision tree.
Illustratively, the trained k decision trees are formed into a random forest classifier H (H) for credit default recognition1,h2,…,m,hk)。
And S340, respectively carrying out classification prediction according to the generated credit default classification model, counting classification results of all decision trees on the same user transaction, and selecting the most categories as final classification categories.
Illustratively, the input: credit training data set S, testing data set T, random forest decision tree number nTree and characteristic number k.
And (3) outputting: RF classifier and classification results
1:For i=1:nTree
2: using bootstrap method to replace samples, providing each decision tree with a training set with N selected samples;
3: randomly selecting k attributes at a node, determining the optimal attribute characteristics of the k attributes through comparison analysis, and segmenting a training sample;
4: recursively constructing a decision tree without pruning;
5:End
6: in the test sample set, the probability of classifying the data x to be tested into c is calculated, wherein P (c | x) ═ 1/nTree) ∑ hj(c|x);
7: classifying c ← argmaxP (c | x) according to a majority vote principle, and calculating a classification error;
8: return RF model and predictive classification results.
On the basis of the above embodiments, optionally, the embodiment of the present invention adopts public data provided by the Kaggle match as financial transaction data, and performs experimental verification on the credit default identification method.
Wherein public data Loan-Default-preview provided by the Kaggle game is provided. The data set is for a loan default prediction contest, which is a listing of financial transactions associated with an individual. The total number of samples of the training set is 105471, the number of positive samples is 95688, the number of negative samples is 1145, 771 features, and the label is composed of two types of default and non-default. Due to extreme imbalance of original data, the embodiment of the invention deletes most types of examples by an undersampling technology, improves the sampling proportion of negative samples, randomly extracts 11100 positive samples and 900 negative samples, and finally performs the following steps: 2, 9600 transactions are training samples, and 2400 transactions are recorded as testing samples.
Table 3 shows the accuracy, recall rate and score of the recognition of the various models in both normal and default aspects, wherein the SVM classifier and the Logistic Regression (LR) classifier all use default parameters, the number of decision trees of the RF classifier is set to 200, the CNN hyperparameters are set according to table 1, and the specific results are shown in table 3.
TABLE 3
Figure BDA0003589921170000141
As can be seen from table 3, while conventional machine learning algorithms perform well for the identification of on-time repayment, the ability to identify loan violations is weak, and for credit violation identification models, the identification of violations is extremely important. The accuracy rate and the recall rate of the embodiment of the invention on normal and default categories are higher than those of other models, and the advantages of the embodiment of the invention are fully proved by comprehensive and comprehensive evaluation on the classification performance of the models.
The results for the different models are shown in table 4.
TABLE 4
Classifier Accuracy AUC
RF 0.923479167 0.54094224
SVM 0.924791667 0.5
LR 0.926979167 0.60218914
CNN 0.9045833 0.64049159
Examples of the invention 0.9705 0.861145207
From the experimental results in table 4, it can be found that although the classification accuracy of three conventional credit default identification methods of SVM, LR, and RF reaches more than 92%, the AUC value is low, and the classification effect of the representative model is not ideal. Therefore, the limitation of classification performance of the traditional machine learning algorithm on the high-dimensional unbalanced data set is illustrated, namely the deficiency of the learning capacity of the high-dimensional features; secondly, it can be observed that although the classification accuracy of the CNN model is not as good as that of the conventional machine learning method, the AUC of the CNN model is improved by at least 5%, which fully explains the advantages of the CNN model and the feasibility of performing autonomous feature learning by using CNN in the embodiments of the present invention. Finally, through further comparison of experimental results, the accuracy of the CNN-RF model is improved by 4.63% compared with that of the RF model, and the AUC value is improved by 0.313. Compared with a CNN model, the accuracy is improved by 6.52%, and the AUC value is improved by 0.2134. The combination of the convolutional neural network and the random forest algorithm is proved to have a better classification effect on financial transaction data with high dimensional imbalance. In conclusion, the CNN-RF model provided by the embodiment of the invention has incomparable outstanding performance in solving the problems existing in the current credit default identification.
In conclusion, experimental results prove that the embodiment of the invention can ensure higher classification precision, and meanwhile, the AUC value is greatly improved, and the credit default identification method has a good effect on credit default identification.
Embodiments of the present invention also provide a credit default identification apparatus, which may be implemented by software and/or hardware. Fig. 8 is a schematic structural diagram of a credit default identification apparatus according to an embodiment of the present invention. Referring to fig. 8, the credit default identification means includes:
the feature learning module 410 is configured to train the financial transaction data by using a convolutional neural network algorithm to extract effective features, so as to perform dimension reduction on the financial transaction data;
and the default prediction module 420 is configured to construct a data set according to the effective features, and perform model training and model testing on the data set by using a random forest algorithm degree to perform default prediction.
The credit default identification device provided by the embodiment of the invention can execute the credit default identification method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
FIG. 9 shows a schematic diagram of an electronic device that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 9, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM)12, a Random Access Memory (RAM)13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM)12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The processor 11 performs the various methods and processes described above, such as the credit default identification method.
In some embodiments, the credit default identification method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the credit violation identification method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the credit violation identification method by any other suitable means (e.g., by way of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the Internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A credit default identification method, comprising:
training financial transaction data by adopting a convolutional neural network algorithm to extract effective characteristics so as to reduce the dimension of the financial transaction data;
and constructing a data set according to the effective characteristics, and performing model training and model testing on the data set by adopting a random forest algorithm to perform default prediction.
2. The method of claim 1, wherein the financial transaction data is post-loan data.
3. The method of claim 1, wherein the convolutional neural network algorithm comprises:
converting the financial transaction data into a matrix at an input layer and transmitting the matrix to a convolutional layer;
performing feature extraction on the financial transaction data at the convolutional layer, and transmitting the feature extraction to a maximum pooling layer;
the input data is down-sampled in the maximum pooling layer, the size of the input data is reduced, and the input data is transmitted to the global maximum pooling layer;
further reducing the dimension of the input multi-dimensional vector in the global maximum pooling layer, converting the input multi-dimensional vector into a one-dimensional vector and transmitting the one-dimensional vector to the full connection layer;
the full connection layer comprises a preset number of output neurons and is transmitted to the dropout layer;
a regularization method is adopted in the dropout layer and is transmitted to the sigmoid layer to prevent overfitting;
and adopting a random forest algorithm to replace the sigmoid layer to classify the prediction results, and transmitting the prediction results to an output layer.
4. The method of claim 3, wherein performing model training and model testing on the dataset using a random forest algorithm comprises:
sampling the financial transaction data subjected to feature extraction by adopting a convolutional neural network algorithm by adopting a resampling method, and extracting samples in a place-to-place manner to obtain a plurality of data sets;
establishing a decision tree for the data set; the generation of the decision tree adopts a process of recursively constructing a binary classification tree, training sets are randomly extracted from a plurality of obtained training sets to form subsets, and the decision tree is constructed;
repeating the obtaining step of the training set and the step of establishing a decision tree for the training set, and forming a random forest classifier for credit default identification by using the trained decision tree;
and respectively carrying out classification prediction according to the generated credit default classification model, counting classification results of all decision trees on the same user transaction, and selecting the most categories as final classification categories.
5. The method of any one of claims 1-4, further comprising, prior to training financial transaction data using a convolutional neural network algorithm to extract valid features:
and performing data preprocessing on the financial transaction data.
6. The method of claim 5, wherein the data preprocessing comprises:
performing data cleaning on the financial transaction data, performing data missing value filling on the financial transaction data, and performing data normalization on the financial transaction data.
7. The method of claim 1, wherein the credit default identification method is experimentally validated using public data provided by the Kaggle contest for the financial transaction data.
8. A credit default identification device, comprising:
the characteristic learning module is used for training financial transaction data by adopting a convolutional neural network algorithm to extract effective characteristics so as to reduce the dimension of the financial transaction data;
and the default prediction module is used for constructing a data set according to the effective characteristics, and performing model training and model testing on the data set by adopting random forest algorithm to perform default prediction.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the credit violation identification method of any of claims 1-7.
10. A computer-readable storage medium storing computer instructions for causing a processor to perform the credit violation identification method of any of claims 1-7 when executed.
CN202210374880.3A 2022-04-11 2022-04-11 Credit default identification method, apparatus, device and medium Pending CN114722941A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210374880.3A CN114722941A (en) 2022-04-11 2022-04-11 Credit default identification method, apparatus, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210374880.3A CN114722941A (en) 2022-04-11 2022-04-11 Credit default identification method, apparatus, device and medium

Publications (1)

Publication Number Publication Date
CN114722941A true CN114722941A (en) 2022-07-08

Family

ID=82243876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210374880.3A Pending CN114722941A (en) 2022-04-11 2022-04-11 Credit default identification method, apparatus, device and medium

Country Status (1)

Country Link
CN (1) CN114722941A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118035750A (en) * 2024-04-11 2024-05-14 杭银消费金融股份有限公司 PANEL DATA-based credit model sample construction method
CN118035750B (en) * 2024-04-11 2024-07-02 杭银消费金融股份有限公司 PANEL DATA-based credit model sample construction method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118035750A (en) * 2024-04-11 2024-05-14 杭银消费金融股份有限公司 PANEL DATA-based credit model sample construction method
CN118035750B (en) * 2024-04-11 2024-07-02 杭银消费金融股份有限公司 PANEL DATA-based credit model sample construction method

Similar Documents

Publication Publication Date Title
US11416867B2 (en) Machine learning system for transaction reconciliation
CN108960833B (en) Abnormal transaction identification method, equipment and storage medium based on heterogeneous financial characteristics
US8341111B2 (en) Graph pattern recognition interface
CN111612041A (en) Abnormal user identification method and device, storage medium and electronic equipment
CN110909165A (en) Data processing method, device, medium and electronic equipment
CN111612038A (en) Abnormal user detection method and device, storage medium and electronic equipment
US20220292131A1 (en) Method, apparatus and system for retrieving image
CN114997916A (en) Prediction method, system, electronic device and storage medium of potential user
Huang et al. Enterprise risk assessment based on machine learning
CN113642727B (en) Training method of neural network model and processing method and device of multimedia information
CN113392920B (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
CN113435900A (en) Transaction risk determination method and device and server
CN113988878B (en) Graph database technology-based anti-fraud method and system
CN115994331A (en) Message sorting method and device based on decision tree
CN114722941A (en) Credit default identification method, apparatus, device and medium
CN116861226A (en) Data processing method and related device
Slabchenko et al. Development of models for imputation of data from social networks on the basis of an extended matrix of attributes
Li Credit card fraud identification based on unbalanced data set based on fusion model
Lee et al. An Integral Predictive Model of Financial Distress
Hossain et al. Data mining for predicting and finding factors of bankruptcy
CN117975204B (en) Model training method, defect detection method and related device
CN113779236B (en) Method and device for problem classification based on artificial intelligence
CN114818892A (en) Credit grade determining method, device, equipment and storage medium
CN117495563A (en) Stock algorithm trading method based on deep neural network, electronic equipment and medium
Guo et al. Evaluation and Optimization of Machine Learning Algorithms in Personalized Marketing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination