CN115983982A

CN115983982A - Credit risk identification method, credit risk identification device, credit risk identification equipment and computer readable storage medium

Info

Publication number: CN115983982A
Application number: CN202310025119.3A
Authority: CN
Inventors: 黄茂湘; 壮青; 陈婷; 吴三平; 庄伟亮; 王永兴; 谭蕴琨; 要卓
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-04-18

Abstract

The application provides a credit risk identification method, a credit risk identification device, credit risk identification equipment and a computer readable storage medium, wherein the credit risk identification method comprises the following steps: acquiring historical behavior data of different time nodes of a user, and sequencing and splicing the historical behavior data of the time nodes according to a time sequence to generate a user behavior sequence; performing natural language processing on the user behavior sequence to obtain a first behavior sequence coding vector, and performing Embedding sparse data processing on the first behavior sequence coding vector to obtain a second behavior sequence coding vector with low dimension and density; inputting the second behavior sequence coding vector into a pre-constructed recurrent neural network model, and decoding to obtain behavior sequence representation data; and identifying the credit risk of the user according to the behavior sequence characterization data. The credit risk identification method and the credit risk identification device can improve accuracy of credit risk identification.

Description

Credit risk identification method, credit risk identification device, credit risk identification equipment and computer readable storage medium

Technical Field

The present application relates to the field of financial technology (Fintech), and in particular, to a credit risk identification method, apparatus, device, and computer-readable storage medium.

Background

In the financial credit industry, it is often necessary for a sponsor to assess whether a customer is at risk of default or fraud and based thereon determine whether to credit the customer.

At present, in a credit scene, for a behavior sequence of a borrower, for example, information dimensions such as account opening with time context, loan inquiry, historical repayment records and the like, the most common credit risk identification method is to perform feature extraction on each information dimension by adopting a fixed time window, process the information dimension into structured data in the same way, and then perform subsequent modeling processing. That is, the current credit risk identification method is based on a single-dimensional behavior time sequence, and is characterized in that structured features are generated according to a time window and an aggregation function, and are input into a model of a decision tree and an integration tree for interaction of different features.

However, the current credit risk identification method for processing the behavior sequence of the borrower into the structural characteristics often faces the following problems:

one is the loss of information. Common aggregation methods, such as average, ratio, standard deviation, etc., all aggregate original information, and these operations inevitably lose a part of the most original data information.

Secondly, the data sparsity degree is high. When unstructured behavior sequences are processed into structured information, the final aggregation characteristics have high missing ratio due to the huge difference of behavior sequences of different borrowers.

And thirdly, the interaction among behaviors and the sequence information of the behaviors are ignored, for example, two customers have 3 loans and 3 repayments in the last 6 months, wherein one of the customers borrows 3 pens first, and then the other customer needs to borrow again before each repayment, and the credit risk of the latter is obviously higher than that of the former in the dimension. But if the traditional characteristic derivation method is adopted, the behavior difference of the two clients cannot be identified.

Therefore, how to overcome the above problems and improve the accuracy of credit risk identification has become a technical problem to be solved urgently in the field of financial credit.

Disclosure of Invention

The main purpose of the present application is to provide a credit risk identification method, apparatus, device and computer readable storage medium, aiming to improve the accuracy of credit risk identification.

To achieve the above object, the present application provides a credit risk identification method, including:

acquiring historical behavior data of different time nodes of a user, sequencing and splicing the historical behavior data of the time nodes according to a time sequence to generate a user behavior sequence, wherein the historical behavior data is unstructured data representing historical credit behaviors, and the user behavior sequence comprises behavior characteristics and time sequence information of the behavior characteristics;

performing natural language processing on the user behavior sequence to obtain a first behavior sequence coding vector, and performing Embedding sparse data processing on the first behavior sequence coding vector to obtain a second behavior sequence coding vector with low dimension and density;

inputting the second behavior sequence coding vector into a pre-constructed recurrent neural network model, and decoding to obtain behavior sequence representation data;

and identifying credit risks of the user according to the behavior sequence characterization data.

In some embodiments, the step of performing natural language processing on the user behavior sequence to obtain a first behavior sequence encoding vector includes:

mapping each behavior feature in the user behavior sequence into a behavior code according to a preset behavior action dictionary to obtain a behavior code sequence;

and carrying out One-hot coding on the behavior coding sequence to obtain a first behavior sequence coding vector.

In some embodiments, the step of identifying credit risk to the user according to the behavioral sequence characterization data comprises:

acquiring structure representation data of the user, wherein the structure representation data are structural data representing attribute characteristics of the user;

fusing the structural representation data and the behavior sequence representation data through a preset classification algorithm to obtain fused representation data, and training a credit risk model together;

inputting the fusion characterization data into a trained credit risk model, predicting to obtain default probability of the user, and identifying the credit risk of the user according to the default probability.

In some embodiments, prior to the step of inputting the fused characterization data to the trained credit risk model, the method further comprises:

acquiring behavior sequence sample data and structure sample data, and fusing the behavior sequence sample data and the structure sample data to obtain a training set and a verification set, wherein the behavior sequence sample data is sample data representing historical credit behaviors, and the structure sample data is sample data representing user attribute characteristics;

performing iterative training on the credit risk model based on the training set, and evaluating the effect of the credit risk model according to the verification set to obtain a judgment result;

if the judgment result does not meet the preset standard, continuing to carry out iterative training on the credit risk model;

and if the judgment result meets the preset standard, finishing the iterative training to obtain a trained credit risk model.

In some embodiments, the step of fusing the behavior sequence sample data and the structure sample data to obtain a training set and a verification set includes:

fusing the behavior sequence sample data and the structure sample data through a preset classification algorithm to obtain a fused sample set, wherein the fused sample set comprises a plurality of training samples and default labels associated with the training samples;

and dividing the fusion sample set into a training set and a verification set according to a preset proportion.

In some embodiments, the predetermined classification algorithm is a logistic regression algorithm.

In some embodiments, before the step of inputting the second behavior sequence coding vector into a pre-constructed recurrent neural network model and decoding the second behavior sequence coding vector to obtain behavior sequence characterization data, the method further includes:

and training based on a long-short term memory network (LSTM) algorithm to obtain the recurrent neural network model.

Further, the present application provides a credit risk identification apparatus, comprising:

the behavior sequence extraction module is used for acquiring historical behavior data of different time nodes of a user, sequencing and splicing the historical behavior data of each time node according to a time sequence to generate a user behavior sequence, wherein the historical behavior data is unstructured data representing historical credit behaviors, and the user behavior sequence comprises behavior characteristics and time sequence information of each behavior characteristic;

the unstructured data processing module is used for carrying out natural language processing on the user behavior sequence to obtain a first behavior sequence coding vector, and carrying out Embedding sparse data processing on the first behavior sequence coding vector to obtain a second behavior sequence coding vector with low dimension and density;

the behavior sequence characterization module is used for inputting the second behavior sequence coding vector into a pre-constructed recurrent neural network model and decoding to obtain behavior sequence characterization data;

and the credit risk identification module is used for identifying the credit risk of the user according to the behavior sequence characterization data.

Further, to achieve the above object, the present application also provides a credit risk identification device including: a memory, a processor, and a credit risk identification program stored on the memory and executable on the processor, the credit risk identification program when executed by the processor implementing the steps of the credit risk identification method as described above.

The present application further provides a computer readable storage medium having stored thereon a credit risk identification program, which when executed by a processor, implements the steps of the credit risk identification method as described above.

The method comprises the steps of obtaining historical behavior data of different time nodes of a user, sequencing and splicing the historical behavior data of the time nodes according to a time sequence to generate a user behavior sequence, wherein the historical behavior data is unstructured data representing historical credit behaviors, and the user behavior sequence comprises behavior characteristics and time sequence information of the behavior characteristics; performing natural language processing on the user behavior sequence to obtain a first behavior sequence coding vector, and performing Embedding sparse data processing on the first behavior sequence coding vector to obtain a second behavior sequence coding vector with low dimension and density; inputting the second behavior sequence coding vector into a pre-constructed recurrent neural network model, and decoding to obtain behavior sequence representation data; and identifying the credit risk of the user according to the behavior sequence characterization data.

That is, the original behavior sequence is extracted, the unstructured data are converted by applying technologies such as natural language processing and sparse data processing based on the unstructured data and serve as the input of the recurrent neural network model, the behavior sequence representation of the key behavior time sequence is extracted, and the behavior sequence relation information of the user in the past time is fully captured, so that the feature granularity representing the historical credit behavior of the user is finer, the information expression is more accurate, the credit default probability of the user is more accurately predicted, the better credit risk model prediction effect is achieved, and the accuracy of credit risk identification is improved.

The current credit risk identification mode is based on a single-dimensional behavior time sequence, a structured feature is generated by derivation according to a time window and an aggregation function, and the structured feature is input into a model of a decision tree and an integration tree to carry out interaction of different features.

Compared with the prior art, the original behavior data of the borrower (namely the historical behavior data serving as the unstructured features) are collected and are converted by applying technologies such as natural language processing and sparse data processing, so that the original behavior information is prevented from being aggregated by adopting a conventional aggregation mode (such as average, ratio or standard deviation), a part of most original data information is prevented from being lost, the feature granularity representing the historical credit behavior of the user is finer, the information expression is more accurate, and the credit default probability of the user can be predicted more accurately. That is, according to the method, the structured features are extracted from the traditional behavior sequence and are converted into the method of directly utilizing the unstructured data such as the action text, the action flow and the like for modeling, and due to the fact that more original data are utilized, the problems of unstructured data processing, sparse data conversion and deep sequence feature mining are solved by combining a natural language processing technology and a specific model application scheme, and therefore more hidden and valuable information is captured, and better credit default probability prediction performance is achieved.

In addition, the first behavior sequence coding vector serving as high-dimensional sparse data is subjected to dimensionality reduction processing through an Embedding sparse data processing technology to obtain a second behavior sequence coding vector with low and dense dimensions, and then the second behavior sequence coding vector is used as the input of a subsequent recurrent neural network model, so that the data sparsity is effectively reduced, the operation complexity of the recurrent neural network model for extracting behavior sequence representation based on the behavior sequence coding vector is reduced, and the accuracy of predicting the credit default probability of the user is further improved.

Furthermore, the behavior sequence of the historical behaviors of the borrower is adopted, the complete behavior sequence is created, the borrower stands at the integral angle of the client instead of a single dimension, the sequence among different behaviors is considered, the main purpose is to construct a credit default model on the premise of not losing the original behavior sequence information, the interaction among the behaviors and the sequence information of the behaviors are fully captured, and therefore the better credit default probability prediction performance is achieved, and the technical purpose of improving the accuracy of credit risk identification is achieved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a flow chart illustrating the implementation of a first embodiment of the credit risk identification method of the present application;

FIG. 2 is a schematic diagram of behavior code mapping according to an embodiment of the present application;

FIG. 3 is a diagram illustrating one-hot encoding according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating operations performed on the embedded sparse data processing according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an LSTM model according to an embodiment of the present application;

FIG. 6 is a block diagram illustrating the logic processing of a hidden layer vector according to an embodiment of the present invention;

FIG. 7 is a block diagram of a credit risk identification device for an operating environment of device hardware according to an embodiment of the present disclosure;

FIG. 8 is a functional block diagram of the credit risk identification device of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The main solution of the embodiment of the invention is as follows: acquiring historical behavior data of different time nodes of a user, sequencing and splicing the historical behavior data of the time nodes according to a time sequence to generate a user behavior sequence, wherein the historical behavior data is unstructured data representing historical credit behaviors, and the user behavior sequence comprises behavior characteristics and time sequence information of the behavior characteristics; performing natural language processing on the user behavior sequence to obtain a first behavior sequence coding vector, and performing Embedding sparse data processing on the first behavior sequence coding vector to obtain a second behavior sequence coding vector with low dimension and density; inputting the second behavior sequence coding vector into a pre-constructed recurrent neural network model, and decoding to obtain behavior sequence representation data; and identifying the credit risk of the user according to the behavior sequence characterization data.

Compared with the prior art, the original behavior data of the borrower (namely, the historical behavior data serving as the unstructured features) are collected and are converted by applying technologies such as natural language processing and sparse data processing, so that the original behavior information is prevented from being aggregated by adopting a conventional aggregation mode (such as average, ratio or standard deviation), a part of the most original data information is prevented from being lost, the feature granularity representing the historical credit behavior of the user is finer, the information expression is more accurate, and the credit default probability of the user can be predicted more accurately. That is, according to the method, the structured features are extracted from the traditional behavior sequence and are converted into the method of directly utilizing the unstructured data such as the action text, the action flow and the like for modeling, and due to the fact that more original data are utilized, the problems of unstructured data processing, sparse data conversion and deep sequence feature mining are solved by combining a natural language processing technology and a specific model application scheme, and therefore more hidden and valuable information is captured, and better credit default probability prediction performance is achieved.

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

Structuring data: structured data, also called row data, is data logically represented and implemented by a two-dimensional table structure, strictly following the data format and length specifications, and mainly stored and managed by a relational database.

Unstructured data: the data structure is irregular or incomplete, and data represented by a database two-dimensional logic table is inconvenient because of no predefined data model. Including office documents, text, pictures, HTML, various types of reports, images, audio/video information, and the like, in all formats. In this patent, text-type data is mainly referred to.

Recurrent Neural Network (RNN): the recurrent neural network is a recurrent neural network (recurrent neural network) in which sequence data is input, recursion is performed in the evolution direction of the sequence, and all nodes (i.e., recurrent units) are connected in a chain.

Logistic regression (logistic regression): logistic regression is a generalized linear model (generalized linear model) for solving the binary problem, which assumes that dependent variables belong to bernoulli distribution and selects sigmoid function as the connection function, and is a supervised machine learning method.

One-Hot encoding: also known as One-bit-efficient coding, which mainly uses an N-bit state register to code N states, each state is represented by its independent register bit and only One bit is active at any time, one-Hot coding is the representation of categorizing variables as binary vectors. This first requires mapping the classification values to integer values. Each integer value is then represented as a binary vector, which is a zero value, except for the index of the integer, which is marked as 1.

Embedding: a word is represented by a low-dimensional vector. The embedding vector has the property that objects corresponding to vectors with similar distances have similar meanings.

word2vector (W2V): each word is represented as a fixed-length vector and allows the vectors to better represent similarity and analogy relationships between different words.

Credit risk: credit risk refers to the risk of a counterparty not fulfilling an expired debt. The credit risk is also called default risk, which means the possibility that a borrower, a security issuer or a transaction counterpart will suffer loss due to the fact that the borrower, the security issuer or the transaction counterpart are unwilling or unable to fulfill contract conditions for various reasons.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that, in this embodiment, the current credit risk identification method is based on a single-dimensional behavior time series, a structured feature is derived according to a time window and an aggregation function, and the structured feature is input into a model of a decision tree or an integration tree to perform interaction of different features. However, the current credit risk identification method for processing the behavior sequence of the borrower into the structural characteristics often faces the following problems:

Secondly, the data sparsity degree is high. When unstructured behavior sequences are processed into structured information, the final aggregation characteristics have high missing ratio due to the huge difference of behavior sequences of different borrowers. For example, only 1% of borrowers have a credit card payment record in the first 1 month of loan application, and other borrowers do not have the record, so that the missing value accounts for 99% when the characteristic of normal payment of the credit card in the first 1 month of loan is established.

Based on the above, in order to overcome the above problems and improve the accuracy of credit risk identification, various embodiments of the credit risk identification method of the application are provided. Referring to fig. 1, fig. 1 is a flowchart illustrating implementation steps of a first embodiment of the credit risk identification method of the present application. In this embodiment, the credit risk identification method of the present application may include:

step S100, acquiring historical behavior data of different time nodes of a user, sequencing and splicing the historical behavior data of the time nodes according to time sequence to generate a user behavior sequence;

in this embodiment, the historical behavior data is unstructured data characterizing historical credit behavior, and the user behavior sequence includes behavior features and timing information of the behavior features. The behavior characteristics are behavior characteristics of the user for performing financial credit interaction operation, and the behavior characteristics include but are not limited to types of account opening, loan behaviors, repayment behaviors and the like.

In one embodiment, the historical behavioral data may be behavioral data of the user conducting financial credit interactions all of the time in the past. In another embodiment, the historical behavioral data may be behavioral data of a user performing financial credit interactions over the past 24 months. In yet another embodiment, the historical behavioral data may be behavioral data of a user conducting financial credit interactions over the past 36 months. In this regard, the present embodiment is not particularly limited. It is easy to understand that the behavior characteristics of account opening, loan behavior and repayment behavior in the historical behavior data of the user often have corresponding time node information. To further assist understanding, for example, if the user inquires about a credit card at 2/1/2021, the user opens an account with a credit card at 3/5/2021, inquires about a loan at 3/15/2021, makes a normal payment for the credit card at 30/4/2021 (the more normal payment, the lower the risk of credit, and the more overdue payment, the higher the risk of credit), and makes a normal payment for the consumer at 30/4/2021, the behavior characteristics here are: the time node corresponding to the credit card is 2021 year, 2 month and 1 day, the time node corresponding to the credit card account opening is 2021 year, 3 month and 5 day, the time node corresponding to the credit card account opening is 2021 year, 3 month and 15 day, the time node corresponding to the credit card normal repayment is 2021 year, 4 month and 30 day, and the time node corresponding to the consumption credit card normal repayment is 2021 year, 4 month and 30 day. Then, sequencing and splicing the historical behavior data according to a time sequence to generate a user behavior sequence as follows: inquiring the credit card, opening an account with the credit card, inquiring the loan, paying the credit card normally and paying the credit normally. Of course, the user behavior sequence may also carry a time node identifier, which may be, for example: 20210201 inquiring credit card, opening account with 20210305 credit card, inquiring 20210315 inquiring loan, 20210430 paying normal payment with credit card and consuming normal payment.

Step S200, natural language processing is carried out on the user behavior sequence to obtain a first behavior sequence coding vector, and Embedding sparse data processing is carried out on the first behavior sequence coding vector to obtain a second behavior sequence coding vector with low dimension and density;

further, in a possible embodiment, in step S200, the step of performing natural language processing on the behavior sequence of the user to obtain a first behavior sequence coding vector includes:

step A10, mapping each behavior feature in a user behavior sequence into a behavior code according to a preset behavior action dictionary to obtain a behavior code sequence;

in this embodiment, a behavior action dictionary is set according to various actions of financial behaviors, and each behavior feature in a borrower behavior sequence (i.e., a user behavior sequence) is mapped with a character code in the behavior action dictionary. To facilitate understanding, for example, if the code of the "inquiry" map in the behavior action dictionary is q and the code of the "credit card" map is "03", the behavior characteristic is that the behavior number of the "inquiry credit card" map is "q-03" at this time. For another example, if the code of the credit card mapping is "81" and the code of the opening mapping is "k", the behavior characteristic at this time is the behavior number of the "opening of credit card" mapping is "k-81". For example, if the code of the credit card mapping is "81" and the code of the "normal payment" mapping is "N", the behavior feature at this time is that the behavior number of the "normal payment for credit card" mapping is "81-N", as shown in fig. 2, fig. 2 is a schematic view of the behavior code mapping according to an embodiment of the present application. In fig. 2, the user behavior sequence is: inquiring a credit card, opening an account of the credit card, inquiring loan, normally repaying the credit card and normally repaying a consumption credit, and mapping each behavior feature in the user behavior sequence into a behavior code according to a preset behavior action dictionary to obtain a behavior code sequence of [ q-03 ], [ k-81 ], [ q-02 ], [ 81-N,91-N ].

And step A20, performing One-hot coding on the behavior coding sequence to obtain a first behavior sequence coding vector.

In this embodiment, a one-hot encoding is required for the behavior encoding sequence after splicing, where the one-hot encoding is a binary representation of the classification variable, and this first maps the classification value to an integer value. Then, each integer value table is represented as a binary vector in which the behavior feature is 1 and the rest are 0. To facilitate understanding, enumerating an example, such as a total of 120 dimensions for all actions in the action dictionary of a behavior, for the user behavior sequence a: the "inquiry-credit card", "open-operation credit" and "credit card-M1" are mapped into behavior codes, that is, the composed behavior sequence is coded and converted into a vector representation of 3 x 120 dimensions. As shown in fig. 3, a first behavior sequence encoding vector obtained by One-hot encoding of the user behavior sequence a is [ 1,0,0. ] in [ 0,1,0. ] in [ 0,0, 1. ] in.

According to the method, each behavior feature in the user behavior sequence is mapped into a behavior code according to a preset behavior action dictionary to obtain a behavior code sequence, and One-hot coding is carried out on the behavior code sequence to obtain a first behavior sequence coding vector, so that effective natural language processing is carried out on the user behavior sequence (unstructured data).

In this embodiment, after natural language processing is performed on a user behavior sequence to obtain a first behavior sequence coding vector, embedding sparse data processing is further performed on the first behavior sequence coding vector to obtain a second behavior sequence coding vector with low dimension and density. Specifically, after the natural language processing, the user behavior sequence serving as the unstructured data is converted into the first behavior sequence coding vector serving as the vector data, but the dimensionality of the first behavior sequence coding vector is too high to cause data sparseness, so that the embodiment adopts an Embedding sparse data processing technology to perform dimensionality reduction processing on the first behavior sequence coding vector serving as the high-dimensional sparse data to obtain the second behavior sequence coding vector with low and dense dimensions, as shown in fig. 4, fig. 4 is an operational schematic diagram of performing the Embedding sparse data processing according to the embodiment of the present application. And then the second behavior sequence coding vector is used as the input of a subsequent recurrent neural network model, so that the operation complexity of the recurrent neural network model for extracting behavior sequence representation based on the behavior sequence coding vector is reduced, and the accuracy of credit risk identification is improved.

After the step S200, executing the step S300, inputting the second behavior sequence coding vector into a pre-constructed recurrent neural network model, and decoding to obtain behavior sequence characterization data;

in this embodiment, the recurrent neural network model may be constructed by using an LSTM (Long Short-Term Memory) method, and may be replaced by a GRU (gated recurrent neural network), a bidirectional recurrent neural network, or other conventional recurrent neural network methods. This embodiment is not particularly limited thereto.

In the embodiment, the second behavior sequence coding vector is input into a pre-constructed recurrent neural network model, and behavior sequence representation data is obtained by decoding, so that behavior sequence representation of historical credit interaction behavior is extracted based on a deep learning technology.

In a possible embodiment, before the step of inputting the second behavior sequence coding vector into the pre-constructed recurrent neural network model and decoding to obtain the behavior sequence characterization data, the method further includes:

and B10, training based on a Long Short-Term Memory network (LSTM) algorithm to obtain a recurrent neural network model.

In this embodiment, the recurrent neural network model obtained based on the long-short term memory network LSTM algorithm training is a long-short term memory network model, and key information extraction of the second behavior sequence coding vector is mainly realized through three stages:

(1) A forgetting stage: at the moment t of the behavior sequence of the user, the LSTM model can be selectively forgotten

Non-important information transmitted at the moment;

(2) And (3) a memory stage: the LSTM model can selectively memorize important behavior information input at the time t;

then, the LSTM adds the results of the two stages and transmits the result to the t +1 moment;

(3) An output stage: the LSTM model will output the result at the current time t.

By applying the three stages, the LSTM model memorizes all historical important information from the initial time to the final time of the user behavior sequence and outputs the historical important information, so that the behavior sequence representation data is obtained by decoding.

In this embodiment, since the lending behavior is influenced by the past behavior, the LSTM model (i.e., the recurrent neural network model trained based on the long-short term memory network LSTM algorithm) used in this application is a classical recurrent neural network model, which has the advantages of alleviating the problem of gradient disappearance, having the ability of long-term memory, capturing the influence of the past behavior on the current behavior, and predicting the future default probability of the borrower. The LSTM neural network is defined as follows:

wherein it, ft, ot represent the input, forgetting and output gate of the t moment respectively; xt and ht represent input values and hidden output vectors; ct is the memory cell vector; an indication of a hadamard product; w denotes a connection weight; b denotes the corresponding offset; denotes the activation function, the indices i, f, o denote sigmoid activation functions, c and h denote tanh activation functions. In the training process of the LSTM model, the back propagation error is calculated at each iteration, and the weights are updated accordingly, and the structural diagram of the LSTM model is shown in fig. 5.

In this embodiment, based on the second behavior sequence coding vector, an LSTM model is trained, and after iterative convergence, a final LSTM model and a prediction result of the LSTM model for each borrower are obtained (the prediction result is behavior sequence characterization data used for predicting the credit default probability).

Specifically, based on the constructed LSTM recurrent neural network model, the following ideas can be adopted to extract the structural representation:

(1) A predicted value of whether each borrower has default or not;

(2) The mean of the hidden layer vectors for each borrower LSTM model (as shown in FIG. 6);

(3) The last hidden output vector of each borrower LSTM model.

After step S300, step S400 is executed to identify the credit risk of the user according to the behavior sequence characterization data.

Illustratively, the step of identifying credit risk to the user from the behavioral sequence characterization data comprises:

and step C10, inputting the behavior sequence representation data into the trained credit risk prediction model, predicting to obtain default probability of the user, and identifying the credit risk of the user according to the default probability.

That is, in the embodiment of the application, the original behavior sequence is extracted, the unstructured data is converted by applying technologies such as natural language processing and sparse data processing based on the unstructured data and is used as the input of the recurrent neural network model, so that the behavior sequence representation of the key behavior time sequence is extracted, and the behavior sequence relation information of the user in the past time is fully captured, so that the feature granularity representing the historical credit behavior of the user is finer, the information expression is more accurate, the credit default probability of the user is more accurately predicted, a better credit risk model prediction effect is achieved, and the accuracy of credit risk identification is improved.

Compared with the prior art, the original behavior data of the borrower (namely, the historical behavior data serving as the unstructured features) are collected and are converted by applying technologies such as natural language processing and sparse data processing, so that the original behavior information is prevented from being aggregated by adopting a conventional aggregation mode (such as average, ratio or standard deviation), a part of the most original data information is prevented from being lost, the feature granularity representing the historical credit behavior of the user is finer, the information expression is more accurate, and the credit default probability of the user can be predicted more accurately. That is, in the embodiment, the structured features are extracted from the traditional behavior sequence and converted into modeling by directly using the unstructured data such as the action text, the action pipeline and the like, and since more original data is used, the problems of unstructured data processing, sparse data conversion and deep sequence feature mining need to be solved by combining a natural language processing technology and a specific model application scheme, more hidden and valuable information is captured, and better credit default probability prediction performance is achieved.

In addition, in the embodiment, dimension reduction processing is performed on the first behavior sequence coding vector serving as high-dimensional sparse data through an Embedding sparse data processing technology to obtain a second behavior sequence coding vector with low and dense dimensions, and then the second behavior sequence coding vector is used as input of a subsequent recurrent neural network model, so that the data sparsity degree is effectively reduced, the operation complexity of the recurrent neural network model for extracting behavior sequence representation based on the behavior sequence coding vector is further reduced, and the accuracy of predicting the credit default probability of the user is further improved.

Furthermore, in the embodiment, the behavior action sequence of the historical behaviors of the borrower is adopted, a complete behavior sequence is created, the borrower stands at the angle of the whole customer, rather than a single dimension, the sequence among different behaviors is considered, the main purpose is to construct a credit default model on the premise of not losing the original behavior sequence information, and the interaction among the behaviors and the sequence information of the behaviors are fully captured, so that the better credit default probability prediction performance is achieved, and the technical purpose of improving the accuracy of credit risk identification is achieved.

Further, based on the above-described first embodiment of the credit risk identification method of the present application, a second embodiment of the credit risk identification method of the present application is proposed.

In the second embodiment of the credit risk identification method of the application, in the step S400, the step of identifying the credit risk of the user according to the behavior sequence characterization data includes:

step C10, acquiring structure representation data of a user;

in this embodiment, the structural characterization data is structural data characterizing user attribute features, such as age of the base, gender information, current time point credit account status, and the like.

Step C20, fusing the structural representation data and the behavior sequence representation data through a preset classification algorithm to obtain fused representation data, and training a credit risk model together;

and step C30, inputting the fusion characterization data into the trained credit risk model, predicting the default probability of the user, and identifying the credit risk of the user according to the default probability.

For example, the predetermined classification algorithm may be a logistic regression algorithm. Of course, other classification algorithms may be substituted, such as NBC (Naive Bayesian classification) algorithm, ID3 (Iterative Dichotomiser 3 Iterative binary tree 3 generation) decision tree algorithm, C4.5 decision tree algorithm, C5.0 decision tree algorithm, SVM (Support Vector Machine) algorithm, KNN (k: (k) ((k)) algorithm

K nearest neighbor) algorithm, ANN (Artificial Neural Network) algorithm, etc., which is not particularly limited in this embodiment.

In the embodiment, the structure representation data of a user are obtained, and the structure representation data and the behavior sequence representation data are fused through a preset classification algorithm to obtain fused representation data, so that a credit risk model is trained together; the method comprises the steps of inputting fused representation data into a credit risk model after training, predicting to obtain default probability of a user, and identifying credit risk of the user according to the default probability, so that after key behavior sequence information (namely, behavior sequence representation data) is extracted, combined modeling is carried out by combining non-time-series structured data (namely, structure representation data) of a borrower, and a better model effect is expected to be achieved.

Further, in a possible embodiment, before the step C30 of inputting the fused characterization data into the trained credit risk model and predicting the default probability of the user, the method further includes:

step D10, acquiring behavior sequence sample data and structure sample data, and fusing the behavior sequence sample data and the structure sample data to obtain a training set and a verification set, wherein the behavior sequence sample data is sample data representing historical credit behaviors, and the structure sample data is sample data representing user attribute characteristics;

in this embodiment, the behavior sequence sample data and the structure sample data are further acquired through the mechanism platform or the third-party mechanism platform.

Step D20, carrying out iterative training on the credit risk model based on the training set, and evaluating the effect of the credit risk model according to the verification set to obtain a judgment result;

step D30, if the judgment result does not meet the preset standard, continuing to carry out iterative training on the credit risk model;

and D40, if the judgment result meets the preset standard, finishing the iterative training to obtain the trained credit risk model.

In the embodiment, the behavior sequence sample data and the structure sample data are obtained and fused to obtain a training set and a verification set, wherein the behavior sequence sample data is sample data representing historical credit behaviors, the structure sample data is sample data representing user attribute characteristics, iterative training is performed on the credit risk model based on the training set, and the effect of the credit risk model is evaluated according to the verification set, so that the quality of the credit risk model is effectively ensured.

Further, in a possible embodiment, in step D10 of the above step, the step of fusing the behavior sequence sample data and the structure sample data to obtain the training set and the verification set includes:

e10, fusing the behavior sequence sample data and the structure sample data through a preset classification algorithm to obtain a fusion sample set, wherein the fusion sample set comprises a plurality of training samples and default labels associated with the characteristics of the training samples;

specifically, the fused sample set is a training sample with a default label, and the default label is known whether to default, which specifically may be: if the training sample A is default, the value of the default label associated with the training sample A is 1; if the training sample A does not violate, the value of the violation label is 0.

And E20, dividing the fusion sample set into a training set and a verification set according to a preset proportion.

In this embodiment, the fusion sample set is divided into a training set and a verification set according to a preset ratio (for example, a ratio of 7.

Illustratively, the predetermined classification algorithm is a logistic regression algorithm. Of course, other classification algorithms may be substituted, such as NBC (Naive Bayesian classification) algorithm, ID3 (Iterative Dichotomiser 3 Iterative binary tree 3 generation) decision tree algorithm, C4.5 decision tree algorithm, C5.0 decision tree algorithm, SVM (Support Vector Machine) algorithm, KNN (k: (k) ((k)) algorithm

According to the method and the device, the original behavior sequence is extracted, the unstructured data are converted by applying technologies such as natural language processing and sparse data processing on the basis of the unstructured data and serve as the input of the recurrent neural network model, the behavior sequence representation of the key behavior time sequence is extracted, the behavior sequence relation information of the user in the past time is fully captured, the feature granularity of the historical credit behavior of the user is finer, the information expression is more accurate, the credit default probability of the user is more accurately predicted, the better credit risk model prediction effect is achieved, and the accuracy of credit risk identification is further improved.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a credit risk identification device of a hardware operating environment of a device according to an embodiment of the present application.

As shown in fig. 7, the credit risk identification device may include: a processor 1001, such as a CPU, a memory 1005, and a communication bus 1002. The communication bus 1002 is used for realizing connection communication between the processor 1001 and the memory 1005. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.

Optionally, the credit risk identification device may also include a rectangular user interface, a network interface, a camera, RF (Radio Frequency) circuitry, sensors, audio circuitry, a WiFi module, and the like. The rectangular user interface may comprise a Display screen (Display), an input sub-module such as a Keyboard (Keyboard), and the optional rectangular user interface may also comprise a standard wired interface, a wireless interface. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WIFI interface).

Those skilled in the art will appreciate that the credit risk identification device configuration shown in FIG. 7 does not constitute a limitation of the credit risk identification device and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 7, a memory 1005, which is a kind of computer storage medium, may include therein an operating device, a network communication module, and a credit risk identification program. The operating means is a program that manages and controls the hardware and software resources of the credit risk identification device, supporting the operation of the credit risk identification program as well as other software and/or programs. The network communication module is used to implement communication between the components within the memory 1005 and with other hardware and software in the credit risk identification device.

In the credit risk identification apparatus shown in fig. 7, the processor 1001 is configured to execute the credit risk identification program stored in the memory 1005, and to execute the steps of:

and identifying the credit risk of the user according to the behavior sequence characterization data.

In some possible embodiments, the processor 1001 is further configured to execute a credit risk identification program stored in the memory 1005, and perform the following steps:

acquiring structure representation data of a user, wherein the structure representation data are structural data representing attribute characteristics of the user;

and inputting the fusion characterization data into the trained credit risk model, predicting to obtain default probability of the user, and identifying the credit risk of the user according to the default probability.

In some possible embodiments, the processor 1001 is further configured to execute a credit risk identification program stored in the memory 1005, and further performs the steps of:

and if the judgment result meets the preset standard, finishing the iterative training to obtain the trained credit risk model.

and training based on the long-short term memory network LSTM algorithm to obtain a recurrent neural network model.

The specific implementation of the credit risk identification device of the application is basically the same as that of the above embodiments of the credit risk identification method, and is not described herein again.

In addition, referring to fig. 8, fig. 8 is a schematic diagram of functional modules of the credit risk identification apparatus according to the present application, and the present application further provides a credit risk identification apparatus, including:

the behavior sequence extraction module 10 is configured to acquire historical behavior data of different time nodes of a user, sort and splice the historical behavior data of each time node according to a time sequence, and generate a user behavior sequence, where the historical behavior data is unstructured data representing historical credit behaviors, and the user behavior sequence includes behavior features and time sequence information of each behavior feature;

the unstructured data processing module 20 is configured to perform natural language processing on the user behavior sequence to obtain a first behavior sequence coding vector, and perform Embedding sparse data processing on the first behavior sequence coding vector to obtain a second behavior sequence coding vector with low dimension and density;

the behavior sequence characterization module 30 is configured to input the second behavior sequence coding vector to a pre-constructed recurrent neural network model, and decode the second behavior sequence coding vector to obtain behavior sequence characterization data;

and the credit risk identification module 40 is used for identifying the credit risk of the user according to the behavior sequence characterization data.

Optionally, the unstructured data processing module 20 is further configured to:

Optionally, the credit risk identification module 40 is further configured to:

Optionally, the credit risk identification apparatus further comprises a training module (not shown) for:

Optionally, the training module is further configured to:

The specific implementation of the credit risk identification device is basically the same as that of the above credit risk identification method, and is not described herein again.

Furthermore, the present application also proposes a computer-readable storage medium having stored thereon a program for credit risk identification, which when executed by a processor implements the steps of the method for credit risk identification of the present application as described above.

The specific embodiment of the computer storage medium of the present application is substantially the same as the embodiments of the credit risk identification method described above, and is not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' ...does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method of the embodiments of the present application.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims

1. A credit risk identification method, characterized in that it comprises:

2. The credit risk identification method of claim 1 wherein the step of natural language processing the sequence of user actions to obtain a first sequence of actions encoding vector comprises:

3. The credit risk identification method of claim 1 wherein the step of identifying the credit risk of the user according to the behavioral sequence characterization data comprises:

4. The credit risk identification method of claim 3 wherein prior to the step of inputting the fused characterization data to the trained credit risk model, the method further comprises:

5. The credit risk identification method of claim 4, wherein fusing the behavior sequence sample data and the structure sample data to obtain a training set and a validation set comprises:

6. The credit risk identification method of any one of claims 3 to 5, wherein the pre-set classification algorithm is a logistic regression algorithm.

7. The credit risk identification method of claim 1 wherein prior to the step of inputting the second behavior sequence encoding vector into a pre-constructed recurrent neural network model, decoding to obtain behavior sequence characterization data, the method further comprises:

8. A credit risk identification arrangement, the credit risk identification arrangement comprising:

9. A credit risk identification device, characterized in that it comprises: memory, a processor and a credit risk identification program stored on the memory and executable on the processor, the credit risk identification program when executed by the processor implementing the steps of the credit risk identification method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a credit risk identification program which, when executed by a processor, implements the steps of the credit risk identification method according to any one of claims 1 to 7.