CN116050539A

CN116050539A - Machine learning classification model interpretation method based on greedy algorithm

Info

Publication number: CN116050539A
Application number: CN202211687370.8A
Authority: CN
Inventors: 徐圣源; 周婷婷; 焦旭; 梁变; 胡汉一; 刘智; 许浩
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2023-05-02

Abstract

A machine learning classification model interpretation method based on a greedy algorithm is characterized in that greedy algorithm is used for combining characteristic boxes, and each screened box is used as a starting point to obtain characteristic box combinations. The feature box body for feature matching learning is combined, so that the feature statistics of the target can be more comprehensively known, and the machine learning result can be explained. The method utilizes greedy strategies to screen and combine the features, provides a more visual customer screening method for enterprises, and is mainly used for assisting in explaining the black box model prediction result in practical application and serving first-line business personnel.

Description

Machine learning classification model interpretation method based on greedy algorithm

Technical Field

The invention belongs to the field of artificial intelligence and data processing, and particularly relates to a machine learning classification model interpretation method based on a greedy algorithm.

Background

Machine learning is often regarded by the outside as the basis of AI applications as one of the extremely critical technologies for artificial intelligence. The algorithm engineer considers that the algorithm capability is directly located in the fields with better digital foundations such as finance, industry, medicine, internet and the like, and provides various services such as intelligent wind control, predictive maintenance, medicine discovery, personalized recommendation and the like for enterprises. However, this obviously does not take into account that many industries need to obtain relatively transparent algorithmic processes for accurate interpretation of results. Especially for machine learning models, although the accuracy of the model approaches 100%, it cannot be cleared to know whether the model is reasonably classified, and only the result believing that the model predicts can be selected. Under such preconditions, the patent describes a machine learning classification model interpretation method based on greedy algorithm in order to implement the algorithm interpretation that is commonly used for first-line scenarios.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a machine learning classification model interpretation method based on a greedy algorithm, so that enterprise business personnel provide a more real result interpretation when facing to a target customer prediction result. The method aims to assist business personnel to audit rapidly, improve performance and improve service quality and customer satisfaction.

The aim of the invention is achieved by the following technical scheme:

a machine learning classification model interpretation method based on greedy algorithm,

step one: acquiring client history feature data, behavior data and classification labels in a training set;

step two: cleaning the client history feature data and the behavior data, and then retaining the feature with highest information value in the features with correlation reaching a threshold value;

step three: carrying out box division processing on the characteristics processed in the second step to form a characteristic table which is finally input into a machine learning model;

step four: training a machine learning classification model which is input as structured data by utilizing the feature table and the classification label to obtain feature importance ranking of the machine learning classification model;

step five: calculating the information value of each box body and the number of target clients of the box body, and constructing box body indexes of each box body by combining the feature importance sequencing; reserving a box with the box index larger than the box screening threshold as a characteristic box;

step six: using each characteristic box as a starting point, and finding a series of characteristic box combinations aiming at target clients through a greedy algorithm to serve as a characteristic box combination dictionary;

step seven: the new customer characteristic data and behavior data are subjected to preprocessing in the second step and box division in the third step, and then are input into a trained machine learning classification model to obtain a new customer prediction label; if the client is judged to be the target client according to the prediction label of the new client, the characteristics of the new client are matched with the characteristic box combination dictionary, and the hit characteristic box combination is output to be used as the description of the machine learning classification model.

Further, cleaning the client history characteristic data and the behavior data specifically comprises dirty data cleaning, missing value processing and repeated value deleting.

Further, in the second step, after the client history feature data and the behavior data are cleaned, feature encoding is required for the non-digital category features.

Further, in the third step, the class type feature is a box body, the date type feature is box-divided according to year or month, and the numerical value type feature is box-divided by adopting a minimum condition entropy box-dividing method;

the method for carrying out the box division treatment by adopting the minimum conditional entropy box division method specifically comprises the following steps:

(1) Taking each value of the feature as a dividing point, dividing the feature into two parts, and calculating the conditional entropy weighted sum of two boxes of the feature:

wherein H (Y) represents the weighted sum of the entropy values of the characteristic conditions,

representing the number ratio of the box bodies, p _j For single box x _i A customer-to-customer ratio or a non-customer ratio;

(2) Selecting a minimum value of the conditional entropy weighted sum as a dividing point, and dividing one box into two boxes;

(3) And (3) selecting a box body with the maximum conditional entropy in the characteristics, repeating the steps (1) - (2), and continuously dividing the box body until the number of the boxes of the characteristics reaches the set upper limit of the number of the boxes of the single characteristics.

Further, the calculation formula of the box index of each box in the fifth step is as follows:

wherein S represents a box index, p _vi Representing the application proportion of information value, p _target Represents the application proportion of the number of the boxes to the ratio, p _{feature_importance} An application ratio representing a feature importance ratio; IV _i Information value of the ith box body is represented; IV _max Representing the maximum value of the information value in all the boxes; target (Target) _i Representing the number of target clients of the ith box; target (Target) _max Representing the maximum value of the number of target clients in all the boxes of the index; feature_import _I Representing feature importance of the ith feature derived by the machine learning model; feature_import _MAX Representing the greatest feature importance derived from the machine learning model.

In the sixth step, a series of feature box combinations aiming at the target clients are found by using each feature box as a starting point through a greedy algorithm, and the method specifically comprises the following steps:

(1) Taking a single characteristic box body as a starting point, combining the characteristic box body with other characteristic box bodies one by one, and screening out a characteristic two-box body combination with the maximum probability of a target customer;

(2) Taking the two box combinations as a starting point, combining the two box combinations with other characteristic boxes one by one, screening out a characteristic three box combination with the maximum probability of target clients, and setting the upper limit of the number of the boxes in the characteristic box division combination to stop circulation;

(3) And (3) replacing the initial characteristic box body, repeating the steps (1) - (2), and reserving the combination of all the characteristic box bodies to form a characteristic box body combination dictionary.

Further, if the classification labels of the clients in the training set have more than two categories, the classification labels are simplified to be two categories, wherein 1 is the label of the target client and 0 is the label of the non-target client.

Further, in the second step, the pearson correlation analysis is used to calculate the correlation of the features.

Further, the unique thermal code is used for carrying out feature coding on the non-digital category features.

Further, as the feature dimension is increased by the single thermal code, the feature dimension is reduced by adopting a principal component analysis method according to expert experience after the non-digital category features are subjected to feature coding by adopting the single thermal code.

The beneficial effects of the invention are as follows:

the method adopts a minimum condition entropy box-sorting method to categorize all the features, then uses the information value and the target quantity of the features as screening indexes to screen the feature boxes with high target probability according to the feature importance given by a machine learning classification model, and then uses a greedy algorithm to combine the feature boxes. The technical means are to summarize the data set characteristics to the greatest extent and reduce the characteristic dimension so as to carry out statistical interpretation on the results of the machine learning classification model. The interpretation method is close to the result of the classification model, does not need professional knowledge to interpret, widens the use scene of the machine learning classification model, accelerates the application deployment of the model, reduces the understanding threshold of first-line business personnel on the model, helps general business personnel to manage clients, gives more real result interpretation, and reduces the problem of client complaints.

Secondly, the effect of the invention is applicable to all machine learning classification models. The feature contribution of some machine learning models is commonly explained by a new model in the market, so that the interpretation difficulty is increased, and the results of the two models cannot be completely fit. The method does not relate to the interpretation of a mathematical method in the machine learning classification model, only analyzes according to an initial data set, finds the interpretation of the fitting result, reduces the interpretation difficulty of the machine learning classification model, and has a wider application range.

Drawings

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Fig. 1 is a machine learning classification model interpretation method based on greedy algorithm according to an embodiment of the present invention.

Detailed Description

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

The embodiment aims to assist in explaining the complex machine learning model result through a greedy algorithm, and realize good application of real business, so that the risk of loan clients is estimated. The risk monitoring of the bank loan can be divided into three stages of pre-loan, mid-loan and post-loan, and the method can be implemented in the three stages, wherein the main difference is the range of client data, and provides a risk assessment basis for the bank.

The specific content of the method in this embodiment is taken as an example in the post-loan stage, and the method includes the following steps:

step one: and acquiring the historical characteristic data, the behavior data and the classification labels of the clients in the training set.

The bank data is characterized by large quantity, information and data of each department under the traditional management architecture are not circulated, the cooperative efficiency is low, the data quality is uneven, and the data carding cost is high, the efficiency is low, and the practical effect is slow. Thus, multiple department internals are required to assist in organizing the data framework, extracting and organizing the data. And acquiring customer characteristic data, behavior data and five-level classification labels, wherein the customer characteristic data comprises customer personal information, loan data and credit investigation data.

The customer behavior data includes customer loan behavior, deposit behavior, and transfer behavior. The five-level classification label is used for carrying out five-level classification on the loan quality by a commercial bank according to the actual repayment capability of a borrower, and is normal, concerned, secondary, suspicious and lost, and the secondary and the following loans are bad loans. Forming a data set D:

in data set D, x _nm Represents the mth feature, y of the nth client _n A label representing an nth customer, wherein y _i E ("normal", "focus", "secondary", "possible", "loss").

Step two: and cleaning the historical characteristic data and the behavior data of the clients, and then retaining the characteristic with the highest information value in the characteristics with the correlation reaching the threshold value.

The method specifically comprises the steps of cleaning the client history characteristic data and the behavior data, namely cleaning dirty data, processing missing values and deleting repeated values. Dirty data cleaning comprises deleting wrong identification card numbers, data expression with unified meaning of unknown, and any other data which does not accord with the field standard length or format; the missing value processing method comprises the steps of filling a 0 value or a mean value or a median value or a mode.

Feature coding is also needed for non-digital class features, and a tag coding method, such as a unique hot code, is selected by the feature coding mode, so that the non-digital class field can be converted into a digital class field, and model training is facilitated. However, as the unique thermal codes increase the feature dimension, the feature dimension is reduced by adopting a principal component analysis method according to expert experience after the unique thermal codes are adopted to perform feature coding on the non-digital category features.

When the features are more, the features with poor data quality can be removed, and the removing conditions comprise that the field null rate reaches 90% or more and the number of single types of the fields reaches 90% or more.

If the classification labels of the clients in the training set have more than two categories, the classification labels are simplified to be two categories, wherein 1 is the label of the target client, and 0 is the label of the non-target client.

In this embodiment, the tag data preprocessing is divided into two parts, the first is to simplify the five-level classification into two-level classification, classify normal and concerned loan customers into one type, determine good customers, classify secondary, suspicious and lost loan customers into one type, and determine bad customers. y is _i ∈(0,1，y _i =1 denotes bad samples, i.e., default samples; y is _i =0 indicates a good sample.

Step three: and (3) carrying out box division processing on the features processed in the step two to form a feature table finally input into the machine learning model.

The type features are classified into a box body, the date type features are classified into boxes according to year or month, and the numerical value type features are classified into boxes by adopting a minimum condition entropy classification method.

The minimum conditional entropy binning is a top-to-bottom binning method. The method for carrying out the box division treatment by adopting the minimum condition entropy box division method comprises the following steps:

(1) Taking each value of the feature as a dividing point in sequence, dividing the data set into two, and calculating the conditional entropy weighted sum of two boxes of the feature:

representing the number ratio of the box bodies, p _j For single box x _i Either customer or non-customer.

The advantage of using the minimum conditional entropy binning method is that purer bins can be obtained, i.e. the ratio of target clients/non-target client labels in the bins is maximized, if the ratio of the classes of dependent variables in the ith bin is equal, i.e. the entropy value of the ith bin reaches a maximum; if the ith bin dependent variable has only one value, i.e. a certain p _i Equal to 1 and the ratios of the other categories equal to 0, then the entropy value of the ith bin reaches a minimum.

The classification type features are not required to be classified into boxes, one class is a box, and if the phenomenon of excessive classification exists, the small classes need to be combined according to the situation. And (3) the time-based feature is divided into boxes, whether obvious periodic features exist or not is observed according to the feature distribution diagram, and the box division processing is carried out according to the period length of the features.

Step four: and training the machine learning classification model input as the structured data by utilizing the feature table and the classification label to obtain the feature importance ranking of the machine learning classification model.

In this embodiment, a decision tree model is used to predict the rate of client loan violations. The sampling strategy is random downsampling, and the ratio of the default client to the non-default client is 2:1, training set, validation set, test set ratio is 6:2:2. feature_importances (model feature importance) ordering of model yield is obtained.

Step five: calculating the information value of each box body and the number of target clients of the box body, and constructing box body indexes of each box body by combining the feature importance sequencing; and reserving the box body with the box body index larger than the box body screening threshold value as a characteristic box body.

The information value is an index for representing the contribution degree of the feature to the target prediction, namely the classification prediction capability of the feature. In general, the higher the information value, the more predictive the feature is, because the higher the degree of contribution of the information. The information value is used under the condition that the task is a supervised learning task, namely a classification label; secondly, the task must be a two-class task, the reason being represented in the specific formula:

wherein IV _i Information value of a certain box body is represented; t (T) _i A target sample number, i.e., a number of clients, in the box; NT _i The number of non-target samples in the box body is the number of non-samples; t (T) _t For all target sample numbers in the feature; NT _t For all non-target sample numbers in the feature.

In this embodiment, the IV value of the box is calculated as follows:

wherein IV _i An IV value representing a certain bin; bad (Bad) _i The number of bad samples in the box body is the number of offending clients; good (Good) _i The number of samples in the box body is good, namely the number of samples which are not violated; bad (Bad) _t The number of all bad samples in the feature; good (Good) _t For all good sample numbers in the feature.

The information value, the number of target clients of the box and the feature importance ranking are three indexes with different dimensions, and in order to normalize the three indexes, the box index of each box is constructed by adopting the following formula:

If the box index S is greater than the threshold value, the box is reserved, otherwise, the box is rejected. The reason for this is that, first, the minimum conditional entropy binning will result in a portion of bins with a larger proportion of good samples, which are not what we want; second, most banking industries are huge in number and require a simplified method.

Step six: and (3) taking each characteristic box as a starting point, and finding a series of characteristic box combinations aiming at target clients through a greedy algorithm to serve as a characteristic box combination dictionary. The method specifically comprises the following substeps:

Examples are as follows:

(1) The probability of breach of the first feature box a is calculated.

(2) Respectively calculating the default probabilities of the characteristic box a and other single boxes (b, c, d … …), selecting two box combinations with the maximum default probability (such as the combination of the box a and the box b), comparing the default probability of the box combination with the default probability of the box a, and if the default probability of the box combination is larger than the default probability of the box a, continuing to search the next box (c, d … …) to form a three-box combination; otherwise, stopping. This step is repeated until the probability of a breach of the box combination that is sought starting with box a no longer increases or the number of boxes in the box combination reaches a preset upper limit. The feature box combination result is input into a high-default customer feature box combination dictionary.

(3) And (3) replacing the next feature box b as a starting point, and repeating the steps (1) and (2) until all feature boxes are used as the starting point, so as to obtain the required high-default customer feature box combination dictionary.

Because the greedy algorithm combination result has the characteristic of local optimum, all the characteristics after screening are needed to be used as starting points, all the characteristic combinations with the highest default rate are searched, and the trouble caused by the local optimum is reduced.

Step seven: the new customer characteristic data and behavior data are subjected to preprocessing in the second step and box division in the third step, and then are input into a trained machine learning classification model to obtain a new customer prediction label; if the client is judged to be the target client according to the prediction label of the new client, the characteristics of the new client are matched with the characteristic box combination dictionary, and the hit characteristic box combination is output to be used as the description of the machine learning classification model. Thereby assisting first line business personnel in determining customer risk.

It will be appreciated by persons skilled in the art that the foregoing description is a preferred embodiment of the invention, and is not intended to limit the invention, but rather to limit the invention to the specific embodiments described, and that modifications may be made to the technical solutions described in the foregoing embodiments, or equivalents may be substituted for elements thereof, for the purposes of those skilled in the art. Modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A machine learning classification model interpretation method based on greedy algorithm is characterized in that:

2. The machine learning classification model interpretation method based on greedy algorithm as claimed in claim 1, wherein the cleaning of the client history feature data and the behavior data specifically includes dirty data cleaning, missing value processing, and duplicate value deletion.

3. The machine learning classification model interpretation method based on greedy algorithm as claimed in claim 1, wherein after the client history feature data and behavior data are cleaned in the second step, feature encoding is further required for non-digital class features.

4. The machine learning classification model interpretation method based on greedy algorithm according to claim 1, characterized in that in the third step, the class type feature is a box, the date type feature is classified according to year or month, and the numerical type feature is classified by adopting a minimum condition entropy classification method;

5. The machine learning classification model interpretation method based on greedy algorithm according to claim 1, characterized in that the calculation formula of the box index of each box in the fifth step is as follows:

6. The machine learning classification model interpretation method based on greedy algorithm as claimed in claim 1, characterized in that in step six, a series of feature box combinations for target clients are found by greedy algorithm with each feature box as a starting point, specifically comprising:

7. The machine learning classification model interpretation method based on greedy algorithm as claimed in claim 1, wherein if the classification labels of the clients in the training set have more than two categories, the classification labels are simplified to become two categories, wherein 1 is the label of the target client and 0 is the label of the non-target client.

8. A greedy algorithm-based machine learning classification model interpretation method as claimed in claim 1, characterized in that in step two, the relevance of the features is calculated using pearson relevance analysis.

9. A greedy algorithm-based machine-learning classification model interpretation method as claimed in claim 3, characterized in that the non-numeric class features are feature coded using a single thermal code.

10. The machine learning classification model interpretation method based on greedy algorithm as claimed in claim 9, characterized in that, since the unique hot code will increase the feature dimension, the feature dimension is reduced by adopting the principal component analysis method according to expert experience after the unique hot code is adopted to perform feature encoding on the non-digital class features.