CN111738331A

CN111738331A - User classification method and device, computer-readable storage medium and electronic device

Info

Publication number: CN111738331A
Application number: CN202010568382.3A
Authority: CN
Inventors: 白云飞
Original assignee: Beijing Tongbang Zhuoyi Technology Co ltd
Current assignee: Beijing Tongbang Zhuoyi Technology Co ltd
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-10-02

Abstract

The embodiment of the invention relates to a user classification method and device, a computer readable storage medium and electronic equipment, relating to the technical field of machine learning, wherein the method comprises the following steps: acquiring historical user data of a user to be classified, and generating user characteristics of the user to be classified according to attribute characteristics in the historical user data; processing the user characteristics of the user to be classified according to the characteristic category to which the user characteristics of the user to be classified belong to obtain a plurality of initial characteristics to be processed, and performing characteristic intersection on each initial characteristic to be processed to generate an initial characteristic matrix to be processed; and inputting the initial matrix to be processed into a multi-granularity cascading forest model, and classifying users to be classified according to an output result of the multi-granularity cascading forest model. The embodiment of the invention improves the accuracy of the user classification result.

Description

User classification method and device, computer-readable storage medium and electronic device

Technical Field

The embodiment of the invention relates to the technical field of machine learning, in particular to a user classification method, a user classification device, a computer readable storage medium and electronic equipment.

Background

In the credit scenario, the risk control is the most important loop, and the probability of the default risk brought by the client can be predicted by establishing a credit risk score in the client acquisition period.

In the existing risk probability prediction method, an application risk scoring model is established in a client application processing period, and then the risk probability of default delinquent in a certain period after a client opens an account is predicted through the risk scoring model, so that poor credit clients and non-target clients are classified. Wherein, the risk assessment model is a logistic regression model.

However, the above method has the following drawbacks: on one hand, the dimensionality of the model entering variable of the logistic regression model is high, so that a large amount of redundant data exists, and the burden of a system is heavy; on the other hand, it is difficult to effectively utilize all the information thereof, and thus the accuracy of user classification is low.

It is to be noted that the information invented in the above background section is only for enhancing the understanding of the background of the present invention, and therefore, may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present invention is directed to a user classifying method, a user classifying device, a computer-readable storage medium, and an electronic apparatus, which overcome at least some of the problems of low classifying efficiency and low classifying accuracy due to limitations and disadvantages of the related art.

According to an aspect of the present disclosure, there is provided a user classification method, including:

acquiring historical user data of a user to be classified, and generating user characteristics of the user to be classified according to attribute characteristics in the historical user data;

processing the user characteristics of the user to be classified according to the characteristic category to which the user characteristics of the user to be classified belong to obtain a plurality of initial characteristics to be processed, and performing characteristic intersection on each initial characteristic to be processed to generate an initial characteristic matrix to be processed;

and inputting the initial matrix to be processed into a multi-granularity cascading forest model, and classifying users to be classified according to an output result of the multi-granularity cascading forest model.

In an exemplary embodiment of the present disclosure, the feature classes include a plurality of a continuous type, a discrete type, a class type, and a time type;

the processing the user features of the user to be classified to obtain a plurality of initial features to be processed according to the feature categories to which the user features of the user to be classified belong comprises:

when the characteristic category is continuous, carrying out characteristic construction on the user characteristics of the user to be classified based on a preset customer relationship management model to obtain a plurality of first initial characteristics;

when the feature type is discrete, performing evidence weight conversion on the user features of the user to be classified to obtain a plurality of second initial features;

when the feature type is a type, carrying out one-hot coding on the user features of the users to be classified to obtain a plurality of third initial features;

when the feature type is a time type, calculating the time length of the user features of the user to be classified to obtain a plurality of fourth initial features; wherein the fourth initial feature is a numerically continuous feature.

In an exemplary embodiment of the present disclosure, performing feature intersection on each of the initial to-be-processed features, and generating an initial to-be-processed feature matrix includes:

performing feature intersection on the first initial feature, the second initial feature, the third initial feature and the fourth initial feature to generate an initial feature matrix to be processed;

the number of each initial feature of the rows of the initial feature matrix to be processed is listed as the dimension of each initial feature after feature intersection, and the intersection mode of the feature intersection comprises sequential pairwise intersection or random pairwise intersection.

In an exemplary embodiment of the disclosure, the user classification method further includes:

acquiring a data sample, and calculating an expression value of a target user included in the data sample according to the data sample;

correlating the representation value and the modeling characteristics corresponding to the target user in the data sample to obtain the user characteristics of the target user;

generating an initial characteristic matrix to be trained according to the user characteristics of the target user, and constructing a training set and a test set according to the initial characteristic matrix to be trained;

performing feature screening on the initial feature matrix to be trained in the training set to obtain a target feature matrix to be trained, and performing machine learning on the target feature matrix and target parameters based on a multi-granularity cascade forest algorithm to obtain an initial model;

testing the initial model by using the initial characteristic matrix to be trained in the test set to obtain a plurality of test results, and calculating an evaluation index according to the test results and the performance values in the initial characteristic matrix to be trained in the test set;

and when the evaluation index is between a first index value and a second index value, taking the initial model as the multi-granularity cascade forest model.

In an exemplary embodiment of the present disclosure, calculating a performance value of a target user included in the data sample from the data sample includes:

calculating the account age and the scroll rate of a target user included in the data sample according to the data sample;

and analyzing the account age and the rolling rate to obtain the observation period of the target user, and calculating the performance value of the target user according to the observation period.

In an exemplary embodiment disclosed in the present invention, the obtaining of the target feature matrix to be trained by performing feature screening on the initial feature matrix to be trained in the training set includes:

calculating the feature missing rate and the number of abnormal values in the initial feature matrix to be trained in the training set;

judging whether the characteristic missing rate is greater than a first preset threshold value and/or whether the number of the abnormal values is greater than a second preset threshold value;

when the feature missing rate is determined to be larger than a first preset threshold value and/or the number of abnormal values is determined to be larger than a second preset threshold value, filtering the initial feature matrix;

processing the residual initial characteristic matrix to be trained after filtering based on a preset characteristic processing algorithm to obtain the target characteristic matrix to be trained;

wherein the feature processing algorithm comprises at least one of an information content algorithm, a compression estimation algorithm, an extreme gradient boosting algorithm, and a mild boosting tree algorithm.

In an exemplary embodiment of the present disclosure, the user classification method further includes:

searching a parameter space based on a preset parameter searching algorithm to obtain a plurality of target parameters;

wherein the parameter search algorithm comprises at least one of a grid search, a random search, and a Bayesian search.

According to an aspect of the present disclosure, there is provided a user classifying device including:

the first feature generation module is used for acquiring historical user data of a user to be classified and generating user features of the user to be classified according to attribute features in the historical user data;

the first feature processing module is used for processing the user features of the users to be classified according to the feature categories to which the user features of the users to be classified belong to obtain a plurality of initial features to be processed, and performing feature intersection on each initial feature to be processed to generate an initial feature matrix to be processed;

and the user classification module is used for inputting the initial matrix to be processed into the multi-granularity cascade forest model and classifying users to be classified according to the output result of the multi-granularity cascade forest model.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a user classification method as described in any one of the above.

According to an aspect of the present disclosure, there is provided an electronic apparatus including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform any of the user classification methods described above via execution of the executable instructions.

On one hand, historical user data of a user to be classified is obtained, and user features of the user to be classified are generated according to attribute features in the historical user data; then, according to the feature category to which the user features of the users to be classified belong, processing the user features of the users to be classified to obtain a plurality of initial features to be processed, and performing feature crossing on each initial feature to be processed to generate an initial feature matrix to be processed; finally, the initial matrix to be processed is input into the multi-granularity cascading forest model, and the users to be classified are classified according to the output result of the multi-granularity cascading forest model, so that the input of the multi-granularity cascading forest model is only the initial matrix to be processed obtained according to the attribute characteristics, and the problem that in the prior art, due to the fact that the dimensionality of the model entering variable of the logistic regression model is high, a large amount of redundant data exists, and the burden of a system is heavy is solved; on the other hand, the initial to-be-processed matrix input into the multi-granularity cascade forest model is obtained by performing feature crossing according to the initial to-be-processed features, so that the problem that the accuracy of user classification is reduced because all information of the initial to-be-processed matrix is difficult to effectively utilize in the prior art is solved, and the accuracy of user classification is improved; on the other hand, the initial feature matrix to be processed is generated by performing feature intersection on each initial feature to be processed, so that the problem of low accuracy of user classification results caused by the fact that the feature matrix is single due to the fact that the user features are few is solved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 schematically shows a flow chart of a user classification method according to an exemplary embodiment of the present invention.

Fig. 2 schematically shows a flowchart of a method for processing the user features of the user to be classified to obtain a plurality of initial features to be processed according to the feature class to which the user features of the user to be classified belong, according to an exemplary embodiment of the present invention.

Fig. 3 schematically shows a flow chart of another user classification method according to an exemplary embodiment of the present invention.

Fig. 4 schematically shows a flow chart of a method of computing a performance value of a target user comprised in the data sample from the data sample calculation according to an exemplary embodiment of the invention.

Fig. 5 schematically shows a flowchart of a method for performing feature screening on an initial feature matrix to be trained in the training set to obtain a target feature matrix to be trained according to an exemplary embodiment of the present invention.

Fig. 6 schematically shows a block diagram of a structure of a multi-granularity cascaded forest according to an exemplary embodiment of the present invention.

Fig. 7 schematically shows a block diagram of a user classifying apparatus according to an exemplary embodiment of the present invention.

Fig. 8 schematically illustrates an electronic device for implementing the user classification method according to an exemplary embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the invention.

Furthermore, the drawings are merely schematic illustrations of the invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

In the credit scenario, risk control is one of the most important. And in the client acquisition period, establishing a credit risk score and predicting the probability of default risk brought by the client. Generally, in a conventional application scoring card, an application risk scoring model is established in a client application processing period, and the risk probability of default delinquent in a certain period after a client opens an account is predicted, so that the application of clients with poor credit and non-target clients is effectively eliminated. The industry generally uses a logistic regression method to establish an application scoring model, which specifically includes the following steps:

firstly, obtaining a sample, and determining the distribution of good and bad people;

secondly, correlating sample characteristic dimensions according to a primary key, performing data preprocessing, and dividing a training test set;

then, processing the characteristics of part of the characteristics, screening the characteristics by using a series of indexes, and performing parameter selection of a model on screened variables;

finally, the model is trained and evaluated, while the output model file is standardized and model monitoring materials are prepared.

Furthermore, in addition to the logistic regression model, in order to deal with the occurrence of a large amount of structured data and unstructured data, many wind control models based on machine learning are increasingly researched, for example, algorithms such as a decision tree algorithm, a random forest algorithm, a GBDT, a deep network and the like are researched for wind control modeling, and a good effect is achieved.

However, with the application scoring card of logistic regression, the modeling variables often need to be screened for information quantity values and monotonicity adjustment of evidence weights, the requirements on business experience of modeling personnel are high, the business period of an iterative model is long, and manpower consumption is high; meanwhile, the model-entering variables recognized in the logistic regression model are dozens of dimensions, and in the face of explosive data growth, the variables reach tens of thousands of dimensions, and all information of the variables is difficult to effectively utilize.

Furthermore, the neural network model with good generalization capability and high accuracy can be fitted by adopting the neural network model to establish the application scoring card model under a large sample, and in a credit scene, the acquisition cost of the sample is extremely high, the data volume of a single scene is not too large, so that the application scene of the deep network model is limited. Meanwhile, the neural network model is not very explanatory, and is not suitable for the scene that the application scoring card needs to be interpretable in business.

The exemplary embodiment first provides a user classification method, which can be used to predict the credit of a user applying for credit fast. Meanwhile, the method can be operated in a server, a server cluster or a cloud server and the like; of course, those skilled in the art may also operate the method of the present invention on other platforms as needed, and this is not particularly limited in this exemplary embodiment. Referring to fig. 1, the user classification method may include the steps of:

s110, obtaining historical user data of a user to be classified, and generating user characteristics of the user to be classified according to attribute characteristics in the historical user data;

s120, processing the user characteristics of the user to be classified according to the characteristic category to which the user characteristics of the user to be classified belong to obtain a plurality of initial characteristics to be processed, and performing characteristic intersection on each initial characteristic to be processed to generate an initial characteristic matrix to be processed;

and S130, inputting the initial matrix to be processed into a multi-granularity cascading forest model, and classifying users to be classified according to an output result of the multi-granularity cascading forest model.

In the user classification method, on one hand, historical user data of a user to be classified is obtained, and user characteristics of the user to be classified are generated according to attribute characteristics in the historical user data; then, according to the feature category to which the user features of the users to be classified belong, processing the user features of the users to be classified to obtain a plurality of initial features to be processed, and performing feature crossing on each initial feature to be processed to generate an initial feature matrix to be processed; finally, the initial matrix to be processed is input into the multi-granularity cascading forest model, and the users to be classified are classified according to the output result of the multi-granularity cascading forest model, so that the problems that in the prior art, as the mode entering variable of the logistic regression model usually needs information value screening and evidence weight monotonicity adjustment, the service cycle of the iterative model is longer, the training efficiency of the model is lower, and the classification efficiency of the users is lower are solved, and the classification efficiency of the users is improved; on the other hand, the problem that in the prior art, due to the fact that the dimensionality of the logistic regression model entry variable is high, a large amount of redundant data exists, the burden of a system is heavy, all information of the logistic regression model entry variable is difficult to effectively utilize, the accuracy of user classification is reduced, and the accuracy of user classification is improved; on the other hand, the initial feature matrix to be processed is generated by performing feature intersection on each initial feature to be processed, so that the problem of low accuracy of user classification results caused by the fact that the feature matrix is single due to the fact that the user features are few is solved.

Hereinafter, each step involved in the user classification method according to the exemplary embodiment of the present invention will be explained and explained in detail with reference to the drawings.

In step S110, historical user data of the user to be classified is obtained, and the user feature of the user to be classified is generated according to the attribute feature in the historical user data.

In the present exemplary embodiment, the attribute feature may include, for example, a crowd basic attribute, e-commerce statistical information, e-commerce behavior information, address information, time class information, and the like; the basic attributes of the crowd can include basic information such as age and gender of the user to be classified, the e-commerce behavior information can include information such as browsing, shopping, searching and purchasing of the user on the platform, and the address information can include common mailing address information of the user, address information of the user and the like; the time class information may include, for example, the occurrence time of each action of the user, and the like; the e-commerce statistical information may be, for example, user integrated information obtained by the platform through calculation of e-commerce behavior information, address information, and time class information. Therefore, after the historical user data is obtained, the historical data can be classified and sorted, and then the user characteristics of the users to be classified are obtained. By the method, the accuracy of the user characteristics can be improved, and the accuracy of the final classification result can be further improved.

In step S120, according to the feature category to which the user feature of the user to be classified belongs, the user feature of the user to be classified is processed to obtain a plurality of initial features to be processed, and feature intersection is performed on each of the initial features to be processed to generate an initial feature matrix to be processed.

In the present exemplary embodiment, the above-described feature classes may include a continuous type, a discrete type, a class type, a temporal type, and the like. Further, referring to fig. 2, processing the user characteristics of the user to be classified to obtain a plurality of initial characteristics to be processed according to the characteristic category to which the user characteristics of the user to be classified belong may include steps S210 to S240. Wherein:

in step S210, when the feature type is a continuous type, feature construction is performed on the user features of the user to be classified based on a preset customer relationship management model, so as to obtain a plurality of first initial features.

In step S220, when the feature category is discrete, performing evidence weight conversion on the user features of the user to be classified to obtain a plurality of second initial features.

In step S230, when the feature category is a category type, performing unique hot coding on the user features of the user to be classified to obtain a plurality of third initial features.

In step S240, when the feature category is a temporal type, calculating a time length of the user feature of the user to be classified to obtain a plurality of fourth initial features; wherein the fourth initial feature is a numerically continuous feature.

Hereinafter, steps S210 to S240 will be explained and explained. Firstly, for continuous features, the features can be constructed by utilizing a customer relationship management model, and then first initial features are obtained; wherein, the customer relationship management model (RFM model) may include last time consumption (Recency), consumption Frequency (Frequency), and consumption amount (money); secondly, for the discrete features, WOE (Weight of Evidence) conversion can be carried out, and then second initial features are obtained; further, for the class-type feature, one-hot encoding (onehot encoding) may be performed to obtain a third initial feature, and finally, for the time-type feature, the time length may be calculated to obtain a fourth initial feature, and the fourth initial feature is a numerical continuous feature. By the method, the accuracy of each initial feature can be improved, meanwhile, the comprehensiveness of each initial feature can be improved, and the problem of low accuracy caused by too few features is solved. Meanwhile, it should be added that steps S210 to S240 are performed synchronously in a parallel manner, and there is no precedence order, and such numbering is used here only for convenience of description and without other special meanings.

Further, in this exemplary embodiment, after obtaining the first initial feature, the second initial feature, the third initial feature, and the fourth initial feature, feature intersection may be performed on the first initial feature, the second initial feature, the third initial feature, and the fourth initial feature, so as to generate the initial feature matrix to be processed; the number of each initial feature of the rows of the initial feature matrix to be processed is listed as the dimension of each initial feature after feature intersection. The feature intersection manner may be, for example, two-by-two sequential intersection or two-by-two random intersection, or other intersection manners, which is not limited in this example. However, in order to ensure the controllability of the dimension, it is necessary to perform the intersection while controlling the dimension within a certain range in the intersection process.

In step S130, the initial to-be-processed matrix is input into a multi-granularity cascaded forest model, and the to-be-classified users are classified according to an output result of the multi-granularity cascaded forest model.

In this example embodiment, after the initial to-be-processed matrix is obtained, the initial to-be-processed matrix may be input into a multi-granularity cascading forest model, and a user to be classified is classified according to an output result of the multi-granularity cascading forest model. Because the multi-granularity cascade forest model is adopted in the embodiment of the invention, the problems that the business period of the iterative model is longer and the classification efficiency of the user is lower because the model entering variable of the logistic regression model often needs to be screened and the evidence weight is monotonously adjusted in the prior art are solved, and the classification efficiency of the user is improved; meanwhile, the problem that in the prior art, due to the fact that the dimensionality of the logistic regression model entry variable is high, a large amount of redundant data exists, the burden of a system is heavy, all information of the logistic regression model entry variable is difficult to effectively utilize, the accuracy of user classification is reduced, and the accuracy of user classification is improved.

Fig. 3 schematically illustrates another user classification method according to an exemplary embodiment of the present invention. Referring to fig. 3, the user classification method may further include steps S310 to S360. Wherein:

in step S310, a data sample is obtained, and a performance value of a target user included in the data sample is calculated according to the data sample.

In the present exemplary embodiment, as shown with reference to fig. 4, calculating a performance value of a target user included in the data sample from the data sample may include steps S410 to S420. Wherein:

in step S410, calculating an account age and a scroll rate of a target user included in the data sample according to the data sample; in step S420, the account age and the scroll rate are analyzed to obtain an observation period of the target user, and a performance value of the target user is calculated according to the observation period.

The account of book (MOB) can be used for analyzing the quality condition of the assets paid in different periods of the same product, and the scroll rate can be used for analyzing overdue users. Further, the performance values may include good performance and bad performance, which may be represented by 0 and 1, respectively.

In step S320, the representation value and the modeling feature corresponding to the target user in the data sample are associated to obtain the user feature of the target user.

In this example embodiment, sample performance (performance values) and modeling characteristics may be associated according to primary keys, which may generally include: the basic attributes of the crowd, e-commerce statistical information, e-commerce behavior information, address information, time information and the like are similar to the attribute characteristics; furthermore, in order to further improve the accuracy of the multi-granularity cascading forest model, abnormal values and missing values of the user characteristics of the associated target users can be processed. The specific processing procedure may include removing an abnormal value, supplementing a missing value with a default value, and the like, and may also be performed in other manners, which is not limited in this example.

In step S330, an initial feature matrix to be trained is generated according to the user features of the target user, and a training set and a test set are constructed according to the initial feature matrix to be trained.

In this exemplary embodiment, a specific method for generating the initial feature matrix to be trained is similar to the step S120, and is not described here again. Meanwhile, the proportion of the training set and the test set can be controlled to be 7:3 or 8:2, and can be determined according to actual needs.

In step S340, feature screening is performed on the initial feature matrix to be trained in the training set to obtain a target feature matrix to be trained, and machine learning is performed on the target feature matrix and target parameters based on a multi-granularity cascade forest algorithm to obtain an initial model.

In this exemplary embodiment, referring to fig. 5, the obtaining of the target feature matrix to be trained by performing feature screening on the initial feature matrix to be trained in the training set may include steps S510 to S540. Wherein:

in step S510, the feature missing rate and the number of abnormal values in the initial feature matrix to be trained in the training set are calculated.

In step S520, it is determined whether the feature missing rate is greater than a first preset threshold and/or the number of abnormal values is greater than a second preset threshold.

In step S530, when it is determined that the feature missing rate is greater than a first preset threshold, and/or the number of abnormal values is greater than a second preset threshold, the initial feature matrix is filtered.

In step S540, processing the filtered remaining initial feature matrix to be trained based on a preset feature processing algorithm to obtain the target feature matrix to be trained; the feature processing algorithm includes an Information Value algorithm (Information Value), a compression estimation algorithm (LASSO, last absolute Gradient and selection operator), an eXtreme Gradient Boosting algorithm (XGBoost, eXtreme Gradient Boosting), a lightweight lifting tree algorithm (ligatggbm), and the like.

In the exemplary embodiment schematically illustrated in fig. 5, on the one hand, the accuracy of the feature matrix may be improved; on the other hand, the residual initial feature matrix to be trained after filtering is processed based on a preset feature processing algorithm to obtain the target feature matrix to be trained, so that the diversity of features can be kept, the accuracy of the target feature matrix to be trained is further improved, and the accuracy of the multi-granularity cascade forest model is further improved.

Further, in order to perform machine learning, a corresponding target parameter needs to be selected. Specifically, a parameter space may be searched based on a preset parameter search algorithm to obtain a plurality of target parameters; the parameter search algorithm includes a grid search, a random search, a bayesian search, and the like, and of course, other search algorithms may also be included, which is not limited in this example.

In step S350, the initial model is tested by using the initial feature matrix to be trained in the test set to obtain a plurality of test results, and an evaluation index is calculated according to the test results and the performance values in the initial feature matrix to be trained in the test set. The evaluation index may include an AUC value and a KS value.

In step S360, when the evaluation index is between the first index value and the second index value, the initial model is used as the multi-granularity cascading forest model.

Specifically, the first index value may be, for example, 0.2, and the second index value may be, for example, 0.7, or may be determined by itself according to actual needs, which is not limited in this example. The obtained multi-granularity cascading forest model may be shown in fig. 6, for example, and may include a multi-granularity input module 610, a plurality of cascading structures 620 composed of a sliding window and a tree forest, an average or maximum module 630, and a prediction result output module 640.

Based on this, a complete technical solution of the exemplary embodiment of the present invention can be obtained as follows:

step (1): and obtaining a modeling sample, and determining the observation period and the performance of the sample according to the account age analysis and the rolling rate analysis of the sample. Suppose Y is the good or bad performance of the sample, good is 0 and bad is 1.

Step (2): data preprocessing, wherein sample expression and modeling characteristics are associated according to the primary key, and the modeling characteristics generally comprise: the method comprises the steps of associating the basic attributes of the crowd, e-commerce statistical information, e-commerce behavior information, address information, time information and the like, and processing abnormal values and missing values of feature populations after association.

And (3): the method comprises the following steps: first, the features are classified: the method comprises the following steps of (1) dividing the method into continuous type, discrete type, category type and time type variables; secondly, constructing characteristics by using an RFM model for a continuous type, adopting WOE (word-of-origin) conversion for a discrete type variable and calculating time length for a time type variable to form numerical type continuous characteristics for a category type onehot coding; and finally, generating cross features of the features to form a final feature matrix.

And (4): and (3) feature screening: firstly, screening characteristics by utilizing statistical information, and screening the characteristics according to the loss rate, abnormal values and other distributions of the characteristics; secondly, feature screening is performed by a plurality of methods, including: IV value, LASSO, XGboost, LightGBM, relevance and other screening methods, and aims to retain the diversity of characteristics;

and (5): selecting parameters, namely screening gc-Forest parameters by utilizing technologies such as grid search, random search, Bayesian search and the like;

and (6): model training, namely constructing a gc-Forest model based on a gc-Forest algorithm by adopting the optimal parameters after parameter selection on the basis of the screened characteristics to complete the learning of a training sample;

and (7): and model prediction, namely storing the model trained in the training stage as a model file, reading the model file offline to predict the test set data, and simultaneously using the model file as an online deployment model file.

And (8): model evaluation, AUC and KS are the most common evaluation indicators in the credit risk prediction model. The gc-Forest model outputs a (0,1) value, and a KS value can be calculated by combining the real good-good distribution of the sample. Generally, a risk prediction model with KS above 0.2 has good distinguishing capability on good and bad samples, and if the KS is below 0.2, the distinguishing capability of the model is poor, so that the model is not suitable for being put on production. However, a KS exceeding 0.7, i.e., too large, is also risky, and data problems or model overfitting problems may occur, requiring further investigation of steps (1) to (7).

And (9) after the gc-Forest model is obtained, inputting the initial matrix to be processed of the user to be evaluated into the model, and classifying the user according to the output result. For example, if the output result is between 0.2 and 0.7, it is a good user; otherwise, the user is a bad user.

The user classification method provided by the exemplary embodiment of the present invention has at least the following advantages:

on one hand, the multi-granularity cascading forest model is applied to a credit risk prediction scene for the first time, the training process of the multi-granularity cascading forest model is high in efficiency and expandable, the training time on one PC is almost the same as the deep neural network running on a GPU facility, and the advantage of high efficiency is more obvious in view of the fact that the multi-granularity cascading forest model can be suitable for parallel deployment;

on the other hand, compared with a neural network, the hyper-parameters needing to be trained are much easier, the robustness is higher, and a better effect can be achieved on different data in different fields under almost the same parameters; in addition, the gc-Forest model uses a cascade structure to enable a multi-granularity cascade Forest model to be used for characterization learning compared with a traditional tree model XGboost, end-to-end training can be achieved, and excessive artificial investment in characteristic engineering is not needed; at the same time, it can also operate in small sample cases;

on the other hand, the problem of utilization rate of the logistic regression scoring card model on high-dimensional features, the problem of long service period of the iterative model and the problem of limited use of the neural network model in a small sample scene can be solved; meanwhile, the model has similar performance to the deep neural network model under the condition of a large sample, and the interpretability of partial features is increased.

The embodiment of the invention also provides a user classification device. Referring to fig. 7, the user classifying means may include a first feature generating module 710, a first feature processing module 720, and a user classifying module 730. Wherein:

the first feature generation module 710 may be configured to obtain historical user data of a user to be classified, and generate a user feature of the user to be classified according to an attribute feature in the historical user data;

the first feature processing module 720 may be configured to process the user features of the user to be classified according to the feature category to which the user features of the user to be classified belong to obtain a plurality of initial features to be processed, and perform feature crossing on each of the initial features to be processed to generate an initial feature matrix to be processed;

the user classification module 730 may be configured to input the initial to-be-processed matrix into a multi-granularity cascaded forest model, and classify a user to be classified according to an output result of the multi-granularity cascaded forest model.

In an exemplary embodiment of the present disclosure, the user classifying device further includes:

the performance value calculation module is used for acquiring a data sample and calculating the performance value of a target user included in the data sample according to the data sample;

the correlation module may be configured to correlate the representation value and a modeling feature corresponding to the target user in the data sample to obtain a user feature of the target user;

the data set construction module can be used for generating an initial characteristic matrix to be trained according to the user characteristics of the target user and constructing a training set and a test set according to the initial characteristic matrix to be trained;

the machine learning module can be used for performing feature screening on the initial feature matrix to be trained in the training set to obtain a target feature matrix to be trained, and performing machine learning on the target feature matrix and target parameters based on a multi-granularity cascade forest algorithm to obtain an initial model;

the model testing module can be used for testing the initial model by using the initial characteristic matrix to be trained in the test set to obtain a plurality of test results, and calculating an evaluation index according to the test results and the performance values in the initial characteristic matrix to be trained in the test set;

and the model determining module can be used for taking the initial model as the multi-granularity cascading forest model when the evaluation index is between the first index value and the second index value.

In an exemplary embodiment of the present disclosure, calculating a performance value of a target user included in the data sample from the data sample comprises:

In an exemplary embodiment of the present disclosure, the obtaining of the target feature matrix to be trained by performing feature screening on the initial feature matrix to be trained in the training set includes:

the target parameter searching module can be used for searching a parameter space based on a preset parameter searching algorithm to obtain a plurality of target parameters;

The specific details of each module in the data classification apparatus have been described in detail in the corresponding data classification method, and therefore are not described herein again.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the invention. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present invention are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

In an exemplary embodiment of the present invention, there is also provided an electronic device capable of implementing the above method.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 800 according to this embodiment of the invention is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.

As shown in fig. 8, electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, a bus 830 connecting various system components (including the memory unit 820 and the processing unit 810), and a display unit 840.

Wherein the storage unit stores program code that is executable by the processing unit 810 to cause the processing unit 810 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification. For example, the processing unit 810 may perform step S110 as shown in fig. 1: acquiring historical user data of a user to be classified, and generating user characteristics of the user to be classified according to attribute characteristics in the historical user data; step S120: processing the user characteristics of the user to be classified according to the characteristic category to which the user characteristics of the user to be classified belong to obtain a plurality of initial characteristics to be processed, and performing characteristic intersection on each initial characteristic to be processed to generate an initial characteristic matrix to be processed; step S130: and inputting the initial matrix to be processed into a multi-granularity cascading forest model, and classifying users to be classified according to an output result of the multi-granularity cascading forest model.

The storage unit 820 may include readable media in the form of volatile memory units such as a random access memory unit (RAM)8201 and/or a cache memory unit 8202, and may further include a read only memory unit (ROM) 8203.

The storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 800 may also communicate with one or more external devices 900 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 800, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 800 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 850. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present invention.

In an exemplary embodiment of the present invention, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

According to the program product for realizing the method, the portable compact disc read only memory (CD-ROM) can be adopted, the program code is included, and the program product can be operated on terminal equipment, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

Claims

1. A method for classifying a user, comprising:

2. The method according to claim 1, wherein the feature classes include a plurality of a continuous type, a discrete type, a class type, and a time type;

3. The user classification method according to claim 2, wherein the performing feature intersection on each of the initial features to be processed to generate an initial feature matrix to be processed comprises:

4. The user classification method according to claim 1, further comprising:

5. The user classification method according to claim 4, wherein calculating, from the data sample, a performance value of a target user included in the data sample comprises:

6. The user classification method according to claim 4, wherein the feature screening of the initial feature matrix to be trained in the training set to obtain the target feature matrix to be trained comprises:

7. The user classification method according to claim 4, further comprising:

8. A user classifying apparatus, comprising:

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the user classification method of any one of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the user classification method of any of claims 1-7 via execution of the executable instructions.