WO2021129368A1 - 一种客户类型的确定方法及装置 - Google Patents

一种客户类型的确定方法及装置 Download PDF

Info

Publication number
WO2021129368A1
WO2021129368A1 PCT/CN2020/134357 CN2020134357W WO2021129368A1 WO 2021129368 A1 WO2021129368 A1 WO 2021129368A1 CN 2020134357 W CN2020134357 W CN 2020134357W WO 2021129368 A1 WO2021129368 A1 WO 2021129368A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
sample
training
customer
evaluation value
Prior art date
Application number
PCT/CN2020/134357
Other languages
English (en)
French (fr)
Inventor
赖�良
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2021129368A1 publication Critical patent/WO2021129368A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/01Customer relationship services
    • G06Q30/015Providing customer assistance, e.g. assisting a customer within a business location or via helpdesk
    • G06Q30/016After-sales
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/067Enterprise or organisation modelling

Definitions

  • the present invention relates to the field of financial technology (Fintech), and in particular to a method and device for determining a customer type.
  • the existing technical solution has the problem that the customer type to which the customer belongs cannot be accurately determined.
  • the present invention provides a method and device for determining a customer type to solve the problem that the customer type to which the customer belongs cannot be accurately determined.
  • an embodiment of the present invention provides a method for determining a customer type.
  • the method includes: obtaining attribute information of a customer; inputting the attribute information of the customer into a preset model to obtain the customer type to which the customer belongs; wherein, The preset model is obtained in the following manner: for the model obtained in the nth training, whether the model is over-fitted is determined by verification data; after determining that the model is over-fitted, the model is obtained in the The evaluation value of each sample feature used in the nth training process; according to the evaluation value of each sample feature, determine each sample feature used in the n+1th training to obtain the n+1th training model, return The step of determining whether the model is over-fitting by verifying the data until the model does not have over-fitting.
  • the customer type to which the customer belongs can be quickly determined, so as to realize the accurate positioning of the customer and facilitate its accurate later Marketing; further, by adjusting the preset model, that is, when the n-th training model is verified with verification data and it is determined that the model is over-fitting, then the model is further obtained during the n-th training process The evaluation value of each sample feature used; and according to the evaluation value of each sample feature, each sample feature used in the n+1th training is further determined and the n+1th training model is obtained.
  • the features of each sample include noise features; according to the evaluation value of each sample feature, determining the features of each sample used in the n+1th training includes: lowering the evaluation value below all The sample feature of the evaluation value of the noise feature is deleted.
  • the importance of the characteristics of the sample to the trained model can be expressed in the form of evaluation values: the more important the characteristics of the sample, the higher the corresponding evaluation value.
  • the noise feature itself is a type of meaningless feature, when the evaluation value of some features of the sample is lower than the evaluation value of the noise feature, it means that the sample features below the noise feature are not sufficiently meaningful for the training of the model.
  • the sample features whose evaluation value is lower than the evaluation value of the noise feature can be deleted.
  • the evaluation value is determined at least according to the number of times the sample feature is used in the training process or the information gain when the sample feature is split; according to the evaluation value of each sample feature, determine the first
  • Each sample feature used in n+1 training includes: sorting the evaluation value of each sample feature; if the evaluation value of the first sample feature is k times the evaluation value of the second sample feature, delete all The first sample feature; the first sample feature and the second sample feature are adjacent sample features in the ranking, k ⁇ 3.
  • the importance of the characteristics of the sample to the trained model can be expressed in the form of evaluation values: the more important the characteristics of the sample, the higher the corresponding evaluation value.
  • the verification data includes multiple verification samples; determining whether the model is over-fitting through the verification data includes: inputting the multiple verification samples into the model respectively to obtain multiple verification results Determine the accuracy and recall rate of the model according to the multiple verification results and the true values of the multiple verification samples; when the accuracy rate is greater than the first threshold and the recall rate is greater than the second threshold, It is determined that the model is overfitting.
  • the corresponding verification results are obtained; further by comparing the verification results with the true values of the corresponding verification samples, it is determined The precision rate and the recall rate of the model; when the precision rate is greater than a first threshold and the recall rate is greater than a second threshold, it is determined that the model is over-fitting. By verifying the model, the verified data is used to accurately determine whether the model is over-fitting.
  • the method before determining whether the model is over-fitting through the verification data, the method further includes: dividing the sample data into M sample sets, where each includes the same positive sample, and each includes a negative sample Are not the same; for each sample set, each sample used in the n-th training is determined from the sample set according to the characteristics of each sample of the n-th training, and the sub-samples corresponding to the sample set are obtained through training. Model; According to the M sub-models, the n-th training model is obtained.
  • the sample data is divided into multiple sample sets, where each sample set includes the same positive samples, but includes different negative samples, that is, the method of no replacement is used to determine the negative of each sample set.
  • Samples; the sub-model corresponding to the sample set is obtained by training each sample set, and the n-th trained model is obtained according to multiple sub-models.
  • the obtained n-th training model is more general and its applicable scenarios are more abundant.
  • the sample data is collected in a first historical period; the verification data is collected in a second historical period; and the second historical period is later than the first historical period.
  • the data collected in the first historical period is used as sample data
  • the data collected in the second historical period is used as verification data.
  • the second historical period is later than the first historical period, and is about to be longer.
  • the full amount of historical data is used as the sample data for determining the customer type, and the historical data closer to the current date is used as the verification data for the model obtained to verify the model, making the trained model more accurate and more suitable for analyzing current data .
  • determining the characteristics of each sample used in the n+1th training to obtain the n+1th training model includes: after determining that the model is over-fitted, the parameters of the model Make adjustments; according to the characteristics of each sample used in the n+1th training, re-train the adjusted model for the n+1th time.
  • the parameters of the model are adjusted. Based on the model after the parameter adjustment, according to the characteristics of each sample used in the n+1th training, The adjusted model is re-trained for the n+1th time, making the final model more accurate in data analysis and judgment.
  • an embodiment of the present invention provides an apparatus for determining a customer type.
  • the apparatus includes: an obtaining unit for obtaining attribute information of a user; and a determining unit for inputting the attribute information of the customer into a preset model to obtain The customer type to which the customer belongs; wherein the preset model is obtained through a training unit: the training unit is used to determine whether the model is over-fitting according to the model obtained in the nth training through verification data; The training unit is configured to, after determining that the model is over-fitted, obtain the evaluation value of each sample feature used by the model in the nth training process; determine according to the evaluation value of each sample feature The features of each sample used in the n+1th training obtain the n+1th trained model, and return to the step of determining whether the model is over-fitting through the verification data until the model does not have over-fitting.
  • the customer type to which the customer belongs can be quickly determined, so as to realize the accurate positioning of the customer and facilitate its accurate later Marketing; further, by adjusting the preset model, that is, when the n-th training model is verified with verification data and it is determined that the model is over-fitting, then the model is further obtained during the n-th training process The evaluation value of each sample feature used; and according to the evaluation value of each sample feature, each sample feature used in the n+1th training is further determined and the n+1th training model is obtained.
  • each sample feature includes a noise feature; the training unit is specifically configured to delete sample features whose evaluation value is lower than the evaluation value of the noise feature.
  • the importance of the characteristics of the sample to the trained model can be expressed in the form of evaluation values: the more important the characteristics of the sample, the higher the corresponding evaluation value.
  • the noise feature itself is a type of meaningless feature, when the evaluation value of some features of the sample is lower than the evaluation value of the noise feature, it means that these sample features below the noise feature are not sufficiently meaningful for the training of the model.
  • the sample features whose evaluation value is lower than the evaluation value of the noise feature can be deleted.
  • the evaluation value is determined at least according to the number of times the sample feature is used in the training process or the information gain when the sample feature is split; the training unit is specifically configured to The evaluation values of the sample features are sorted; if the evaluation value of the first sample feature is k times the evaluation value of the second sample feature, the first sample feature is deleted; the first sample feature and the first sample feature The two-sample feature is the adjacent sample feature in the ranking, k ⁇ 3.
  • the importance of the characteristics of the sample to the trained model can be expressed in the form of evaluation values: the more important the characteristics of the sample, the higher the corresponding evaluation value.
  • the verification data includes multiple verification samples; the training unit is specifically configured to input the multiple verification samples into the model to obtain multiple verification results; The verification result and the true values of the multiple verification samples determine the accuracy rate and the recall rate of the model; when the accuracy rate is greater than the first threshold and the recall rate is greater than the second threshold, it is determined that the model is over-fit Together.
  • the corresponding verification results are obtained; further by comparing the verification results with the true values of the corresponding verification samples, it is determined The precision rate and the recall rate of the model; when the precision rate is greater than a first threshold and the recall rate is greater than a second threshold, it is determined that the model is over-fitting. By verifying the model, the verified data is used to accurately determine whether the model is over-fitting.
  • the training unit before determining whether the model is over-fitting through the verification data, is also used to divide the sample data into M sample sets, where each includes the same positive sample, and each The negative samples included in each sample are not the same; for each sample set, each sample used in the nth training is determined from the sample set according to the characteristics of each sample of the nth training, and the training obtains the The sub-model corresponding to the sample set; the n-th training model is obtained according to the M sub-models.
  • the sample data is divided into multiple sample sets, where each sample set includes the same positive samples, but includes different negative samples, that is, the method of no replacement is used to determine the negative of each sample set.
  • Samples; the sub-model corresponding to the sample set is obtained by training each sample set, and the n-th trained model is obtained according to multiple sub-models.
  • the obtained n-th training model is more general and its applicable scenarios are more abundant.
  • the sample data is collected in a first historical period; the verification data is collected in a second historical period; and the second historical period is later than the first historical period.
  • the data collected in the first historical period is used as sample data
  • the data collected in the second historical period is used as verification data.
  • the second historical period is later than the first historical period, and is about to be longer.
  • the full amount of historical data is used as the sample data for determining the customer type, and the historical data closer to the current date is used as the verification data for the model obtained to verify the model, making the trained model more accurate and more suitable for analyzing current data .
  • the training unit is specifically configured to adjust the parameters of the model after determining that the model is over-fitted; according to the characteristics of each sample used in the n+1th training , Re-train the adjusted model for the n+1th time.
  • the parameters of the model are adjusted. Based on the model after the parameter adjustment, according to the characteristics of each sample used in the n+1th training, The adjusted model is re-trained for the n+1th time, making the final model more accurate in data analysis and judgment.
  • an embodiment of the present invention provides a computing device, including:
  • Memory used to store program instructions
  • the processor is configured to call the program instructions stored in the memory, and execute the method according to any one of the first aspects according to the obtained program.
  • an embodiment of the present invention provides a computer-readable storage medium that stores computer-executable instructions, and the computer-executable instructions are used to cause a computer to execute any of the operations described in the first aspect method.
  • Fig. 1 is a method for determining a customer type provided by an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a confusion matrix provided by an embodiment of the present invention.
  • FIG. 3 is an apparatus for determining a customer type provided by an embodiment of the present invention.
  • Fig. 4 is a schematic diagram of a computing device provided by an embodiment of the present invention.
  • FIG. 1 it is a method for determining a customer type provided by an embodiment of the present invention, and the method includes:
  • Step 101 Obtain the attribute information of the customer.
  • Step 102 Input the attribute information of the customer into a preset model to obtain the customer type to which the customer belongs; wherein, the preset model is obtained in the following manner: for the model obtained in the nth training, the verification data is passed Determine whether the model is over-fitting; after determining that the model is over-fitting, obtain the evaluation value of each sample feature used by the model in the nth training process; according to the evaluation value of each sample feature , Determine the characteristics of each sample used in the n+1th training to obtain the n+1th trained model, and return to the step of determining whether the model is over-fitting through the verification data until the model does not have over-fitting.
  • the customer type to which the customer belongs can be quickly determined, so as to realize the accurate positioning of the customer and facilitate its accurate later Marketing; further, by adjusting the preset model, that is, when the n-th training model is verified with verification data and it is determined that the model is over-fitting, then the model is further obtained during the n-th training process The evaluation value of each sample feature used; and according to the evaluation value of each sample feature, each sample feature used in the n+1th training is further determined and the n+1th training model is obtained.
  • the problem raised in the background technology is how to accurately locate the customer information of small, medium and micro business owners from the massive amount of information, so that the targeted small, medium and micro business owners can be advertised to achieve precision marketing.
  • the solution provided by the embodiment of the present invention is as follows:
  • step 101 the attribute information of the customer is obtained.
  • the customer Xiaohong has only handled the A business, it means that the customer Xiaohong has more personal information in the A business level of the application software. There is relatively little personal information left in other business levels; customer Xiaolan has handled B business, C business and D business in total, indicating that customer Xiaolan’s B business level, C business level and D business level in the application software There is more personal information left in the database, and relatively little personal information is left in other business levels; the business handling of other customers may be more complicated, and the specific circumstances need to be analyzed in detail.
  • the acquisition of the customer's attribute information can be achieved through a unified identification label, such as the customer's ID number.
  • the ID card number is registered when the customer first registers the application software. Due to the special design of the software, the ID card number of the customer can be associated with all businesses of the bank. the goal.
  • data collectors can obtain attribute information of all customers who have transactions with the bank from the software.
  • the acquired customer attribute information includes various types of customer label information, such as population tags, device tags, geographic tags, channel tags, behavior tags, account tags, product tags, etc.; specifically, the tag information can be expressed as follows :
  • Equipment label equipment type, equipment brand, equipment model, brand launch date, operator name, etc.
  • Geographic tags registered provinces, registered cities, provinces where mobile phone numbers belong, cities where mobile phone numbers belong, active cities, etc.;
  • Channel label source business channel
  • Behavior label login related fields, active related fields, transaction related fields, access to other platform related fields, etc.
  • Account label account opening-related fields, account-related fields, other account fields, etc.
  • Product tags tags related to historical purchases, tags related to various product changes, etc.
  • the indicators for statistics of customer attribute information can include the following:
  • Quantile statistics minimum, Q1 (first quartile), median, Q3 (third quartile), maximum, range, quartile range;
  • Descriptive statistics mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness.
  • the customer's attribute information can be further analyzed and sorted, mainly including the following aspects:
  • abnormal features such as the maximum value, minimum value, and coefficient of variation
  • deal with the abnormal values in the subsequent feature engineering stage such as deleting abnormal data such as age and date; for abnormal amounts, Refer to attributes such as the number of products held, and adjust the amount back to a reasonable range, etc.
  • the aforementioned customer attribute information includes a "channel tag", the channel value can be set to 1/2/3/4... etc.; for data collectors, number 1, number 2, number 3, number 4 ... etc. are just a concise representation of the real channel, but for the later prediction model, they only have the magnitude of the value, that is, 4>3>2>1, so the model can accurately identify 1 /2/3/4;For the purpose of real information represented by numbers, and for the purpose of ensuring the scalability of the data set, it can be widened into several fields through ONE-HOT encoding, such as channels 1. Channel 2, Channel 3, Channel 4, etc., the values under each field are only 0 and 1; the above, through the ONE-HOT encoding of the "channel label", will help ensure that the data set will switch models in the future ( Such as logistic regression, etc.).
  • the datetime method is specifically: calculating the specific number of days between the time of the timestamp and the current date, and converting it into today's number.
  • the customer's attribute information is obtained, but also the customer's attribute information is processed and organized.
  • the customer's attribute information can be input into the preset model to determine the customer type to which the customer belongs.
  • the attribute information of the customer is input into a preset model to obtain the customer type to which the customer belongs; wherein, the preset model is obtained in the following manner: for the model obtained in the nth training, Determine whether the model is over-fitting through the verification data; after determining that the model is over-fitting, obtain the evaluation value of each sample feature used by the model in the nth training process; according to the each sample feature The evaluation value of, determine the characteristics of each sample used in the n+1 training to obtain the n+1 training model, return to the step of determining whether the model is over-fitting through the verification data, until the model does not exist Fitting.
  • the attribute information of the customer is the customer attribute information obtained after processing and sorting in the foregoing step 101.
  • the preset model is a decision tree model.
  • Common decision tree models include the lightgbm model and GBDT (Gradient Boosting Decision Tree) , Xgboost model, if the lightgbm model can be selected as the preset model in the implementation of the present invention, by calculating and simulating the attribute information of the customer Xiaohong through the lightgbm model, the probability that the customer Xiaohong belongs to the small, medium and micro enterprise owner customer is 67%; Then according to the preset probability threshold of being a small, medium and micro enterprise owner customer of 80%, the lightgbm model determines that the customer Xiaohong does not belong to the small, medium and micro enterprise owner customer; similarly, enter the collected attribute information of the customer Xiaolan into lightgbm Model, through the calculation and simulation of the attribute information of the customer Xiaolan through the lightgbm model, the
  • the lightgbm model used in the embodiment is a certain preset model without over-fitting phenomenon, that is, by using the preset model without over-fitting phenomenon, the customer can be determined The customer type to which it belongs.
  • an embodiment of the present invention provides a way to generate a preset model, which is expressed as follows:
  • n can take 1, 2, 3, and so on. If you further assume that the value of n is 2, you can The model after the second training is verified, and the verification data is used to determine whether the model after the second training is over-fitting.
  • Table 1 is a table of the corresponding relationship between each sample feature and its corresponding evaluation value in the process of determining the customer type provided by the embodiment of the present invention.
  • the left side represents the evaluation value
  • the right side represents the sample characteristics.
  • the evaluation value is 10630
  • the corresponding evaluation values are 10336, 5876, 4633... etc. respectively.
  • sample feature can be various attributes of the customer, such as age, gender, marital status, occupation, etc.; for example, one of the possibilities is: sample feature A can be the customer’s age, and sample feature B can be the customer’s Gender, sample feature C can be the customer’s marital status, and sample feature D can be the customer’s occupation.
  • the evaluation value used in Table 1 is the Split score.
  • the selection of the evaluation value can also be other types of scores, such as the Gain score.
  • the Split score is used as the evaluation value and described.
  • the features of each sample include noise features; according to the evaluation value of each sample feature, determining the features of each sample used in the n+1th training includes: lowering the evaluation value below all The sample feature of the evaluation value of the noise feature is deleted.
  • the importance of the characteristics of the sample to the trained model can be expressed in the form of an evaluation value: the more important the characteristics of the sample, the higher the corresponding evaluation value.
  • the noise feature itself is a type of meaningless feature, when the evaluation value of some features of the sample is lower than the evaluation value of the noise feature, it means that these sample features below the noise feature are not sufficiently meaningful for the training of the model.
  • the sample features whose evaluation value is lower than the evaluation value of the noise feature can be deleted.
  • the Split score is calculated to be 2206 points, and the Split scores of the four features of sample feature O, sample feature X, sample feature Y, and sample feature Z are respectively 1944 points, 1866 points, 1659 points and 1406 points, that is, the Split scores of these four features are all lower than the Split score of the "Noise" feature, so that the sample feature O, the sample feature X, the sample feature Y and the sample Feature Z These four features do not have sufficient training significance for the subsequent determination of customer types.
  • the Split score can be lower than the Split score of "Noise"
  • the sample feature of is deleted, that is, the sample feature O, the sample feature X, the sample feature Y, and the sample feature Z of the customer are not trained in the following third customer type determination process.
  • the evaluation value is determined at least according to the number of times the sample feature is used in the training process or the information gain when the sample feature is split; according to the evaluation value of each sample feature, determine the first
  • Each sample feature used in n+1 training includes: sorting the evaluation value of each sample feature; if the evaluation value of the first sample feature is k times the evaluation value of the second sample feature, delete all The first sample feature; the first sample feature and the second sample feature are adjacent sample features in the ranking, k ⁇ 3.
  • the Split score is an evaluation value defined according to the number of times the sample feature is used in the training process
  • the Gain score is an evaluation value defined according to the information gain when the sample feature is split.
  • the sample feature B is deleted in the third customer type determination process.
  • the sample feature B is the first sample feature
  • the sample feature C is the second sample feature.
  • the Split score relationship between the sample feature B and the sample feature C in Table 1 of the embodiment of the present invention does not meet the requirement of deleting the sample feature B; at the same time, the Split score relationship between other sample features and the sample feature of the next item It also does not meet the requirement to delete sample features; naturally, it is not required to delete sample feature B and other sample features during the third customer type determination process.
  • the verification data includes multiple verification samples; determining whether the model is over-fitting through the verification data includes: inputting the multiple verification samples into the model respectively to obtain multiple verification results Determine the accuracy and recall rate of the model according to the multiple verification results and the true values of the multiple verification samples; when the accuracy rate is greater than the first threshold and the recall rate is greater than the second threshold, It is determined that the model is overfitting.
  • the verification data used to verify the second trained model includes 10,000 new customer information; input these 10,000 new customer information into the second trained model, and you can get these 10,000 new customer information The verification result of customer information processed by the second trained model.
  • the number of customers is 150; that is, the model after the second training, through the learning and data processing of the characteristics of these 200 real small, medium and micro business owners, predicts 150 of them.
  • Micro-enterprise customers are small, medium and micro-enterprise customers
  • the number of customers is 50; that is, the model after the second training, through the learning and data processing of the characteristics of the 200 real small, medium and small business owners, predicts 50 of them.
  • Micro-enterprise customers are non-medium, small and micro-enterprise customers;
  • the number of customers is 9,700; that is, the model after the second training, through the learning and data processing of the characteristics of the 9,800 non-real small, medium and micro business owners, predicts that 9,700 of them are non-real Of small, medium and micro business owners are non-small, medium and micro business owners;
  • the number of customers is 100; that is, the model after the second training, through the learning and data processing of the characteristics of the 9800 non-real small and medium-sized business owners and customers, predicts 100 of them are non-real Of the main customers of small, medium and micro enterprises are the main customers of small, medium and micro enterprises.
  • FIG. 2 it is a schematic diagram of a confusion matrix provided by an embodiment of the present invention.
  • TP Ture Positive
  • the positive class is predicted as the number of positive classes. If the sample is truly 1, the model prediction is also 1.
  • the model is used to predict small, medium and micro business owners, it corresponds to the above situation 1, which is The value of TP is 150;
  • FN False Negative means that the positive class is predicted as a negative class number. If the sample is true, the model predicts 0; when the model is used to predict small, medium and micro business owners, it corresponds to the above scenario 2, which is FN The value is 50;
  • FP False Positive
  • the model predicts 1; when the model is used to predict small, medium and micro business owners, it corresponds to the above situation 3, that is, FP The value is 100;
  • TN Ture Negative indicates that the negative class is predicted as a negative class number. If the sample is truly 0, the model prediction is also 0; when the model is used to predict small, medium and micro business owners, it corresponds to the above scenario 4, which is The value of TN is 9700.
  • the number “1” is used to indicate a real small, medium and micro enterprise owner customer, and the number “0” is used to indicate an unreal small, medium, and micro enterprise owner customer.
  • the precision and recall of the model can be determined.
  • the precision rate (Precision) can be calculated in the following way:
  • the recall rate (Recall) can be calculated in the following ways:
  • the Precision and Recall of the 10,000 new customer information can be calculated, the Precision value is 60%, and the Recall value is 75%.
  • 80% is set as the threshold for the accuracy rate when the model is over-fitted, and 80% is the threshold for the recall rate when the model is over-fitted, then the value of the above Precision is 60% and the value of Recall is 75%. It is determined that the model after the second training is not an over-fitting model; where 80% is the first threshold, 80% is the second threshold, and the first threshold is equal to the second threshold.
  • the method before determining whether the model is over-fitting through the verification data, the method further includes: dividing the sample data into M sample sets, where each includes the same positive sample, and each includes a negative sample Are not the same; for each sample set, each sample used in the n-th training is determined from the sample set according to the characteristics of each sample of the n-th training, and the sub-samples corresponding to the sample set are obtained through training. Model; According to the M sub-models, the n-th training model is obtained.
  • the collected sample data is 20.5 million pieces, of which 500,000 are small, medium and micro business owners, letting be a positive sample set; 20 million are not small, medium and micro business owner customers, that is, ordinary customers, letting be a negative sample.
  • the collected sample data is divided into 4 sample sets, where the positive samples included in each sample set are 500,000 small, medium and micro business owners; the negative samples included in each sample set are different. That is to say, 4 copies are collected from the negative sample set using the method of no replacement. For example, 4 copies can be collected from the negative sample set by means of equal division, and the negative samples in each sample set are 5 million ordinary customers; 4 The sample set consists of 500,000 small, medium and micro business owners and 5 million ordinary customers.
  • the 4 sample sets are a sample set, b sample set, c sample set and d sample set; use the set model_10 to train the a sample set, and use the set model_20 to perform the b sample set Training, use the set model_30 to train the c sample set, and use the set model_40 to train the d sample set; among them, the initial values of model_10, model_20, model_30, and model_40
  • the parameters are the same, that is, these four are essentially the same model.
  • the special order is model_10, model_20, model_30, and model_40 to distinguish them.
  • any sample set of the b sample set, c sample set, and d sample set to continue training with the model_11 that has been trained by the a sample set. For example, put the b sample set into model_11 to continue training; then, add c In the sample set and any sample set in the d sample set, use the model_12 that has been trained by the b sample set to continue training. For example, put the c sample set into the model_12 to continue training; finally, use the remaining d sample set to use the model_12 that has been trained by c The model_13 obtained from the sample set training continues to be trained.
  • any sample set of a sample set, c sample set, and d sample set to continue training with the model_21 that has been trained by the b sample set, such as putting a sample set into model_21 to continue training;
  • Any sample set in the c sample set and the d sample set uses the model_22 that has been trained by the a sample set to continue training.
  • the c sample set is put into the model_22 to continue training; finally, the remaining d sample set is adopted by the model_22 Model_23 obtained by training on the sample set continues to train.
  • any sample set of a sample set, b sample set, and d sample set to continue training with model_31 that has been trained with sample set c, such as putting a sample set into model_31 to continue training; then Any sample set in the b sample set and the d sample set uses the model_32 that has been trained by the a sample set to continue training, such as putting the b sample set into the model_32 to continue training; finally, the remaining d sample set is adopted by the model_32 b.
  • the model_33 obtained from the training of the sample set continues to be trained.
  • any sample set of a sample set, b sample set, and c sample set to continue training with model_41 that has been trained by d sample set, such as putting a sample set into model_41 to continue training; then Any sample set in the b sample set and the c sample set uses the model_42 that has been trained by the a sample set to continue training, such as putting the b sample set into the model_42 to continue training; finally, the remaining c sample set is adopted by the model_42 b.
  • the model_43 obtained by training on the sample set continues to be trained.
  • a corresponding probability value will be calculated for each customer, and then the average value of all model results will be integrated.
  • a non-medium, small, and micro-enterprise customer such as Ms. Grace
  • she is divided into the b sample set, through model_20, other models corresponding to model_10 (model_11, model_12, and model_13) Any one of the other models (model_31, model_32, and model_33) corresponding to model_30, and other models corresponding to model_40 (model_41, model_42 Calculate separately with any of the models in Model_43), and the probability values that are predicted to be small, medium and micro enterprise owners are 30%, 35%, 40%, and 25% respectively. Then take these 4 probability values
  • the average value is 32.5%, that is, through the training of the model, it is believed that Ms. Grace has a 32.5% probability that it is a small, medium and micro business owner customer.
  • the sample data is collected in a first historical period; the verification data is collected in a second historical period; and the second historical period is later than the first historical period.
  • the data used for the first training model and the second training model are called sample data; if the model after the second training is an over-fitting model, then the model after the second training is adjusted.
  • the data used in the third training model that is excellent and made is called validation data.
  • determining the characteristics of each sample used in the n+1th training to obtain the n+1th training model includes: after determining that the model is over-fitted, the parameters of the model Make adjustments; according to the characteristics of each sample used in the n+1th training, re-train the adjusted model for the n+1th time.
  • the decision tree model of lightgbm is used to predict small and medium-sized business owners; when it is determined that the model after the second training is an over-fitting model, the parameters of the decision tree model of lightgbm can be adjusted. In order to achieve a better model in the third customer type determination process.
  • the maximum depth (max_depth) can be adjusted: when the model is confirmed to be over-fitting, the max_depth is adjusted to be smaller;
  • an embodiment of the present invention also provides a device for determining a customer type. As shown in FIG. 3, the device includes:
  • the obtaining unit 301 is used to obtain the attribute information of the client.
  • the determining unit 302 is configured to input the attribute information of the customer into a preset model to obtain the customer type to which the customer belongs; wherein the preset model is obtained through the training unit 303:
  • the training unit 303 is configured to determine whether the model is over-fitting according to the verification data for the model obtained through the nth training;
  • the training unit 303 is configured to, after determining that the model is over-fitting, obtain the evaluation value of each sample feature used by the model in the nth training process; according to the evaluation value of each sample feature, Determine the characteristics of each sample used in the n+1th training to obtain the n+1th trained model, and return to the step of determining whether the model is over-fitting through the verification data until the model does not have over-fitting.
  • each sample feature includes a noise feature; the training unit 303 is specifically configured to delete sample features whose evaluation value is lower than the evaluation value of the noise feature.
  • the evaluation value is determined at least according to the number of times the sample feature is used in the training process or the information gain when the sample feature is split; the training unit 303 is specifically configured to The evaluation value of each sample feature is sorted; if the evaluation value of the first sample feature is k times the evaluation value of the second sample feature, then the first sample feature is deleted; the first sample feature and the The second sample feature is the adjacent sample feature in the ranking, k ⁇ 3.
  • the verification data includes multiple verification samples; the training unit 303 is specifically configured to input the multiple verification samples into the model to obtain multiple verification results; Verification results and the true values of the multiple verification samples to determine the accuracy rate and recall rate of the model; when the accuracy rate is greater than the first threshold and the recall rate is greater than the second threshold, it is determined that the model is over Fitting.
  • the training unit 303 is further configured to divide the sample data into M sample sets, where each includes the same positive sample, and The negative samples included in each part are different; for each sample set, according to the characteristics of each sample of the nth training, each sample used in the nth training is determined from the sample set, and all the samples used in the nth training are obtained through training.
  • the sub-model corresponding to the sample set; the n-th training model is obtained according to the M sub-models.
  • the sample data is collected in a first historical period; the verification data is collected in a second historical period; and the second historical period is later than the first historical period.
  • the training unit 303 is specifically configured to adjust the parameters of the model after determining that the model is over-fitted; according to each sample used in the n+1th training Features, and re-train the adjusted model for the n+1th time.
  • the embodiment of the present invention provides a computing device, and the computing device may specifically be a desktop computer, a portable computer, a smart phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), etc.
  • the computing device may include a central processing unit (CPU), a memory, an input/output device, etc.
  • the input device may include a keyboard, a mouse, a touch screen, etc.
  • an output device may include a display device, such as a liquid crystal display (Liquid Crystal Display, LCD), Cathode Ray Tube (CRT), etc.
  • the memory may include read-only memory (ROM) and random access memory (RAM), and provides the processor with program instructions and data stored in the memory.
  • ROM read-only memory
  • RAM random access memory
  • the memory may be used to store the program instructions of the method for determining the client type
  • the processor is configured to call the program instructions stored in the memory, and execute the method for determining the client type according to the obtained program.
  • FIG. 4 a schematic diagram of a computing device provided by an embodiment of this application, and the computing device includes:
  • the processor 401 is configured to read the program in the memory 402 and execute the method for determining the client type described above;
  • the processor 401 may be a central processing unit (central processing unit, CPU for short), a network processor (NP for short), or a combination of CPU and NP. It can also be a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (ASIC for short), a programmable logic device (PLD for short), or a combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (generic array logic, GAL), or any of them combination.
  • the memory 402 is configured to store one or more executable programs, and can store data used by the processor 401 when performing operations.
  • the program may include program code, and the program code includes computer operation instructions.
  • the memory 402 may include a volatile memory (volatile memory), such as random-access memory (random-access memory, RAM for short); the memory 402 may also include a non-volatile memory (non-volatile memory), such as flash memory ( flash memory), hard disk drive (HDD for short) or solid-state drive (SSD for short); the memory 402 may also include a combination of the foregoing types of memories.
  • volatile memory volatile memory
  • RAM random-access memory
  • non-volatile memory such as flash memory ( flash memory), hard disk drive (HDD for short) or solid-state drive (SSD for short
  • SSD solid-state drive
  • the memory 402 stores the following elements, executable modules or data structures, or their subsets, or their extended sets:
  • Operating instructions including various operating instructions, used to implement various operations.
  • Operating system Including various system programs, used to implement various basic services and process hardware-based tasks.
  • the bus 405 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 4 to represent, but it does not mean that there is only one bus or one type of bus.
  • the bus interface 404 may be a wired communication access port, a wireless bus interface or a combination thereof, where the wired bus interface may be, for example, an Ethernet interface.
  • the Ethernet interface can be an optical interface, an electrical interface, or a combination thereof.
  • the wireless bus interface may be a WLAN interface.
  • the embodiment of the present invention provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are used to make a computer execute a method for determining a customer type.
  • the embodiments of the present invention can be provided as a method or a computer program product. Therefore, the present invention may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may be in the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Software Systems (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Evolutionary Computation (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Educational Administration (AREA)
  • Medical Informatics (AREA)
  • Finance (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

一种客户类型的确定方法及装置,获取客户的属性信息(101);将客户的属性信息输入预设模型,得到客户所属的客户类型(102);其中,通过以下方式调整预设模型:针对第n次训练得到的模型,通过验证数据确定模型是否过拟合;在确定模型过拟合后,获取模型在所述第n次训练过程中使用的各样本特征的评估值;根据各样本特征的评估值,确定第n+1次训练时使用的各样本特征从而得到第n+1次训练的模型,返回通过验证数据确定模型是否过拟合的步骤,直至模型不存在过拟合。该方案通过将客户信息输入预设模型,经模型处理,即可快速确定出客户所属的客户类型,以实现对客户的精准定位,便于对其进行精准营销。

Description

一种客户类型的确定方法及装置
相关申请的交叉引用
本申请要求在2019年12月26日提交中国专利局、申请号为201911363412.0、申请名称为“一种客户类型的确定方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及金融科技(Fintech)领域,尤其涉及一种客户类型的确定方法及装置。
背景技术
随着计算机技术的发展,越来越多的技术(例如:分布式架构、云计算或大数据)应用在金融领域,传统金融业正在逐步向金融科技转变,大数据技术也不例外。但由于金融、支付行业的安全性、实时性要求,也对大数据技术提出了更高的要求。
有关于如何从海量的客户中,确定出某些客户属于某一/些特定的客户类型的问题,如属于中小微企业主这一客户类型,目前许多互联网公司或数据采集厂商,通过从移动用户的终端(如手机)上采集到的操作行为(如下载APP(Application,应用程序)、APP上的操作、操作频率、时间、定位等)加工成用户标签,比如相关用户人口属性、社会属性、经常前往的地理位置/片区、APP偏好(银行APP、理财APP)、兴趣偏好(游戏、直播、音乐、阅读等)、活跃时长等。数据部门对标签进行基于经验的权重值计算,组合出可能符合中小微企业主的行为标签/属性标签。
以上技术存在的缺点表现如下:
(1)存在较大程度的主观判断,未必贴近事实,精确率及召回率普遍不高;
(2)高度依赖采集数据的数据质量,如数据完整性、时效性、真实性;
(3)容易受固有标签限制,标签数量的轻微变化,可能对预测效果造成较大影响。
综上,现有技术的方案存在无法准确确定出客户所属的客户类型的问题。
发明内容
本发明提供一种客户类型的确定方法及装置,用以解决无法准确确定客户所属的客户类型的问题。
第一方面,本发明实施例提供一种客户类型的确定方法,该方法包括:获取客户的属性信息;将所述客户的属性信息输入预设模型,得到所述客户所属的客户类型;其中,所述预设模型是通过以下方式得到的:针对第n次训练得到的模型,通过验证数据确定所述模型是否过拟合;在确定所述模型过拟合后,获取所述模型在所述第n次训练过程中使用的各样本特征的评估值;根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征从而得到第n+1次训练的模型,返回通过验证数据确定所述模型是否过拟合的步骤,直至所述模型不存在过拟合。
基于该方案,通过将获取到的客户信息输入至预设模型,经过预设模型的处理,即可快速地确定出客户所属的客户类型,以实现对客户的精准定位,便于后期对其进行精准营 销;进一步地,通过对预设模型的调整,也即当用验证数据验证第n次训练的模型并确定该模型出现了过拟合的情形,则进一步获取该模型在第n次训练过程中使用的各样本特征的评估值;以及根据各样本特征的评估值来进一步确定第n+1次训练时使用的各样本特征并得到第n+1次训练的模型,通过以上的方式,实现了对所训练模型的逐步调优,使得最终的模型对于客户数据的分析、判断更加准确。
作为一种可能实现的方法,所述各样本特征中包括噪声特征;根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征,包括:将评估值低于所述噪声特征的评估值的样本特征删除。
基于该方案,在客户类型的确定过程中,样本的特征对于所训练模型的重要性可通过评估值的形式进行表示:样本的特征越重要,则对应的评估值越高。由于噪声特征本身为一类无意义特征,当样本的某些特征的评估值低于噪声特征的评估值时,则说明这些低于噪声特征的样本特征对模型的训练也不具备充足意义,从而出于对有效简化模型以及提高客户类型的确定速度的目的,可以将评估值低于所述噪声特征的评估值的样本特征删除。
作为一种可能实现的方法,所述评估值至少是根据样本特征在训练过程中的使用次数或样本特征被拆分时的信息增益来确定的;根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征,包括:对所述各样本特征的评估值进行排序;若第一样本特征的评估值是第二样本特征的评估值的k倍,则删除所述第一样本特征;所述第一样本特征和所述第二样本特征为排序中相邻的样本特征,k≥3。
基于该方案,在客户类型的确定过程中,样本的特征对于所训练模型的重要性可通过评估值的形式进行表示:样本的特征越重要,则对应的评估值越高。通过将各样本特征的评估值进行排序(如可以是降序排序),当发现第一样本特征的评估值是第二样本特征的评估值的k倍时,可能是模型在训练过程单方面认为所述第一样本特征过于重要,而导致模型出现了作弊行为,可以将所述第一样本特征进行删除,所述第一样本特征和所述第二样本特征为排序中相邻的样本特征,k≥3。
作为一种可能实现的方法,所述验证数据包括多个验证样本;通过验证数据确定所述模型是否过拟合,包括:将所述多个验证样本分别输入所述模型,得到多个验证结果;根据所述多个验证结果与所述多个验证样本的真实值,确定所述模型的精确率和召回率;在所述精确率大于第一阈值且所述召回率大于第二阈值时,确定所述模型过拟合。
基于该方案,通过将验证数据输入所述模型,也即将多个验证样本分别输入所述模型,得到各自对应的验证结果;进一步通过将验证结果与对应的验证样本的真实值作比较,确定出所述模型的精确率和召回率;在所述精确率大于第一阈值且所述召回率大于第二阈值时,则确定所述模型过拟合。通过对所述模型进行验证,用验证得到的数据来精确地判别所述模型是否出现过拟合。
作为一种可能实现的方法,通过验证数据确定所述模型是否过拟合之前,还包括:将样本数据划分为M份样本集,其中各份包括的正样本相同,且各份包括的负样本均不相同;针对每份样本集,按照所述第n次训练的各样本特征,从所述样本集中确定所述第n次训练使用的各样本,并通过训练得到所述样本集对应的子模型;根据M个子模型得到第n次训练的模型。
基于该方案,通过将样本数据划分为多份样本集,其中每份样本集包括的正样本相同,包括的负样本均不相同,也即采用无放回的方式确定出每份样本集的负样本;通过对每份 样本集的训练得到所述样本集对应的子模型,以及根据多个子模型得到第n次训练的模型。通过用多个子模型来得到第n次训练的模型,在充分考虑各样本的样本特征的基础上,使得所得到的第n次训练的模型更具一般性,其适用的场景更加丰富。
作为一种可能实现的方式,所述样本数据为第一历史时段采集的;所述验证数据为第二历史时段采集的;所述第二历史时段晚于所述第一历史时段。
基于该方案,通过将第一历史时段采集的数据作为样本数据,将第二历史时段采集的数据作为验证数据,所述第二历史时段晚于所述第一历史时段,也即将更为久远一些的全量历史数据作为客户类型的确定所用到的样本数据,将距当前日期更近一些的历史数据作为验证所得到模型的验证数据,使得所训练模型更加准确,也更加适用于对当前数据进行分析。
作为一种可能实现的方法,确定第n+1次训练时使用的各样本特征从而得到第n+1次训练的模型,包括:在确定所述模型过拟合后,对所述模型的参数进行调整;根据所述第n+1次训练时使用的各样本特征,对调整后的模型重新进行第n+1次训练。
基于该方案,在确定第n次训练的模型过拟合后,并对该模型的参数进行调整,基于参数调整后的模型,根据所述第n+1次训练时使用的各样本特征,对调整后的模型重新进行第n+1次训练,使得最终的模型对于数据的分析、判断更加准确。
第二方面,本发明实施例提供一种客户类型的确定装置,该装置包括:获取单元,用于获取用户的属性信息;确定单元,用于将所述客户的属性信息输入预设模型,得到所述客户所属的客户类型;其中,所述预设模型是通过训练单元得到的:所述训练单元,用于针对第n次训练得到的模型,通过验证数据确定所述模型是否过拟合;所述训练单元,用于在确定所述模型过拟合后,获取所述模型在所述第n次训练过程中使用的各样本特征的评估值;根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征从而得到第n+1次训练的模型,返回通过验证数据确定所述模型是否过拟合的步骤,直至所述模型不存在过拟合。
基于该方案,通过将获取到的客户信息输入至预设模型,经过预设模型的处理,即可快速地确定出客户所属的客户类型,以实现对客户的精准定位,便于后期对其进行精准营销;进一步地,通过对预设模型的调整,也即当用验证数据验证第n次训练的模型并确定该模型出现了过拟合的情形,则进一步获取该模型在第n次训练过程中使用的各样本特征的评估值;以及根据各样本特征的评估值来进一步确定第n+1次训练时使用的各样本特征并得到第n+1次训练的模型,通过以上的方式,实现了对所训练模型的逐步调优,使得最终的模型对于客户数据的分析、判断更加准确。
作为一种可能实现的方式,所述各样本特征中包括噪声特征;所述训练单元,具体用于将评估值低于所述噪声特征的评估值的样本特征删除。
基于该方案,在客户类型的确定过程中,样本的特征对于所训练模型的重要性可通过评估值的形式进行表示:样本的特征越重要,则对应的评估值越高。由于噪声特征本身为一类无意义特征,当样本的某些特征的评估值低于噪声特征的评估值时,则说明这些低于噪声特征的样本特征对模型的训练也不具备充足意义,从而出于对有效简化模型以及提高客户类型的确定速度的目的,可以将评估值低于所述噪声特征的评估值的样本特征删除。
作为一种可能实现的方式,所述评估值至少是根据样本特征在训练过程中的使用次数或样本特征被拆分时的信息增益来确定的;所述训练单元,具体用于对所述各样本特征的 评估值进行排序;若第一样本特征的评估值是第二样本特征的评估值的k倍,则删除所述第一样本特征;所述第一样本特征和所述第二样本特征为排序中相邻的样本特征,k≥3。
基于该方案,在客户类型的确定过程中,样本的特征对于所训练模型的重要性可通过评估值的形式进行表示:样本的特征越重要,则对应的评估值越高。通过将各样本特征的评估值进行排序(如可以是降序排序),当发现第一样本特征的评估值是第二样本特征的评估值的k倍时,可能是模型在训练过程单方面认为所述第一样本特征过于重要,而导致模型出现了作弊行为,可以将所述第一样本特征进行删除,所述第一样本特征和所述第二样本特征为排序中相邻的样本特征,k≥3。
作为一种可能实现的方式,所述验证数据包括多个验证样本;所述训练单元,具体用于将所述多个验证样本分别输入所述模型,得到多个验证结果;根据所述多个验证结果与所述多个验证样本的真实值,确定所述模型的精确率和召回率;在所述精确率大于第一阈值且所述召回率大于第二阈值时,确定所述模型过拟合。
基于该方案,通过将验证数据输入所述模型,也即将多个验证样本分别输入所述模型,得到各自对应的验证结果;进一步通过将验证结果与对应的验证样本的真实值作比较,确定出所述模型的精确率和召回率;在所述精确率大于第一阈值且所述召回率大于第二阈值时,则确定所述模型过拟合。通过对所述模型进行验证,用验证得到的数据来精确地判别所述模型是否出现过拟合。
作为一种可能实现的方式,通过验证数据确定所述模型是否过拟合之前,所述训练单元,还用于将样本数据划分为M份样本集,其中各份包括的正样本相同,且各份包括的负样本均不相同;针对每份样本集,按照所述第n次训练的各样本特征,从所述样本集中确定所述第n次训练使用的各样本,并通过训练得到所述样本集对应的子模型;根据M个子模型得到第n次训练的模型。
基于该方案,通过将样本数据划分为多份样本集,其中每份样本集包括的正样本相同,包括的负样本均不相同,也即采用无放回的方式确定出每份样本集的负样本;通过对每份样本集的训练得到所述样本集对应的子模型,以及根据多个子模型得到第n次训练的模型。通过用多个子模型来得到第n次训练的模型,在充分考虑各样本的样本特征的基础上,使得所得到的第n次训练的模型更具一般性,其适用的场景更加丰富。
作为一种可能实现的方式,所述样本数据为第一历史时段采集的;所述验证数据为第二历史时段采集的;所述第二历史时段晚于所述第一历史时段。
基于该方案,通过将第一历史时段采集的数据作为样本数据,将第二历史时段采集的数据作为验证数据,所述第二历史时段晚于所述第一历史时段,也即将更为久远一些的全量历史数据作为客户类型的确定所用到的样本数据,将距当前日期更近一些的历史数据作为验证所得到模型的验证数据,使得所训练模型更加准确,也更加适用于对当前数据进行分析。
作为一种可能实现的方式,所述训练单元,具体用于在确定所述模型过拟合后,对所述模型的参数进行调整;根据所述第n+1次训练时使用的各样本特征,对调整后的模型重新进行第n+1次训练。
基于该方案,在确定第n次训练的模型过拟合后,并对该模型的参数进行调整,基于参数调整后的模型,根据所述第n+1次训练时使用的各样本特征,对调整后的模型重新进行第n+1次训练,使得最终的模型对于数据的分析、判断更加准确。
第三方面,本发明实施例提供了一种计算设备,包括:
存储器,用于存储程序指令;
处理器,用于调用所述存储器中存储的程序指令,按照获得的程序执行如第一方面任一所述的方法。
第四方面,本发明实施例提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行如第一方面任一所述的方法。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种客户类型的确定方法;
图2为本发明实施例提供的一种混淆矩阵示意图;
图3为本发明实施例提供的一种客户类型的确定装置;
图4为本发明实施例提供的一种计算设备的示意图。
具体实施方式
为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作进一步地详细描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。
如图1所示,为本发明实施例提供的一种客户类型的确定方法,该方法包括:
步骤101,获取客户的属性信息。
步骤102,将所述客户的属性信息输入预设模型,得到所述客户所属的客户类型;其中,所述预设模型是通过以下方式得到的:针对第n次训练得到的模型,通过验证数据确定所述模型是否过拟合;在确定所述模型过拟合后,获取所述模型在所述第n次训练过程中使用的各样本特征的评估值;根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征从而得到第n+1次训练的模型,返回通过验证数据确定所述模型是否过拟合的步骤,直至所述模型不存在过拟合。
基于该方案,通过将获取到的客户信息输入至预设模型,经过预设模型的处理,即可快速地确定出客户所属的客户类型,以实现对客户的精准定位,便于后期对其进行精准营销;进一步地,通过对预设模型的调整,也即当用验证数据验证第n次训练的模型并确定该模型出现了过拟合的情形,则进一步获取该模型在第n次训练过程中使用的各样本特征的评估值;以及根据各样本特征的评估值来进一步确定第n+1次训练时使用的各样本特征并得到第n+1次训练的模型,通过以上的方式,实现了对所训练模型的逐步调优,使得最终的模型对于客户数据的分析、判断更加准确。
就背景技术中所提出的问题,即要如何从海量的信息中准确定位出中小微企业主客户的客户信息,从而可以对所定位出的中小微企业主客户进行广告投放、以实现精准营销的 目的,本发明实施例提供的解决方案如下:
在上述步骤101中,获取客户的属性信息。
设一家大型银行在建行之初,就已经颇为关注对客户数据的采集工作,并为此专门开发了一款应用软件,主要用于:凡是与该行发生过交易的客户均在该款应用软件上注册自己的个人信息,比如客户的身份证号等信息。
考虑到每一个客户在该行所办理业务存在差异的现实情况:比如客户小红仅办理过A业务,则说明客户小红在应用软件中的A业务层面中留有的个人信息较多,在其他业务层面中留有的个人信息相对较少;客户小蓝先后共办理过B业务、C业务和D业务,则说明客户小蓝在应用软件中的B业务层面、C业务层面和D业务层面中留有的个人信息较多,在其他业务层面中留有的个人信息相对较少;对于其他客户的业务办理情况可能更复杂,具体情况需具体分析。因此,对于客户的属性信息的获取,可以通过一个统一的识别标签来实现,比如客户的身份证号。一种简单的实现方式是:在客户最初注册该款应用软件时就有对身份证号的注册,由于软件自身的特殊设计,可以实现将客户的身份证号与该行的所有业务进行关联的目标。从而,数据采集人员可以从该款软件中获取所有与该行发生交易的客户的属性信息。如获取的客户的属性信息包括客户的各种类型的标签信息,比如可以是人口标签、设备标签、地理标签、渠道标签、行为标签、账户标签、产品标签等;具体地,标签信息可以表现如下:
人口标签:年龄、性别、婚姻、职业、是否有房、是否有车、是否有小孩等;
设备标签:设备类型、设备品牌、设备型号、品牌上市日期、运营商名称等;
地理标签:登记省份、登记城市、手机号归属省份、手机号归属城市、活跃城市等;
渠道标签:来源业务渠道;
行为标签:登录相关字段、活跃相关字段、交易相关字段、访问其他平台相关字段等;
账户标签:开户相关字段、动账相关字段、其他账户字段等;
产品标签:历史购买相关产品标签、各类产品变动相关标签等。
进一步地,在获取到客户的属性信息之后,需要对客户的属性信息作进一步统计、分析与整理,从而使得后期用于模型预测的客户的属性信息均为正确且有效的数据。
对客户的属性信息进行统计的指标可以包括以下内容:
要点:类型,唯一值,缺失值,倾斜情况,分布情况等;
分位数统计量:最小值,Q1(第一四分位数),中位数,Q3(第三四分位数),最大值,范围,四分位数范围;
描述性统计数据:均值,模式,标准差,总和,中位数绝对偏差,变异系数,峰度,偏度。
在对客户的属性信息进行统计后,可以进一步对客户的属性信息进行分析与整理,主要包括以下方面:
1、对于数据质量/价值较低的字段进行标记,在后期模型效果不理想的情况下,可考虑将它们排除后再进行测试,以减少此类字段对模型的干扰;其中,数据质量/价值较低可以理解为数据表现出高度稀疏、方差较低、严重倾斜等诸如此类的状态;
2、对于最大值、最小值、变异系数等指标异常的特征,也需要有所记录,并在后续的特征工程阶段对异常值进行处理(比如删除年龄、日期等异常数据;对于金额异常情况,参考产品持有数量等属性,将金额调回合理区间等);
3、对于一些数据质量低而业务含义较重要的标签(如人口基本属性:年龄、性别、地区,设备属性:品牌、型号、估值等),继续返回应用软件中的其他业务层面,以希望重新获取正确的数据。
在获取客户的属性信息后,并对所采集的客户的属性信息进行统计、分析与整理后,进一步,还可以对客户的个别属性信息作进一步加工;需要说明的是,此处的客户的属性信息也可以称为特征,个别属性信息也即个别特征。以下举两个实例说明如何对个别特征作进一步加工,也即特征工程。
1、对类型标签进行ONE-HOT编码(一位有效编码)处理
比如前述客户的属性信息中包括了一个“渠道标签”,其渠道值可以令为1/2/3/4……等;对于数据采集人员来说,数字1、数字2、数字3、数字4……等只是真实渠道的一种简洁表示,但对于后期的预测模型来说,它们只有数值的大小之分,也即4>3>2>1,因此,出于让模型可以准确识别出1/2/3/4……等数字所代表的真实信息的目的,以及出于确保数据集的可扩展性的目的,则可以通过ONE-HOT编码,将其拉宽成若干个字段,如渠道1、渠道2、渠道3、渠道4……等,每个字段下的值只有0和1;以上通过对“渠道标签”进行ONE-HOT编码处理,有助于确保将来数据集要切换模型(如逻辑回归等)时的兼容性。
2、对日期字段的处理
对于日期字段(如2019-12-25,时间戳可能更长),并不便于模型理解或比较,因此可以将其转化为更便于模型理解的数值——距今天数,从而确保日期字段对于建模能发挥出价值。其中一种可能的实现方式如下:
首先在模型中导入datetime(日期时间)包;接着利用datetime包中的datetime方法将时间戳转换为至今天数。其中,datetime方法具体为:计算出时间戳的时间距离当前日期的具体天数,从而转换成至今天数。
通过以上步骤,不仅获取到客户的属性信息,进一步还对客户的属性信息作了加工、整理工作,此时客户的属性信息可以输入预设模型,以确定客户所属的客户类型。
在上述步骤102中,将所述客户的属性信息输入预设模型,得到所述客户所属的客户类型;其中,所述预设模型是通过以下方式得到的:针对第n次训练得到的模型,通过验证数据确定所述模型是否过拟合;在确定所述模型过拟合后,获取所述模型在所述第n次训练过程中使用的各样本特征的评估值;根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征从而得到第n+1次训练的模型,返回通过验证数据确定所述模型是否过拟合的步骤,直至所述模型不存在过拟合。
需要说明的是,在上述步骤102中,所述客户的属性信息为前述步骤101中经过加工、整理后所的到的客户属性信息。
如将采集到的客户小红的属性信息输入到预设模型,可以假设该预设模型为一种决策树模型,常见的决策树模型有lightgbm模型、GBDT(Gradient Boosting Decision Tree,提升决策树)、xgboost模型,如本发明实施中可以选取lightgbm模型作为预设模型,则通过lightgbm模型对客户小红的属性信息的计算、模拟,得到客户小红属于中小微企业主客户的概率为67%;则进一步根据预设的属于中小微企业主客户的概率阈值为80%,则lightgbm模型确定客户小红并不属于中小微企业主客户;同理,将采集到的客户小蓝的属性信息输入lightgbm模型,则通过lightgbm模型对客户小蓝的属性信息的计算、模拟,得到客户小蓝属于中小微企业主客户的概率为92%;则进一步根据预设的属于中小微企业主客户的概 率阈值为80%,则lightgbm模型确定客户小蓝属于中小微企业主客户。
需要说明的是,实施例中所使用的lightgbm模型是一种确定的、并不存在过拟合现象的预设模型,也即通过使用不存在过拟合现象的预设模型,可以确定出客户所属的客户类型。
其中,在使用模型对输入对象进行模拟、计算时,时常会出现过拟合的现象;基于此,本发明实施例中提供了一种来生成预设模型的方式,表现如下:
通过将采集到的海量客户信息投入初始模型进行训练,设在第n次训练结束后,比如n可以取1、2、3等诸如此类的数值,如进一步假设此处n取值为2,则可以对这第2次训练结束后的模型进行验证,通过验证数据来确定这第2次训练结束后的模型是否过拟合。
设通过使用验证数据对前述第2次训练结束后的模型进行验证后,确定了这第2次训练结束后的模型为一个过拟合的模型,则获取在训练这第2次模型过程中所使用的各样本特征的评估值。作为示例,表1为本发明实施例提供的一种客户类型的确定过程中的各样本特征与其对应的评估值对应关系表。
表1
Split Value(Split得分) Feature(特征)
10630 A
10336 B
5876 C
4633 D
4434 E
3922 F
3655 M
2545 N
2206 Noise
1944 O
1866 X
1659 Y
1406 Z
参考表1,左侧表示评估值,右侧表示样本特征。比如对于样本特征A,其评估值是10630;同理,对于样本特征B、C、D……等,其对应的评估值分别是10336、5876、4633……等。
其中,样本特征可以为客户的各方面属性,如年纪、性别、婚姻状态、职业等等;示例性地,其中一种可能为:样本特征A可以为客户的年纪,样本特征B可以为客户的性别,样本特征C可以为客户的婚姻状态,样本特征D可以为客户的职业。
表1中用的评估值是Split得分,当然评估值的选取还可以是其他性质的得分,如Gain得分。本发明实施中以Split得分来作为评估值并加以描述。
通过对表1中各样本特征的评估值的判断,以确定在接下来第3次的训练模型时需要 使用哪些样本特征来对模型进行调优,以及在客户类型的确定过程中可以摒弃哪些特征;通过使用验证数据对第2次的训练模型进行验证后,若得到的第3次的训练模型不存在过拟合,则确定第3次的训练模型为预设模型;若得到的第3次的训练模型依然存在过拟合,则返回通过验证数据确定所述模型是否过拟合的步骤,直至所述模型不存在过拟合,即通过使用验证数据对第3次训连的模型进行验证后,若得到的第4次的训练模型不存在过拟合,则确定第4次的训练模型为预设模型;若得到的第4次的训练模型依然存在过拟合,在继续使用验证数据对得到的第4次的训练模型进行训练,直至模型不存在过拟合。
作为一种可能实现的方法,所述各样本特征中包括噪声特征;根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征,包括:将评估值低于所述噪声特征的评估值的样本特征删除。
在客户类型的确定过程中,样本的特征对于所训练模型的重要性可通过评估值的形式进行表示:样本的特征越重要,则对应的评估值越高。由于噪声特征本身为一类无意义特征,当样本的某些特征的评估值低于噪声特征的评估值时,则说明这些低于噪声特征的样本特征对模型的训练也不具备充足意义,从而出于对有效简化模型以及提高客户类型的确定速度的目的,可以将评估值低于所述噪声特征的评估值的样本特征删除。
参考表1,所列举的样本特征共13项,其中,A、B、C、D、E、F、M、N、O、X、Y、Z这12项特征为客户真实具有的特征;“Noise”这1项特征是在客户类型的确定过程中所使用的无意义特征。在客户类型的确定过程中通过使用“Noise”这1项特征,计算其Split得分为2206分,而样本特征O、样本特征X、样本特征Y和样本特征Z这4项特征的Split得分分别为1944分、1866分、1659分和1406分,也即这4项特征的Split得分均低于“Noise”这1项特征的Split得分,从而认为样本特征O、样本特征X、样本特征Y和样本特征Z这4项特征对于接下来的客户类型的确定不具备充足的训练意义,进而出于对有效简化模型以及提高客户类型的确定速度的目的,可以将Split得分低于“Noise”的Split得分的样本特征删除,也即,在接下来的第3次客户类型的确定过程中,不对客户的样本特征O、样本特征X、样本特征Y和样本特征Z进行训练。
作为一种可能实现的方法,所述评估值至少是根据样本特征在训练过程中的使用次数或样本特征被拆分时的信息增益来确定的;根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征,包括:对所述各样本特征的评估值进行排序;若第一样本特征的评估值是第二样本特征的评估值的k倍,则删除所述第一样本特征;所述第一样本特征和所述第二样本特征为排序中相邻的样本特征,k≥3。
如前述所述的例子,Split得分是根据样本特征在训练过程中的使用次数所定义的一种评估值,Gain得分是根据样本特征被拆分时的信息增益所定义的一种评估值。
本发明实施例中仅以Split得分这种评估值的方式来说明,Gain得分这种评估值的方式可以参考Split得分的情形,在此不赘述。
参考表1,通过将A、B、C、D、E、F、M、N、O、X、Y、Z这12项特征以及1项“Noise”特征按照Split得分进行降序排序,表格自上而下,Split得分依次降低。
可以想象的是,当样本特征B的Split得分是其下一项的样本特征C的Split得分的3倍以及3倍以上时,则在第3次的客户类型的确定过程中删除样本特征B。其中,样本特征B即为第一样本特征,样本特征C即为第二样本特征。
当然,本发明实施例的表1中样本特征B与样本特征C这两者的Split得分关系不满 足删除样本特征B的要求;同时,其他的样本特征与其下一项的样本特征的Split得分关系也不满足删除样本特征的要求;自然,在第3次的客户类型的确定过程中不要求删除样本特征B和其他样本特征。
作为一种可能实现的方法,所述验证数据包括多个验证样本;通过验证数据确定所述模型是否过拟合,包括:将所述多个验证样本分别输入所述模型,得到多个验证结果;根据所述多个验证结果与所述多个验证样本的真实值,确定所述模型的精确率和召回率;在所述精确率大于第一阈值且所述召回率大于第二阈值时,确定所述模型过拟合。
设用于验证第2个训练后的模型时所使用的验证数据包括10000条新的客户信息;将这10000条新的客户信息输入至第2个训练后的模型,可以得到这10000条新的客户信息经由第2个训练后的模型处理后的验证结果。
这10000条新的客户信息经由第2个训练后的模型处理,可能出现以下情形:
情形1、对真实的中小微企业主客户进行模型处理后,其结果为真;也即,将真实的中小微企业主客户预测为中小微企业主客户;
情形2、对真实的中小微企业主客户进行模型处理后,其结果为假;也即,将真实的中小微企业主客户预测为非中小微企业主客户;
情形3、对非真实的中小微企业主客户进行模型处理后,其结果为真;也即,将非真实的中小微企业主客户预测为非中小微企业主客户;
情形4、对非真实的中小微企业主客户进行模型处理后,其结果为假;也即,将非真实的中小微企业主客户预测为中小微企业主客户。
比如,设这10000条新的客户信息中有200个客户为真实的中小微企业主客户,余下的9800个客户为非真实的中小微企业主客户;通过将这10000条新的客户信息的验证结果与其真实值进行比较,得到如下的结果:
对应于情形1,其客户数量为150;也即第2次训练后的模型通过对这200个真实的中小微企业主客户的各特征的学习与数据处理,预测出其中的150个真实的中小微企业主客户为中小微企业主客户;
对应于情形2,其客户数量为50;也即第2次训练后的模型通过对这200个真实的中小微企业主客户的各特征的学习与数据处理,预测出其中的50个真实的中小微企业主客户为非中小微企业主客户;
对应于情形3,其客户数量为9700;也即第2次训练后的模型通过对这9800个非真实的中小微企业主客户的各特征的学习与数据处理,预测出其中的9700个非真实的中小微企业主客户为非中小微企业主客户;
对应于情形4,其客户数量为100;也即第2次训练后的模型通过对这9800个非真实的中小微企业主客户的各特征的学习与数据处理,预测出其中的100个非真实的中小微企业主客户为中小微企业主客户。
根据上述数据,可以得到有关于第2次训练后的模型的混淆矩阵。如图2所示,为本发明实施例提供的一种混淆矩阵示意图。参考图2:
TP(Ture Positive,真阳性)表示将正类预测为正类数,如样本真实为1,模型预测也为1;当模型用于预测中小微企业主时,则对应于上述情形1,也即TP的值为150;
FN(False Negative,假阴性)表示将正类预测为负类数,如样本真实为1,模型预测为0;当模型用于预测中小微企业主时,则对应于上述情形2,也即FN的值为50;
FP(False Positive,假阳性)表示将负类预测为正类数,如样本真实为0,模型预测为1;当模型用于预测中小微企业主时,则对应于上述情形3,也即FP的值为100;
TN(Ture Negative,真阴性)表示将负类预测为负类数,如样本真实为0,模型预测也为0;当模型用于预测中小微企业主时,则对应于上述情形4,也即TN的值为9700。
以上,数字“1”用于表示真实的中小微企业主客户,数字“0”用于表示非真实的中小微企业主客户。
根据混淆矩阵,可以确定模型的精确率(Precision)和召回率(Recall)。其中,精确率(Precision)可以通过以下方式计算:
Figure PCTCN2020134357-appb-000001
召回率(Recall)可以通过以下方式计算:
Figure PCTCN2020134357-appb-000002
对于上述实施例,可以计算出这10000条新的客户信息的精确率(Precision)和召回率(Recall),其Precision的值为60%,其Recall的值为75%。
若设定50%为判定模型过拟合时精确率的阈值,70%为判定模型过拟合时召回率的阈值,则对于上述Precision的值为60%,Recall的值为75%,则可以确定第2次训练后的模型属于过拟合的模型;其中,50%为第一阈值,70%为第二阈值。
若设定80%为判定模型过拟合时精确率的阈值,80%为判定模型过拟合时召回率的阈值,则对于上述Precision的值为60%,Recall的值为75%,则可以确定第2次训练后的模型不属于过拟合的模型;其中,80%为第一阈值,80%为第二阈值,第一阈值与第二阈值相等。
作为一种可能实现的方法,通过验证数据确定所述模型是否过拟合之前,还包括:将样本数据划分为M份样本集,其中各份包括的正样本相同,且各份包括的负样本均不相同;针对每份样本集,按照所述第n次训练的各样本特征,从所述样本集中确定所述第n次训练使用的各样本,并通过训练得到所述样本集对应的子模型;根据M个子模型得到第n次训练的模型。
设采集到的样本数据为2050万条,其中50万是中小微企业主客户,令为正样本集;2000万不是中小微企业主客户,也即为普通客户,令为负样本。
设将所采集到的样本数据划分为4份样本集,其中每一份样本集中包括的正样本均为50万的中小微企业主客户;其中每一份样本集中包括的负样本均不相同,也即采用无放回的方式从负样本集中采集4份,比如可以采用均分的方式从负样本集中采集4份,每一份样本集中的负样本均为500万的普通客户;得到的4份样本集中均为50万的中小微企业主客户和500万的普通客户。
设这4份样本集分别为a样本集、b样本集、c样本集和d样本集;采用已设置的模型_10对a样本集进行训练,采用已设置的模型_20对b样本集进行训练,采用已设置的模型_30对c样本集进行训练,以及采用已设置的模型_40对d样本集进行训练;其中,模型_10、模型_20、模型_30和模型_40的初始参数均一致,也即这四者本质上为同一模型,在此是为了叙述的方便,特令为模型_10、模型_20、模型_30和模型_40以示区分。
然后,将b样本集、c样本集和d样本集中的任一样本集采用已经由a样本集训练得到的模型_11继续训练,如将b样本集投入模型_11继续训练;然后,将c样本集和d样本集中的任一样本集采用已经由b样本集训练得到的模型_12继续训练,如将c样本集投入 模型_12继续训练;最后,将余下的d样本集采用已经由c样本集训练得到的模型_13继续训练。
同样的,将a样本集、c样本集和d样本集中的任一样本集采用已经由b样本集训练得到的模型_21继续训练,如将a样本集投入模型_21继续训练;然后,将c样本集和d样本集中的任一样本集采用已经由a样本集训练得到的模型_22继续训练,如将c样本集投入模型_22继续训练;最后,将余下的d样本集采用已经由c样本集训练得到的模型_23继续训练。
同样的,将a样本集、b样本集和d样本集中的任一样本集采用已经由c样本集训练得到的模型_31继续训练,如将a样本集投入模型_31继续训练;然后,将b样本集和d样本集中的任一样本集采用已经由a样本集训练得到的模型_32继续训练,如将b样本集投入模型_32继续训练;最后,将余下的d样本集采用已经由b样本集训练得到的模型_33继续训练。
同样的,将a样本集、b样本集和c样本集中的任一样本集采用已经由d样本集训练得到的模型_41继续训练,如将a样本集投入模型_41继续训练;然后,将b样本集和c样本集中的任一样本集采用已经由a样本集训练得到的模型_42继续训练,如将b样本集投入模型_42继续训练;最后,将余下的c样本集采用已经由b样本集训练得到的模型_43继续训练。
经过每一步的客户类型的确定后,关于每一个客户都会计算出一个对应的概率值,然后综合所有的模型结果取均值。如对于一位非中小微企业主客户,如Grace女士,她被划分至b样本集中,通过模型_20、对应于模型_10的其他模型(模型_11、模型_12和模型_13)中的任一种模型、对应于模型_30的其他模型(模型_31、模型_32和模型_33)中的任一种模型和对应于模型_40的其他模型(模型_41、模型_42和模型_43)中的任一种模型的分别计算,得到的被预测为中小微企业主客户的概率值分别为30%、35%、40%和25%,则对这4个概率值取均值,则有32.5%,也即,通过模型的训练,认为Grace女士有32.5%的可能性是中小微企业主客户。
作为一种可能实现的方法,所述样本数据为第一历史时段采集的;所述验证数据为第二历史时段采集的;所述第二历史时段晚于所述第一历史时段。
比如,前述训练第1次的模型和训练第2次的模型所用到的数据称为样本数据;设第2次训练后的模型为过拟合的模型,则针对第2次训练后的模型调优、而做的第3次训练的模型所用到的数据称为验证数据。
比如,当前月份是2019年12月21号,则可以将2019年10月31号以及之前日期的客户数据作为样本数据,将2019年11月1号至2019年11月30号这一整个月份的客户数据作为验证数据。其中,2019年10月31号以及之前日期则为第一历史时段,2019年11月1号至2019年11月30号这一整个月份则为第二历史时段。
作为一种可能实现的方法,确定第n+1次训练时使用的各样本特征从而得到第n+1次训练的模型,包括:在确定所述模型过拟合后,对所述模型的参数进行调整;根据所述第n+1次训练时使用的各样本特征,对调整后的模型重新进行第n+1次训练。
本发明实施例中使用lightgbm这种决策树模型进行对中小微企业主的预测;当确定第2次训练后的模型属于过拟合的模型,可以通过调整lightgbm这种决策树模型自身的参数,以实现在第3次的客户类型的确定过程中的得到较优的模型。
其中,可以通过调整最大深度(max_depth):在确认模型过拟合时,则将max_depth调小一些;
可以通过调整叶节点个数(num_leaves):由于lightgbm这种决策树模型是基于leaves_wise的生长规则,因而其叶节点个数的须小于2^max_depth(即2的max_depth次方);
可以通过调整叶节点的最少样本数(mean_data_in_leaf):增大叶节点的最少样本数。
基于同样的构思,本发明实施例还提供一种客户类型的确定装置,如图3所示,该装置包括:
获取单元301,用于获取客户的属性信息。
确定单元302,用于将所述客户的属性信息输入预设模型,得到所述客户所属的客户类型;其中,所述预设模型是通过训练单元303得到的:
所述训练单元303,用于针对第n次训练得到的模型,通过验证数据确定所述模型是否过拟合;
所述训练单元303,用于在确定所述模型过拟合后,获取所述模型在所述第n次训练过程中使用的各样本特征的评估值;根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征从而得到第n+1次训练的模型,返回通过验证数据确定所述模型是否过拟合的步骤,直至所述模型不存在过拟合。
进一步地,对于所述装置,所述各样本特征中包括噪声特征;所述训练单元303,具体用于将评估值低于所述噪声特征的评估值的样本特征删除。
进一步地,对于所述装置,所述评估值至少是根据样本特征在训练过程中的使用次数或样本特征被拆分时的信息增益来确定的;所述训练单元303,具体用于对所述各样本特征的评估值进行排序;若第一样本特征的评估值是第二样本特征的评估值的k倍,则删除所述第一样本特征;所述第一样本特征和所述第二样本特征为排序中相邻的样本特征,k≥3。
进一步地,对于所述装置,所述验证数据包括多个验证样本;所述训练单元303,具体用于将所述多个验证样本分别输入所述模型,得到多个验证结果;根据所述多个验证结果与所述多个验证样本的真实值,确定所述模型的精确率和召回率;在所述精确率大于第一阈值且所述召回率大于第二阈值时,确定所述模型过拟合。
进一步地,对于所述装置,通过验证数据确定所述模型是否过拟合之前,所述训练单元303,还用于将样本数据划分为M份样本集,其中各份包括的正样本相同,且各份包括的负样本均不相同;针对每份样本集,按照所述第n次训练的各样本特征,从所述样本集中确定所述第n次训练使用的各样本,并通过训练得到所述样本集对应的子模型;根据M个子模型得到第n次训练的模型。
进一步地,对于所述装置,所述样本数据为第一历史时段采集的;所述验证数据为第二历史时段采集的;所述第二历史时段晚于所述第一历史时段。
进一步地,对于所述装置,所述训练单元303,具体用于在确定所述模型过拟合后,对所述模型的参数进行调整;根据所述第n+1次训练时使用的各样本特征,对调整后的模型重新进行第n+1次训练。
本发明实施例提供了一种计算设备,该计算设备具体可以为桌面计算机、便携式计算机、智能手机、平板电脑、个人数字助理(Personal Digital Assistant,PDA)等。该计算设备可以包括中央处理器(Center Processing Unit,CPU)、存储器、输入/输出设备等,输入设备可以包括键盘、鼠标、触摸屏等,输出设备可以包括显示设备,如液晶显示器(Liquid  Crystal Display,LCD)、阴极射线管(Cathode Ray Tube,CRT)等。
存储器,可以包括只读存储器(ROM)和随机存取存储器(RAM),并向处理器提供存储器中存储的程序指令和数据。在本发明实施例中,存储器可以用于存储客户类型的确定方法的程序指令;
处理器,用于调用所述存储器中存储的程序指令,按照获得的程序执行客户类型的确定方法。
如图4所示,为本申请实施例提供的一种计算设备的示意图,该计算设备包括:
处理器401、存储器402、收发器403、总线接口404;其中,处理器401、存储器402与收发器403之间通过总线405连接;
所述处理器401,用于读取所述存储器402中的程序,执行上述客户类型的确定方法;
处理器401可以是中央处理器(central processing unit,简称CPU),网络处理器(network processor,简称NP)或者CPU和NP的组合。还可以是硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,简称ASIC),可编程逻辑器件(programmable logic device,简称PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,简称CPLD),现场可编程逻辑门阵列(field-programmable gate array,简称FPGA),通用阵列逻辑(generic array logic,简称GAL)或其任意组合。
所述存储器402,用于存储一个或多个可执行程序,可以存储所述处理器401在执行操作时所使用的数据。
具体地,程序可以包括程序代码,程序代码包括计算机操作指令。存储器402可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,简称RAM);存储器402也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,简称HDD)或固态硬盘(solid-state drive,简称SSD);存储器402还可以包括上述种类的存储器的组合。
存储器402存储了如下的元素,可执行模块或者数据结构,或者它们的子集,或者它们的扩展集:
操作指令:包括各种操作指令,用于实现各种操作。
操作***:包括各种***程序,用于实现各种基础业务以及处理基于硬件的任务。
总线405可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图4中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
总线接口404可以为有线通信接入口,无线总线接口或其组合,其中,有线总线接口例如可以为以太网接口。以太网接口可以是光接口,电接口或其组合。无线总线接口可以为WLAN接口。
本发明实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行客户类型的确定方法。
本领域内的技术人员应明白,本发明的实施例可提供为方法、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存 储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(***)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (10)

  1. 一种客户类型的确定方法,其特征在于,包括:
    获取客户的属性信息;
    将所述客户的属性信息输入预设模型,得到所述客户所属的客户类型;其中,所述预设模型是通过以下方式得到的:
    针对第n次训练得到的模型,通过验证数据确定所述模型是否过拟合;
    在确定所述模型过拟合后,获取所述模型在所述第n次训练过程中使用的各样本特征的评估值;根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征从而得到第n+1次训练的模型,返回通过验证数据确定所述模型是否过拟合的步骤,直至所述模型不存在过拟合。
  2. 如权利要求1所述的方法,其特征在于,所述各样本特征中包括噪声特征;
    根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征,包括:
    将评估值低于所述噪声特征的评估值的样本特征删除。
  3. 如权利要求1所述的方法,其特征在于,所述评估值至少是根据样本特征在训练过程中的使用次数或样本特征被拆分时的信息增益来确定的;
    根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征,包括:
    对所述各样本特征的评估值进行排序;
    若第一样本特征的评估值是第二样本特征的评估值的k倍,则删除所述第一样本特征;所述第一样本特征和所述第二样本特征为排序中相邻的样本特征,k≥3。
  4. 如权利要求1所述的方法,其特征在于,所述验证数据包括多个验证样本;
    通过验证数据确定所述模型是否过拟合,包括:
    将所述多个验证样本分别输入所述模型,得到多个验证结果;
    根据所述多个验证结果与所述多个验证样本的真实值,确定所述模型的精确率和召回率;
    在所述精确率大于第一阈值且所述召回率大于第二阈值时,确定所述模型过拟合。
  5. 如权利要求1所述的方法,其特征在于,
    通过验证数据确定所述模型是否过拟合之前,还包括:
    将样本数据划分为M份样本集,其中各份包括的正样本相同,且各份包括的负样本均不相同;
    针对每份样本集,按照所述第n次训练的各样本特征,从所述样本集中确定所述第n次训练使用的各样本,并通过训练得到所述样本集对应的子模型;
    根据M个子模型得到第n次训练的模型。
  6. 如权利要求5所述的方法,其特征在于,所述样本数据为第一历史时段采集的;所述验证数据为第二历史时段采集的;所述第二历史时段晚于所述第一历史时段。
  7. 如权利要求1-6任一项所述的方法,其特征在于,
    确定第n+1次训练时使用的各样本特征从而得到第n+1次训练的模型,包括:
    在确定所述模型过拟合后,对所述模型的参数进行调整;
    根据所述第n+1次训练时使用的各样本特征,对调整后的模型重新进行第n+1次训练。
  8. 一种客户类型的确定装置,其特征在于,包括:
    获取单元,用于获取客户的属性信息;
    确定单元,用于将所述客户的属性信息输入预设模型,得到所述客户所属的客户类型;其中,所述预设模型是通过训练单元得到的:
    所述训练单元,用于针对第n次训练得到的模型,通过验证数据确定所述模型是否过拟合;
    所述训练单元,用于在确定所述模型过拟合后,获取所述模型在所述第n次训练过程中使用的各样本特征的评估值;根据所述各样本特征的评估值,确定第n+1次训练时使用的各样本特征从而得到第n+1次训练的模型,返回通过验证数据确定所述模型是否过拟合的步骤,直至所述模型不存在过拟合。
  9. 一种计算设备,其特征在于,包括:
    存储器,用于存储程序指令;
    处理器,用于调用所述存储器中存储的程序指令,按照获得的程序执行如权利要求1-7任一项所述的方法。
  10. 一种计算机可读存储介质,其特征在于,所述存储介质存储有计算机可执行指令,所述计算机可执行指令用于使计算机执行如权利要求1-7任一项所述的方法。
PCT/CN2020/134357 2019-12-26 2020-12-07 一种客户类型的确定方法及装置 WO2021129368A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911363412.0A CN111160929B (zh) 2019-12-26 2019-12-26 一种客户类型的确定方法及装置
CN201911363412.0 2019-12-26

Publications (1)

Publication Number Publication Date
WO2021129368A1 true WO2021129368A1 (zh) 2021-07-01

Family

ID=70558060

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/134357 WO2021129368A1 (zh) 2019-12-26 2020-12-07 一种客户类型的确定方法及装置

Country Status (2)

Country Link
CN (1) CN111160929B (zh)
WO (1) WO2021129368A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160929B (zh) * 2019-12-26 2024-02-09 深圳前海微众银行股份有限公司 一种客户类型的确定方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108320171A (zh) * 2017-01-17 2018-07-24 北京京东尚科信息技术有限公司 热销商品预测方法、***及装置
CN108492200A (zh) * 2018-02-07 2018-09-04 中国科学院信息工程研究所 一种基于卷积神经网络的用户属性推断方法和装置
CN108829763A (zh) * 2018-05-28 2018-11-16 电子科技大学 一种基于深度神经网络的影评网站用户的属性预测方法
CN109684933A (zh) * 2018-11-30 2019-04-26 广州大学 一种前方行人窜出马路的预警方法
CN110060068A (zh) * 2019-02-14 2019-07-26 阿里巴巴集团控股有限公司 商户评估方法、装置、电子设备及可读存储介质
CN110414580A (zh) * 2019-07-19 2019-11-05 东南大学 基于随机森林算法的钢筋混凝土深梁承载力评估方法
CN111160929A (zh) * 2019-12-26 2020-05-15 深圳前海微众银行股份有限公司 一种客户类型的确定方法及装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108337316A (zh) * 2018-02-08 2018-07-27 平安科技(深圳)有限公司 信息推送方法、装置、计算机设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108320171A (zh) * 2017-01-17 2018-07-24 北京京东尚科信息技术有限公司 热销商品预测方法、***及装置
CN108492200A (zh) * 2018-02-07 2018-09-04 中国科学院信息工程研究所 一种基于卷积神经网络的用户属性推断方法和装置
CN108829763A (zh) * 2018-05-28 2018-11-16 电子科技大学 一种基于深度神经网络的影评网站用户的属性预测方法
CN109684933A (zh) * 2018-11-30 2019-04-26 广州大学 一种前方行人窜出马路的预警方法
CN110060068A (zh) * 2019-02-14 2019-07-26 阿里巴巴集团控股有限公司 商户评估方法、装置、电子设备及可读存储介质
CN110414580A (zh) * 2019-07-19 2019-11-05 东南大学 基于随机森林算法的钢筋混凝土深梁承载力评估方法
CN111160929A (zh) * 2019-12-26 2020-05-15 深圳前海微众银行股份有限公司 一种客户类型的确定方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Understanding Cross-Validation and Overfitting", CNBLOGS.COM, 6 August 2018 (2018-08-06), XP055824077, Retrieved from the Internet <URL:https://www.cnblogs.com/solong1989/p/9415606.html> *

Also Published As

Publication number Publication date
CN111160929B (zh) 2024-02-09
CN111160929A (zh) 2020-05-15

Similar Documents

Publication Publication Date Title
Zhou et al. The effect of artificial intelligence on China’s labor market
CN111383101B (zh) 贷后风险监控方法、装置、设备及计算机可读存储介质
CN108833458B (zh) 一种应用推荐方法、装置、介质及设备
CN110837931A (zh) 客户流失预测方法、装置及存储介质
CN111160473A (zh) 一种分类标签的特征挖掘方法及装置
CN107633257B (zh) 数据质量评估方法及装置、计算机可读存储介质、终端
CN112396211B (zh) 一种数据预测方法及装置、设备和计算机存储介质
WO2018040067A1 (zh) 用户指导***及方法
WO2023123933A1 (zh) 用户的类型信息的确定方法、设备及存储介质
WO2021103401A1 (zh) 数据对象分类方法、装置、计算机设备和存储介质
CN115545103A (zh) 异常数据识别、标签识别方法和异常数据识别装置
WO2021129368A1 (zh) 一种客户类型的确定方法及装置
CN112950359A (zh) 一种用户识别方法和装置
CN113191681A (zh) 网点选址方法、装置、电子设备及可读存储介质
CN110610378A (zh) 产品需求分析方法、装置、计算机设备和存储介质
CN113706258B (zh) 基于组合模型的产品推荐方法、装置、设备及存储介质
CN115829722A (zh) 信用风险评分模型的训练方法及信用风险评分方法
CN115914363A (zh) 消息推送方法、装置、计算机设备和存储介质
CN115049429A (zh) 增益预测方法、装置和计算机设备
CN115204501A (zh) 企业评估方法、装置、计算机设备和存储介质
CN114626940A (zh) 数据分析方法、装置及电子设备
CN109919811B (zh) 基于大数据的保险代理人培养方案生成方法及相关设备
CN114170000A (zh) ***用户风险类别识别方法、装置、计算机设备和介质
CN113934894A (zh) 基于指标树的数据显示方法、终端设备
JP2023516035A (ja) ランダムフォレスト分類器を用いて、さまざまな時間特性を有するデータを処理してマネジメントアレンジメントに関する予測を生成する方法およびシステム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20905267

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 26.10.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20905267

Country of ref document: EP

Kind code of ref document: A1