CN113570469B

CN113570469B - Intelligent vehicle change prediction method for vehicle insurance user

Info

Publication number: CN113570469B
Application number: CN202110851738.9A
Authority: CN
Inventors: 邱卫东; 黄征; 崔海名; 来春蕾; 代德发; 鲁静文; 唐鹏; 徐源; 李昕朋; 陆尔东; 徐春雷
Original assignee: Shanghai Jiaotong University; China Pacific Insurance Group Co Ltd CPIC
Current assignee: Shanghai Jiaotong University; China Pacific Insurance Group Co Ltd CPIC
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2024-05-28
Anticipated expiration: 2041-07-27
Also published as: CN113570469A

Abstract

An intelligent car change prediction system and method facing car insurance users, the system comprises: the system comprises a data processing module, an offline training module and an online prediction module, wherein the data processing module performs data screening and data labeling processing according to user insurance policy information and outputs whether a user changes a vehicle or not and a vehicle type changing result, the offline training module performs machine learning model training according to the user insurance policy and the labeling information and outputs a prediction model, and the online prediction module performs prediction on whether the user changes the vehicle and changes a specified vehicle type according to new user insurance policy information and the prediction model and outputs whether the user changes the vehicle and changes the specified vehicle type or not. According to whether the insurance vehicles in the year insurance policy before and after in the historical user insurance policy data are consistent, whether the user changes the vehicle and the changed vehicle type is marked, relevant feature sets of the user are screened to train a machine learning and deep learning model, and accurate prediction of whether the user changes the vehicle and whether the user changes the appointed vehicle type is completed.

Description

Intelligent vehicle change prediction method for vehicle insurance user

Technical Field

The invention relates to a technology in the field of neural network application, in particular to a vehicle change prediction method based on machine learning for a vehicle insurance user.

Background

Through the investigation of the prior art, the industry has some achievements in the field of accurate marketing at present. The user portrayal technology is the most commonly used technical means in the accurate marketing field, uses the modern computer technology to collect and analyze user information, classifies and screens user characteristics through technologies such as machine learning, deep learning and the like, establishes user portrayal, and realizes functions such as user potential value mining, user value subdivision, user management and the like. Based on the user portrait, the business objective and profit increase of the enterprise are realized through personalized marketing strategies.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides an intelligent car changing prediction system and method for car insurance users.

The invention is realized by the following technical scheme:

The invention relates to an intelligent vehicle change prediction method for a vehicle insurance user, which marks whether a user changes a vehicle and a changed vehicle type according to whether the insurance vehicles in a front year insurance policy and a rear year insurance policy in historical user vehicle insurance policy data are consistent, screens relevant characteristic sets of the user to train a machine learning and deep learning model, and completes accurate prediction of whether the user changes the vehicle and whether the user changes a specified vehicle type.

The invention relates to a vehicle-changing prediction system based on machine learning for a vehicle insurance user, which realizes the method, and comprises the following steps: the system comprises a data processing module, an offline training module and an online prediction module, wherein: the data processing module performs data screening and data marking processing according to the user insurance policy information and outputs the result of whether the user changes the vehicle and whether the user changes the vehicle type, the offline training module performs machine learning model training according to the user insurance policy and marking information and outputs a prediction model, and the online prediction module performs prediction on whether the user changes the vehicle and changes the specified vehicle type according to the new user insurance policy information and the prediction model and outputs whether the user changes the vehicle and changes the specified vehicle type.

The data processing module comprises: the device comprises a data screening unit and a data labeling unit, wherein: the data screening unit screens effective samples from the insurance policy data of the user, performs data cleaning according to the insurance date of the insurance policy and the certificate number field of the insurance applicant, obtains the insurance policy data of the same user in different years, and extracts relevant characteristics such as insurance user information, insurance vehicle information and insurance information in the insurance policy data; the data marking unit judges whether the vehicles applied in different years of the user are consistent or not according to whether the vin codes of the applied vehicles in the policy are consistent or not in the screened policy data of different years of the same user, so as to mark whether the user changes the vehicle and the changed vehicle type.

The offline training module comprises: the system comprises a characteristic engineering unit and a model training unit, wherein: the feature engineering unit cleans, sorts and normalizes the relevant features extracted from the data processing module, changes the policy information of a single user into a group of feature values through a character data digitizing method, and screens important features by XGBoost; the model training unit divides the data into a training data set and a test data set, trains a machine learning model by using the training set, tests the model effect by using the test set, stores the model with the optimal effect and provides the model with the optimal effect for the online prediction module to predict.

The online prediction module comprises: the device comprises a feature extraction unit and a vehicle change prediction unit, wherein: the feature extraction unit extracts and screens relevant fields and performs standardized processing according to a method used by a feature engineering unit in the offline training module according to initial feature input of online prediction by leading in policy information to be predicted by a user; the vehicle change prediction unit inputs the processed characteristics to the corresponding trained model, and the model outputs a vehicle change prediction result.

Technical effects

The invention integrally solves the problem that whether the user changes the vehicle or not is predicted by the user vehicle insurance policy; compared with the prior art, the method and the device complete the prediction of the vehicle change and the vehicle type based on the vehicle insurance policy data; the user's car change and prediction of the car type can be completed by using the user's car insurance policy data, a large amount of user personal information is not needed, and the method is more practical; the training model can predict whether all the dangerous users change the car, and meanwhile, the method can be expanded to the prediction of changing the car of various car types, and has strong flexibility.

Drawings

FIG. 1 is a block diagram of a system according to the present invention.

Detailed Description

As shown in fig. 1, an intelligent car change prediction for a car insurance user according to this embodiment includes: the system comprises a data processing module, an offline training module and an online prediction module, wherein: the data processing module performs data screening and data marking processing according to the user insurance policy information and outputs the result of whether the user changes the vehicle and whether the user changes the vehicle type, the offline training module performs machine learning model training according to the user insurance policy and marking information and outputs a prediction model, and the online prediction module performs prediction on whether the user changes the vehicle and changes the specified vehicle type according to the new user insurance policy information and the prediction model and outputs whether the user changes the vehicle and changes the specified vehicle type.

The data processing module comprises: the device comprises a data screening unit and a data labeling unit, wherein: the data screening unit is used for collecting data and screening the data, and the data labeling unit is used for searching the data of the insurance policy of the same user in the next year from the screened data according to the certificate number and the field of the insurance applicant and labeling the data.

The data collection refers to: user policy data provided by an insurance company is collected, data formats of all fields are standardized, 50 relevant fields such as user information, vehicle information and insurance information in the policy data are extracted as features, and a user policy database is established.

The data screening refers to: according to the insurance date of the insurance policy, the certificate number and the field of the insurance applicant, the data of the insurance vehicles of the same user in different years are searched in the user insurance policy database, the insurance policy data of different vin codes and the number more than 1 are deleted according to the vin codes and the field of the insurance policy, and the data of the insurance vehicles of different years, the number of which is 1, are reserved, namely the data records of the insurance vehicles of the same user in different years are screened out in the user insurance policy database.

The data label specifically comprises the following steps: when the vin code of the insurance vehicle in the current year is different from the vin code and field value of the insurance vehicle in the next year, marking the vehicle as a vehicle change, and marking the vehicle type replaced by the user by using the insurance vehicle type in the next year insurance policy data; when the vin code of the current year of the insurance application vehicle is the same as the vin code and field value of the next year of the insurance application vehicle, the vehicle is marked as not being changed.

The offline training module comprises: the system comprises a characteristic engineering unit and a model training unit, wherein: the feature engineering unit processes abnormal values, data standardization and feature screening of the data obtained by the data processing module, and the model training unit carries out modeling training of the MLP model and the GBDT model according to the screened features.

The abnormal value processing means: performing outlier processing on default values or outliers in the features obtained by the data processing module, wherein the processing features comprise: the area, the three-responsibility insurance policy, the ticket premium, the traffic violation coefficient, the expected odds, the train, the negotiated actual value, the age of the vehicle, the classification of the vehicle, the risk level of the vehicle, the type of the vehicle, the displacement, the number of times the platform returns to insurance, the platform returns to NCD coefficient, the total number of cases of the vehicle, the amount of the vehicle pay, the sex of the insured person, whether the applicant has an insurance client, whether the insurance client is an effective insurance client for life insurance, the total insurance number purchased by the applicant, the age, the vehicle type, the purchase price of new vehicle, the fuel type and the like. Because the data volume of the non-vehicle change is large, if the default value exists in the data of the non-vehicle change user, the abnormal data is directly removed, and if the abnormality exists in the data of the vehicle change user, the data is processed in the modes of mean value filling, hot card filling, manual filling and the like.

The artificial padding is suitable for the part of the missing value which can be deduced from the rest of the data, such as gender can be deduced from provincial evidence.

The hot card filling refers to: for an object that contains a null value, the hot card fill method finds an object that is most similar to it in the complete data and then fills with the value of this similar object.

The data normalization refers to: the characteristic values are standardized and then converted into standard normal distribution, such as the vehicle age, new vehicle acquisition price, actual negotiating value and the like, and are directly converted into standard normal distribution. The characteristics of the other numerical value types are standardized by an interval scaling method in a dimensionless method, the processing formula is as follows,Wherein x is the original value of the feature, min is the minimum value of all the values of the feature, max is the maximum value of all the values of the feature, and x' is the value normalized by the original value. And converting the character data into numerical values by using a onehot coding method according to the characteristic value belonging to the character string type.

The feature screening means that: and carrying out feature screening on the 50 standardized features by XGBoost, and screening features with higher importance for the classification model. XGBoost the main parameters are set as: the input data is 50 in length, the booster is tree type (gbtree), the activation function is multi: softmax, the maximum depth of the tree is 6 layers, and the gamma value is 0.1. Training runs 100 rounds. And selecting 5000 vehicle-changing data from the data set by adopting a ten-fold cross-validation method, inputting 5000 vehicle-changing data into a XGBoost model for learning, outputting a feature importance result of 50 features, and counting feature sets with front feature importance in a ten-fold experiment. And screening out the features with high importance from the 50 features according to the statistical result, wherein 28 features are included: regional, three-responsibility insurance policy, ticket policy, traffic violation coefficients, expected odds, final odds, vehicle systems, negotiated actual value, vehicle age, vehicle type classification, vehicle type risk level, vehicle type, displacement, number of platform returns to insurance, number of platform returns to NCD coefficients, total number of vehicle cases, vehicle pay amount, sex of insured, whether the insurer is a life insurance client, whether the life insurance client is a life insurance long effective policy client, the insurer purchases life insurance total policy, age, vehicle type, risk, new vehicle purchase price, fuel type, the insurer pays total policy, and the like.

The MLP model refers to: selecting a multi-layer perceptron MLP as a classification model, wherein the network structure and parameters of the MLP comprise: an input layer, three hidden layers, and an output layer. The nodes of the three hidden layers are 128, 256 and 64 respectively, the hidden layers adopt an activation function LeakyReLU, and the corresponding dropout is set to 0.2. The activation function of the output layer is Sigmod.

The data obtained by the feature engineering unit are trained one by one to obtain different models, and the model can be divided into two kinds of models of whether a user changes a vehicle model or not and whether the vehicle model changes into two kinds of models of a plurality of target vehicle types such as BMW, gekko Swinhonis, leishas, masses, mercedes-Benz and the like after changing the vehicle. For the model of whether to change the car, when the user changes the car, the label is 1, and when the user does not change the car, the label is 0. For the model of the target vehicle model, the data mark of the target vehicle model is 1, and the data marks of other vehicle models are 0. All data were read as per 4:1 split, where 75% of the data is trained and 25% of the data is used as test set.

The LeakyReLU formula isSigmod has the formula/>

The model effect of the MLP is shown in the following table.

The GBDT model refers to: the GradientBoostingClassifier model in sklearn library is selected, the size of the tree is set to be 500 in the experiment, the maximum depth of the tree is set to be 4, the learning rate is set to be 0.1, and the minimum number of samples required by splitting one internal node of the tree is set to be 100. The loss function is a logarithmic loss function L (Y, P (y|x)) = -logP (y|x).

Training the data obtained by the feature engineering unit one by one to obtain different models, wherein the trained model data are the same as the MLP model data, and dividing 4:1, wherein 75% of the data are trained and 25% of the data are used as test sets.

The GBDT model effects are shown in the following table.

Accuracy (accuracy) indicates that all samples with correct prediction result account for all sample ratios.

Precision (precision) indicates the proportion of samples that are truly valid in samples for which the prediction result is valid.

Recall (recall) that indicates the proportion of samples for which the predicted outcome is valid to all true valid samples.

And storing the trained model with the optimal effect to a local place for an online detection module.

The on-line detection module specifically comprises: the device comprises a feature extraction unit and a vehicle change prediction unit, wherein: the feature extraction unit extracts multidimensional feature information of the user required by prediction by using a method of a feature engineering unit in the offline training module, and the vehicle change prediction unit inputs the obtained multidimensional features of the user into the stored model in batches to obtain a predicted value of whether the user changes a vehicle or not and whether the user changes a target vehicle type or not.

The multi-dimensional characteristic information comprises: regional, three-responsibility insurance policy, ticket policy, traffic violation coefficients, expected odds, final odds, vehicle systems, negotiated actual value, vehicle age, vehicle type classification, vehicle type risk level, vehicle type, displacement, number of platform returns to insurance, number of platform returns to NCD coefficients, total number of vehicle cases, vehicle pay amount, sex of insured, whether the insurer is a life insurance client, whether the life insurance client is a life insurance long effective policy client, the insurer purchases life insurance total policy, age, vehicle type, risk, new vehicle purchase price, fuel type, the insurer pays total policy, and the like.

Through specific practical experiments, under a Linux operating system, a python programming language is configured, the shell command is used for starting the model, the accuracy of the model on a test set for changing vehicles is up to 70.2%, and the accuracy of the model for changing vehicle types is up to 74.8%. Experimental results show that the method has certain effect and practicability in predicting the vehicle change and the vehicle change based on policy data.

The foregoing embodiments may be partially modified in numerous ways by those skilled in the art without departing from the principles and spirit of the invention, the scope of which is defined in the claims and not by the foregoing embodiments, and all such implementations are within the scope of the invention.

Claims

1. An intelligent car change prediction system for a car insurance user is characterized by comprising: the system comprises a data processing module, an offline training module and an online prediction module, wherein: the data processing module performs data screening and data marking processing according to the user insurance policy information and outputs the result of whether the user changes the vehicle and whether the user changes the vehicle type, the offline training module performs machine learning model training according to the user insurance policy and marking information and outputs a prediction model, and the online prediction module performs prediction on whether the user changes the vehicle and changes the specified vehicle type according to the new user insurance policy information and the prediction model and outputs whether the user changes the vehicle and changes the specified vehicle type;

the data processing module comprises: the device comprises a data screening unit and a data labeling unit, wherein: the data screening unit is used for collecting data and screening the data, and the data labeling unit is used for searching the data of the insurance policy of the same user in the next year from the screened data according to the certificate number and the field of the insurance applicant and labeling the data;

the data collection refers to: collecting user policy data provided by an insurance company, standardizing data formats of all fields, extracting user information, vehicle information and insurance information in the policy data as features, and establishing a user policy database;

The data screening refers to: according to the insurance date of the insurance policy, the certificate number and the field of the insurance applicant, searching data of the insurance vehicles of the same user in different years in a user insurance policy database, deleting the insurance policy data with different vin codes and the quantity larger than 1 according to the vin codes and the field of the insurance policy, and reserving the data with the quantity of the insurance vehicles of different years being 1, namely screening out the data records of the insurance vehicles of the same user in different years in the user insurance policy database;

The data label specifically comprises the following steps: when the vin code of the insurance vehicle in the current year is different from the vin code and field value of the insurance vehicle in the next year, marking the vehicle as a vehicle change, and marking the vehicle type replaced by the user by using the insurance vehicle type in the next year insurance policy data; when the vin code of the current year of the insurance application vehicle is the same as the vin code and field value of the next year of the insurance application vehicle, marking as not changing;

The offline training module comprises: the system comprises a characteristic engineering unit and a model training unit, wherein: the feature engineering unit processes abnormal values, data standardization and feature screening of the data obtained by the data processing module, and the model training unit carries out modeling training of the MLP model and the GBDT model according to the screened features;

the abnormal value processing means: performing outlier processing on default values or outliers in the features obtained by the data processing module, wherein the processing features comprise: regional, three-responsibility insurance policy, policy premium, traffic violation coefficient, expected odds, train, negotiating actual value, age, classification of model, model risk level, model type, displacement, number of platform returns to insurance, platform returns NCD coefficient, total number of vehicle cases, vehicle odds and amount, sex of insured, whether the applicant has a life insurance client, whether it is a life insurance long effective policy client, total insurance policy number purchased by applicant, age, model, new vehicle acquisition price, fuel type;

if the default value exists in the data of the non-vehicle-changing user, the abnormal data is directly removed, and if the abnormality exists in the data of the vehicle-changing user, the data is processed through mean filling, hot card filling and manual filling;

the part which is suitable for the missing value and is estimated by the rest data is filled manually;

The hot card filling refers to: for an object containing null values, the hot card fill method finds an object most similar to it in the complete data, and then fills with the value of this similar object;

The data normalization refers to: the characteristic value accords with the numerical value type of normal distribution, is converted into standard normal distribution after standardization, the characteristics of the other numerical value types are standardized by an interval scaling method in a dimensionless method, a processing formula is that, Wherein x is the original value of the feature, min is the minimum value of all values of the feature, max is the maximum value of all values of the feature, x' is the value after the original value is standardized, the feature value belongs to the character string type, and character data are converted into numerical values through a onehot coding method;

the feature screening means that: carrying out feature screening on the 50 standardized features by XGBoost, and screening features with higher importance for the classification model;

The main parameters of XGBoost are set as follows: the length of input data is 50, a boost is a tree type, an activation function is multi, the maximum depth of the tree is 6 layers, the gamma value is 0.1, training rounds are 100 rounds, a ten-fold cross validation method is adopted, 5000 vehicle-changing data are selected from a data set, 5000 vehicle-non-vehicle-changing data are input into a XGBoost model for learning, feature importance results of 50 features are output, feature sets with front feature importance in ten-fold experiments are counted, features with high importance degree are selected from the 50 features according to the counted results, 28 features are included: regional, three-responsibility insurance policy, ticket policy, traffic violation coefficient, expected odds, final odds, train, negotiated actual value, age, classification of models, risk class of models, type of vehicle, displacement of models, number of platform returns to insurance, number of platform returns to NCD coefficient, total number of vehicle cases, amount of vehicle odds, sex of insured, whether the applicant has a life insurance client, whether the customer is a life insurance long insurance effective policy client, the applicant has purchased the life insurance total policy, age, model, risk, new vehicle purchase price, fuel type, the applicant has paid total policy;

The MLP model refers to: selecting a multi-layer perceptron MLP as a classification model, wherein the network structure and parameters of the MLP comprise: the input layer, three hidden layers and the output layer, wherein the nodes of the three hidden layers are 128, 256 and 64 respectively, the hidden layers adopt an activation function LeakyReLU, the corresponding dropout is set to be 0.2, and the activation function of the output layer is Sigmod;

The method comprises the steps of training data obtained by a feature engineering unit one by one to obtain different models, classifying the models into two classification models of whether a user changes a vehicle model or not, changing the vehicle model into a target vehicle model or not after changing the vehicle, for whether the vehicle model is changed, when the user changes the vehicle, the label is 1, the label is 0, for changing the vehicle model into the target vehicle model, the data of changing the vehicle model into the target vehicle model is 1, the data of changing the vehicle model into the data of other vehicle models is 0, and all the data are as follows: 1, wherein 75% of the data are trained and 25% of the data are used as test sets;

the LeakyReLU formula is Sigmod has the formula/>

The GBDT model refers to: selecting GradientBoostingClassifier models in sklearn libraries, setting the size of a tree to be 500, setting the maximum depth of the tree to be 4, setting the learning rate to be 0.1, setting the minimum sample number required by splitting an internal node of the tree to be 100, and setting a loss function to be a logarithmic loss function L (Y, P (Y|X))= -log P (Y|X);

Training the data obtained by the feature engineering unit one by one to obtain different models, wherein the trained model data are the same as the MLP model data, and dividing 4:1, wherein 75% of the data are trained and 25% of the data are used as test sets;

The on-line prediction module specifically comprises: the device comprises a feature extraction unit and a vehicle change prediction unit, wherein: the feature extraction unit extracts multidimensional feature information of a user required by prediction by using a method of a feature engineering unit in the offline training module, and the vehicle change prediction unit inputs the obtained multidimensional features of the user into a stored model in batches to obtain a predicted value of whether the user changes a vehicle or not and whether the user changes a target vehicle type or not;

The multi-dimensional characteristic information comprises: regional, three-responsibility insurance policy, ticket policy, traffic violation coefficients, expected odds, final odds, train, negotiated actual value, age, classification of models, risk class of models, type of vehicle, displacement of models, number of platform returns to insurance, number of platform returns to NCD coefficients, total number of vehicle cases, amount of vehicle odds, sex of insured, whether the applicant has a life insurance client, whether the customer is a life insurance long insurance effective policy client, the applicant has purchased the life insurance total policy, age, model, risk, new vehicle purchase price, fuel type, and applicant has paid total policy.