CN113177642A - Automatic modeling system for data imbalance - Google Patents

Automatic modeling system for data imbalance Download PDF

Info

Publication number
CN113177642A
CN113177642A CN202110563919.1A CN202110563919A CN113177642A CN 113177642 A CN113177642 A CN 113177642A CN 202110563919 A CN202110563919 A CN 202110563919A CN 113177642 A CN113177642 A CN 113177642A
Authority
CN
China
Prior art keywords
module
data
sample
samples
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110563919.1A
Other languages
Chinese (zh)
Inventor
时玥
谭俊
黎婧璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Rongqiniu Information Technology Co ltd
Original Assignee
Beijing Rongqiniu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Rongqiniu Information Technology Co ltd filed Critical Beijing Rongqiniu Information Technology Co ltd
Priority to CN202110563919.1A priority Critical patent/CN113177642A/en
Publication of CN113177642A publication Critical patent/CN113177642A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides an automatic modeling system for data imbalance, which comprises: data reading module, data preprocessing module, characteristic analysis module, model training module, data preprocessing module includes: the system comprises a sample sampling module, a sample balancing module, a sample partitioning module, a feature screening module, a missing value filling module and a feature value mapping module; and the model training module trains the training set data to obtain model prediction results of the training set and the verification set. The automatic modeling system provided by the invention is suitable for a multi-scene modeling process, assists business personnel to complete modeling through simple operation, and solves the problem of inaccurate model caused by data imbalance.

Description

Automatic modeling system for data imbalance
Technical Field
The invention relates to the technical field of machine learning, in particular to a big data automatic modeling system aiming at data unbalance.
Background
The traditional manual modeling mode needs professional model personnel and developers to invest a great deal of labor time to complete data extraction, model algorithm selection, model parameter configuration and subsequent line optimization. When high-dimensional features and mass data are faced, data sampling and feature screening are needed, the use of the data is reduced, and the modeling cost is very high.
When data imbalance is encountered, the number of samples of different types of labels in a data set is greatly different, and the problems of low training efficiency and reduced performance of the whole model caused by less sample fluctuation can be caused. Common methods such as oversampling do not utilize the distribution information of the samples.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an automatic modeling system for data imbalance, which comprises: the data reading module reads data from the big data cluster, the read data is a wide table and comprises a characteristic column, and the characteristic column is marked; the data preprocessing module processes the read data, and comprises: the system comprises a sample sampling module, a sample balancing module, a sample partitioning module, a feature analysis module, a feature screening module, a missing value filling module and a feature value mapping module, wherein the sample sampling module randomly samples or hierarchically samples a sample according to a set sampling ratio, the sample balancing module amplifies a few types of samples after the sample sampling, the sample partitioning module divides the sample into a training set and a verification set according to the set sample partitioning ratio, the feature analysis module performs statistical analysis on each feature of the training set, the feature screening module screens the samples in the verification set and the training set to select an entering feature, the missing value filling module fills missing values in data, and the feature value mapping module performs feature value mapping on the training set; and the model training module trains the training set data to obtain model prediction results of the training set and the verification set.
Optionally, the sample balancing module amplifies the minority class samples after the sample sampling based on an ACGAN algorithm or a BAGAN algorithm, and the sample balancing module includes: a tag processing module which processes the tag to be amplified into a form required for the model; a discriminator for discriminating whether the data is input label data of a certain category or generated data; a generator that generates a class label exemplar.
Optionally, when ACGAN is used, an auxiliary tag column is added to distinguish between true and false samples during tag processing, and when BAGAN is used, a false sample tag with a value of N is added, where N is the number of sample classes.
Optionally, the sample sampling module performs random sampling, hierarchical sampling or negative sampling with a custom positive-negative label ratio on the sample according to a set sampling ratio.
Optionally, the feature analysis module combines the sample label to count KS index, IV index and PSI index of each dimension feature.
Optionally, the model training module further performs the following operations: before model training, an isolated forest algorithm is adopted to remove abnormal samples, a wide table containing characteristic columns is input, and the probability of abnormality is output.
Optionally, the operation of the data preprocessing module further comprises: the data deduplication and format conversion are carried out on input data based on a shopping basket FP-Growth algorithm, wherein the input data is the commodity purchase condition of a user, and the data deduplication enables the record of the user and the same commodity to appear only once.
Optionally, the operation of the data preprocessing module further comprises: and performing data deduplication and information mapping on input data based on a collaborative filtering ALS algorithm, wherein the input data is interactive behaviors of the user and the commodity, the interactive behaviors of the user and the commodity only occur once due to the data deduplication, and the information mapping is that the user and the commodity are mapped into a tag index.
Optionally, the indicators for evaluating the effect of the model output by the model evaluation module include: accuracy, recall, precision, kini coefficient, F1 statistics, confusion matrix, ROC graph, AUC, KS graph, lifting curve, recall graph and response curve; MSE, RMSE, R2, regulation R2, SMAPE, EVS, media absolute error, MAE, residual map of features, comparison map of predicted values and actual values, quantile-quantile map and residual distribution map of predicted values; the sum of the squares of the distances from the points within a cluster to the center point; accuracy, confusion matrix, accuracy, and recall.
Optionally, the data preprocessing module further includes a sample matching module and a probability correction module, the sample matching module can sample the negative sample and adjust the proportion of the positive sample in the total sample, and the probability correction module adjusts the probability of the modeling result and corrects the modeling result by using the prior probability.
The automatic modeling system provided by the invention is suitable for a multi-scene modeling process, uses ACGAN and BAGAN methods to generate few samples, assists business personnel to complete modeling through simple operation, supports big data calculation processing, releases manpower, improves efficiency and improves effect.
The solution of the invention to the common problem of each field is also used in the credit field and the recommendation and marketing field, and the effect can be improved by 5 percent compared with the unused solution. The invention can effectively help the relevant personnel to automatically model in the credit field, and reduce the average development time of the model by 80 percent.
Drawings
In order that the invention may be more readily understood, it will be described in more detail with reference to specific embodiments thereof that are illustrated in the accompanying drawings. These drawings depict only typical embodiments of the invention and are not therefore to be considered to limit the scope of the invention.
FIG. 1 is a flow chart of the operation of the system of the present invention.
FIG. 2 is a schematic diagram of one embodiment of a system of the present invention.
FIG. 3 is a schematic view of yet another embodiment of the system of the present invention.
Detailed Description
Embodiments of the present invention will now be described with reference to the drawings, wherein like parts are designated by like reference numerals. The embodiments described below and the technical features of the embodiments may be combined with each other without conflict.
The system of the invention provides a standardized solution for the field of credit wind control with large data volume and various processing operations. The system of the present invention enables full lifecycle management of models, including: the system comprises a data reading module, a data preprocessing module, a characteristic analysis module, a model training module, a model evaluation module and a model online module, and the possibility of errors in manual operation in a large number of processes is reduced. And a flow log of each module is also provided, so that business personnel can conveniently know the progress.
As shown in fig. 1, the working principle of the system of the present invention includes:
1) reading data from the big data cluster through a data reading node;
2) randomly or hierarchically sampling the samples according to a sampling ratio (which may be set by a user);
3) dividing the sample into a training set and a verification set according to a sample division ratio (which can be set by a user);
4) carrying out feature type analysis on all features of the training set, dividing category type features and numerical type features, and carrying out feature analysis;
5) screening the training set characteristics and the verification set characteristics by using the results of the training set characteristic analysis;
6) filling missing values in the training set and the verification set by using the training set statistic value or the fixed value;
7) performing characteristic value mapping on the training set, and processing the verification set by using the same mapping dictionary;
8) training the training set data by using different models to obtain model prediction results of the training set and the verification set;
9) evaluating the effect of the trained model by using the verification set, and outputting a model report;
10) and (5) repeating the operations of the step 5-7 and the verification set on the test set, and inputting the processed test data into the model to obtain the prediction result of the model.
11) According to the model evaluation result, selecting the optimal model on the data set for model online prediction, and predicting online data in real time
12) If the effect of the 11 model is attenuated after a period of execution, the offline operation can be carried out, and the latest data is repeated by 1-11) to obtain an updated model
And the data reading module reads data from the big data cluster for data training and testing. The read data is a wide table and comprises various characteristic columns, and by judging the type of each column in the input data and combining the input of a user, an index information ID column (used for distinguishing samples and not used for training a model), a label column and a date column in the data are identified so as to be used in a characteristic analysis module. The table can be a plurality of file formats and a plurality of databases, and comprises common format files such as txt and csv which are uploaded or connected with the common databases, such as Greenplus and MySQL.
The data preprocessing module processes the data read by the data reading module and provides the data entering the module. The data preprocessing module comprises: the device comprises a sample sampling module, a sample partitioning module, a characteristic screening module, a missing value filling module and a characteristic value mapping module.
The sample sampling module performs random sampling or hierarchical sampling on the samples according to a set sampling ratio. Random sampling and layered sampling are carried out in the database to reduce the scale of modeling samples, so that the cost can be reduced, and the speed of subsequent model training can be increased. In addition, aiming at the label unbalanced data set, the module also provides a sampling method for customizing the proportion of positive and negative labels by a user class so as to provide a subsequent model training effect. In one embodiment, a user-defined positive and negative label proportion sampling method is used, a data set of the positive and negative labels is sampled to be 1:10, and therefore the follow-up model is prevented from learning in a biased mode.
And the sample partition module divides the samples into a training set and a verification set according to the set sample division ratio. The validation set can be used for adjusting the hyper-parameters of the model and carrying out preliminary evaluation on the capability of the model constructed by the same distribution training set. In one embodiment, a random input-scale division or a sequential division by time series is used.
The characteristic analysis module is used for carrying out statistical analysis on each characteristic of the data. The module can be used for analyzing each characteristic width table of the data stream, but balances the influence and the execution time of the analyzed data on the model in the process, and selects to perform characteristic analysis between sample partition and characteristic screening. Particularly, the feature analysis module automatically judges whether the feature is a continuous feature, a categorical feature, a character-type feature or a date-type feature according to the proportion of the unique value number of each dimensional feature to the total sample number and the value type of the feature. The numerical value continuous characteristic, the numerical value category characteristic and the character type characteristic (hereinafter, the two are collectively referred to as category types) are subjected to statistics of different indexes, and can be visualized so as to facilitate business personnel to understand data distribution. Specifically, the statistical analysis comprises: and counting the null value rate, the zero value rate, the maximum value, the minimum value, the mean value, the median, the 1/4 locus, the 3/4 locus, the standard deviation, the variance, the skewness and the kurtosis of the numerical continuous feature, and drawing a histogram, a box line graph and a density curve graph of feature distribution. And counting the number of categories, the null value rate, the highest category of the proportion, the lowest category of the proportion and the pie chart of the feature distribution of the category type features so as to facilitate the analysis of the single features by a user. Specifically, for the application in the credit wind control field, the feature analysis module combines the sample label to count KS (Kolmogorov Smirnov) and IV (Information Value) indexes of each dimension feature. The two indexes are necessary indexes concerned in the field of credit wind control modeling, wherein KS reflects the distinguishing degree of the characteristics on positive and negative samples, and the larger KS is, the better the distinguishing degree is; the IV index reflects the correlation between the feature and the label, with the greater the IV the greater the relationship between the feature and the label. In addition, each dimension characteristic PSI (Population Stability Index) Index can be counted, the Index is also a common Index in the credit wind control modeling field in characteristic screening, the Stability of the characteristic is reflected, the larger the PSI is, the more unstable the characteristic is, and the characteristic with the PSI larger than 0.2 is deleted in the modeling process.
And the characteristic screening module screens the samples in the verification set and the training set to select the in-mode characteristics. The method specifically comprises the following steps: and respectively counting a numerical characteristic list and a category characteristic list according to the data types of the characteristic list, and then screening the characteristic list. Preferably, in the present invention, in addition to providing the feature list by interactively enabling service personnel to manually screen using the feature analysis result, the service personnel can also screen according to the loss rate, the discrete feature value level, the feature KS value, the feature IV value and the PSI value provided by the feature analysis module, and a set threshold value can be used for quickly screening, for example, when the class-type feature value level exceeds a set number (e.g., 100) or when the variable loss rate reaches a set value (e.g., 0.8), the feature is considered invalid and deleted.
And the missing value filling module is used for processing the missing values in the data, ensuring that the model which does not receive the missing values can normally run, and filling the missing values in the training set and the verification set by using the statistical values or the fixed values of the training set. For the numerical type characteristics, three modes of mean value, median and user-specified value can be selected for filling, and the category type variables are filled by using user-specified values or default values.
And the characteristic value mapping module performs characteristic value mapping on the training set and processes the verification set by using the same mapping dictionary. Optionally, the feature value mapping module encodes the class-type feature. Considering that onehotencor (unique coding) increases the depth of the tree for the platform massive tree class approach, labelencor (label coding) is preferably used to map the class-type features to ordinal values starting from 0. For the categorical variables with a large number of values, WOE (Weight of Evidence in Evidence) derived methods can be used for converting the categorical variables into continuous variables.
Preferably, the data preprocessing module may further include a sample proportioning module and a probability correction module for the second category of credit windage. In the two-classification problem, the proportion of positive samples is often less in an actual scene, the sample proportioning module can sample negative samples, the proportion of the positive samples in the total samples is adjusted, and the influence of unbalanced sample proportion on the wind control model is reduced. And the probability correction module adjusts the probability of the modeling result, uses the prior probability to correct, samples are sampled, uses the prior probability to correct in order to ensure that the promotion degree and the response rate of the evaluation index are not influenced by the change of the proportion of the positive sample and the negative sample, and increases the corrected prediction probability in the model training result. Adding a sample ratio step before the step 3) of sample partition, and adding a probability correction step in the model prediction results of the steps 8) and 10) to obtain a prediction probability and a corrected prediction probability.
Referring again to FIG. 2, the system of the present invention includes a model training module. And after data preprocessing, obtaining model entering data, and training the training set data by using different models by using a model training module to obtain model prediction results of the training set and the verification set.
The model training module can model data aiming at different problems such as secondary classification, Regression, clustering and multi-classification, and aiming at the secondary classification problem, the model training module adopts logic Regression, GBDT (Gradient Boosting Decision Tree), XGBoost (eXtreme Gradient Boosting) and LightGBM (Light Gradient Boosting Machine) algorithms. Aiming at the Regression problem, the model training module adopts Linear Regression (Linear Regression), GBDT, XGboost and LightGBM algorithms. Aiming at the clustering problem, the KMeans (K mean value) algorithm is adopted by the model training module. Aiming at the multi-classification problem, the XGboost and LightGBM algorithms are adopted by the model training module. The modeling of the algorithm can configure algorithm parameters as needed.
Preferably, in the model training module, aiming at the two-classification and regression scenes, the invention also provides a function of one-key automatic modeling. The function is mainly oriented to business personnel in the credit wind control field, which do not know modeling algorithms, but often have the requirement of rapid modeling to judge the data effect in business work. Therefore, the one-key modeling function enables a user to upload data only and click a modeling button, the algorithm in the invention automatically performs characteristic analysis and screening, 3 common model algorithms GBDT, XGboost and LightGBM are selected according to a data scene, optimal model parameters are automatically searched in a parameter space respectively, and finally, an optimal model is automatically output and effect evaluation is performed through comparison, so that business personnel can conveniently and quickly obtain the modeling effect of the data. Particularly, in the process of searching the parameter space, the invention combines the technical accumulation of the 360-degree model group in the credit wind control field for years, and embeds a plurality of groups of effective experience parameters in the credit wind control scene, thereby being capable of obviously improving the efficiency and the effect of the model parameter optimizing stage.
After the model training, the model training module outputs a training result consisting of the predicted value and the actual value of the model on the training set and the verification set. And providing feature importance of the model for other algorithms outside the cluster so as to facilitate the understanding of the model by service personnel, and performing subsequent analysis by using key features.
In addition, the operations of feature screening, feature analysis, missing value filling and feature value mapping are repeated on the test set, and the processed test data is input into the model training module to obtain the prediction result of the model.
Referring again to FIG. 2, the system of the present invention includes a model evaluation module. The model evaluation module is used for calculating the effect of the evaluation index reflection model of the model. And the model evaluation module evaluates the trained model effect by using the verification set prediction result and outputs a model report.
Aiming at the two-classification problem, the model evaluation module outputs the following indexes: accuracy, recall, accuracy, kini coefficient, F1 statistic, confusion matrix, ROC (Receiver Operating Characteristic Curve) graph and AUC (Area size Under ROC Curve), KS graph, lifting graph, recall graph, and response graph. For the regression problem, the model evaluation module outputs the following indicators: MSE (Mean Square Error), RMSE (Root Mean Square Error), R2 (coefficient determination), adjustment R2, SMAPE (Symmetric Mean Absolute Error ratio), EVS (extended Variance Score), media Absolute Error, MAE (Mean Absolute Error), residual map of each feature, predicted value-to-actual value comparison map, quantile-to-quantile map, predicted value residual distribution map. Aiming at the clustering problem, the invention supports the user to try a plurality of clustering category numbers, and the system runs a multi-clustering algorithm according to the category number range input by the user and generates a corresponding result. The model evaluation module outputs the following indexes: the distance square sum from the point in the cluster to the central point, the two-dimensional projection graph of various data divided by the model, the distribution graph of each characteristic and the elbow graph reflecting the change of the square sum in the cluster along with the number of categories. Aiming at the multi-classification problem, the model evaluation module outputs the following indexes: accuracy, confusion matrix, accuracy of each category and recall rate are used as evaluation indexes. In addition to the common analysis indexes, the evaluation indexes also include the common analysis indexes in the credit wind control field, such as KS curve graphs and the like.
Preferably, in the classification problem, the feature analysis module provides KS, IV indices (discrimination for observed features), PSI (stability of observed features). Feature screening is performed before modeling is performed, so that modeling complexity is reduced. In the traditional field, accuracy, recall rate, F1, AUC and the like are generally used as model evaluation indexes, and in the credit field, KS and lift indexes are used for evaluating the overall effect of the model. The evaluation indexes provided by the conventional universal modeling platform are mostly accuracy, recall rate, F1, AUC and the like, and cannot meet the requirements of users in the credit wind control field on model evaluation.
According to another aspect of the invention, the system of the invention provides a solution to the problem of data imbalance. The system disclosed by the invention is suitable for a multi-scene modeling process, uses the ACGAN and BAGAN methods to generate few samples, assists business personnel to complete modeling through simple operation, supports big data calculation processing, releases manpower, improves the efficiency and improves the effect.
Therefore, the system comprises a data balance module, a data generation algorithm based on GAN is realized in the data balance module, and the problem of unbalanced samples in most fields is solved. When an unbalanced data set is faced, the data balancing module amplifies a few types of samples and balances data really concerned after the samples are sampled. Oversampling to approximately the same ratio for each type of sample is a straightforward method, but does not take advantage of the information contained in most types of samples. The invention selects ACGAN (automatic Classification GANs) which adopts an auxiliary Classifier to enable GAN (genetic adaptive Network, countermeasure generation Network) to obtain a classification function, and improved N (sample class number) +1 BAGAN (BAlaning GAN) of the Classifier on the basis of the ACGAN, and generates samples for less classes of labels under the condition of distinguishing different classes of data. The GAN model is composed of a generator and a discriminator, the generator and the discriminator resist learning, Nash equilibrium is achieved by fitting a loss function of a true sample, a false sample and a category of the model, and the model is converged. Firstly, as the task of identifying the object in the picture (such as judging whether the object is a cat) by the discriminator is easier than the task of generating the picture by the generator (simulating the generation of the picture by the cat), the task needs to be learned by the discriminator faster, and the concrete implementation is that the generator trains for M times and the discriminator trains for 1 time. Secondly, because the optimization convergence speed is low due to sample imbalance, the influence of improving the gradient of the few samples is improved by N times during learning.
The ACGAN end use function is therefore as follows
LS=E[logP(S=real|Xreal)]+E[logP(S=fake|Xfake)]
LC=E[logP(C=c|Xreal)]+E[logP(C=c|Xfake)]
Wherein
Figure BDA0003080022390000101
Maximizing 1 discriminator loss LS+LCMaximizing L of M secondary generatorsC-LSThe same applies to BAGAN. For a trained GAN model, sample generation can be performed given the required class and number of samples. The batch of samples can be put into a model together with the original samples to participate in training, and the effect of the model can be improved through verification.
Data set Original effect Effect after adding sample to balance Increment of
Wind-controlled multiple data source binary classification KS:17.06 KS:19.44 KS:2.38
Wind-controlled image classification KS:10.33 KS:12.19 KS:1.86
Marketing two categories AUC:0.62 AUC:0.64 AUC:0.02
In order to ensure the use of the data balance key algorithm, a step of label processing is required, and a label processing module processes the label into a format required by the GAN. ACGAN uses an auxiliary classifier, so that an auxiliary label column is added to distinguish true and false samples during label processing. BAGAN requires the addition of a false sample label of value N, according to the algorithm assumptions.
In one embodiment, taking the most common imbalance of positive and negative samples in wind control as an example, the platform needs to train a binary model with user risks as labels for data distributed with very few positive samples in business, taking the positive samples (label is 1) quantity 400, the negative samples (label is 0) quantity 9600, and the positive samples account for 4% as an example. Using ACGAN, a positive specimen tag is (1 (original tag), 1 (true or not, 1 true 0 false)), a negative positive specimen tag is (1, 0), and the preparation generator generates specimen tags (0, 1), (0, 0). The sample tag 2 is generated by the prepare generator using BAGAN. The M and N super parameters need to be conditioned in combination with the training speed. After the GAN training is converged, a sample balance module is needed to generate a balance data set, that is, 9200 samples with (1,1) labels need to be generated for ACGAN, 9200 samples with 1 label need to be generated for BAGAN, and then the generated sample labels are recorded as 1 and added into the sample set for subsequent operation.
Preferably, the samples are oversampled for classes that appear less frequently in the samples. Algorithms in the model training module comprise algorithms of secondary classification, regression, clustering, multi-classification and the like in machine learning to model credit data. Preferably, the classification algorithm comprises: logistic Regression, GBDT (Gradient Boosting Decision Tree), XGBoost (eXtreme Gradient Boosting), LightGBM (Light Gradient Boosting Machine). The regression algorithm includes: linear Regression, GBDT, XGBoost, LightGBM. The clustering algorithm includes KMeans (K means). The multi-classification algorithm comprises XGboost and LightGBM algorithms. The use of the algorithmic modeling described above may optionally incorporate actual required configuration parameters.
The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims (10)

1. An automated modeling system for data imbalance, comprising: a data reading module, a data preprocessing module, a characteristic analysis module and a model training module,
the data reading module reads data from the big data cluster, the read data is a wide table and comprises a characteristic column, and the characteristic column is marked;
the data preprocessing module processes the read data, and comprises: the system comprises a sample sampling module, a sample balancing module, a sample partitioning module, a feature analysis module, a feature screening module, a missing value filling module and a feature value mapping module, wherein the sample sampling module randomly samples or hierarchically samples a sample according to a set sampling ratio, the sample balancing module amplifies a few types of samples after the sample sampling, the sample partitioning module divides the sample into a training set and a verification set according to the set sample partitioning ratio, the feature analysis module performs statistical analysis on each feature of the training set, the feature screening module screens the samples in the verification set and the training set to select an entering feature, the missing value filling module fills missing values in data, and the feature value mapping module performs feature value mapping on the training set;
and the model training module trains the training set data to obtain model prediction results of the training set and the verification set.
2. The automated modeling system of claim 1, wherein the sample balancing module augments the few sample classes after sampling the samples based on an ACGAN algorithm or a BAGAN algorithm, the sample balancing module comprising:
a tag processing module which processes the tag to be amplified into a form required for the model;
a discriminator for discriminating whether the data is input label data of a certain category or generated data;
a generator that generates a class label exemplar.
3. The automated modeling system of claim 2,
when ACGAN is used, an auxiliary label column is added to distinguish true samples from false samples during label processing, and when BAGAN is used, a false sample label with the value of N is added, wherein N is the number of sample categories.
4. The automated modeling system of claim 1, wherein the sample sampling module randomly samples, hierarchically samples, or samples with custom positive and negative label ratios based on a set sampling ratio.
5. The automated modeling system of claim 1, wherein the feature analysis module incorporates sample labels to count KS, IV and PSI indicators for each dimension of the feature.
6. The automated modeling system of claim 1, wherein the model training module further operates to: before model training, an isolated forest algorithm is adopted to remove abnormal samples, a wide table containing characteristic columns is input, and the probability of abnormality is output.
7. The automated modeling system of claim 6, wherein the operations of the data pre-processing module further comprise: the data deduplication and format conversion are carried out on input data based on a shopping basket FP-Growth algorithm, wherein the input data is the commodity purchase condition of a user, and the data deduplication enables the record of the user and the same commodity to appear only once.
8. The automated modeling system of claim 7, wherein the operations of the data pre-processing module further comprise: and performing data deduplication and information mapping on input data based on a collaborative filtering ALS algorithm, wherein the input data is interactive behaviors of the user and the commodity, the interactive behaviors of the user and the commodity only occur once due to the data deduplication, and the information mapping is that the user and the commodity are mapped into a tag index.
9. The automated modeling system of claim 8, wherein the metrics for evaluating the effect of the model output by the model evaluation module comprise: accuracy, recall, precision, kini coefficient, F1 statistics, confusion matrix, ROC graph, AUC, KS graph, lifting curve, recall graph and response curve; MSE, RMSE, R2, regulation R2, SMAPE, EVS, media absolute error, MAE, residual map of features, comparison map of predicted values and actual values, quantile-quantile map and residual distribution map of predicted values; the sum of the squares of the distances from the points within a cluster to the center point; accuracy, confusion matrix, accuracy, and recall.
10. The automated modeling system of claim 9, wherein the data preprocessing module further comprises a sample proportioning module and a probability correction module, wherein the sample proportioning module is capable of sampling negative samples and adjusting the proportion of positive samples in the total samples, and the probability correction module is capable of adjusting the probability of the modeling result and correcting the probability using prior probabilities.
CN202110563919.1A 2021-05-24 2021-05-24 Automatic modeling system for data imbalance Pending CN113177642A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110563919.1A CN113177642A (en) 2021-05-24 2021-05-24 Automatic modeling system for data imbalance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110563919.1A CN113177642A (en) 2021-05-24 2021-05-24 Automatic modeling system for data imbalance

Publications (1)

Publication Number Publication Date
CN113177642A true CN113177642A (en) 2021-07-27

Family

ID=76929717

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110563919.1A Pending CN113177642A (en) 2021-05-24 2021-05-24 Automatic modeling system for data imbalance

Country Status (1)

Country Link
CN (1) CN113177642A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104718547A (en) * 2013-10-11 2015-06-17 文化便利俱乐部株式会社 Customer data analysis system
CN108363714A (en) * 2017-12-21 2018-08-03 北京至信普林科技有限公司 A kind of method and system for the ensemble machine learning for facilitating data analyst to use
CN108470187A (en) * 2018-02-26 2018-08-31 华南理工大学 A kind of class imbalance question classification method based on expansion training dataset
CN109670892A (en) * 2017-10-17 2019-04-23 Tcl集团股份有限公司 A kind of collaborative filtering recommending method and system, terminal device
CN110414780A (en) * 2019-06-18 2019-11-05 东华大学 A kind of financial transaction negative sample generation method based on generation confrontation network
CN111311400A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Modeling method and system of grading card model based on GBDT algorithm
CN111582651A (en) * 2020-04-09 2020-08-25 上海淇毓信息科技有限公司 User risk analysis model training method and device and electronic equipment
CN112527851A (en) * 2021-02-05 2021-03-19 北京淇瑀信息科技有限公司 User characteristic data screening method and device and electronic equipment
CN113177643A (en) * 2021-05-24 2021-07-27 北京融七牛信息技术有限公司 Automatic modeling system based on big data

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104718547A (en) * 2013-10-11 2015-06-17 文化便利俱乐部株式会社 Customer data analysis system
CN109670892A (en) * 2017-10-17 2019-04-23 Tcl集团股份有限公司 A kind of collaborative filtering recommending method and system, terminal device
CN108363714A (en) * 2017-12-21 2018-08-03 北京至信普林科技有限公司 A kind of method and system for the ensemble machine learning for facilitating data analyst to use
CN108470187A (en) * 2018-02-26 2018-08-31 华南理工大学 A kind of class imbalance question classification method based on expansion training dataset
CN110414780A (en) * 2019-06-18 2019-11-05 东华大学 A kind of financial transaction negative sample generation method based on generation confrontation network
CN111311400A (en) * 2020-03-30 2020-06-19 百维金科(上海)信息科技有限公司 Modeling method and system of grading card model based on GBDT algorithm
CN111582651A (en) * 2020-04-09 2020-08-25 上海淇毓信息科技有限公司 User risk analysis model training method and device and electronic equipment
CN112527851A (en) * 2021-02-05 2021-03-19 北京淇瑀信息科技有限公司 User characteristic data screening method and device and electronic equipment
CN113177643A (en) * 2021-05-24 2021-07-27 北京融七牛信息技术有限公司 Automatic modeling system based on big data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MARIANI G 等: "BAGAN: Data Augmentation with Balancing GAN", 《INTERNATIONAL CONFERENCE ON MACHINE LEARNING》, 26 March 2018 (2018-03-26), pages 1 - 9 *
周琪: "类别不平衡数据的个人信用风险评估算法研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, no. 08, 15 August 2020 (2020-08-15), pages 140 - 120 *

Similar Documents

Publication Publication Date Title
CN111882446B (en) Abnormal account detection method based on graph convolution network
CN110968069B (en) Fault prediction method of wind generating set, corresponding device and electronic equipment
CN106228389A (en) Network potential usage mining method and system based on random forests algorithm
CN111079941B (en) Credit information processing method, credit information processing system, terminal and storage medium
CN114722746B (en) Chip aided design method, device and equipment and readable medium
CN112396428B (en) User portrait data-based customer group classification management method and device
CN110866832A (en) Risk control method, system, storage medium and computing device
CN117828539B (en) Intelligent data fusion analysis system and method
CN113590396A (en) Method and system for diagnosing defect of primary device, electronic device and storage medium
CN113177643A (en) Automatic modeling system based on big data
CN113450009A (en) Method and system for evaluating enterprise growth
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN113177644A (en) Automatic modeling system based on word embedding and depth time sequence model
CN114037018A (en) Medical data classification method and device, storage medium and electronic equipment
CN116911994B (en) External trade risk early warning system
KR102406375B1 (en) An electronic device including evaluation operation of originated technology
CN115952426B (en) Distributed noise data clustering method based on random sampling and user classification method
CN117372144A (en) Wind control strategy intelligent method and system applied to small sample scene
CN116719714A (en) Training method and corresponding device for screening model of test case
CN116611911A (en) Credit risk prediction method and device based on support vector machine
CN113177642A (en) Automatic modeling system for data imbalance
CN116091206A (en) Credit evaluation method, credit evaluation device, electronic equipment and storage medium
CN114186644A (en) Defect report severity prediction method based on optimized random forest
CN112258235A (en) Method and system for discovering new service of electric power marketing audit
TWI759785B (en) System and method for recommending audit criteria based on integration of qualitative data and quantitative data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination