CN113177644A

CN113177644A - Automatic modeling system based on word embedding and depth time sequence model

Info

Publication number: CN113177644A
Application number: CN202110564485.7A
Authority: CN
Inventors: 黎婧璇; 时玥; 谭俊
Original assignee: Beijing Rongqiniu Information Technology Co ltd
Current assignee: Beijing Rongqiniu Information Technology Co ltd
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2021-07-27

Abstract

The invention provides an automatic modeling system based on word embedding and a depth time sequence model, which comprises: the system comprises a mining module, a data reading module, a data preprocessing module and a model training module, wherein the mining module is used for mining based on a word embedding and depth time sequence model to generate big data cluster reading data; the data reading module reads data from the big data cluster; the data preprocessing module processes the read data and performs the following operations: sample sampling, sample partitioning, feature screening, missing value filling and feature value mapping; and the model training module trains the training set data by using different models to obtain model prediction results of the training set and the verification set. The system of the invention has small vector dimension and high expression efficiency, and can quickly calculate the similarity.

Description

Automatic modeling system based on word embedding and depth time sequence model

Technical Field

The invention relates to the technical field of machine learning, in particular to an automatic modeling system based on word embedding and a depth time sequence model.

Background

The traditional manual modeling mode needs professional model personnel and developers to invest a great deal of labor time to complete data extraction, model algorithm selection, model parameter configuration and subsequent line optimization. When high-dimensional features and mass data are faced, data sampling and feature screening are needed, the use of the data is reduced, and the modeling cost is very high.

In addition, when facing a buried point behavior sequence, the technical route of the prior art is to describe the whole behavior sequence by using a vector after vector representation is carried out on each behavior of a user. The direct method for measuring each behavior is one-hot coding, but the vector obtained in the way has large dimension and low expression efficiency, and the similarity is difficult to calculate.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an automatic modeling system based on word embedding and a depth time sequence model, which comprises the following steps: the system comprises a buried point mining module, a data reading module, a data preprocessing module and a model training module, wherein the mining module is used for mining based on a word embedding and depth time sequence model to generate big data cluster reading data; the data reading module reads data from the big data cluster; the data preprocessing module processes the read data and performs the following operations: sample sampling, sample partitioning, feature screening, missing value filling and feature value mapping; and the model training module trains the training set data by using different models to obtain model prediction results of the training set and the verification set.

Optionally, the buried point excavation module performs the following operations: format conversion, namely organizing the user behaviors acquired by the buried points according to users to obtain a behavior sequence of each user; information mapping, namely processing each clicking behavior in a user behavior sequence into a vector by using a word embedding technology, and representing the whole behavior sequence as a time sequence characteristic wide table of a plurality of fixed dimension vectors; and sequence mining, processing the sequence information which dynamically changes along with time, and predicting future behaviors based on the current trend.

Optionally, the mining module performs information mapping based on GloVe, and performs sequence mining based on LSTM algorithm and GRU algorithm.

Optionally, the hidden layer vector and the output layer probability after the sequence mining are stored in a wide table and are spliced with other feature tables.

Optionally, the input data of the mining module is user behaviors and user labels acquired from the buried points, and corresponding vectors mined for the behavior sequences of the users and prediction probabilities of the labels by the behaviors are output.

Optionally, the system comprises a model evaluation module, and the output of the model evaluation module for evaluating the indicator of the model effect comprises: accuracy, recall, precision, kini coefficient, F1 statistics, confusion matrix, ROC graph, AUC, KS graph, lifting curve, recall graph and response curve; MSE, RMSE, R2, regulation R2, SMAPE, EVS, media absolute error, MAE, residual map of features, comparison map of predicted values and actual values, quantile-quantile map and residual distribution map of predicted values; the sum of the squares of the distances from the points within a cluster to the center point; accuracy, confusion matrix, accuracy, and recall.

Optionally, the data preprocessing module processes the read data, and the data preprocessing module includes: the device comprises a sample sampling module and a sample partitioning module, wherein the sample sampling module carries out random sampling or layered sampling on samples according to a set sampling ratio, and the sample partitioning module divides the samples into a training set and a verification set according to the set sample dividing ratio.

Optionally, the data preprocessing module further includes a sample matching module and a probability correction module, the sample matching module can sample the negative sample and adjust the proportion of the positive sample in the total sample, and the probability correction module adjusts the probability of the modeling result and corrects the modeling result by using the prior probability.

Optionally, the data preprocessing module further includes a feature analysis module, and the feature analysis module combines the sample label to count KS, IV and PSI indicators for each dimension of the feature.

Optionally, the data preprocessing module further includes a missing value padding module that processes missing values in the data, and the model training module further includes a feature value mapping module that performs feature value mapping on the training set.

The system of the invention has small vector dimension and high expression efficiency, and can quickly calculate the similarity. The experiment of the invention proves that the model effect of training the identification label by using the vector is better than that of training the identification label by mixing a single sub-component with other feature modules.

Drawings

In order that the invention may be more readily understood, it will be described in more detail with reference to specific embodiments thereof that are illustrated in the accompanying drawings. These drawings depict only typical embodiments of the invention and are not therefore to be considered to limit the scope of the invention.

FIG. 1 is a schematic diagram of one embodiment of a system of the present invention.

FIG. 2 is a schematic diagram of one embodiment of a system of the present invention.

Detailed Description

Embodiments of the present invention will now be described with reference to the drawings, wherein like parts are designated by like reference numerals. The embodiments described below and the technical features of the embodiments may be combined with each other without conflict.

The model algorithm of the invention can be divided into two parts of traditional machine learning and deep learning. In traditional machine learning, Spark is used as a calculation engine, and through RDD (resource Distributed data sets) Distributed memory design, calculation resources can be Distributed according to the size of a data set, and mass data can be easily processed and models such as logistic regression, random forests, XGboost, LightGBM and the like can be built. Deep learning uses a TensorFlow framework, updates a model structure in batches by using large-scale data, and completes model training, such as GloVe, LSTM, GRU and the like used by FM, DeepFM and buried point mining. And finally calling Python scripts to evaluate the result.

As shown in fig. 1, the system of the present invention comprises: the system comprises a buried point mining module, a data reading module, a data preprocessing module, a characteristic analysis module, a model training module and a model online module, and the possibility of errors in manual operation in a large number of processes is reduced. And a flow log of each module is also provided, so that business personnel can conveniently know the progress.

The working principle of the present invention is described below with reference to fig. 1.

As shown in fig. 2, the buried point mining module performs mining based on word embedding and a depth time sequence model, and solves the problem of insufficient buried point sequence data mining in various fields. Since mining of buried point data is required, format conversion, information mapping, and sequence mining are required, and finally a user feature wide table is generated.

The format conversion is to organize the acquired user behaviors according to the users to obtain a behavior sequence of each user.

The information mapping is to process each click behavior in the user behavior sequence into a vector by using a word embedding technology, and then represent the whole behavior sequence into a time sequence characteristic wide table of a plurality of fixed dimension vectors. In one embodiment, the information mapping module adopts Word Vectors Word2Vec which can achieve expected effect and high practicability only by using local content windows, and a global vector for Word replication method which assumes that Word Vectors and co-occurrence matrixes have consistency and fully utilizes statistical information, so as to complete the model training process of mapping click behaviors into fixed-dimension Vectors.

During sequence mining, because the behavior of the embedded points has strong sequential correlation, the unsupervised embedding of matrix operation combination words to obtain sentence embedding is not applicable, the sequence mining uses a supervised depth time sequence model LSTM (Long Short Term Memory) which considers the change of data in a time dimension and a GRU (Gated Recurrent Unit) with reduced variant parameters and a consistent effect to process sequence information which dynamically changes along with time, and predicts the future behavior based on the current trend. The LSTM and the GRU are all composed of an input layer, a hidden layer and an output layer, the output of the upper layer is the input of the lower layer, and the result of the last output layer is the prediction probability of the label, so that the result of the model can be directly evaluated. In addition, the input of the output layer and the output of the hidden layer are embedded with a vector with long dimension, the vector reflects the information mined by the user sequence, the information can be used as a new model for training the identification tag as a feature, the evaluation result reflects that the user sequence mining information has certain identification capability on the user tag, and the model effect of training the identification tag by using the vector is better than that of training the identification tag by mixing other feature modules with single prediction probability.

In one embodiment, taking the embedded point of the app as an example, the input is a table of embedded points of time, user, and user's behavior in the app, and the example is as follows:

2021-04-2518: 00, user A, click on Page 1

2021-04-2518: 02, user A, click on Page 2

2021-04-2518: 20, user A, click on Page 3

…

And a tag table for the user, such as user a, 1 (purchase item B). Firstly, before the data reading module reads data, format conversion is carried out on the embedded point records to obtain a behavior sequence of each user, such as a user A (clicking a page 1, clicking a page 2 and clicking a page 3), then a user behavior unit is processed into a vector by using GloVe, such as a vector [ 0.10.10.1 ] corresponding to the clicking page 1, and a vector format mapped by the user behavior sequence and the user behavior sequence is obtained. The method is spliced with a label table of a user into a wide table containing user behavior sequence characteristics and labels, and the wide table is used as the input of a depth time sequence model to train the model. The hidden layer vector and the output layer probability of the depth time sequence model can be stored in a table, whether the hidden layer vector and the output layer probability are spliced with other feature tables or not can be selected, then follow-up processes such as wind control modeling can be carried out, the result can be compared with the result of a single buried point and the original feature table, and the mining gain can be conveniently checked.

Then, as shown in fig. 1, the data reading module reads data from the big data cluster for data training and testing. The read data is a wide table and comprises various characteristic columns, and by judging the type of each column in the input data and combining the input of a user, an index information ID column (used for distinguishing samples and not used for training a model), a label column and a date column in the data are identified so as to be used in a characteristic analysis module. The table can be a plurality of file formats and a plurality of databases, and comprises common format files such as txt and csv which are uploaded or connected with the common databases, such as Greenplus and MySQL.

The data preprocessing module processes the data read by the data reading module and provides the data entering the module. The data preprocessing module comprises: the device comprises a sample sampling module, a sample partitioning module, a characteristic screening module, a missing value filling module and a characteristic value mapping module.

Preferably, for the second classification, sample matching and probability correction can also be performed. In the two classification problems of credit wind control, the proportion of positive samples is often less in an actual scene, the sample proportioning supports sampling of negative samples, the proportion of the positive samples in a total sample is adjusted, and the influence of unbalanced sample proportion on a wind control model is reduced. The probability correction is to adjust the probability of the modeling result, samples are sampled, in order to ensure that the promotion degree and the response rate of the evaluation index are not influenced by the change of the proportion of the positive sample and the negative sample, the prior probability is used for correction, and the corrected prediction probability is added to the result of model training as described below.

The sample sampling module performs random sampling or hierarchical sampling on the samples according to a set sampling ratio. Random sampling and layered sampling are carried out in the database to reduce the scale of modeling samples, so that the cost can be reduced, and the speed of subsequent model training can be increased. In addition, aiming at the label unbalanced data set, the module also provides a sampling method for customizing the proportion of positive and negative labels by a user class so as to provide a subsequent model training effect. In one embodiment, a user-defined positive and negative label proportion sampling method is used, a data set of the positive and negative labels is sampled to be 1:10, and therefore the follow-up model is prevented from learning in a biased mode.

And the sample partition module divides the samples into a training set and a verification set according to the set sample division ratio. The validation set can be used for adjusting the hyper-parameters of the model and carrying out preliminary evaluation on the capability of the model constructed by the same distribution training set. In one embodiment, a random input-scale division or a sequential division by time series is used.

The characteristic analysis module is used for carrying out statistical analysis on each characteristic of the data. The module can be used for analyzing each characteristic width table of the data stream, but balances the influence and the execution time of the analyzed data on the model in the process, and selects to perform characteristic analysis between sample partition and characteristic screening. Particularly, the feature analysis module automatically judges whether the feature is a continuous feature, a categorical feature, a character-type feature or a date-type feature according to the proportion of the unique value number of each dimensional feature to the total sample number and the value type of the feature. The numerical value continuous characteristic, the numerical value category characteristic and the character type characteristic (hereinafter, the two are collectively referred to as category types) are subjected to statistics of different indexes, and can be visualized so as to facilitate business personnel to understand data distribution. Specifically, the statistical analysis comprises: and counting the null value rate, the zero value rate, the maximum value, the minimum value, the mean value, the median, the 1/4 locus, the 3/4 locus, the standard deviation, the variance, the skewness and the kurtosis of the numerical continuous feature, and drawing a histogram, a box line graph and a density curve graph of feature distribution. And counting the number of categories, the null value rate, the highest category of the proportion, the lowest category of the proportion and the pie chart of the feature distribution of the category type features so as to facilitate the analysis of the single features by a user. Specifically, for the application in the credit wind control field, the feature analysis module combines the sample label to count KS (Kolmogorov Smirnov) and IV (Information Value) indexes of each dimension feature. The two indexes are necessary indexes concerned in the field of credit wind control modeling, wherein KS reflects the distinguishing degree of the characteristics on positive and negative samples, and the larger KS is, the better the distinguishing degree is; the IV index reflects the correlation between the feature and the label, with the greater the IV the greater the relationship between the feature and the label. In addition, each dimension characteristic PSI (Population Stability Index) Index can be counted, the Index is also a common Index in the credit wind control modeling field in characteristic screening, the Stability of the characteristic is reflected, the larger the PSI is, the more unstable the characteristic is, and the characteristic with the PSI larger than 0.2 is deleted in the modeling process.

The data preprocessing module can also comprise a characteristic screening module, and the characteristic screening module screens samples in the verification set and the training set to select the in-mode characteristics. The method specifically comprises the following steps: respectively counting the numerical characteristic and the category characteristic list according to the data types of the characteristic list, then screening the characteristic list, and if the value quantity of the category characteristic exceeds a set quantity (such as 100) or the numerical variable loss rate reaches a set value (such as 0.8), determining that the characteristic is invalid and deleting the characteristic. Preferably, the invention not only provides the feature list by making the service personnel manually screen the sample analysis result through interaction, but also provides the function of screening according to the feature deletion rate and the discrete feature value level, and the common screening can be rapidly carried out by using the threshold value.

The data preprocessing module can also comprise a missing value filling module, and the missing value filling module is used for processing the missing values in the data to ensure that the model which does not receive the missing values can also run normally. For the numerical type characteristics, three modes of mean value, median and user-specified value can be selected for filling, and the category type variables are filled by using user-specified values or default values. Optionally, missing value population is performed on the training set and validation set using training set statistics or fixed values.

The data pre-processing module may also include a feature value mapping module. And the characteristic value mapping module performs characteristic value mapping on the training set and processes the verification set by using the same mapping dictionary. Optionally, the feature value mapping module encodes the class-type feature. Considering that onehotencor (unique coding) increases the depth of the tree for the platform massive tree class approach, labelencor (label coding) is preferably used to map the class-type features to ordinal values starting from 0. For the categorical variables with a large number of values, WOE (Weight of Evidence in Evidence) derived methods can be used for converting the categorical variables into continuous variables.

Referring to fig. 1, after data preprocessing, different models can be used for training set data through a model training module, and model prediction results of a training set and a verification set are obtained. The model training module is used for training the adopted algorithm.

Referring to fig. 1, the model evaluation module is configured to calculate an evaluation index of the model reflecting an effect of the model. And the model evaluation module evaluates the trained model effect by using the verification set according to the model prediction result and outputs a model report.

During testing, the same functions as the function modules of feature screening, missing value filling and feature value mapping in data preprocessing are used again for processing test data, the model is used for prediction, a training result formed by a predicted value and an actual value of the model on the test data is obtained, and the model is reused for evaluation, so that the evaluation effect of the test is obtained.

Referring to fig. 1, in the model online module, a user may select an optimal model for online operation according to a model evaluation result. For the on-line model, a user can request in a mode of calling an API interface, and the method and the system can return a corresponding model result and can meet the requirement of real-time prediction on business. Meanwhile, the online model can be updated, offline and the like, so that the effectiveness of the online model is ensured.

The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. An automated modeling system based on word embedding and depth timing models, comprising: a mining module, a data reading module, a data preprocessing module and a model training module,

the mining module performs mining based on the word embedding and depth time sequence model to generate big data cluster reading data;

the data reading module reads data from the big data cluster;

the data preprocessing module processes the read data and performs the following operations: sample sampling, sample partitioning, feature screening, missing value filling and feature value mapping;

and the model training module trains the training set data by using different models to obtain model prediction results of the training set and the verification set.

2. The automated modeling system of claim 1, wherein the mining module:

format conversion, namely organizing the acquired user behaviors according to users to obtain a behavior sequence of each user;

information mapping, namely processing each clicking behavior in a user behavior sequence into a vector by using a word embedding technology, and representing the whole behavior sequence as a time sequence characteristic wide table of a plurality of fixed dimension vectors; and

and (4) sequence mining, which processes sequence information dynamically changing along with time and predicts future behaviors based on the current trend.

3. The automated modeling system of claim 2, wherein the mining module performs information mapping based on GloVe, sequence mining based on LSTM algorithm and GRU algorithm.

4. The automated modeling system of claim 3, wherein the sequence mined hidden layer vectors, output layer probabilities are stored in a wide table and concatenated with other feature tables.

5. The automated modeling system of claim 1, wherein the input data of the mining module is user behavior and user tags obtained for the buried points, and the corresponding vectors mined for the behavior sequences of the users and the predicted probabilities of the tags using the behaviors are output.

6. The automated modeling system of claim 1, further comprising a model evaluation module, the output of the model evaluation module used to evaluate the metrics of the model's effectiveness comprising: accuracy, recall, precision, kini coefficient, F1 statistics, confusion matrix, ROC graph, AUC, KS graph, lifting curve, recall graph and response curve; MSE, RMSE, R2, regulation R2, SMAPE, EVS, media absolute error, MAE, residual map of features, comparison map of predicted values and actual values, quantile-quantile map and residual distribution map of predicted values; the sum of the squares of the distances from the points within a cluster to the center point; accuracy, confusion matrix, accuracy, and recall.

7. The automated modeling system of claim 1, wherein the data pre-processing module processes the read data, the data pre-processing module comprising: the device comprises a sample sampling module and a sample partitioning module, wherein the sample sampling module carries out random sampling or layered sampling on samples according to a set sampling ratio, and the sample partitioning module divides the samples into a training set and a verification set according to the set sample dividing ratio.

8. The automated modeling system of claim 7, wherein the data preprocessing module further comprises a sample matching module and a probability correction module, wherein the sample matching module is capable of sampling negative samples and adjusting the proportion of positive samples in the total samples, and the probability correction module is capable of adjusting the probability of the modeling result and correcting the probability using prior probabilities.

9. The automated modeling system of claim 1, wherein the data preprocessing module further comprises a feature analysis module that combines the sample labels to count KS, IV, and PSI indices for each dimension of the feature.

10. The automated modeling system of claim 1, wherein the data preprocessing module further comprises a missing value population module that processes missing values in the data, and the model training module further comprises a feature value mapping module that maps feature values on a training set.