CN109582724A

CN109582724A - Distributed automated characterization engineering system framework

Info

Publication number: CN109582724A
Application number: CN201811493937.1A
Authority: CN
Inventors: 施铭铮; 刘占辉
Original assignee: Xiamen Pencil Head Information Technology Co Ltd
Current assignee: Xiamen Pencil Head Information Technology Co Ltd
Priority date: 2018-12-07
Filing date: 2018-12-07
Publication date: 2019-04-05
Anticipated expiration: 2038-12-07
Also published as: CN109582724B

Abstract

The invention discloses distributed automated characterization engineering system framework, steps are as follows: distributed automated characterization computing cluster for the realization of the distribution automated characterization engineering system framework, dimension-reduction algorithm, model training, and hyper parameter is found；The distribution automated characterization engineering system architecture design is reasonable, a large amount of costs of labor can be saved for automobile lease finance company, it should will be automatically performed by process of the invention by the work that Feature Engineering teacher and the experienced air control expert in the field complete, automobile lease finance company only needs to provide original untreated business datum, and the present invention will be automatically performed the whole process of Feature Engineering and model training and export last air control report.

Description

Distributed automated characterization engineering system framework

Technical field

The present invention is distributed automated characterization engineering system framework, belongs to auto metal halide lamp air control technical field.

Background technique

The process of orthodox car finance air control is that a series of air control rules are formulated by the expert in the field, and each rule may Comprising some calculation formula, and by application loan the related data of client be calculated, each rule may need one or Multiple customer datas (in this document, a customer data is defined as a feature), if what the client of loan provided Lower than the qualifying point of air control, the loan application of the client will not pass through the score that data obtain after strictly all rules calculate.

The characteristics of Feature Engineering is a useful feature often by multiple primitive characters (i.e. initial data) by one Some simple arithmetic are calculated, that is to say, that after client provides the data that application is provided a loan, also a feature mentions The process for taking and generating, this process are collectively referred to as Feature Engineering, after new feature is generated by Feature Engineering, own Feature will be merged, and the algorithm for inputting machine learning is calculated.

And the expert that first problem here is the air control rule that has air control experience and can make is one very dilute Scarce resource, Second Problem is regular even if there is expert to formulate air control, but these rules are all by this expert personal experience It sums up, there is no the integrated demand that method represents entire auto metal halide lamp industry, Second Problem is by expert or feature Engineer manually go extract new feature be it is very time-consuming, since one group of data is generally from different data sources, same number Data according to source also frequently include multiple tables of data, and so more initial data permutation and combination calculating may be needed to spend one The time in Feature Engineering teacher several weeks.

So the whole process of Feature Engineering and model training can be automatically performed and export last air control by inventing one kind The frame of report has very important significance, for this purpose, the present invention proposes a kind of distributed automated characterization engineering system framework.

Summary of the invention

In view of the deficienciess of the prior art, it is an object of the present invention to provide distributed automated characterization engineering system framework, with The problems mentioned above in the background art are solved, the present invention has rational design, can save a large amount of people for automobile lease finance company Work cost, should will be by process of the invention by the work that Feature Engineering teacher and the experienced air control expert in the field complete It is automatically performed, automobile lease finance company only needs to provide original untreated business datum, and the present invention will be automatic complete At the whole process of Feature Engineering and model training and export last air control report.

To achieve the above object, the invention provides the following technical scheme: distributed automated characterization engineering system framework, described Steps are as follows for the realization of distributed automated characterization engineering system framework:

Step 1: distributed automated characterization computing cluster；A computer cluster is needed, first part is distributed automatic special Levy computing cluster, it is assumed that initial data is made of a main table and n from table, this is also that auto metal halide lamp air control field is common Data mode distributes n platform computer in the cluster, and every computer is all deployed with hbase and python, and every computer has Two roles, a role are distributed storage data, another role is distributed computing, and each computer will be divided With n from one in table from table, and main table will be copied to each computer, for example computer n will merge main table With from table n and generate new feature, and the generation of new feature be by one group of state modulator, it is different from table under normal circumstances Number of parameters is different, it is now assumed that having x from table 1₁A parameter has x from table 2₂A parameter, and so on, then all n X is shared from table one₁+x₂+…+x_nA parameter, main table and it is all can be longitudinally cutting from table, so as to being assigned to more evenly In each computer of cluster, all features that first part generates will be merged into a big table, this table will conduct The feature extraction of the input data of second part, first part will generate a large amount of feature, if all features are all inputted Model will need a large amount of model parameter, so carrying out dimension-reduction treatment to all features before data enter model；

Step 2: dimension-reduction algorithm；Specific step is as follows for dimension-reduction algorithm:

1.: with decision Tree algorithms or other can with the algorithm of arrayed feature importance give all feature orderings, it is important to obtain feature Property list, select the feature of x%(such as 10%) to be labeled as L as basic feature list in lists；

2.: a feature f is taken out from remaining feature, and L is added；

3.: ridge regression (ridge regression) is carried out to L or similar recurrence calculates；

4.: calculate AUC；

5.: if AUC improve (i.e. accuracy rate raising) with regard to keeping characteristics f, remove feature f if AUC is not improved；

6.: 2. circulation arrives 5. until all features all handle completion, dimension-reduction algorithm will obtain the feature set after dimensionality reduction after the completion；

Step 3: model training；Feature set after dimensionality reduction will be put into model as input data and be trained, and model training uses Integrated Algorithm, the algorithm for including have deep neural network (using python packets such as TensorFlow, Keras), gradient elevator Device (using the python packets such as LightGBM, xgboost, catboost) and random forest (use scikit-learnpython Packet) scheduling algorithm, the key step of model training is as follows:

1.: determine set of algorithms, such as deep neural network, LightGBM, xgboost, catboost, random forest etc.；

2.: to each algorithm in set of algorithms, feature set after inputting dimensionality reduction carries out model evaluation with k- folding cross validation, most After export AUC, the output of each algorithm only has a column, i.e. AUC value；

3.: the output of all algorithms is arranged, i.e. AUC is merged into a Table A, for example algorithm is concentrated with 20 algorithms, each algorithm One column of output, most latter incorporated Table A will have 20 column；

4.: determine Integrated Algorithm, there are commonly Logistic recurrence, neural networks etc., if use Logistic return as Integrated Algorithm, then, Table A is inputted Logistic regression algorithm, and last integrated AUC value is calculated, collection can be verified Accuracy of the AUC value than the AUC value that any single algorithm calculates after will be high, if there is 20 algorithms, this 20 algorithms It will be deployed to respectively in 20 computers in cluster, and the feature set of second part dimension-reduction algorithm output will be copied to In all computers, that is to say, that the input data of this 20 computers is the same, but the algorithm run is different, is owned Algorithm calculate after the completion of obtain AUC column will be pooled in a host carry out Integrated Algorithm calculating, Integrated Algorithm calculate AUC out is a column, is averaged to obtain AUC mean value (single numerical value) to this train value, by this AUC mean value and third portion The x in hyper parameter and first part's Feature Engineering for including in sub-model training₁+x₂+…+x_nA parameter is as Part IV Input data；

Step 4: hyper parameter is found；Hyper parameter searching is regarded as another model training, the input of this model is first Point and Part III in all hyper parameters, there are also AUC mean value, all hyper parameter forms a hyper parameter space, and model Trained target be exactly found in this hyper parameter space any make AUC obtain maximum value, select Bayes optimize as The algorithm that hyper parameter is found, Bayesian Optimization Algorithm is the process of a loop iteration, can all be generated after the completion of calculating every time new Hyper parameter value as feedback, these new hyper parameter values will input first part, and start a new circulation, each follow Ring (i.e. from first part to Part IV) can all increase a point in hyper parameter space, after the number of circulation is increasing The value that Bayesian Optimization Algorithm obtains will constantly restrain, and will stop recycling after converging to a preset threshold value.

In one embodiment: the specific implementation step of this distributed automated characterization engineering system framework is as follows:

1.: firstly, it is necessary to have a computer cluster, such as Ali's cloud；

2.: on every computer on cluster install Apache Hadoop cluster version, Apache HBase cluster version and Apache Hhoenix and Apache Thrift, cluster entire so just have the function of distributed data base；

3.: wherein on a computer install MySQL and use InnoDB database engine, on MySQL creation one project Database, for this project database by whole flow process good at managing, all computers on cluster will be to this project database It is concurrently accessed, obtains the status data of process, guarantee the consistency of whole flow process；

4.: required python program and required python packet, every computer portion are installed on every computer of cluster The python code of administration is the same；

5.: the python program on every computer will start simultaneously, and the parallel computation and con current control in cluster are by MySQL On project database management；

6.: order inquiries result will be returned with json format, and air control report is presented by looking for real big data system.

In one embodiment: the AUC in the step 2 is the common accuracy rate module in air control field.

In one embodiment: the python packet include numpy, pandas, scipy, scikit-learn, TensorFlow, Keras, LightGBM, xgboost, catboost etc..

After adopting the above technical scheme, on the one hand, a large amount of costs of labor can be saved for automobile lease finance company, originally It should will be automatically performed by process of the invention by the work that Feature Engineering teacher and the experienced air control expert in the field complete, vapour Vehicle lease finance company only needs to provide original untreated business datum, the present invention will be automatically performed Feature Engineering and The whole process of model training simultaneously exports last air control report, and automobile lease finance company can select by order charging Mode, this can provide the mode charged on demand for the lesser small-to-medium business of order volume, than the mould for employing expert Formula is flexibly very much；

On the other hand, compared with manual operation, the working efficiency of automated characterization engineering and model training will also be greatly promoted, for The data volume of 100000 orders takes around the machine time of 500 hours (i.e. simultaneously if there is 20 computers with cluster completion Row executes, and takes around 25 hours), the work for needing several weeks to complete by Shi Jiayi air control expert of a Feature Engineering originally, After being assigned to each computer of cluster now, only need can be completed within one day, and after the completion of the analysis of full dose data, it is subsequent to order Single each order of air control report queries only needs the time less than 1 second that can return the result；

In addition, in terms of accuracy, since this distributed automated characterization engineering system framework is looked for Bayesian Optimization Algorithm All possible feature combination, therefore some new useful features not found manually can be found out, it is automatic with computer Effect same with air control expert's manual extraction feature will be reached by extracting feature, and computer calculating be based on data, That objectively, model does not depend on any artificial subjective rule generated, in this way can to avoid because Feature Engineering teacher and In air control expertise or the inaccuracy of model caused by subjective fault.

Detailed description of the invention

Fig. 1 is the functional block diagram of the distributed automated characterization engineering system framework of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Referring to Fig. 1, the present invention provides distributed automated characterization engineering system framework, the distribution automated characterization engineering Steps are as follows for the realization of system architecture:

2.: a feature f is taken out from remaining feature, and L is added；

4.: calculate AUC；

In the present embodiment, the specific implementation step of this distributed automated characterization engineering system framework is as follows:

1.: firstly, it is necessary to have a computer cluster, such as Ali's cloud；

Further, the AUC in the step 2 is the common accuracy rate module in air control field.

Through the above structure, herein after distributed automated characterization engineering system framework applications, on the one hand, can melt for automobile It provides leasing company and saves a large amount of costs of labor, should be completed by Feature Engineering teacher and the experienced air control expert in the field Work will be automatically performed by process of the invention, and automobile lease finance company only needs to provide original untreated business number According to the present invention will be automatically performed the whole process of Feature Engineering and model training and export last air control report, and vapour Vehicle lease finance company can select the mode by order charging, this can provide one for the lesser small-to-medium business of order volume A mode charged on demand, than employing the mode of expert flexibly very much, on the other hand, and compared with manual operation, automated characterization The working efficiency of engineering and model training will also greatly promote, and for the data volume of 100,000 orders, about be needed with cluster completion The machine time (i.e. if there is 20 computers execute parallel, taking around 25 hours) for wanting 500 hours, originally by a spy The work that sign engineer adds an air control expert that several weeks is needed to complete after being assigned to each computer of cluster now, only needs one It can be completed, and after the completion of the analysis of full dose data, the subsequent each order of order air control report queries was only needed less than 1 second Time can return the result.

Preferably, the present embodiment also has a following configuration, the python packet include numpy, pandas, scipy, Scikit-learn, TensorFlow, Keras, LightGBM, xgboost, catboost etc..

In addition, in terms of accuracy, since this distributed automated characterization engineering system framework is gone with Bayesian Optimization Algorithm All possible feature combination is found, therefore some new useful features not found manually can be found out, uses computer Effect same with air control expert's manual extraction feature will be reached by automatically extracting feature, and computer calculating is based on data , it is that objectively, model does not depend on any artificial subjective rule generated, it in this way can be to avoid because of Feature Engineering teacher With in air control expertise or the inaccuracy of model caused by subjective fault.

In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiments being understood that.

Claims

1. distributed automated characterization engineering system framework, which is characterized in that the distribution automated characterization engineering system framework Realize that steps are as follows:

2.: a feature f is taken out from remaining feature, and L is added；

4.: calculate AUC；

2. distribution automated characterization engineering system framework according to claim 1, it is characterised in that: this is distributed automatic special The specific implementation step for levying engineering system framework is as follows:

1.: firstly, it is necessary to have a computer cluster, such as Ali's cloud；

3. distribution automated characterization engineering system framework according to claim 1, it is characterised in that: in the step 2 AUC is the common accuracy rate module in air control field.

4. distribution automated characterization engineering system framework according to claim 2, it is characterised in that: the python packet Include numpy, pandas, scipy, scikit-learn, TensorFlow, Keras, LightGBM, xgboost, catboost Deng.