CN109582724A - Distributed automated characterization engineering system framework - Google Patents

Distributed automated characterization engineering system framework Download PDF

Info

Publication number
CN109582724A
CN109582724A CN201811493937.1A CN201811493937A CN109582724A CN 109582724 A CN109582724 A CN 109582724A CN 201811493937 A CN201811493937 A CN 201811493937A CN 109582724 A CN109582724 A CN 109582724A
Authority
CN
China
Prior art keywords
algorithm
feature
auc
cluster
computer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811493937.1A
Other languages
Chinese (zh)
Other versions
CN109582724B (en
Inventor
施铭铮
刘占辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Pencil Head Information Technology Co Ltd
Original Assignee
Xiamen Pencil Head Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Pencil Head Information Technology Co Ltd filed Critical Xiamen Pencil Head Information Technology Co Ltd
Priority to CN201811493937.1A priority Critical patent/CN109582724B/en
Publication of CN109582724A publication Critical patent/CN109582724A/en
Application granted granted Critical
Publication of CN109582724B publication Critical patent/CN109582724B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Finance (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Complex Calculations (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses distributed automated characterization engineering system framework, steps are as follows: distributed automated characterization computing cluster for the realization of the distribution automated characterization engineering system framework, dimension-reduction algorithm, model training, and hyper parameter is found;The distribution automated characterization engineering system architecture design is reasonable, a large amount of costs of labor can be saved for automobile lease finance company, it should will be automatically performed by process of the invention by the work that Feature Engineering teacher and the experienced air control expert in the field complete, automobile lease finance company only needs to provide original untreated business datum, and the present invention will be automatically performed the whole process of Feature Engineering and model training and export last air control report.

Description

Distributed automated characterization engineering system framework
Technical field
The present invention is distributed automated characterization engineering system framework, belongs to auto metal halide lamp air control technical field.
Background technique
The process of orthodox car finance air control is that a series of air control rules are formulated by the expert in the field, and each rule may Comprising some calculation formula, and by application loan the related data of client be calculated, each rule may need one or Multiple customer datas (in this document, a customer data is defined as a feature), if what the client of loan provided Lower than the qualifying point of air control, the loan application of the client will not pass through the score that data obtain after strictly all rules calculate.
The characteristics of Feature Engineering is a useful feature often by multiple primitive characters (i.e. initial data) by one Some simple arithmetic are calculated, that is to say, that after client provides the data that application is provided a loan, also a feature mentions The process for taking and generating, this process are collectively referred to as Feature Engineering, after new feature is generated by Feature Engineering, own Feature will be merged, and the algorithm for inputting machine learning is calculated.
And the expert that first problem here is the air control rule that has air control experience and can make is one very dilute Scarce resource, Second Problem is regular even if there is expert to formulate air control, but these rules are all by this expert personal experience It sums up, there is no the integrated demand that method represents entire auto metal halide lamp industry, Second Problem is by expert or feature Engineer manually go extract new feature be it is very time-consuming, since one group of data is generally from different data sources, same number Data according to source also frequently include multiple tables of data, and so more initial data permutation and combination calculating may be needed to spend one The time in Feature Engineering teacher several weeks.
So the whole process of Feature Engineering and model training can be automatically performed and export last air control by inventing one kind The frame of report has very important significance, for this purpose, the present invention proposes a kind of distributed automated characterization engineering system framework.
Summary of the invention
In view of the deficienciess of the prior art, it is an object of the present invention to provide distributed automated characterization engineering system framework, with The problems mentioned above in the background art are solved, the present invention has rational design, can save a large amount of people for automobile lease finance company Work cost, should will be by process of the invention by the work that Feature Engineering teacher and the experienced air control expert in the field complete It is automatically performed, automobile lease finance company only needs to provide original untreated business datum, and the present invention will be automatic complete At the whole process of Feature Engineering and model training and export last air control report.
To achieve the above object, the invention provides the following technical scheme: distributed automated characterization engineering system framework, described Steps are as follows for the realization of distributed automated characterization engineering system framework:
Step 1: distributed automated characterization computing cluster;A computer cluster is needed, first part is distributed automatic special Levy computing cluster, it is assumed that initial data is made of a main table and n from table, this is also that auto metal halide lamp air control field is common Data mode distributes n platform computer in the cluster, and every computer is all deployed with hbase and python, and every computer has Two roles, a role are distributed storage data, another role is distributed computing, and each computer will be divided With n from one in table from table, and main table will be copied to each computer, for example computer n will merge main table With from table n and generate new feature, and the generation of new feature be by one group of state modulator, it is different from table under normal circumstances Number of parameters is different, it is now assumed that having x from table 11A parameter has x from table 22A parameter, and so on, then all n X is shared from table one1+x2+…+xnA parameter, main table and it is all can be longitudinally cutting from table, so as to being assigned to more evenly In each computer of cluster, all features that first part generates will be merged into a big table, this table will conduct The feature extraction of the input data of second part, first part will generate a large amount of feature, if all features are all inputted Model will need a large amount of model parameter, so carrying out dimension-reduction treatment to all features before data enter model;
Step 2: dimension-reduction algorithm;Specific step is as follows for dimension-reduction algorithm:
1.: with decision Tree algorithms or other can with the algorithm of arrayed feature importance give all feature orderings, it is important to obtain feature Property list, select the feature of x%(such as 10%) to be labeled as L as basic feature list in lists;
2.: a feature f is taken out from remaining feature, and L is added;
3.: ridge regression (ridge regression) is carried out to L or similar recurrence calculates;
4.: calculate AUC;
5.: if AUC improve (i.e. accuracy rate raising) with regard to keeping characteristics f, remove feature f if AUC is not improved;
6.: 2. circulation arrives 5. until all features all handle completion, dimension-reduction algorithm will obtain the feature set after dimensionality reduction after the completion;
Step 3: model training;Feature set after dimensionality reduction will be put into model as input data and be trained, and model training uses Integrated Algorithm, the algorithm for including have deep neural network (using python packets such as TensorFlow, Keras), gradient elevator Device (using the python packets such as LightGBM, xgboost, catboost) and random forest (use scikit-learnpython Packet) scheduling algorithm, the key step of model training is as follows:
1.: determine set of algorithms, such as deep neural network, LightGBM, xgboost, catboost, random forest etc.;
2.: to each algorithm in set of algorithms, feature set after inputting dimensionality reduction carries out model evaluation with k- folding cross validation, most After export AUC, the output of each algorithm only has a column, i.e. AUC value;
3.: the output of all algorithms is arranged, i.e. AUC is merged into a Table A, for example algorithm is concentrated with 20 algorithms, each algorithm One column of output, most latter incorporated Table A will have 20 column;
4.: determine Integrated Algorithm, there are commonly Logistic recurrence, neural networks etc., if use Logistic return as Integrated Algorithm, then, Table A is inputted Logistic regression algorithm, and last integrated AUC value is calculated, collection can be verified Accuracy of the AUC value than the AUC value that any single algorithm calculates after will be high, if there is 20 algorithms, this 20 algorithms It will be deployed to respectively in 20 computers in cluster, and the feature set of second part dimension-reduction algorithm output will be copied to In all computers, that is to say, that the input data of this 20 computers is the same, but the algorithm run is different, is owned Algorithm calculate after the completion of obtain AUC column will be pooled in a host carry out Integrated Algorithm calculating, Integrated Algorithm calculate AUC out is a column, is averaged to obtain AUC mean value (single numerical value) to this train value, by this AUC mean value and third portion The x in hyper parameter and first part's Feature Engineering for including in sub-model training1+x2+…+xnA parameter is as Part IV Input data;
Step 4: hyper parameter is found;Hyper parameter searching is regarded as another model training, the input of this model is first Point and Part III in all hyper parameters, there are also AUC mean value, all hyper parameter forms a hyper parameter space, and model Trained target be exactly found in this hyper parameter space any make AUC obtain maximum value, select Bayes optimize as The algorithm that hyper parameter is found, Bayesian Optimization Algorithm is the process of a loop iteration, can all be generated after the completion of calculating every time new Hyper parameter value as feedback, these new hyper parameter values will input first part, and start a new circulation, each follow Ring (i.e. from first part to Part IV) can all increase a point in hyper parameter space, after the number of circulation is increasing The value that Bayesian Optimization Algorithm obtains will constantly restrain, and will stop recycling after converging to a preset threshold value.
In one embodiment: the specific implementation step of this distributed automated characterization engineering system framework is as follows:
1.: firstly, it is necessary to have a computer cluster, such as Ali's cloud;
2.: on every computer on cluster install Apache Hadoop cluster version, Apache HBase cluster version and Apache Hhoenix and Apache Thrift, cluster entire so just have the function of distributed data base;
3.: wherein on a computer install MySQL and use InnoDB database engine, on MySQL creation one project Database, for this project database by whole flow process good at managing, all computers on cluster will be to this project database It is concurrently accessed, obtains the status data of process, guarantee the consistency of whole flow process;
4.: required python program and required python packet, every computer portion are installed on every computer of cluster The python code of administration is the same;
5.: the python program on every computer will start simultaneously, and the parallel computation and con current control in cluster are by MySQL On project database management;
6.: order inquiries result will be returned with json format, and air control report is presented by looking for real big data system.
In one embodiment: the AUC in the step 2 is the common accuracy rate module in air control field.
In one embodiment: the python packet include numpy, pandas, scipy, scikit-learn, TensorFlow, Keras, LightGBM, xgboost, catboost etc..
After adopting the above technical scheme, on the one hand, a large amount of costs of labor can be saved for automobile lease finance company, originally It should will be automatically performed by process of the invention by the work that Feature Engineering teacher and the experienced air control expert in the field complete, vapour Vehicle lease finance company only needs to provide original untreated business datum, the present invention will be automatically performed Feature Engineering and The whole process of model training simultaneously exports last air control report, and automobile lease finance company can select by order charging Mode, this can provide the mode charged on demand for the lesser small-to-medium business of order volume, than the mould for employing expert Formula is flexibly very much;
On the other hand, compared with manual operation, the working efficiency of automated characterization engineering and model training will also be greatly promoted, for The data volume of 100000 orders takes around the machine time of 500 hours (i.e. simultaneously if there is 20 computers with cluster completion Row executes, and takes around 25 hours), the work for needing several weeks to complete by Shi Jiayi air control expert of a Feature Engineering originally, After being assigned to each computer of cluster now, only need can be completed within one day, and after the completion of the analysis of full dose data, it is subsequent to order Single each order of air control report queries only needs the time less than 1 second that can return the result;
In addition, in terms of accuracy, since this distributed automated characterization engineering system framework is looked for Bayesian Optimization Algorithm All possible feature combination, therefore some new useful features not found manually can be found out, it is automatic with computer Effect same with air control expert's manual extraction feature will be reached by extracting feature, and computer calculating be based on data, That objectively, model does not depend on any artificial subjective rule generated, in this way can to avoid because Feature Engineering teacher and In air control expertise or the inaccuracy of model caused by subjective fault.
Detailed description of the invention
Fig. 1 is the functional block diagram of the distributed automated characterization engineering system framework of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Referring to Fig. 1, the present invention provides distributed automated characterization engineering system framework, the distribution automated characterization engineering Steps are as follows for the realization of system architecture:
Step 1: distributed automated characterization computing cluster;A computer cluster is needed, first part is distributed automatic special Levy computing cluster, it is assumed that initial data is made of a main table and n from table, this is also that auto metal halide lamp air control field is common Data mode distributes n platform computer in the cluster, and every computer is all deployed with hbase and python, and every computer has Two roles, a role are distributed storage data, another role is distributed computing, and each computer will be divided With n from one in table from table, and main table will be copied to each computer, for example computer n will merge main table With from table n and generate new feature, and the generation of new feature be by one group of state modulator, it is different from table under normal circumstances Number of parameters is different, it is now assumed that having x from table 11A parameter has x from table 22A parameter, and so on, then all n X is shared from table one1+x2+…+xnA parameter, main table and it is all can be longitudinally cutting from table, so as to being assigned to more evenly In each computer of cluster, all features that first part generates will be merged into a big table, this table will conduct The feature extraction of the input data of second part, first part will generate a large amount of feature, if all features are all inputted Model will need a large amount of model parameter, so carrying out dimension-reduction treatment to all features before data enter model;
Step 2: dimension-reduction algorithm;Specific step is as follows for dimension-reduction algorithm:
1.: with decision Tree algorithms or other can with the algorithm of arrayed feature importance give all feature orderings, it is important to obtain feature Property list, select the feature of x%(such as 10%) to be labeled as L as basic feature list in lists;
2.: a feature f is taken out from remaining feature, and L is added;
3.: ridge regression (ridge regression) is carried out to L or similar recurrence calculates;
4.: calculate AUC;
5.: if AUC improve (i.e. accuracy rate raising) with regard to keeping characteristics f, remove feature f if AUC is not improved;
6.: 2. circulation arrives 5. until all features all handle completion, dimension-reduction algorithm will obtain the feature set after dimensionality reduction after the completion;
Step 3: model training;Feature set after dimensionality reduction will be put into model as input data and be trained, and model training uses Integrated Algorithm, the algorithm for including have deep neural network (using python packets such as TensorFlow, Keras), gradient elevator Device (using the python packets such as LightGBM, xgboost, catboost) and random forest (use scikit-learnpython Packet) scheduling algorithm, the key step of model training is as follows:
1.: determine set of algorithms, such as deep neural network, LightGBM, xgboost, catboost, random forest etc.;
2.: to each algorithm in set of algorithms, feature set after inputting dimensionality reduction carries out model evaluation with k- folding cross validation, most After export AUC, the output of each algorithm only has a column, i.e. AUC value;
3.: the output of all algorithms is arranged, i.e. AUC is merged into a Table A, for example algorithm is concentrated with 20 algorithms, each algorithm One column of output, most latter incorporated Table A will have 20 column;
4.: determine Integrated Algorithm, there are commonly Logistic recurrence, neural networks etc., if use Logistic return as Integrated Algorithm, then, Table A is inputted Logistic regression algorithm, and last integrated AUC value is calculated, collection can be verified Accuracy of the AUC value than the AUC value that any single algorithm calculates after will be high, if there is 20 algorithms, this 20 algorithms It will be deployed to respectively in 20 computers in cluster, and the feature set of second part dimension-reduction algorithm output will be copied to In all computers, that is to say, that the input data of this 20 computers is the same, but the algorithm run is different, is owned Algorithm calculate after the completion of obtain AUC column will be pooled in a host carry out Integrated Algorithm calculating, Integrated Algorithm calculate AUC out is a column, is averaged to obtain AUC mean value (single numerical value) to this train value, by this AUC mean value and third portion The x in hyper parameter and first part's Feature Engineering for including in sub-model training1+x2+…+xnA parameter is as Part IV Input data;
Step 4: hyper parameter is found;Hyper parameter searching is regarded as another model training, the input of this model is first Point and Part III in all hyper parameters, there are also AUC mean value, all hyper parameter forms a hyper parameter space, and model Trained target be exactly found in this hyper parameter space any make AUC obtain maximum value, select Bayes optimize as The algorithm that hyper parameter is found, Bayesian Optimization Algorithm is the process of a loop iteration, can all be generated after the completion of calculating every time new Hyper parameter value as feedback, these new hyper parameter values will input first part, and start a new circulation, each follow Ring (i.e. from first part to Part IV) can all increase a point in hyper parameter space, after the number of circulation is increasing The value that Bayesian Optimization Algorithm obtains will constantly restrain, and will stop recycling after converging to a preset threshold value.
In the present embodiment, the specific implementation step of this distributed automated characterization engineering system framework is as follows:
1.: firstly, it is necessary to have a computer cluster, such as Ali's cloud;
2.: on every computer on cluster install Apache Hadoop cluster version, Apache HBase cluster version and Apache Hhoenix and Apache Thrift, cluster entire so just have the function of distributed data base;
3.: wherein on a computer install MySQL and use InnoDB database engine, on MySQL creation one project Database, for this project database by whole flow process good at managing, all computers on cluster will be to this project database It is concurrently accessed, obtains the status data of process, guarantee the consistency of whole flow process;
4.: required python program and required python packet, every computer portion are installed on every computer of cluster The python code of administration is the same;
5.: the python program on every computer will start simultaneously, and the parallel computation and con current control in cluster are by MySQL On project database management;
6.: order inquiries result will be returned with json format, and air control report is presented by looking for real big data system.
Further, the AUC in the step 2 is the common accuracy rate module in air control field.
Through the above structure, herein after distributed automated characterization engineering system framework applications, on the one hand, can melt for automobile It provides leasing company and saves a large amount of costs of labor, should be completed by Feature Engineering teacher and the experienced air control expert in the field Work will be automatically performed by process of the invention, and automobile lease finance company only needs to provide original untreated business number According to the present invention will be automatically performed the whole process of Feature Engineering and model training and export last air control report, and vapour Vehicle lease finance company can select the mode by order charging, this can provide one for the lesser small-to-medium business of order volume A mode charged on demand, than employing the mode of expert flexibly very much, on the other hand, and compared with manual operation, automated characterization The working efficiency of engineering and model training will also greatly promote, and for the data volume of 100,000 orders, about be needed with cluster completion The machine time (i.e. if there is 20 computers execute parallel, taking around 25 hours) for wanting 500 hours, originally by a spy The work that sign engineer adds an air control expert that several weeks is needed to complete after being assigned to each computer of cluster now, only needs one It can be completed, and after the completion of the analysis of full dose data, the subsequent each order of order air control report queries was only needed less than 1 second Time can return the result.
Preferably, the present embodiment also has a following configuration, the python packet include numpy, pandas, scipy, Scikit-learn, TensorFlow, Keras, LightGBM, xgboost, catboost etc..
In addition, in terms of accuracy, since this distributed automated characterization engineering system framework is gone with Bayesian Optimization Algorithm All possible feature combination is found, therefore some new useful features not found manually can be found out, uses computer Effect same with air control expert's manual extraction feature will be reached by automatically extracting feature, and computer calculating is based on data , it is that objectively, model does not depend on any artificial subjective rule generated, it in this way can be to avoid because of Feature Engineering teacher With in air control expertise or the inaccuracy of model caused by subjective fault.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art The other embodiments being understood that.

Claims (4)

1. distributed automated characterization engineering system framework, which is characterized in that the distribution automated characterization engineering system framework Realize that steps are as follows:
Step 1: distributed automated characterization computing cluster;A computer cluster is needed, first part is distributed automatic special Levy computing cluster, it is assumed that initial data is made of a main table and n from table, this is also that auto metal halide lamp air control field is common Data mode distributes n platform computer in the cluster, and every computer is all deployed with hbase and python, and every computer has Two roles, a role are distributed storage data, another role is distributed computing, and each computer will be divided With n from one in table from table, and main table will be copied to each computer, for example computer n will merge main table With from table n and generate new feature, and the generation of new feature be by one group of state modulator, it is different from table under normal circumstances Number of parameters is different, it is now assumed that having x from table 11A parameter has x from table 22A parameter, and so on, then all n X is shared from table one1+x2+…+xnA parameter, main table and it is all can be longitudinally cutting from table, so as to being assigned to more evenly In each computer of cluster, all features that first part generates will be merged into a big table, this table will conduct The feature extraction of the input data of second part, first part will generate a large amount of feature, if all features are all inputted Model will need a large amount of model parameter, so carrying out dimension-reduction treatment to all features before data enter model;
Step 2: dimension-reduction algorithm;Specific step is as follows for dimension-reduction algorithm:
1.: with decision Tree algorithms or other can with the algorithm of arrayed feature importance give all feature orderings, it is important to obtain feature Property list, select the feature of x%(such as 10%) to be labeled as L as basic feature list in lists;
2.: a feature f is taken out from remaining feature, and L is added;
3.: ridge regression (ridge regression) is carried out to L or similar recurrence calculates;
4.: calculate AUC;
5.: if AUC improve (i.e. accuracy rate raising) with regard to keeping characteristics f, remove feature f if AUC is not improved;
6.: 2. circulation arrives 5. until all features all handle completion, dimension-reduction algorithm will obtain the feature set after dimensionality reduction after the completion;
Step 3: model training;Feature set after dimensionality reduction will be put into model as input data and be trained, and model training uses Integrated Algorithm, the algorithm for including have deep neural network (using python packets such as TensorFlow, Keras), gradient elevator Device (using the python packets such as LightGBM, xgboost, catboost) and random forest (use scikit-learnpython Packet) scheduling algorithm, the key step of model training is as follows:
1.: determine set of algorithms, such as deep neural network, LightGBM, xgboost, catboost, random forest etc.;
2.: to each algorithm in set of algorithms, feature set after inputting dimensionality reduction carries out model evaluation with k- folding cross validation, most After export AUC, the output of each algorithm only has a column, i.e. AUC value;
3.: the output of all algorithms is arranged, i.e. AUC is merged into a Table A, for example algorithm is concentrated with 20 algorithms, each algorithm One column of output, most latter incorporated Table A will have 20 column;
4.: determine Integrated Algorithm, there are commonly Logistic recurrence, neural networks etc., if use Logistic return as Integrated Algorithm, then, Table A is inputted Logistic regression algorithm, and last integrated AUC value is calculated, collection can be verified Accuracy of the AUC value than the AUC value that any single algorithm calculates after will be high, if there is 20 algorithms, this 20 algorithms It will be deployed to respectively in 20 computers in cluster, and the feature set of second part dimension-reduction algorithm output will be copied to In all computers, that is to say, that the input data of this 20 computers is the same, but the algorithm run is different, is owned Algorithm calculate after the completion of obtain AUC column will be pooled in a host carry out Integrated Algorithm calculating, Integrated Algorithm calculate AUC out is a column, is averaged to obtain AUC mean value (single numerical value) to this train value, by this AUC mean value and third portion The x in hyper parameter and first part's Feature Engineering for including in sub-model training1+x2+…+xnA parameter is as Part IV Input data;
Step 4: hyper parameter is found;Hyper parameter searching is regarded as another model training, the input of this model is first Point and Part III in all hyper parameters, there are also AUC mean value, all hyper parameter forms a hyper parameter space, and model Trained target be exactly found in this hyper parameter space any make AUC obtain maximum value, select Bayes optimize as The algorithm that hyper parameter is found, Bayesian Optimization Algorithm is the process of a loop iteration, can all be generated after the completion of calculating every time new Hyper parameter value as feedback, these new hyper parameter values will input first part, and start a new circulation, each follow Ring (i.e. from first part to Part IV) can all increase a point in hyper parameter space, after the number of circulation is increasing The value that Bayesian Optimization Algorithm obtains will constantly restrain, and will stop recycling after converging to a preset threshold value.
2. distribution automated characterization engineering system framework according to claim 1, it is characterised in that: this is distributed automatic special The specific implementation step for levying engineering system framework is as follows:
1.: firstly, it is necessary to have a computer cluster, such as Ali's cloud;
2.: on every computer on cluster install Apache Hadoop cluster version, Apache HBase cluster version and Apache Hhoenix and Apache Thrift, cluster entire so just have the function of distributed data base;
3.: wherein on a computer install MySQL and use InnoDB database engine, on MySQL creation one project Database, for this project database by whole flow process good at managing, all computers on cluster will be to this project database It is concurrently accessed, obtains the status data of process, guarantee the consistency of whole flow process;
4.: required python program and required python packet, every computer portion are installed on every computer of cluster The python code of administration is the same;
5.: the python program on every computer will start simultaneously, and the parallel computation and con current control in cluster are by MySQL On project database management;
6.: order inquiries result will be returned with json format, and air control report is presented by looking for real big data system.
3. distribution automated characterization engineering system framework according to claim 1, it is characterised in that: in the step 2 AUC is the common accuracy rate module in air control field.
4. distribution automated characterization engineering system framework according to claim 2, it is characterised in that: the python packet Include numpy, pandas, scipy, scikit-learn, TensorFlow, Keras, LightGBM, xgboost, catboost Deng.
CN201811493937.1A 2018-12-07 2018-12-07 Distributed automatic feature engineering system architecture Active CN109582724B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811493937.1A CN109582724B (en) 2018-12-07 2018-12-07 Distributed automatic feature engineering system architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811493937.1A CN109582724B (en) 2018-12-07 2018-12-07 Distributed automatic feature engineering system architecture

Publications (2)

Publication Number Publication Date
CN109582724A true CN109582724A (en) 2019-04-05
CN109582724B CN109582724B (en) 2022-04-08

Family

ID=65929000

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811493937.1A Active CN109582724B (en) 2018-12-07 2018-12-07 Distributed automatic feature engineering system architecture

Country Status (1)

Country Link
CN (1) CN109582724B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111861047A (en) * 2019-04-08 2020-10-30 阿里巴巴集团控股有限公司 Data processing control method and computing device
CN112380205A (en) * 2020-11-17 2021-02-19 北京融七牛信息技术有限公司 Method and system for automatically generating characteristics of distributed architecture
WO2022193408A1 (en) * 2021-03-17 2022-09-22 中奥智能工业研究院(南京)有限公司 Automatic data analysis and modeling process

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140310221A1 (en) * 2013-04-12 2014-10-16 Nec Laboratories America, Inc. Interpretable sparse high-order boltzmann machines
US20160232540A1 (en) * 2015-02-10 2016-08-11 EverString Innovation Technology Predictive analytics for leads generation and engagement recommendations
CN106339608A (en) * 2016-11-09 2017-01-18 中国科学院软件研究所 Traffic accident rate predicting system based on online variational Bayesian support vector regression
CN107103332A (en) * 2017-04-07 2017-08-29 武汉理工大学 A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset
CN107516135A (en) * 2017-07-14 2017-12-26 浙江大学 A kind of automation monitoring learning method for supporting multi-source data
CN108154430A (en) * 2017-12-28 2018-06-12 上海氪信信息技术有限公司 A kind of credit scoring construction method based on machine learning and big data technology
CN108304941A (en) * 2017-12-18 2018-07-20 中国软件与技术服务股份有限公司 A kind of failure prediction method based on machine learning
CN108566364A (en) * 2018-01-15 2018-09-21 中国人民解放军国防科技大学 Intrusion detection method based on neural network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140310221A1 (en) * 2013-04-12 2014-10-16 Nec Laboratories America, Inc. Interpretable sparse high-order boltzmann machines
US20160232540A1 (en) * 2015-02-10 2016-08-11 EverString Innovation Technology Predictive analytics for leads generation and engagement recommendations
CN106339608A (en) * 2016-11-09 2017-01-18 中国科学院软件研究所 Traffic accident rate predicting system based on online variational Bayesian support vector regression
CN107103332A (en) * 2017-04-07 2017-08-29 武汉理工大学 A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset
CN107516135A (en) * 2017-07-14 2017-12-26 浙江大学 A kind of automation monitoring learning method for supporting multi-source data
CN108304941A (en) * 2017-12-18 2018-07-20 中国软件与技术服务股份有限公司 A kind of failure prediction method based on machine learning
CN108154430A (en) * 2017-12-28 2018-06-12 上海氪信信息技术有限公司 A kind of credit scoring construction method based on machine learning and big data technology
CN108566364A (en) * 2018-01-15 2018-09-21 中国人民解放军国防科技大学 Intrusion detection method based on neural network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BAHAREH NAKISA等: "Long Short Term Memory Hyperparameter Optimization for a Neural Network Based Emotion Recognition Framework", 《IEEE ACCESS》 *
张浩: "自动化特征工程与参数调整算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王嘉卿: "欺诈网页挖掘中特征优选及检测性能研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
陈侨安等: "基于运行数据分析的Spark任务参数优化", 《计算机工程与科学》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111861047A (en) * 2019-04-08 2020-10-30 阿里巴巴集团控股有限公司 Data processing control method and computing device
CN112380205A (en) * 2020-11-17 2021-02-19 北京融七牛信息技术有限公司 Method and system for automatically generating characteristics of distributed architecture
CN112380205B (en) * 2020-11-17 2024-04-02 北京融七牛信息技术有限公司 Automatic feature generation method and system of distributed architecture
WO2022193408A1 (en) * 2021-03-17 2022-09-22 中奥智能工业研究院(南京)有限公司 Automatic data analysis and modeling process

Also Published As

Publication number Publication date
CN109582724B (en) 2022-04-08

Similar Documents

Publication Publication Date Title
US8943059B2 (en) Systems and methods for merging source records in accordance with survivorship rules
WO2019147851A2 (en) Systems and methods for generating machine learning applications
CN109582724A (en) Distributed automated characterization engineering system framework
US20200042659A1 (en) Autonomous surrogate model creation platform
CN108052394A (en) The method and computer equipment of resource allocation based on SQL statement run time
CN102760143A (en) Method and device for dynamically integrating executing structures in database system
Akopov Parallel genetic algorithm with fading selection
Xu et al. Adaptive surrogate-based design optimization with expected improvement used as infill criterion
CN103745319B (en) A kind of data provenance traceability system based on multi-state scientific workflow and method
CN111967971A (en) Bank client data processing method and device
Wu et al. Two layered approaches integrating harmony search with genetic algorithm for the integrated process planning and scheduling problem
CN115034409A (en) Vehicle maintenance scheme determination method, device, equipment and storage medium
CN106228263A (en) Materials stream informationization methods based on big data
CN112507098B (en) Question processing method, question processing device, electronic equipment, storage medium and program product
CN111768096A (en) Rating method and device based on algorithm model, electronic equipment and storage medium
GB2599334A (en) Feature engineering in neural networks optimization
CN115965154A (en) Knowledge graph-based digital twin machining process scheduling method
WO2024007604A1 (en) Mathematical model solving method and apparatus, and computing device and computing device cluster
CN116384606A (en) Scheduling optimization method and system based on cooperative distribution of vehicle unmanned aerial vehicle
CN109711558A (en) For the method and device of the machine learning model of feature construction containing latent instability
Sun et al. [Retracted] Impact of Financial R&D Resource Allocation Efficiency Based on VR Technology and Machine Learning in Complex Systems on Total Factor Productivity
EP4246375A1 (en) Model processing method and related device
CN104778253B (en) A kind of method and apparatus that data are provided
Chen et al. An Attribute Reduction Algorithm Based on Rough Set Theory and an Improved Genetic Algorithm.
Gao et al. Research on product sales forecasting based on multi-value chain collaborative data management system in manufacturing industry

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant