CN109582724A - Distributed automated characterization engineering system framework - Google Patents
Distributed automated characterization engineering system framework Download PDFInfo
- Publication number
- CN109582724A CN109582724A CN201811493937.1A CN201811493937A CN109582724A CN 109582724 A CN109582724 A CN 109582724A CN 201811493937 A CN201811493937 A CN 201811493937A CN 109582724 A CN109582724 A CN 109582724A
- Authority
- CN
- China
- Prior art keywords
- algorithm
- feature
- auc
- cluster
- computer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/03—Credit; Loans; Processing thereof
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Accounting & Taxation (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Finance (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- General Business, Economics & Management (AREA)
- Complex Calculations (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses distributed automated characterization engineering system framework, steps are as follows: distributed automated characterization computing cluster for the realization of the distribution automated characterization engineering system framework, dimension-reduction algorithm, model training, and hyper parameter is found;The distribution automated characterization engineering system architecture design is reasonable, a large amount of costs of labor can be saved for automobile lease finance company, it should will be automatically performed by process of the invention by the work that Feature Engineering teacher and the experienced air control expert in the field complete, automobile lease finance company only needs to provide original untreated business datum, and the present invention will be automatically performed the whole process of Feature Engineering and model training and export last air control report.
Description
Technical field
The present invention is distributed automated characterization engineering system framework, belongs to auto metal halide lamp air control technical field.
Background technique
The process of orthodox car finance air control is that a series of air control rules are formulated by the expert in the field, and each rule may
Comprising some calculation formula, and by application loan the related data of client be calculated, each rule may need one or
Multiple customer datas (in this document, a customer data is defined as a feature), if what the client of loan provided
Lower than the qualifying point of air control, the loan application of the client will not pass through the score that data obtain after strictly all rules calculate.
The characteristics of Feature Engineering is a useful feature often by multiple primitive characters (i.e. initial data) by one
Some simple arithmetic are calculated, that is to say, that after client provides the data that application is provided a loan, also a feature mentions
The process for taking and generating, this process are collectively referred to as Feature Engineering, after new feature is generated by Feature Engineering, own
Feature will be merged, and the algorithm for inputting machine learning is calculated.
And the expert that first problem here is the air control rule that has air control experience and can make is one very dilute
Scarce resource, Second Problem is regular even if there is expert to formulate air control, but these rules are all by this expert personal experience
It sums up, there is no the integrated demand that method represents entire auto metal halide lamp industry, Second Problem is by expert or feature
Engineer manually go extract new feature be it is very time-consuming, since one group of data is generally from different data sources, same number
Data according to source also frequently include multiple tables of data, and so more initial data permutation and combination calculating may be needed to spend one
The time in Feature Engineering teacher several weeks.
So the whole process of Feature Engineering and model training can be automatically performed and export last air control by inventing one kind
The frame of report has very important significance, for this purpose, the present invention proposes a kind of distributed automated characterization engineering system framework.
Summary of the invention
In view of the deficienciess of the prior art, it is an object of the present invention to provide distributed automated characterization engineering system framework, with
The problems mentioned above in the background art are solved, the present invention has rational design, can save a large amount of people for automobile lease finance company
Work cost, should will be by process of the invention by the work that Feature Engineering teacher and the experienced air control expert in the field complete
It is automatically performed, automobile lease finance company only needs to provide original untreated business datum, and the present invention will be automatic complete
At the whole process of Feature Engineering and model training and export last air control report.
To achieve the above object, the invention provides the following technical scheme: distributed automated characterization engineering system framework, described
Steps are as follows for the realization of distributed automated characterization engineering system framework:
Step 1: distributed automated characterization computing cluster;A computer cluster is needed, first part is distributed automatic special
Levy computing cluster, it is assumed that initial data is made of a main table and n from table, this is also that auto metal halide lamp air control field is common
Data mode distributes n platform computer in the cluster, and every computer is all deployed with hbase and python, and every computer has
Two roles, a role are distributed storage data, another role is distributed computing, and each computer will be divided
With n from one in table from table, and main table will be copied to each computer, for example computer n will merge main table
With from table n and generate new feature, and the generation of new feature be by one group of state modulator, it is different from table under normal circumstances
Number of parameters is different, it is now assumed that having x from table 11A parameter has x from table 22A parameter, and so on, then all n
X is shared from table one1+x2+…+xnA parameter, main table and it is all can be longitudinally cutting from table, so as to being assigned to more evenly
In each computer of cluster, all features that first part generates will be merged into a big table, this table will conduct
The feature extraction of the input data of second part, first part will generate a large amount of feature, if all features are all inputted
Model will need a large amount of model parameter, so carrying out dimension-reduction treatment to all features before data enter model;
Step 2: dimension-reduction algorithm;Specific step is as follows for dimension-reduction algorithm:
1.: with decision Tree algorithms or other can with the algorithm of arrayed feature importance give all feature orderings, it is important to obtain feature
Property list, select the feature of x%(such as 10%) to be labeled as L as basic feature list in lists;
2.: a feature f is taken out from remaining feature, and L is added;
3.: ridge regression (ridge regression) is carried out to L or similar recurrence calculates;
4.: calculate AUC;
5.: if AUC improve (i.e. accuracy rate raising) with regard to keeping characteristics f, remove feature f if AUC is not improved;
6.: 2. circulation arrives 5. until all features all handle completion, dimension-reduction algorithm will obtain the feature set after dimensionality reduction after the completion;
Step 3: model training;Feature set after dimensionality reduction will be put into model as input data and be trained, and model training uses
Integrated Algorithm, the algorithm for including have deep neural network (using python packets such as TensorFlow, Keras), gradient elevator
Device (using the python packets such as LightGBM, xgboost, catboost) and random forest (use scikit-learnpython
Packet) scheduling algorithm, the key step of model training is as follows:
1.: determine set of algorithms, such as deep neural network, LightGBM, xgboost, catboost, random forest etc.;
2.: to each algorithm in set of algorithms, feature set after inputting dimensionality reduction carries out model evaluation with k- folding cross validation, most
After export AUC, the output of each algorithm only has a column, i.e. AUC value;
3.: the output of all algorithms is arranged, i.e. AUC is merged into a Table A, for example algorithm is concentrated with 20 algorithms, each algorithm
One column of output, most latter incorporated Table A will have 20 column;
4.: determine Integrated Algorithm, there are commonly Logistic recurrence, neural networks etc., if use Logistic return as
Integrated Algorithm, then, Table A is inputted Logistic regression algorithm, and last integrated AUC value is calculated, collection can be verified
Accuracy of the AUC value than the AUC value that any single algorithm calculates after will be high, if there is 20 algorithms, this 20 algorithms
It will be deployed to respectively in 20 computers in cluster, and the feature set of second part dimension-reduction algorithm output will be copied to
In all computers, that is to say, that the input data of this 20 computers is the same, but the algorithm run is different, is owned
Algorithm calculate after the completion of obtain AUC column will be pooled in a host carry out Integrated Algorithm calculating, Integrated Algorithm calculate
AUC out is a column, is averaged to obtain AUC mean value (single numerical value) to this train value, by this AUC mean value and third portion
The x in hyper parameter and first part's Feature Engineering for including in sub-model training1+x2+…+xnA parameter is as Part IV
Input data;
Step 4: hyper parameter is found;Hyper parameter searching is regarded as another model training, the input of this model is first
Point and Part III in all hyper parameters, there are also AUC mean value, all hyper parameter forms a hyper parameter space, and model
Trained target be exactly found in this hyper parameter space any make AUC obtain maximum value, select Bayes optimize as
The algorithm that hyper parameter is found, Bayesian Optimization Algorithm is the process of a loop iteration, can all be generated after the completion of calculating every time new
Hyper parameter value as feedback, these new hyper parameter values will input first part, and start a new circulation, each follow
Ring (i.e. from first part to Part IV) can all increase a point in hyper parameter space, after the number of circulation is increasing
The value that Bayesian Optimization Algorithm obtains will constantly restrain, and will stop recycling after converging to a preset threshold value.
In one embodiment: the specific implementation step of this distributed automated characterization engineering system framework is as follows:
1.: firstly, it is necessary to have a computer cluster, such as Ali's cloud;
2.: on every computer on cluster install Apache Hadoop cluster version, Apache HBase cluster version and
Apache Hhoenix and Apache Thrift, cluster entire so just have the function of distributed data base;
3.: wherein on a computer install MySQL and use InnoDB database engine, on MySQL creation one project
Database, for this project database by whole flow process good at managing, all computers on cluster will be to this project database
It is concurrently accessed, obtains the status data of process, guarantee the consistency of whole flow process;
4.: required python program and required python packet, every computer portion are installed on every computer of cluster
The python code of administration is the same;
5.: the python program on every computer will start simultaneously, and the parallel computation and con current control in cluster are by MySQL
On project database management;
6.: order inquiries result will be returned with json format, and air control report is presented by looking for real big data system.
In one embodiment: the AUC in the step 2 is the common accuracy rate module in air control field.
In one embodiment: the python packet include numpy, pandas, scipy, scikit-learn,
TensorFlow, Keras, LightGBM, xgboost, catboost etc..
After adopting the above technical scheme, on the one hand, a large amount of costs of labor can be saved for automobile lease finance company, originally
It should will be automatically performed by process of the invention by the work that Feature Engineering teacher and the experienced air control expert in the field complete, vapour
Vehicle lease finance company only needs to provide original untreated business datum, the present invention will be automatically performed Feature Engineering and
The whole process of model training simultaneously exports last air control report, and automobile lease finance company can select by order charging
Mode, this can provide the mode charged on demand for the lesser small-to-medium business of order volume, than the mould for employing expert
Formula is flexibly very much;
On the other hand, compared with manual operation, the working efficiency of automated characterization engineering and model training will also be greatly promoted, for
The data volume of 100000 orders takes around the machine time of 500 hours (i.e. simultaneously if there is 20 computers with cluster completion
Row executes, and takes around 25 hours), the work for needing several weeks to complete by Shi Jiayi air control expert of a Feature Engineering originally,
After being assigned to each computer of cluster now, only need can be completed within one day, and after the completion of the analysis of full dose data, it is subsequent to order
Single each order of air control report queries only needs the time less than 1 second that can return the result;
In addition, in terms of accuracy, since this distributed automated characterization engineering system framework is looked for Bayesian Optimization Algorithm
All possible feature combination, therefore some new useful features not found manually can be found out, it is automatic with computer
Effect same with air control expert's manual extraction feature will be reached by extracting feature, and computer calculating be based on data,
That objectively, model does not depend on any artificial subjective rule generated, in this way can to avoid because Feature Engineering teacher and
In air control expertise or the inaccuracy of model caused by subjective fault.
Detailed description of the invention
Fig. 1 is the functional block diagram of the distributed automated characterization engineering system framework of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Referring to Fig. 1, the present invention provides distributed automated characterization engineering system framework, the distribution automated characterization engineering
Steps are as follows for the realization of system architecture:
Step 1: distributed automated characterization computing cluster;A computer cluster is needed, first part is distributed automatic special
Levy computing cluster, it is assumed that initial data is made of a main table and n from table, this is also that auto metal halide lamp air control field is common
Data mode distributes n platform computer in the cluster, and every computer is all deployed with hbase and python, and every computer has
Two roles, a role are distributed storage data, another role is distributed computing, and each computer will be divided
With n from one in table from table, and main table will be copied to each computer, for example computer n will merge main table
With from table n and generate new feature, and the generation of new feature be by one group of state modulator, it is different from table under normal circumstances
Number of parameters is different, it is now assumed that having x from table 11A parameter has x from table 22A parameter, and so on, then all n
X is shared from table one1+x2+…+xnA parameter, main table and it is all can be longitudinally cutting from table, so as to being assigned to more evenly
In each computer of cluster, all features that first part generates will be merged into a big table, this table will conduct
The feature extraction of the input data of second part, first part will generate a large amount of feature, if all features are all inputted
Model will need a large amount of model parameter, so carrying out dimension-reduction treatment to all features before data enter model;
Step 2: dimension-reduction algorithm;Specific step is as follows for dimension-reduction algorithm:
1.: with decision Tree algorithms or other can with the algorithm of arrayed feature importance give all feature orderings, it is important to obtain feature
Property list, select the feature of x%(such as 10%) to be labeled as L as basic feature list in lists;
2.: a feature f is taken out from remaining feature, and L is added;
3.: ridge regression (ridge regression) is carried out to L or similar recurrence calculates;
4.: calculate AUC;
5.: if AUC improve (i.e. accuracy rate raising) with regard to keeping characteristics f, remove feature f if AUC is not improved;
6.: 2. circulation arrives 5. until all features all handle completion, dimension-reduction algorithm will obtain the feature set after dimensionality reduction after the completion;
Step 3: model training;Feature set after dimensionality reduction will be put into model as input data and be trained, and model training uses
Integrated Algorithm, the algorithm for including have deep neural network (using python packets such as TensorFlow, Keras), gradient elevator
Device (using the python packets such as LightGBM, xgboost, catboost) and random forest (use scikit-learnpython
Packet) scheduling algorithm, the key step of model training is as follows:
1.: determine set of algorithms, such as deep neural network, LightGBM, xgboost, catboost, random forest etc.;
2.: to each algorithm in set of algorithms, feature set after inputting dimensionality reduction carries out model evaluation with k- folding cross validation, most
After export AUC, the output of each algorithm only has a column, i.e. AUC value;
3.: the output of all algorithms is arranged, i.e. AUC is merged into a Table A, for example algorithm is concentrated with 20 algorithms, each algorithm
One column of output, most latter incorporated Table A will have 20 column;
4.: determine Integrated Algorithm, there are commonly Logistic recurrence, neural networks etc., if use Logistic return as
Integrated Algorithm, then, Table A is inputted Logistic regression algorithm, and last integrated AUC value is calculated, collection can be verified
Accuracy of the AUC value than the AUC value that any single algorithm calculates after will be high, if there is 20 algorithms, this 20 algorithms
It will be deployed to respectively in 20 computers in cluster, and the feature set of second part dimension-reduction algorithm output will be copied to
In all computers, that is to say, that the input data of this 20 computers is the same, but the algorithm run is different, is owned
Algorithm calculate after the completion of obtain AUC column will be pooled in a host carry out Integrated Algorithm calculating, Integrated Algorithm calculate
AUC out is a column, is averaged to obtain AUC mean value (single numerical value) to this train value, by this AUC mean value and third portion
The x in hyper parameter and first part's Feature Engineering for including in sub-model training1+x2+…+xnA parameter is as Part IV
Input data;
Step 4: hyper parameter is found;Hyper parameter searching is regarded as another model training, the input of this model is first
Point and Part III in all hyper parameters, there are also AUC mean value, all hyper parameter forms a hyper parameter space, and model
Trained target be exactly found in this hyper parameter space any make AUC obtain maximum value, select Bayes optimize as
The algorithm that hyper parameter is found, Bayesian Optimization Algorithm is the process of a loop iteration, can all be generated after the completion of calculating every time new
Hyper parameter value as feedback, these new hyper parameter values will input first part, and start a new circulation, each follow
Ring (i.e. from first part to Part IV) can all increase a point in hyper parameter space, after the number of circulation is increasing
The value that Bayesian Optimization Algorithm obtains will constantly restrain, and will stop recycling after converging to a preset threshold value.
In the present embodiment, the specific implementation step of this distributed automated characterization engineering system framework is as follows:
1.: firstly, it is necessary to have a computer cluster, such as Ali's cloud;
2.: on every computer on cluster install Apache Hadoop cluster version, Apache HBase cluster version and
Apache Hhoenix and Apache Thrift, cluster entire so just have the function of distributed data base;
3.: wherein on a computer install MySQL and use InnoDB database engine, on MySQL creation one project
Database, for this project database by whole flow process good at managing, all computers on cluster will be to this project database
It is concurrently accessed, obtains the status data of process, guarantee the consistency of whole flow process;
4.: required python program and required python packet, every computer portion are installed on every computer of cluster
The python code of administration is the same;
5.: the python program on every computer will start simultaneously, and the parallel computation and con current control in cluster are by MySQL
On project database management;
6.: order inquiries result will be returned with json format, and air control report is presented by looking for real big data system.
Further, the AUC in the step 2 is the common accuracy rate module in air control field.
Through the above structure, herein after distributed automated characterization engineering system framework applications, on the one hand, can melt for automobile
It provides leasing company and saves a large amount of costs of labor, should be completed by Feature Engineering teacher and the experienced air control expert in the field
Work will be automatically performed by process of the invention, and automobile lease finance company only needs to provide original untreated business number
According to the present invention will be automatically performed the whole process of Feature Engineering and model training and export last air control report, and vapour
Vehicle lease finance company can select the mode by order charging, this can provide one for the lesser small-to-medium business of order volume
A mode charged on demand, than employing the mode of expert flexibly very much, on the other hand, and compared with manual operation, automated characterization
The working efficiency of engineering and model training will also greatly promote, and for the data volume of 100,000 orders, about be needed with cluster completion
The machine time (i.e. if there is 20 computers execute parallel, taking around 25 hours) for wanting 500 hours, originally by a spy
The work that sign engineer adds an air control expert that several weeks is needed to complete after being assigned to each computer of cluster now, only needs one
It can be completed, and after the completion of the analysis of full dose data, the subsequent each order of order air control report queries was only needed less than 1 second
Time can return the result.
Preferably, the present embodiment also has a following configuration, the python packet include numpy, pandas, scipy,
Scikit-learn, TensorFlow, Keras, LightGBM, xgboost, catboost etc..
In addition, in terms of accuracy, since this distributed automated characterization engineering system framework is gone with Bayesian Optimization Algorithm
All possible feature combination is found, therefore some new useful features not found manually can be found out, uses computer
Effect same with air control expert's manual extraction feature will be reached by automatically extracting feature, and computer calculating is based on data
, it is that objectively, model does not depend on any artificial subjective rule generated, it in this way can be to avoid because of Feature Engineering teacher
With in air control expertise or the inaccuracy of model caused by subjective fault.
In addition, it should be understood that although this specification is described in terms of embodiments, but not each embodiment is only wrapped
Containing an independent technical solution, this description of the specification is merely for the sake of clarity, and those skilled in the art should
It considers the specification as a whole, the technical solutions in the various embodiments may also be suitably combined, forms those skilled in the art
The other embodiments being understood that.
Claims (4)
1. distributed automated characterization engineering system framework, which is characterized in that the distribution automated characterization engineering system framework
Realize that steps are as follows:
Step 1: distributed automated characterization computing cluster;A computer cluster is needed, first part is distributed automatic special
Levy computing cluster, it is assumed that initial data is made of a main table and n from table, this is also that auto metal halide lamp air control field is common
Data mode distributes n platform computer in the cluster, and every computer is all deployed with hbase and python, and every computer has
Two roles, a role are distributed storage data, another role is distributed computing, and each computer will be divided
With n from one in table from table, and main table will be copied to each computer, for example computer n will merge main table
With from table n and generate new feature, and the generation of new feature be by one group of state modulator, it is different from table under normal circumstances
Number of parameters is different, it is now assumed that having x from table 11A parameter has x from table 22A parameter, and so on, then all n
X is shared from table one1+x2+…+xnA parameter, main table and it is all can be longitudinally cutting from table, so as to being assigned to more evenly
In each computer of cluster, all features that first part generates will be merged into a big table, this table will conduct
The feature extraction of the input data of second part, first part will generate a large amount of feature, if all features are all inputted
Model will need a large amount of model parameter, so carrying out dimension-reduction treatment to all features before data enter model;
Step 2: dimension-reduction algorithm;Specific step is as follows for dimension-reduction algorithm:
1.: with decision Tree algorithms or other can with the algorithm of arrayed feature importance give all feature orderings, it is important to obtain feature
Property list, select the feature of x%(such as 10%) to be labeled as L as basic feature list in lists;
2.: a feature f is taken out from remaining feature, and L is added;
3.: ridge regression (ridge regression) is carried out to L or similar recurrence calculates;
4.: calculate AUC;
5.: if AUC improve (i.e. accuracy rate raising) with regard to keeping characteristics f, remove feature f if AUC is not improved;
6.: 2. circulation arrives 5. until all features all handle completion, dimension-reduction algorithm will obtain the feature set after dimensionality reduction after the completion;
Step 3: model training;Feature set after dimensionality reduction will be put into model as input data and be trained, and model training uses
Integrated Algorithm, the algorithm for including have deep neural network (using python packets such as TensorFlow, Keras), gradient elevator
Device (using the python packets such as LightGBM, xgboost, catboost) and random forest (use scikit-learnpython
Packet) scheduling algorithm, the key step of model training is as follows:
1.: determine set of algorithms, such as deep neural network, LightGBM, xgboost, catboost, random forest etc.;
2.: to each algorithm in set of algorithms, feature set after inputting dimensionality reduction carries out model evaluation with k- folding cross validation, most
After export AUC, the output of each algorithm only has a column, i.e. AUC value;
3.: the output of all algorithms is arranged, i.e. AUC is merged into a Table A, for example algorithm is concentrated with 20 algorithms, each algorithm
One column of output, most latter incorporated Table A will have 20 column;
4.: determine Integrated Algorithm, there are commonly Logistic recurrence, neural networks etc., if use Logistic return as
Integrated Algorithm, then, Table A is inputted Logistic regression algorithm, and last integrated AUC value is calculated, collection can be verified
Accuracy of the AUC value than the AUC value that any single algorithm calculates after will be high, if there is 20 algorithms, this 20 algorithms
It will be deployed to respectively in 20 computers in cluster, and the feature set of second part dimension-reduction algorithm output will be copied to
In all computers, that is to say, that the input data of this 20 computers is the same, but the algorithm run is different, is owned
Algorithm calculate after the completion of obtain AUC column will be pooled in a host carry out Integrated Algorithm calculating, Integrated Algorithm calculate
AUC out is a column, is averaged to obtain AUC mean value (single numerical value) to this train value, by this AUC mean value and third portion
The x in hyper parameter and first part's Feature Engineering for including in sub-model training1+x2+…+xnA parameter is as Part IV
Input data;
Step 4: hyper parameter is found;Hyper parameter searching is regarded as another model training, the input of this model is first
Point and Part III in all hyper parameters, there are also AUC mean value, all hyper parameter forms a hyper parameter space, and model
Trained target be exactly found in this hyper parameter space any make AUC obtain maximum value, select Bayes optimize as
The algorithm that hyper parameter is found, Bayesian Optimization Algorithm is the process of a loop iteration, can all be generated after the completion of calculating every time new
Hyper parameter value as feedback, these new hyper parameter values will input first part, and start a new circulation, each follow
Ring (i.e. from first part to Part IV) can all increase a point in hyper parameter space, after the number of circulation is increasing
The value that Bayesian Optimization Algorithm obtains will constantly restrain, and will stop recycling after converging to a preset threshold value.
2. distribution automated characterization engineering system framework according to claim 1, it is characterised in that: this is distributed automatic special
The specific implementation step for levying engineering system framework is as follows:
1.: firstly, it is necessary to have a computer cluster, such as Ali's cloud;
2.: on every computer on cluster install Apache Hadoop cluster version, Apache HBase cluster version and
Apache Hhoenix and Apache Thrift, cluster entire so just have the function of distributed data base;
3.: wherein on a computer install MySQL and use InnoDB database engine, on MySQL creation one project
Database, for this project database by whole flow process good at managing, all computers on cluster will be to this project database
It is concurrently accessed, obtains the status data of process, guarantee the consistency of whole flow process;
4.: required python program and required python packet, every computer portion are installed on every computer of cluster
The python code of administration is the same;
5.: the python program on every computer will start simultaneously, and the parallel computation and con current control in cluster are by MySQL
On project database management;
6.: order inquiries result will be returned with json format, and air control report is presented by looking for real big data system.
3. distribution automated characterization engineering system framework according to claim 1, it is characterised in that: in the step 2
AUC is the common accuracy rate module in air control field.
4. distribution automated characterization engineering system framework according to claim 2, it is characterised in that: the python packet
Include numpy, pandas, scipy, scikit-learn, TensorFlow, Keras, LightGBM, xgboost, catboost
Deng.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811493937.1A CN109582724B (en) | 2018-12-07 | 2018-12-07 | Distributed automatic feature engineering system architecture |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811493937.1A CN109582724B (en) | 2018-12-07 | 2018-12-07 | Distributed automatic feature engineering system architecture |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109582724A true CN109582724A (en) | 2019-04-05 |
CN109582724B CN109582724B (en) | 2022-04-08 |
Family
ID=65929000
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811493937.1A Active CN109582724B (en) | 2018-12-07 | 2018-12-07 | Distributed automatic feature engineering system architecture |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109582724B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111861047A (en) * | 2019-04-08 | 2020-10-30 | 阿里巴巴集团控股有限公司 | Data processing control method and computing device |
CN112380205A (en) * | 2020-11-17 | 2021-02-19 | 北京融七牛信息技术有限公司 | Method and system for automatically generating characteristics of distributed architecture |
WO2022193408A1 (en) * | 2021-03-17 | 2022-09-22 | 中奥智能工业研究院(南京)有限公司 | Automatic data analysis and modeling process |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140310221A1 (en) * | 2013-04-12 | 2014-10-16 | Nec Laboratories America, Inc. | Interpretable sparse high-order boltzmann machines |
US20160232540A1 (en) * | 2015-02-10 | 2016-08-11 | EverString Innovation Technology | Predictive analytics for leads generation and engagement recommendations |
CN106339608A (en) * | 2016-11-09 | 2017-01-18 | 中国科学院软件研究所 | Traffic accident rate predicting system based on online variational Bayesian support vector regression |
CN107103332A (en) * | 2017-04-07 | 2017-08-29 | 武汉理工大学 | A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset |
CN107516135A (en) * | 2017-07-14 | 2017-12-26 | 浙江大学 | A kind of automation monitoring learning method for supporting multi-source data |
CN108154430A (en) * | 2017-12-28 | 2018-06-12 | 上海氪信信息技术有限公司 | A kind of credit scoring construction method based on machine learning and big data technology |
CN108304941A (en) * | 2017-12-18 | 2018-07-20 | 中国软件与技术服务股份有限公司 | A kind of failure prediction method based on machine learning |
CN108566364A (en) * | 2018-01-15 | 2018-09-21 | 中国人民解放军国防科技大学 | Intrusion detection method based on neural network |
-
2018
- 2018-12-07 CN CN201811493937.1A patent/CN109582724B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140310221A1 (en) * | 2013-04-12 | 2014-10-16 | Nec Laboratories America, Inc. | Interpretable sparse high-order boltzmann machines |
US20160232540A1 (en) * | 2015-02-10 | 2016-08-11 | EverString Innovation Technology | Predictive analytics for leads generation and engagement recommendations |
CN106339608A (en) * | 2016-11-09 | 2017-01-18 | 中国科学院软件研究所 | Traffic accident rate predicting system based on online variational Bayesian support vector regression |
CN107103332A (en) * | 2017-04-07 | 2017-08-29 | 武汉理工大学 | A kind of Method Using Relevance Vector Machine sorting technique towards large-scale dataset |
CN107516135A (en) * | 2017-07-14 | 2017-12-26 | 浙江大学 | A kind of automation monitoring learning method for supporting multi-source data |
CN108304941A (en) * | 2017-12-18 | 2018-07-20 | 中国软件与技术服务股份有限公司 | A kind of failure prediction method based on machine learning |
CN108154430A (en) * | 2017-12-28 | 2018-06-12 | 上海氪信信息技术有限公司 | A kind of credit scoring construction method based on machine learning and big data technology |
CN108566364A (en) * | 2018-01-15 | 2018-09-21 | 中国人民解放军国防科技大学 | Intrusion detection method based on neural network |
Non-Patent Citations (4)
Title |
---|
BAHAREH NAKISA等: "Long Short Term Memory Hyperparameter Optimization for a Neural Network Based Emotion Recognition Framework", 《IEEE ACCESS》 * |
张浩: "自动化特征工程与参数调整算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王嘉卿: "欺诈网页挖掘中特征优选及检测性能研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
陈侨安等: "基于运行数据分析的Spark任务参数优化", 《计算机工程与科学》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111861047A (en) * | 2019-04-08 | 2020-10-30 | 阿里巴巴集团控股有限公司 | Data processing control method and computing device |
CN112380205A (en) * | 2020-11-17 | 2021-02-19 | 北京融七牛信息技术有限公司 | Method and system for automatically generating characteristics of distributed architecture |
CN112380205B (en) * | 2020-11-17 | 2024-04-02 | 北京融七牛信息技术有限公司 | Automatic feature generation method and system of distributed architecture |
WO2022193408A1 (en) * | 2021-03-17 | 2022-09-22 | 中奥智能工业研究院(南京)有限公司 | Automatic data analysis and modeling process |
Also Published As
Publication number | Publication date |
---|---|
CN109582724B (en) | 2022-04-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8943059B2 (en) | Systems and methods for merging source records in accordance with survivorship rules | |
WO2019147851A2 (en) | Systems and methods for generating machine learning applications | |
CN109582724A (en) | Distributed automated characterization engineering system framework | |
US20200042659A1 (en) | Autonomous surrogate model creation platform | |
CN108052394A (en) | The method and computer equipment of resource allocation based on SQL statement run time | |
CN102760143A (en) | Method and device for dynamically integrating executing structures in database system | |
Akopov | Parallel genetic algorithm with fading selection | |
Xu et al. | Adaptive surrogate-based design optimization with expected improvement used as infill criterion | |
CN103745319B (en) | A kind of data provenance traceability system based on multi-state scientific workflow and method | |
CN111967971A (en) | Bank client data processing method and device | |
Wu et al. | Two layered approaches integrating harmony search with genetic algorithm for the integrated process planning and scheduling problem | |
CN115034409A (en) | Vehicle maintenance scheme determination method, device, equipment and storage medium | |
CN106228263A (en) | Materials stream informationization methods based on big data | |
CN112507098B (en) | Question processing method, question processing device, electronic equipment, storage medium and program product | |
CN111768096A (en) | Rating method and device based on algorithm model, electronic equipment and storage medium | |
GB2599334A (en) | Feature engineering in neural networks optimization | |
CN115965154A (en) | Knowledge graph-based digital twin machining process scheduling method | |
WO2024007604A1 (en) | Mathematical model solving method and apparatus, and computing device and computing device cluster | |
CN116384606A (en) | Scheduling optimization method and system based on cooperative distribution of vehicle unmanned aerial vehicle | |
CN109711558A (en) | For the method and device of the machine learning model of feature construction containing latent instability | |
Sun et al. | [Retracted] Impact of Financial R&D Resource Allocation Efficiency Based on VR Technology and Machine Learning in Complex Systems on Total Factor Productivity | |
EP4246375A1 (en) | Model processing method and related device | |
CN104778253B (en) | A kind of method and apparatus that data are provided | |
Chen et al. | An Attribute Reduction Algorithm Based on Rough Set Theory and an Improved Genetic Algorithm. | |
Gao et al. | Research on product sales forecasting based on multi-value chain collaborative data management system in manufacturing industry |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |