CN107958268A

CN107958268A - The training method and device of a kind of data model

Info

Publication number: CN107958268A
Application number: CN201711175464.6A
Authority: CN
Inventors: 王雪洁; 李长山
Original assignee: Uf Financial Information Technology Ltd By Share Ltd
Current assignee: Uf Financial Information Technology Ltd By Share Ltd
Priority date: 2017-11-22
Filing date: 2017-11-22
Publication date: 2018-04-24

Abstract

The present invention proposes the training method and device of a kind of data model, and the training method of data model includes：Modeling problem type and sample data are obtained, and identifies sample data type；According to modeling problem type and sample data, determine sample parameter and index can be issued；According to modeling problem type, sample parameter and preset model selection strategy, modeling algorithm is determined；According to modeling algorithm training data model, and sample data is input to data model and obtains output result；Score output result, obtain appraisal result；Judge whether appraisal result meets that index can be issued；When appraisal result is unsatisfactory for that index can be issued, then optimize preset model selection strategy, and return to continuation according to modeling problem type, sample parameter and preset model selection strategy, determine modeling algorithm.The present invention introduces Automation grade point mechanism in modeling process, carries out the optimization of data model and modd selection strategy, reduces manual intervention, improve modeling efficiency.

Description

Training method and device for data model

Technical Field

The invention relates to the technical field of data mining, in particular to a training method and device of a data model.

Background

The mining analysis based on big data is utilized to provide support for enterprise decision making, data needs to have better business understanding under the premise that the data is accurate in quality, a targeted analysis prediction model can be trained from massive data by utilizing a proper mining algorithm, and production deployment is carried out. Fig. 1 shows a flow diagram of a classical data mining in the background of the invention. A classical Data mining Process (CRISP-DM: cross-Industry Standard Process for Data mining) is shown in fig. 1, and in the Process of analyzing and modeling business Data, business modeling staff basically explore, process and model the Data by using analysis and mining tools such as SPSS, SAS, R, etc., to convert business problems into Data problems, and model the Data after the Data analysis and processing are ready. In the data modeling process, the analysis prediction model trained based on the sample data needs to be evaluated (such as accuracy, error and the like) to judge whether the model can be put into a production environment, so that the deployment is carried out to solve the business problem.

FIG. 2 shows a flow diagram of classical data modeling in the background of the invention. As shown in fig. 2, the preprocessed (filtered, converted, combined, etc.) data are trained and evaluated by statistical analysis and visual exploration according to experience and business problems of business modeling personnel by selecting different mining algorithms (classification, clustering, association, etc.), corresponding algorithm model parameter values are obtained by training input sample data, and the accuracy of the model is evaluated by verifying a data set, so as to determine whether the model can be put into a production environment. In a production environment, the generated business data is input from the perspective of a model, and an analysis prediction result of a production decision reference is generated after the model is calculated.

In the whole modeling analysis process, the flow of the dotted frame part and the production environment deployment process need to be trained by a modeling worker by selecting a corresponding mining algorithm according to the business field knowledge of the modeling worker, when the training result does not meet the requirement (the error is large, and the like), the algorithm or parameters need to be readjusted, and a large number of attempts are often needed to find a relatively optimized model result. Generally, this step tends to take up a large portion of the entire analysis mining project.

Disclosure of Invention

The present invention has been made to solve at least one of the problems occurring in the prior art or the related art.

To this end, a first aspect of the present invention is directed to a method for training a data model.

A second aspect of the present invention is to provide a training apparatus for data models.

In view of this, according to a first aspect of the present invention, a method for training a data model is provided, including: obtaining a modeling problem type and sample data, and identifying the type of the sample data; determining sample parameters and issuable indexes according to the modeling problem type and sample data; determining a modeling algorithm according to the modeling problem type, the sample parameters and a preset model selection strategy; training a data model according to a modeling algorithm, and inputting sample data into the data model to obtain an output result; grading the output result to obtain a grading result; judging whether the scoring result meets the issuable index; and when the grading result does not meet the issuable index, optimizing a preset model selection strategy, and returning to continuously determine a modeling algorithm according to the modeling problem type, the sample parameter and the preset model selection strategy.

The invention provides a training method of a data model, which comprises the steps of firstly identifying the type of obtained sample data (such as the sample data is digital or character type, continuous or discrete type and the like), determining sample parameters (such as classification indexes of classification problems, mean values of clustering problems and the like) and issuable indexes (such as the accuracy rate is more than 95% and the like) according to the type of the sample data and the type of obtained modeling problems (such as classification problems, clustering problems, association problems and the like), then selecting one or more modeling algorithms from a modeling algorithm cluster according to the type of the modeling problems, the sample parameters and a preset model selection strategy, training the data model, finally grading the data model by using the sample data, judging whether the grading result meets the issuable indexes, optimizing the preset model selection strategy if the grading result does not meet the issuable indexes, and returning to re-determine the modeling algorithms. According to the method, the corresponding mining algorithm is automatically selected through the preset model selection strategy to model the sample data, the preset model selection strategy is automatically optimized through evaluating the data model, manual intervention is not needed, the objectivity of the model is greatly improved, subjective omission and errors of modeling personnel are reduced, the deployable model meeting the production environment can be selected, the threshold of the business modeling personnel for applying the mining algorithm is reduced, and the modeling accuracy and efficiency are improved.

The training method of the data model according to the present invention may further have the following technical features:

in the above technical solution, preferably, the determining a modeling algorithm according to the modeling problem type, the sample parameter, and the preset model selection policy specifically includes: determining the range of the type of a modeling algorithm according to the type of the modeling problem; and determining a modeling algorithm within the range of the type of the modeling algorithm according to the sample parameters and a preset model selection strategy.

In the technical scheme, the range of the type of the modeling algorithm is determined according to the type of the modeling problem, for example, the type of the modeling problem is classified, algorithms corresponding to the classified problem, such as a decision tree, logistic regression, fuzzy rules and the like, can be selected from a modeling algorithm cluster, and as sample parameters reflect the characteristics of sample data, one or more algorithms for modeling are selected finally within the range of the type of the modeling algorithm according to the sample parameters and a preset model selection strategy, so that the modeling is more accurate and reliable, and the modeling efficiency is improved.

In any of the above technical solutions, preferably, the scoring result includes: a correct rate score and at least one or a combination of: performance index scoring, stability index scoring, and custom index scoring.

In the technical scheme, the scoring of the data model comprises correct rate scoring, performance index scoring, stability index scoring and user-defined index scoring, a user can select according to actual needs, and the scoring in all aspects of comprehensive consideration also ensures the reliability of the data model.

In any of the above technical solutions, preferably, the calculation formula of the scoring result is:

SCORE _total ＝SCORE _acc ×W _acc +SCORE _perf ×W _perf +SCORE _robust ×W _robust

+SCORE _cust ×W _cust

wherein, SCORE _total SCORE for Total, SCORE _acc For accuracy scoring, W _acc ScORE for scoring weights for predetermined accuracy rates _perf Scoring the performance index, W _perf ScORE for the Preset Performance index _robust Scoring the stability index, W _robust Scoring a predetermined stability index by a weight, SCORE _cust Scoring the custom index, W _cust And scoring the weight of the preset user-defined index.

In the technical scheme, the scoring result of the data model is a weighted summation result of the correct rate scoring, the performance index scoring, the stability index scoring and the user-defined index scoring, a user can select one or more items according to actual needs to score the data model, and the weight is adjusted correspondingly, generally speaking, the weight of the correct rate scoring is the highest, and the reliability of the data model is ensured.

In any of the above technical solutions, preferably, the accuracy rating formula is:

wherein, acc is the accuracy of the data model, acc _thredhold And if the accuracy threshold is preset, the accuracy of the data model is the ratio of the number of correct results output by the data model to the number of sample data.

In the technical scheme, when the accuracy of the data model is smaller than a preset accuracy threshold, the accuracy of the data model is lower, which indicates that the data model cannot meet the production requirement, and the accuracy score is zero; when the accuracy of the data model is greater than or equal to the accuracy threshold, the accuracy score is the difference between the accuracy of the data model and the accuracy threshold, and the higher the accuracy of the data model is, the higher the accuracy score is.

In any of the above technical solutions, preferably, the performance index scoring formula is:

SCORE _perf ＝T _min -T _i

wherein, the performance index is scored T _min Minimum time spent training data models, T _i It actually takes time to train the data model.

In the technical scheme, the performance index score is the time consumption cost for obtaining an output result for the same sample data, the time spent in each iteration in the data model training process is recorded, the time spent in the data model training process is selected as the minimum time spent in the data model training, the performance index score is the difference between the minimum time spent in the data model training and the actual time spent in the data model training, and the performance index score is higher when the actual time spent is less.

In any of the above technical solutions, preferably, if an abnormal condition occurs during the training process of the data model and the difference between the output result of the data model and the output result of the data model under the abnormal condition is within a preset range, the stability index SCORE is determined _robust Is 1, otherwise, the stability index SCORE SCORE _robust Is 0.

In the technical scheme, if abnormal conditions (such as field value control, insufficient computing resources and the like) occur in the training process of the data model, and the output result of the data model under the abnormal conditions is not greatly different from the output result under the non-abnormal conditions, which indicates that the data model is relatively stable, the stability index score is 1, otherwise, the stability index score is 0, and if no abnormal conditions occur, a user can set the weight of the stability index score to zero when calculating the total score according to the actual conditions.

In any of the above technical solutions, preferably, when the modeling algorithm is a custom algorithm, the custom index SCORE is SCORE _cust Measures given to business expertsGrading according to the model effect; when the modeling algorithm is not a custom algorithm, the custom index SCORE SCORE _cust Is 0.

In the technical scheme, when a user-defined algorithm is selected for modeling, a user-defined index score needs to be set in the total score, wherein the score is the score of the data model effect given by a service expert.

In any of the above technical solutions, preferably, when the scoring result satisfies the issuable index, the data model with the highest total score is determined as the final data model.

In the technical scheme, when the scoring result meets the issuable index, the data model with the highest total score is selected as the final data model, so that the automatic screening of the model is realized, and the method is applied to the actual production environment.

In a second aspect of the present invention, an apparatus for training a data model is provided, including: the acquisition unit is used for acquiring the type of the modeling problem and sample data and identifying the type of the sample data; the first determining unit is used for determining sample parameters and issuable indexes according to the modeling problem type and sample data; the second determining unit is used for determining a modeling algorithm according to the modeling problem type, the sample parameter and a preset model selection strategy; the modeling unit is used for training a data model according to a modeling algorithm and inputting sample data into the data model to obtain an output result; the scoring unit is used for scoring the output result to obtain a scoring result; the judging unit is used for judging whether the grading result meets the issuable index or not; and the optimizing unit is used for optimizing the preset model selection strategy when the grading result does not meet the issuable index, and returning to determine the modeling algorithm continuously according to the modeling problem type, the sample parameter and the preset model selection strategy.

The invention provides a training device of a data model, which comprises the steps of firstly identifying the type of acquired sample data (for example, the sample data is digital or character type, continuous or discrete type and the like), determining sample parameters (for example, classification indexes of classification problems, mean values of clustering problems and the like) and issuable indexes (for example, the accuracy rate is more than 95% and the like) according to the type of the sample data and the type of acquired modeling problems (for example, classification problems, clustering problems, association problems and the like), then selecting one or more modeling algorithms in a modeling algorithm cluster according to the type of the modeling problems, the sample parameters and a preset model selection strategy, training the data model, finally grading the data model by using the sample data, judging whether the grading result meets the issuable indexes, if not, optimizing the preset model selection strategy, and returning to re-determine the modeling algorithms. According to the method, the corresponding mining algorithm is automatically selected through the preset model selection strategy to model the sample data, the preset model selection strategy is automatically optimized through evaluating the data model, manual intervention is not needed, the objectivity of the model is greatly improved, subjective omission and errors of modeling personnel are reduced, the deployable model meeting the production environment can be selected, the threshold of the business modeling personnel for applying the mining algorithm is reduced, and the modeling accuracy and efficiency are improved.

The training device for the data model according to the present invention may further have the following technical features:

in the foregoing technical solution, preferably, the second determining unit specifically includes: the third determining unit is used for determining the range of the modeling algorithm type according to the modeling problem type; and the selection unit is used for determining the modeling algorithm within the range of the type of the modeling algorithm according to the sample parameters and the preset model selection strategy.

In the technical scheme, the range of the modeling algorithm type is determined according to the modeling problem type, for example, the modeling problem type is a classification type problem, an algorithm corresponding to the classification type problem can be selected in a modeling algorithm cluster, such as a decision tree, a logistic regression, a fuzzy rule and the like, as sample parameters reflect the characteristics of sample data, and then one or more algorithms for modeling are selected in the range of the modeling algorithm type according to the sample parameters and a preset model selection and measurement strategy, so that the modeling is more accurate and reliable, and the modeling efficiency is improved.

In any of the above technical solutions, preferably, the scoring result includes: a correct rate score and at least one or a combination of: performance index score, stability index score, and custom index score.

In the technical scheme, the scoring of the data model comprises correct rate scoring, performance index scoring, stability index scoring and user-defined index scoring, a user can select the scoring according to actual needs, and the scoring in all aspects is comprehensively considered, so that the reliability of the data model is ensured.

SCORE _total ＝SCORE _acc ×W _acc +SCORE _perf ×W _perf +SCORE _robust ×W _robust +SCORE _cust ×W _cust

wherein, SCORE _total SCORE for Total, SCORE _acc For accuracy rating, W _acc Score for predetermined accuracy rating _perf Scoring the performance index, W _perf ScORE for the Preset Performance index _robust Scoring for stability index, W _robust Scoring a predetermined stability index by a weight, SCORE _cust Scoring for the custom index, W _cust And the weight is scored for the preset user-defined index, so that the reliability of the data model is ensured.

In the technical scheme, the scoring result of the data model is a weighted summation result of the correct rate scoring, the performance index scoring, the stability index scoring and the user-defined index scoring, and a user can select one or more items according to actual needs to score the data model and correspondingly adjust the weight, wherein generally, the weight of the correct rate scoring is the highest.

wherein, acc is the accuracy of the data model, acc _thredhold For presetting a threshold value of the accuracy, the accuracy of the data model is the accuracy of the data model outputThe ratio of the number to the number of sample data.

SCORE _perf ＝T _min -T _i

wherein, T _min Minimum time spent training data models, T _i It actually takes time to train the data model.

In the technical scheme, the performance index score is the time consumption cost for obtaining an output result for the same sample data, the time spent in each iteration in the data model training process is recorded, the time spent in the least time spent in the data model training process is selected as the minimum time spent in the data model training, the performance index score is the difference between the minimum time spent in the data model training and the actual time spent in the data model training, and the performance index score is higher when the actual time spent is less.

In any of the above technical solutions, preferably, when the modeling algorithm is a custom algorithm, the custom index SCORE is SCORE _cust The score for measuring the effect of the data model is given to the service expert; when the modeling algorithm is not a custom algorithm, the custom index SCORE SCORE _cust Is 0.

In the technical scheme, when a user-defined algorithm is selected for modeling, a user-defined index score needs to be set in the total score, and the score is the score of the data model effect given by a service expert.

In any of the above technical solutions, preferably, the optimization unit is further configured to determine the data model with the highest total score as the final data model when the scoring result satisfies the issuable index.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram illustrating a classical data mining process in the background of the invention;

FIG. 2 is a flow chart illustrating a classical data modeling in the background of the invention;

FIG. 3 illustrates a flow diagram of a method of training a data model according to an embodiment of the invention;

FIG. 4 shows a flow diagram of a method of training a data model according to an embodiment of the invention;

FIG. 5 shows a schematic block diagram of a training apparatus for a data model of an embodiment of the present invention;

FIG. 6 shows a schematic block diagram of a training apparatus for a data model of an embodiment of the present invention;

FIG. 7 is a schematic flow chart diagram illustrating a mining modeling method in accordance with an exemplary embodiment of the present invention;

FIG. 8 illustrates a model diagram of an auto-training evaluation mechanism in accordance with a specific embodiment of the present invention;

fig. 9 is a schematic diagram illustrating the effect of the mining modeling method applied to the data analysis platform according to the embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited to the specific embodiments disclosed below.

Embodiments of the first aspect of the present invention provide a method for training a data model, and fig. 3 shows a flowchart of the method for training a data model according to an embodiment of the present invention. The training method of the data model shown in fig. 3 includes:

102, acquiring a modeling problem type and sample data, and identifying the type of the sample data;

104, determining sample parameters and issuable indexes according to the modeling problem type and sample data;

106, determining a modeling algorithm according to the modeling problem type, the sample parameters and a preset model selection strategy;

108, training a data model according to a modeling algorithm, and inputting sample data into the data model to obtain an output result;

step 110, scoring the output result to obtain a scoring result;

step 112, judging whether the grading result meets the issuable index;

and step 114, when the scoring result does not meet the issuable index, optimizing a preset model selection strategy, and returning to step 106.

The invention provides a training method of a data model, which comprises the steps of firstly identifying the type of obtained sample data (such as the sample data is digital or character type, continuous or discrete type and the like), determining sample parameters (such as classification indexes of classification problems, mean values of clustering problems and the like) and issuable indexes (such as the accuracy rate is more than 95% and the like) according to the type of the sample data and the type of obtained modeling problems (such as classification problems, clustering problems, association problems and the like), checking the contents of necessary items, value range and the like of the sample parameters, selecting one or more modeling algorithms from a modeling algorithm cluster according to the type of the modeling problems, the sample parameters and a preset model selection strategy, training the data model, finally grading the data model by using the sample data, judging whether a grading result meets the issuable indexes or not, optimizing the preset model selection strategy if the grading result does not meet the issuable indexes, and returning to redetermine the modeling algorithms. According to the method, the corresponding mining algorithm is automatically selected through the preset model selection strategy to model the sample data, the preset model selection strategy is automatically optimized through evaluating the data model, manual intervention is not needed, the objectivity of the model is greatly improved, subjective omission and errors of modeling personnel are reduced, the deployable model meeting the production environment can be selected, the threshold of the business modeling personnel for applying the mining algorithm is reduced, and the modeling accuracy and efficiency are improved.

FIG. 4 shows a flow diagram of a method for training a data model according to an embodiment of the invention. The training method of the data model shown in fig. 4 includes:

step 202, obtaining a modeling problem type and sample data, and identifying the sample data type;

step 204, determining sample parameters and issuable indexes according to the modeling problem type and sample data;

step 206, determining the range of the modeling algorithm type according to the modeling problem type;

208, determining a modeling algorithm within the range of modeling algorithm types according to the sample parameters and a preset model selection strategy;

step 210, training a data model according to a modeling algorithm, and inputting sample data into the data model to obtain an output result;

step 212, scoring the output result to obtain a scoring result;

step 214, judging whether the scoring result meets the issuable index;

step 216, when the scoring result does not meet the issuable index, optimizing a preset model selection strategy, and returning to step 208;

and step 218, when the grading result meets the issuable index, selecting the data model with the highest total grade as the final data model, so as to realize automatic screening of the model, and applying the model to the actual production environment.

In this embodiment, in step 206 and step 208, a range of a modeling algorithm type is determined according to the modeling problem type, for example, the modeling problem type is a classification type problem, an algorithm corresponding to the classification type problem, such as a decision tree, a logistic regression, a fuzzy rule, etc., may be selected in the modeling algorithm cluster, and since the sample parameter reflects the characteristics of the sample data, one or more algorithms that are finally modeled are selected within the range of the modeling algorithm type according to the sample parameter and a preset model selection and measurement policy, so as to ensure more accurate and reliable modeling and improve modeling efficiency.

In step 218, when the scoring result meets the issuable index, the data model with the highest total score is selected as the final data model, so as to realize automatic screening of the model, and the model is applied to the actual production environment, and if the issuable index is more than 98% of accuracy, the model with the highest total score is selected from all the data models with the accuracy more than 98% for deployment.

In one embodiment of the present invention, preferably, the scoring result includes: a correct rate score and at least one or a combination of: performance index score, stability index score, and custom index score.

In the embodiment, the scoring of the data model comprises the accuracy scoring, the performance index scoring, the stability index scoring and the user-defined index scoring, a user can select according to actual needs, and the scoring of all aspects of comprehensive consideration also ensures the reliability of the data model.

In one embodiment of the present invention, preferably, the calculation formula of the scoring result is:

wherein, SCORE _total SCORE for total, SCORE _acc For accuracy scoring, W _acc ScORE for scoring weights for predetermined accuracy rates _perf Scoring the performance index, W _perf ScORE for scoring weights for preset performance indicators _robust Scoring for stability index, W _robust Scoring a predetermined stability index weight, SCORE _cust Scoring the custom index, W _cust And scoring the weight for the preset custom index.

In one embodiment of the present invention, preferably, the accuracy rating score formula is:

wherein, acc is the accuracy of the data model, acc _thredhold The accuracy of the data model is the ratio of the number of correct results output by the data model to the number of sample data, which is a preset accuracy threshold.

In one embodiment of the present invention, preferably, the performance index scoring formula is:

SCORE _perf ＝T _min -T _i

In an embodiment of the present invention, preferably, if an abnormal condition occurs during the training process of the data model and a difference between an output result of the data model and an output result of the data model in a non-abnormal condition is within a preset range, the stability index SCORE is set _robust Is 1, otherwise, the stability index SCORE SCORE _robust Is 0.

In one embodiment of the present invention, preferably, when the modeling algorithm is a custom algorithm, the custom index SCORE SCORE _cust The score for measuring the effect of the data model is given to the service expert; when the modeling algorithm is not a custom algorithm, the custom index SCORE SCORE _cust Is 0.

In this embodiment, the scoring result of the data model is a result of weighted summation of the accuracy rating, the performance index rating, the stability index rating and the user-defined index rating, and a user can select one or more of the accuracy rating, the performance index rating, the stability index rating and the user-defined index rating according to actual needs to score the data model and adjust the weight accordingly.

For the accuracy rating, when the accuracy of the data model is smaller than a preset accuracy threshold, the accuracy of the data model is lower, which indicates that the data model cannot meet the production requirement, and the accuracy rating is zero; when the accuracy of the data model is greater than or equal to the accuracy threshold, the accuracy score is the difference between the accuracy of the data model and the accuracy threshold, and the higher the accuracy of the data model is, the higher the accuracy score is.

In addition, the performance index score is the time consumption cost for obtaining an output result for the same sample data, the time spent in each iteration in the data model training process is recorded, the time spent in the data model training process is selected as the minimum time spent in the data model training, the performance index score is the difference between the minimum time spent in the data model training and the actual time spent in the data model training, and the performance index score is higher when the actual time spent is less.

In addition, if an abnormal condition (such as field value control, insufficient computing resources and the like) occurs in the training process of the data model, and the output result of the data model under the abnormal condition is not greatly different from the output result under the non-abnormal condition, which indicates that the data model is relatively stable, the stability index score is 1, otherwise, the stability index score is 0, and if no abnormal condition occurs, the user can set the weight of the stability index score to zero when calculating the total score according to the actual condition.

In addition, when the modeling is performed by using a custom algorithm, a custom index score, which is a score of the effect of the data model given by a business expert, needs to be set in the total score.

It should be noted that, for the scoring of the data model, generally, the accuracy score is the most important consideration factor, the accuracy score weight is also the largest, and the performance index score, the stability index score and the user-defined index score are optional items, and the user can select one or more items to evaluate the data model together with the accuracy score according to actual needs, for example, if a decision tree algorithm is used for modeling, and an abnormality occurs during the training of the model, the user can select the accuracy score and the stability index score to evaluate the data model, and at the same time, the corresponding weight needs to be adjusted.

In a second aspect of the present invention, a training apparatus for a data model is provided, and fig. 5 shows a schematic block diagram of the training apparatus for a data model according to an embodiment of the present invention. The training apparatus 300 of the data model shown in fig. 5 includes:

an obtaining unit 302, configured to obtain a modeling problem type and sample data, and identify the sample data type;

a first determining unit 304, configured to determine a sample parameter and a publishable index according to the modeling problem type and the sample data;

a second determining unit 306, configured to determine a modeling algorithm according to the modeling problem type, the sample parameter, and a preset model selection policy;

the modeling unit 308 is used for training the data model according to a modeling algorithm and inputting sample data into the data model to obtain an output result;

the scoring unit 310 is configured to score the output result to obtain a scoring result;

a judging unit 312, configured to judge whether the scoring result meets the issuable index;

and the optimizing unit 314 is configured to optimize the preset model selection policy when the scoring result does not meet the issuable index, and return to determine the modeling algorithm according to the modeling problem type, the sample parameter, and the preset model selection policy.

The invention provides a training device of a data model, which comprises the steps of firstly identifying the type of acquired sample data (such as the sample data is digital or character type, continuous or discrete), determining sample parameters (such as classification indexes of classification problems, mean values of clustering problems and the like) and issuable indexes (such as accuracy rate more than 95% and the like) according to the type of the sample data and the type of acquired modeling problems (such as classification problems, clustering problems, association problems and the like), checking the contents of necessary items, value range and the like of the sample parameters, selecting one or more modeling algorithms in a modeling algorithm cluster according to the type of the modeling problems, the sample parameters and a preset model selection strategy, training the data model, grading the data model by using the sample data, judging whether a grading result meets the issuable indexes or not, optimizing the preset model selection strategy if the grading result does not meet the issuable indexes, and returning to determine the modeling algorithms again. According to the method, the corresponding mining algorithm is automatically selected through the preset model selection strategy to model the sample data, the preset model selection strategy is automatically optimized through evaluating the data model, manual intervention is not needed, the objectivity of the model is greatly improved, subjective omission and errors of modeling personnel are reduced, the deployable model meeting the production environment can be selected, the threshold of the business modeling personnel for applying the mining algorithm is reduced, and the modeling accuracy and efficiency are improved.

FIG. 6 shows a schematic block diagram of a training apparatus for a data model according to an embodiment of the present invention. The training apparatus of the data model shown in fig. 6 includes:

an obtaining unit 402, configured to obtain a modeling problem type and sample data, and identify the sample data type;

a first determining unit 404, configured to determine a sample parameter and a issuable index according to the modeling problem type and the sample data;

a second determining unit 406, configured to determine a modeling algorithm according to the modeling problem type, the sample parameter, and a preset model selection policy;

the modeling unit 408 is used for training a data model according to a modeling algorithm and inputting sample data into the data model to obtain an output result;

the scoring unit 410 is used for scoring the output result to obtain a scoring result;

a determining unit 412, configured to determine whether the scoring result meets a distributable index;

the optimizing unit 414 is configured to optimize the preset model selection policy when the scoring result does not meet the issuable index, and return to determine a modeling algorithm according to the modeling problem type, the sample parameter, and the preset model selection policy;

the second determining unit 406 specifically includes:

a third determining unit 462, configured to determine a range of a modeling algorithm type according to the modeling problem type;

a selecting unit 464, configured to determine a modeling algorithm within a range of modeling algorithm types according to the sample parameters and a preset model selection policy;

and the optimizing unit 414 is further configured to determine the data model with the highest total score as the final data model when the scoring result satisfies the issuable index.

And when the scoring result meets the issuable index, selecting the data model with the highest total score as the final data model, realizing automatic screening of the model, and applying the model to the actual production environment.

In one embodiment of the present invention, preferably, the scoring result includes: a correct rate score and at least one or a combination of: performance index scoring, stability index scoring, and custom index scoring.

In the embodiment, the scoring of the data model comprises correct rate scoring, performance index scoring, stability index scoring and user-defined index scoring, a user can select the scoring according to actual needs, and the scoring in all aspects of comprehensive consideration also ensures the reliability of the data model.

wherein, SCORE _total SCORE for total, SCORE _acc For accuracy scoring, W _acc Score for predetermined accuracy rating _perf Scoring the performance index, W _perf ScORE for scoring weights for preset performance indicators _robust Scoring the stability index, W _robust Scoring a predetermined stability index weight, SCORE _cust Scoring for the custom index, W _cust And scoring the weight for the preset custom index.

SCORE _perf ＝T _min -T _i

wherein, the performance index is scored as T _min Minimum time spent training data models, T _i It actually takes time to train the data model.

In an embodiment of the present invention, preferably, if an abnormal condition occurs during the training process of the data model and the difference between the output result of the data model and the output result of the data model in the abnormal condition is within a preset range, the stability index SCORE is generated _robust Is 1, otherwise, the stability index SCORE SCORE _robust Is 0.

In one embodiment of the present invention, preferably, when the modeling algorithm is a custom algorithm, the custom index SCORE is SCORE _cust A score for measuring the effect of the data model is given to the service expert; when the modeling algorithm is not a custom algorithm, the custom index SCORE SCORE _cust Is 0.

In this embodiment, the scoring result of the data model is a weighted summation result of the accuracy rating, the performance index rating, the stability index rating and the user-defined index rating, and a user can select one or more of the accuracy rating, the performance index rating, the stability index rating and the user-defined index rating according to actual needs to score the data model and adjust the weight accordingly.

The specific embodiment is as follows:

FIG. 7 is a flowchart illustrating a mining modeling method according to an embodiment of the present invention. Comparing fig. 7 with fig. 2, it can be seen that in the method of fig. 7, modeling evaluation can be automatically performed on input data, an optimal model is determined and automatically deployed, seamless connection without manual intervention is achieved, and therefore, an available model does not need to be separately deployed in a production flow, and self-optimization updating is completed.

FIG. 8 illustrates a model diagram of an auto-training evaluation mechanism, in accordance with an embodiment of the present invention.

From the view of the whole service data flow, the whole automatic training evaluation device firstly carries out automatic modeling on the sampled data of the received service (label 1), and firstly evaluates the availability and carries out self-iterative updating (label 2 and label 4) according to the model prediction result of the production environment. In addition, the business problem definition module is mainly used for judging the types of the analysis modeling problems, such as classification, clustering, relevance analysis and the like. Basic judgment by the service personnel and assignment of input are required here.

The other main parts of the technical principle of the training evaluation device are as follows:

mining model parameter definition:

the main functions are as follows: 1. identifying the data type of the sample, such as whether the column data is a number or a character, continuous or discrete, missing values, data distribution and the like; 2. determining sample parameters such as classification indexes, K value, convergence rules/penalty functions and the like; 3. checking parameters including necessary items, value range and the like; 4. publishable index of mining model and threshold (e.g., >95% accuracy).

Modeling algorithm descriptor:

the basic description information of each algorithm in the algorithm cluster is mainly maintained, and the basic information of the algorithms is obtained according to the algorithm types (classification, clustering and the like), the algorithm parameters, the data types and the like. The method supports the expansion of a custom algorithm, provides an XML format, and describes and registers information such as algorithm parameters, types and the like.

A mining algorithm selector:

the method mainly comprises two parts: 1. defining an applicable algorithm range in an algorithm cluster according to modeling algorithm description by combining modeling problem definition and sample parameter definition; 2. and optimizing the selection strategy of the model according to the selection history of the algorithm and the grading result, and determining the training parameters of the model, such as the selection strategy of the K value in the K-MEANS.

Modeling algorithm clustering:

namely a basic mining modeling algorithm and an algorithm package customized by business modeling personnel. The device mainly comprises two parts: 1. and the algorithm description comprises algorithm classification, algorithm parameters, output data, evaluation parameters and the like. 2. The algorithm execution package supports multiple implementations, such as Java, python, R and other runtime environments, and the runtime supports the PFM/PMML model format.

A model scoring device:

the model scorer is mainly used for evaluating an intermediate result output in a self-iteration process of data model training, giving a score of the intermediate data model, and taking the score as a basis of model optimization to finally obtain an optimal data model. The scoring content comprises the following steps: accuracy index, performance index, stability index and user-defined index.

Analyzing and predicting result evaluation:

and (3) evaluating the model deployed in the production environment in real time, and triggering a self-optimization updating mechanism of the model when the evaluation index is lower than a certain preset threshold (the threshold is specified when the sample parameter is defined), so that the automatic screening and automatic updating of the model are finally realized, and the self-optimization of the data modeling process is automatically completed.

Fig. 9 is a schematic diagram illustrating the effect of the mining modeling method applied to the data analysis platform according to the embodiment of the present invention. As shown in fig. 9, first, sample data is classified and then a data distribution map and an original data model are output, an iterative automatic update is performed on the data model, then, preprocessed data is substituted into the updated data model to output a predicted value, and finally, the accuracy of the model is evaluated and visualized, so that the observation and adjustment of business personnel are facilitated.

In the description herein, the description of the terms "one embodiment," "some embodiments," "specific embodiments," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for training a data model, comprising:

obtaining a modeling problem type and sample data, and identifying the type of the sample data;

determining the sample parameters and a publishable index according to the modeling problem type and the sample data;

determining a modeling algorithm according to the modeling problem type, the sample parameters and a preset model selection strategy;

training a data model according to the modeling algorithm, and inputting the sample data into the data model to obtain an output result;

grading the output result to obtain a grading result;

judging whether the grading result meets the publishable index;

and when the grading result does not meet the publishable index, optimizing the preset model selection strategy, and returning to determine a modeling algorithm continuously according to the modeling problem type, the sample parameter and the preset model selection strategy.

2. The method for training a data model according to claim 1, wherein the determining a modeling algorithm according to the modeling problem type, the sample parameter, and a preset model selection strategy specifically comprises:

determining the range of the type of a modeling algorithm according to the type of the modeling problem;

and determining a modeling algorithm within the range of the type of the modeling algorithm according to the sample parameters and the preset model selection strategy.

3. The method of training a data model according to claim 1,

the scoring result comprises: a correct rate score and at least one or a combination of: performance index scoring, stability index scoring, and custom index scoring.

4. The method of training a data model according to claim 3,

the calculation formula of the scoring result is as follows:

wherein, SCORE _total SCORE for total, SCORE _acc Scoring said accuracy, W _acc ScORE for scoring weights for predetermined accuracy rates _perf Scoring said performance index, W _perf ScORE for the Preset Performance index _robust Scoring the stability index, W _robust Scoring a predetermined stability index weight, SCORE _cust Scoring the custom index, W _cust And scoring the weight for the preset custom index.

5. The method for training a data model according to claim 4, wherein the accuracy rating formula is:

wherein, acc is the accuracy of the data model, acc _thredhold And if the accuracy is the preset accuracy threshold, the accuracy of the data model is the ratio of the number of correct results output by the data model to the number of sample data.

6. The method for training a data model according to claim 4, wherein the performance index scoring formula is:

SCORE _perf ＝T _min -T _i

7. The method of training a data model according to claim 4,

if an abnormal condition occurs in the training process of the data model and the difference between the output result of the data model and the output result under the abnormal condition is within a preset range, the stability index SCORE SCORE _robust Is 1, otherwise, the stability index SCORE SCORE _robust Is 0.

8. The method of training a data model according to claim 4,

when the modeling algorithm is a custom algorithm, the custom index SCORE SCORE _cust The score for measuring the effect of the data model is given to the service expert;

when the modeling algorithm is not a custom algorithm, the custom index SCORE SCORE _cust Is 0.

9. A method of training a data model according to any one of claims 1 to 8, further comprising:

and when the scoring result meets the publishable index, determining the data model with the highest total score as the final data model.

10. An apparatus for training a data model, comprising:

the acquisition unit is used for acquiring the type of the modeling problem and sample data and identifying the type of the sample data;

the first determining unit is used for determining the sample parameters and the issuable index according to the modeling problem type and the sample data;

the second determining unit is used for determining a modeling algorithm according to the modeling problem type, the sample parameter and a preset model selection strategy;

the modeling unit is used for training a data model according to the modeling algorithm and inputting the sample data into the data model to obtain an output result;

the scoring unit is used for scoring the output result to obtain a scoring result;

the judging unit is used for judging whether the grading result meets the publishable index;

and the optimizing unit is used for optimizing the preset model selection strategy when the grading result does not meet the publishable index, and returning to determine a modeling algorithm continuously according to the modeling problem type, the sample parameter and the preset model selection strategy.

11. The apparatus for training a data model according to claim 10, wherein the second determining unit specifically includes:

the third determining unit is used for determining the range of the modeling algorithm type according to the modeling problem type;

and the selection unit is used for determining a modeling algorithm in the range of the type of the modeling algorithm according to the sample parameters and the preset model selection strategy.

12. The apparatus for training a data model according to claim 10,

13. The training apparatus for data model according to claim 12,

the calculation formula of the scoring result is as follows:

wherein, SCORE _total SCORE for Total, SCORE _acc Scoring said accuracy, W _acc ScORE for scoring weights for predetermined accuracy rates _perf Scoring said performance index, W _perf ScORE for scoring weights for preset performance indicators _robust Scoring the stability index, W _robust Scoring a predetermined stability index weight, SCORE _cust Scoring the custom index, W _cust And scoring the weight of the preset user-defined index.

14. The apparatus for training a data model according to claim 13, wherein the accuracy rating score is formulated as:

wherein, acc is the accuracy of the data model, acc _thredhold And if the accuracy rate is the preset accuracy rate threshold, the accuracy rate of the data model is the ratio of the number of the correct results output by the data model to the number of the sample data.

15. The apparatus for training a data model according to claim 13, wherein the performance indicator score is formulated as:

SCORE _perf ＝T _min -T _i

wherein, T _min Cost most for training data modelsSmall time, T _i It actually takes time to train the data model.

16. The apparatus for training a data model according to claim 13,

17. The apparatus for training a data model according to claim 13,

18. Training apparatus of a data model according to one of the claims 10 to 17,

and the optimization unit is further used for determining the data model with the highest total score as the final data model when the scoring result meets the publishable index.