CN111310122A - Model data processing method, electronic device and storage medium - Google Patents

Model data processing method, electronic device and storage medium Download PDF

Info

Publication number
CN111310122A
CN111310122A CN202010082777.2A CN202010082777A CN111310122A CN 111310122 A CN111310122 A CN 111310122A CN 202010082777 A CN202010082777 A CN 202010082777A CN 111310122 A CN111310122 A CN 111310122A
Authority
CN
China
Prior art keywords
models
data set
training
data
accuracy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010082777.2A
Other languages
Chinese (zh)
Inventor
喻颍杰
尚毛毛
张卫华
杨丛丛
杨豫萍
董大为
***
康敏华
李楠
周晴
王业帅
杭玢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Economic Research Institute Of Development And Reform Commission Of Guangxi Zhuang Autonomous Region
Beijing Hongtianyu Technology Co Ltd
Original Assignee
Economic Research Institute Of Development And Reform Commission Of Guangxi Zhuang Autonomous Region
Beijing Hongtianyu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Economic Research Institute Of Development And Reform Commission Of Guangxi Zhuang Autonomous Region, Beijing Hongtianyu Technology Co Ltd filed Critical Economic Research Institute Of Development And Reform Commission Of Guangxi Zhuang Autonomous Region
Priority to CN202010082777.2A priority Critical patent/CN111310122A/en
Publication of CN111310122A publication Critical patent/CN111310122A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Development Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Finance (AREA)
  • Mathematical Physics (AREA)
  • Accounting & Taxation (AREA)
  • Economics (AREA)
  • Software Systems (AREA)
  • Mathematical Optimization (AREA)
  • Marketing (AREA)
  • Educational Administration (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Mathematical Analysis (AREA)
  • Game Theory and Decision Science (AREA)
  • Computational Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Algebra (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a data processing method of a model, which comprises the following steps: determining an original data set; training at least three models by utilizing the original data set, and evaluating to obtain at least three original accuracy indexes; creating an evaluation benchmark from the at least three original accuracy indicators; determining a standard data set according to the original data set; training the at least three models by using the standard data set, and evaluating to obtain at least three standard accuracy indexes; determining at least two pending models among the at least three models according to the at least three standard accuracy indicators; performing parameter optimization on the at least two undetermined models according to the evaluation benchmark to obtain at least two optimized models; a selected model is determined among the at least two optimization models.

Description

Model data processing method, electronic device and storage medium
Technical Field
The present application relates to the field of computer algorithms, and in particular, to a data processing method for a model, an electronic device, and a storage medium.
Background
The inventor of the application finds that the traditional economic analysis mainly depends on structured data, and the most obvious defect of the data is that the traditional economic analysis has strong time lag. For example, a season GDP published by the government often has a one-month lag period, and a statistical yearbook reflecting comprehensive economic and social conditions has a lag period of about three months, which is very disadvantageous for timely understanding of macroscopic economic situation, prediction and early warning.
In order to solve the problems, a big data algorithm is introduced to analyze and predict economic data. How to select a model in a big data algorithm becomes a difficult problem.
Disclosure of Invention
The application aims to provide a data processing method of a model, an electronic device and a storage medium.
One embodiment of the present application provides a data processing method for a model, including: determining an original data set; training at least three models by utilizing the original data set, and evaluating to obtain at least three original accuracy indexes; creating an evaluation benchmark from the at least three original accuracy indicators; determining a standard data set according to the original data set; training the at least three models by using the standard data set, and evaluating to obtain at least three standard accuracy indexes; determining at least two pending models among the at least three models according to the at least three standard accuracy indicators; performing parameter optimization on the at least two undetermined models according to the evaluation benchmark to obtain at least two optimized models; a selected model is determined among the at least two optimization models.
Another embodiment of the present application provides an electronic device comprising a processor and a memory, and a program stored in the memory and executable by the processor, wherein when the program is executed, the processor performs any one of the methods described above.
Another embodiment of the present application provides a storage medium storing a program executable by a processor, the processor performing any one of the methods described above when the program is executed.
By using the method, the electronic device and the storage medium, multiple times of training, evaluation and screening can be performed through multiple candidate models. Finally, the selected model with the best effect can be obtained.
By the method, timely data can be acquired, the data is analyzed and predicted by combining an artificial intelligence machine learning algorithm, economic problems can be effectively explained by using an economic theory, the problems of traditional statistical data can be solved through data information acquired by big data, the macroscopic economic prediction and analysis effects are effectively improved, and new breakthrough is brought to macroscopic economic prediction and analysis.
By the method, the acceleration of the next quarter can be predicted more accurately in real time by combining internet data with traditional statistical data and combining an established index system. And through the machine learning method, the model has stronger generalization, and the interference killing feature is strong, and the degree of accuracy is high, and stability is strong, can have quantitative description to some market environment changes that traditional data can't describe through the internet index moreover, and the side reflects the influence that for example china and america trade war etc. brought, makes the whole prediction have the real-time more.
Drawings
Fig. 1 shows a schematic flow chart of a data processing method of a model according to an embodiment of the present application.
Fig. 2 is a schematic flow chart illustrating a data processing method of a model according to another embodiment of the present application.
FIG. 3 illustrates a data histogram of raw data in an example embodiment.
FIG. 4 shows a data density distribution diagram of raw data in an example embodiment.
FIG. 5 illustrates a data box type diagram of raw data in an example embodiment.
FIG. 6 shows a data dependency diagram of raw data in an example embodiment.
Fig. 7 is a diagram illustrating a pilot analysis between the number of newly added businesses and the proportional acceleration rate in the example embodiment.
FIG. 8 illustrates a graphical representation of a pilot analysis between industrial resources and comparable speedup in an example embodiment.
Fig. 9 is a schematic diagram illustrating a pilot relationship between the number of newly added individual merchants and the proportional acceleration rate in the example embodiment.
Figure 10 shows a statistical diagram of the mean square error of the original accuracy indicator in an example embodiment.
Figure 11 shows a statistical diagram of the mean square error of the standard accuracy indicator in an example embodiment.
FIG. 12 shows a block diagram of an electronic device according to an example embodiment.
Detailed Description
The following is a description of embodiments of the data processing method, the electronic device, and the storage medium according to the present disclosure by specific embodiments, and those skilled in the art will understand the advantages and effects of the present disclosure from the disclosure of the present disclosure. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention. The drawings of the present invention are for illustrative purposes only and are not intended to be drawn to scale. The following embodiments will further explain the related art of the present invention in detail, but the disclosure is not intended to limit the scope of the present invention.
The application aims to provide a data processing method of a model, an electronic device and a storage medium.
One embodiment of the present application provides a data processing method for a model, including: determining an original data set; training at least three models by utilizing the original data set, and evaluating to obtain at least three original accuracy indexes; creating an evaluation benchmark from the at least three original accuracy indicators; determining a standard data set according to the original data set; training the at least three models by using the standard data set, and evaluating to obtain at least three standard accuracy indexes; determining at least two pending models among the at least three models according to the at least three standard accuracy indicators; performing parameter optimization on the at least two undetermined models according to the evaluation benchmark to obtain at least two optimized models; a selected model is determined among the at least two optimization models.
Another embodiment of the present application provides an electronic device comprising a processor and a memory, and a program stored in the memory and executable by the processor, wherein when the program is executed, the processor performs any one of the methods described above.
Another embodiment of the present application provides a storage medium storing a program executable by a processor, the processor performing any one of the methods described above when the program is executed.
By using the method, the electronic device and the storage medium, multiple times of training, evaluation and screening can be performed through multiple candidate models. Finally, the selected model with the best effect can be obtained.
By the method, timely data can be acquired, the data is analyzed and predicted by combining an artificial intelligence machine learning algorithm, economic problems can be effectively explained by using an economic theory, the problems of traditional statistical data can be solved through data information acquired by big data, the macroscopic economic prediction and analysis effects are effectively improved, and new breakthrough is brought to macroscopic economic prediction and analysis.
By the method, the acceleration of the next quarter can be predicted more accurately in real time by combining internet data with traditional statistical data and combining an established index system. And through the machine learning method, the model has stronger generalization, and the interference killing feature is strong, and the degree of accuracy is high, and stability is strong, can have quantitative description to some market environment changes that traditional data can't describe through the internet index moreover, and the side reflects the influence that for example china and america trade war etc. brought, makes the whole prediction have the real-time more.
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present application are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this application, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the application. As used in the specification and claims of this application, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this application refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
Fig. 1 shows a schematic flow chart of a data processing method of a model according to an embodiment of the present application.
As shown in fig. 1, method 1000 may include S110, S120, S130, S140, S150, S160, S170, and S180.
Wherein in S110 the original data set may be determined. The raw data may be obtained manually from a public data source, or may be automatically obtained by using a computer or a computer network, and the raw data is subjected to data arrangement to obtain a database of the raw data, i.e., a raw data set. Optionally, the raw data may be acquired in the network by using a crawler technology, or may be acquired by using each terminal node of the distributed network, or may be acquired by using communication between the cloud server and the terminal user.
As shown in fig. 1, in S120, at least three models can be trained and evaluated using the raw data set, resulting in at least three raw accuracy indicators. Alternatively, at least three models may be selected as candidate models first. Optionally, the candidate model may include: at least three of linear regression, ridge regression, lasso regression, elastic network regression, support vector machine, random forest, extreme random tree, xgboost, GBDT, AdaBoost.
Optionally, the raw data set is divided into a training data set and an evaluation data set. The at least three models may be trained using a training data set, and the training results may be evaluated to obtain at least three raw accuracy indicators. Alternatively, the raw accuracy may be the mean square error of the aforementioned training results.
Alternatively, the raw data set may be divided into a plurality of portions, at least one of which is used as the evaluation data set, while the other portions are used as the training data set. The training data set can be used for training a model, the evaluation data set is used for evaluating a training result to obtain an evaluation result, and the evaluation result is used as an original accuracy index.
Further, at least one portion may be rotated as an evaluation data set while the remaining portion is rotated as a training data set. The training models and the evaluation training results can be trained by using each training data set and each evaluation data set in turn to obtain a plurality of evaluation results. An original accuracy indicator corresponding to each model may be determined from the plurality of evaluation results. For example, the plurality of evaluation results may be a plurality of mean square errors, and the original accuracy indicator may be an average of the plurality of mean square errors.
Further, the original data set may be divided into ten parts on average, and any one of the ten parts may be used as the evaluation data set while the other parts are used as the training data set. And training a model and evaluating a training result according to each training data set and each evaluating data set, namely a ten-fold cross-validation method.
As shown in fig. 1, in S130, an evaluation criterion may be created from the at least three original accuracy indicators. The at least three raw accuracy indicators may be used as evaluation benchmarks. The result of some calculation based on the at least three original accuracy indicators may also be used as an evaluation criterion.
As shown in fig. 1, in S140, a standard data set may be determined from the original data set. Optionally, the data in the original data set may be subjected to data scale unification. Other ways of linear transformation of the raw data are also possible. In normalizing data, to prevent data leakage, Pipeline is used to normalize data and evaluate models.
As shown in fig. 1, in S150, the at least three models may be trained using the standard data set and evaluated to obtain at least three standard accuracy indicators. The standard dataset is divided into a training dataset and an evaluation dataset. The at least three models can be trained by using the training data set, and the training result is evaluated by using the evaluation data set, so as to obtain the standard accuracy index corresponding to each model of the at least three models. Alternatively, a ten-fold cross-validation method may be used to separate the standard data set, train the model, and evaluate the training results.
As shown in fig. 1, in S160, at least two pending models may be determined among the at least three models according to the at least three standard accuracy indicators. Optionally, at least two pending models may be determined according to a standard accuracy index corresponding to each model of the at least three models. Optionally, a model corresponding to the best at least two standard accuracy indicators may be selected as the pending model.
As shown in fig. 1, in S170, the at least two pending models may be optimized according to the evaluation criterion, so as to obtain at least two optimized models. Optionally, the at least two pending models may be optimized by parameters based on the evaluation criterion, to obtain an optimized model of each model. Optionally, the at least two pending models may be optimized by using a grid search algorithm to obtain at least two optimized models.
As shown in FIG. 1, in S180, a selected model is determined among the at least two optimization models. Optionally, the at least two optimization models may be evaluated to obtain an optimization accuracy index corresponding to each of the at least two optimization models. Optionally, the optimization model corresponding to the best optimization accuracy index may be selected as the selected model. Alternatively, the optimization accuracy indicator may be a mean square error of each optimization model.
Fig. 2 is a schematic flow chart illustrating a data processing method of a model according to another embodiment of the present application.
As shown in fig. 2, method 2000 may include: s205, S210, S220, S230, S240, S250, S260, S270, and S280.
Wherein in S205, raw data may be collected. Table 1 shows the raw data in an example embodiment. As shown in the exemplary embodiment, the candidate models can be used to predict an industry increment value that is above a scale, and the raw data set can include data related to the industry increment value.
TABLE 1
Figure BDA0002380899440000061
Figure BDA0002380899440000071
As shown in an example embodiment, the raw data may include: industrial electricity consumption, industrial enterprise income tax, industrial enterprise value-added tax, PMI index, automobile output, electrolytic aluminum output, aluminum product output, ten nonferrous metal outputs, aluminum oxide output, steel product output, cement output, generated energy, individual industry and business number, wholesale retail business number, lodging catering business number, construction business number, farming, pasturing, fishery business number, manufacturing business number, leasing contract business service number and at least one of industry added value concordant speed increase. Breakdown items of the above may also be included. Alternatively, the raw data may not be limited to the above data categories. Alternatively, the raw data may include annual data, quarterly data, monthly data, and other data. Alternatively, the raw data may include aperiodic data.
As shown in fig. 2, in S210, the raw data may be analyzed and sorted to obtain a raw data set. The raw data may be analyzed by descriptive statistics and the results of the analysis of the raw data are presented by visual information. So as to strengthen the understanding of the user to the original data and facilitate the construction of a proper model.
FIG. 3 illustrates a data histogram of raw data in an example embodiment. FIG. 4 shows a data density distribution diagram of raw data in an example embodiment. FIG. 5 illustrates a data box type diagram of raw data in an example embodiment.
Descriptive statistics include statistics of the maximum, minimum, median, and quartile values of the raw data, etc. To analyze raw data distribution and data structure. Such as graph descriptive statistics, may also include analyzing the distribution of data.
Alternatively, the descriptive statistics may include statistics of the data distribution of the raw data. As shown in fig. 3, optionally, the data distribution of the original data may be shown by using a histogram. As shown in the illustrative embodiment, some data is distributed exponentially as Enterprises; some data are characterized by bimodal distributions, such as aluminum oxide and Steels. As shown in fig. 4, optionally, the data distribution characteristics of the original data may also be shown by using a density distribution diagram. The density distribution diagram can be smoother than the histogram diagram. Alternatively, descriptive statistics may analyze the data skewing distribution of the raw data. As shown in FIG. 5, optionally, a data box type diagram may be used to show the data skewness distribution of the raw data.
FIG. 6 shows a data dependency diagram of raw data in an example embodiment.
Further, pairwise association relations between data indexes can be analyzed. Wherein, the association relationship between every two data indexes can be a numerical value. The association relationship between every two data indexes may also be a set of values, for example, the association relationship may be a degree relationship between the data indexes corresponding to each time node in a series of time nodes. As shown in fig. 6, optionally, the association relationship between each two data indexes can also be represented by a graph.
Optionally, in S210, data preprocessing may also be performed on the raw data. Where data preprocessing may include cleansing data and feature derivation. Cleansing the data may include processing the raw data by deleting missing data, outliers, for statistical data. New enterprise indexes, namely new enterprise indexes, can be derived by utilizing the characteristics of newly added enterprises (including the number of newly added enterprises such as individual industrial and commercial enterprises, wholesale retail industry, lodging catering industry, construction industry, agriculture, forestry, animal husbandry, fishery, manufacturing industry, leasing, business service and the like) registered by various industries and businesses; deriving a new index, namely an industrial resource index, by utilizing the characteristics of various industrial products (including the yields of generated electricity, automobiles, electrolytic aluminum, aluminum materials, ten kinds of colored heavy metals, aluminum oxide, steel, cement and the like); the feature of 'number of individual industrial and commercial users' with higher relevance derives a new index-new individual industrial and commercial users.
Fig. 7 is a diagram illustrating a pilot analysis between the number of newly added businesses and the proportional acceleration rate in the example embodiment. FIG. 8 illustrates a graphical representation of a pilot analysis between industrial resources and comparable speedup in an example embodiment. Fig. 9 is a schematic diagram illustrating a pilot relationship between the number of newly added individual merchants and the proportional acceleration rate in the example embodiment.
As shown in fig. 2, optionally, a pilot relationship analysis between the indexes may be further included at S210. As shown in fig. 7, 8, and 9, the 3 indexes of the number of newly added enterprises, newly added industrial resources, and newly added individual merchants have high correlations with "increase in industrial value (monthly)" in the next month, and the pearson correlation coefficients are 0.98, 0.96, and 0.86, respectively; meanwhile, the increasing relation between the index and the industrial value increasing and increasing speed (monthly degree) shows that the index of 'newly added enterprises' and 'industrial resources' has certain precedent for 'increasing the value and increasing the speed (monthly degree) of the industry'.
As shown in fig. 2, in S220, ten candidate models may be determined as: linear Regression (LR), RIDGE Regression (RIDGE), LASSO regression (LASSO), elastic network regression (EN), Support Vector Machine (SVM), Random Forest (RFR), extreme random Tree (ETR), Xgboost (XGB), GBDT (GBR), AdaBoost (ABR). The type and number of the candidate models may not be limited thereto.
The raw data set obtained in S210 may be divided into a training data set and an evaluation data set. The ten models to be selected can be trained, and the training results are evaluated by using the evaluation data set to obtain the original accuracy indexes of the ten models to be selected. Alternatively, the original accuracy measure may be the Mean Square Error (MSE) of the training results.
In S220, optionally, model training may be performed on ten candidate models by using the training data set separated from the original data set, and using preset default training parameters of each candidate model. As shown in the exemplary embodiment, the raw accuracy metrics for the ten candidate models may be as follows.
LR:-49.458561(49.693290)
Ridge:-49.456994(49.695623)
LASSO:-47.962319(49.706692)
EN:-48.747337(49.954865)
SVM:-81.629725(49.751904)
RFR:-47.443491(40.450092)
ETR:-41.751627(34.196770)
ABR:-42.452201(37.206723)
GBR:-57.325249(73.926423)
XGB:-55.308945(62.236916)
Figure 10 shows a statistical diagram of the mean square error of the original accuracy indicator in an example embodiment.
As shown in fig. 2, in S220, optionally, a training data set and an evaluation data set may be determined using a ten-fold cross-separation verification method. For example, the original data set may be divided into ten parts on average, each part may be used as the evaluation data set, and the rest may be used as the training data set, so that ten sets of training data sets and evaluation data sets may be obtained. The model may be trained separately with each grouped training data set and the training results may be evaluated with the evaluation data set to obtain ten Mean Square Errors (MSEs). The ten mean square errors can be analyzed to obtain a statistical diagram of the mean square errors of the original accuracy index, as shown in fig. 10.
As shown in fig. 2, in S230, optionally, the evaluation criterion may be determined according to the accuracy indexes of the foregoing ten candidate models. Alternatively, the mean square error of each model of the aforementioned ten candidate models obtained in S220 may be used as the evaluation reference. The calculation result calculated according to the mean square error of each model in the ten candidate models can be used as the evaluation reference. For example, the mean, maximum, minimum, etc. of the ten mean square errors of each model may be used as the evaluation criterion.
As shown in fig. 2, in S240, a standard data set may be established from the original data set. Each data in the original data set can be standardized, and the value range of each index data is the same. In S240, Pipeline may be used to perform normalization processing of data.
As shown in fig. 2, in S250, the ten candidate models may be trained using the standard data set and evaluated to obtain ten standard accuracy indexes. The execution process of S250 is similar to S220 and will not be described in detail.
As shown in the exemplary embodiment, in S250, the mean square error of the training results of the 10 models can be as follows:
ScalerLR:-49.458561(49.693290)
ScalerRIDGE:-48.259370(49.158078)
ScalerLASSO:-42.222034(37.662790)
ScalerEN:-47.131189(41.963870)
ScalerSVM:-46.693213(30.796697)
ScalerRFR:-46.057714(35.396472)
ScalerETR:-41.115216(39.916171)
ScalerABR:-39.881764(35.669335)
ScalerGBR:-56.378297(71.365000)
ScalerXGB:-55.310488(62.237158)
figure 11 shows a statistical diagram of the mean square error of the standard accuracy indicator in an example embodiment.
As shown in fig. 11, in S250, the ten-fold cross-separation validation results of the example embodiment. Alternatively, the standard accuracy index may comprise the ten-fold cross-separation validation results shown in fig. 11.
As shown in fig. 2, in S260, 2 pending models can be determined from the ten models according to the standard accuracy index. Alternatively, a model in 2 mode with the best standard accuracy index can be selected from the ten models as the pending model. As shown in fig. 11, the adaboost (abr) model in the example embodiment has an optimal MSE, followed by an extreme random tree regression (ETR) model. Thus, the adaboost (abr) model and the extreme random tree regression (ETR) model may be selected as two pending models.
As shown in fig. 2, in S270, the adaboost (abr) model and the extreme random tree regression (ETR) model may be optimized with respect to the estimation criteria. For example, in an example embodiment, the main parameters n _ estimators, max _ depth may be selected for the extreme random tree (ETR) to adjust, with the following results:
optimal (MSE): 30.135471988372101 use { ' n _ estimators ':20, max _ depth ':6}
The three parameters adaboost (abr) selection n _ estimators, and learning _ rate may be adjusted as follows:
optimally: 34.960919707149943 use { 'leaving _ rate':0.3, 'n _ estimators':30}
The results of comparing the optimization results of the two models are shown in table 2.
TABLE 2
Model name MSE (evaluation data set)
Extreme random tree (ETR) 27.98
AdaBoost(ABR) 30.26
As shown in FIG. 2, in S280, the final selected model may be determined at both optimization models. As shown in table 2, the extreme random tree (ETR) model in the example embodiment has a better fit to the sample data, and therefore, the extreme random tree (ETR) may be selected as the final selected model for the project.
FIG. 12 shows a block diagram of an electronic device according to an example embodiment.
An electronic device 200 according to this embodiment of the present application is described below with reference to fig. 12. The electronic device 200 shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 12, the electronic device 200 is embodied in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.
Wherein the storage unit stores program code executable by the processing unit 210 to cause the processing unit 210 to perform the methods according to various exemplary embodiments of the present application described herein. For example, the processing unit 210 may perform a method as illustrated in at least one of fig. 1-11.
The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.
The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 200 may also communicate with one or more external devices 300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 200 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or computer program product. Accordingly, this application may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to as a "circuit," module "or" system. Furthermore, the present application may take the form of a computer program product embodied in any tangible expression medium having computer-usable program code embodied in the medium.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
By using the method, the electronic device and the storage medium, multiple times of training, evaluation and screening can be performed through multiple candidate models. Finally, the selected model with the best effect can be obtained.
By the method, timely data can be acquired, the data is analyzed and predicted by combining an artificial intelligence machine learning algorithm, economic problems can be effectively explained by using an economic theory, the problems of traditional statistical data can be solved through data information acquired by big data, the macroscopic economic prediction and analysis effects are effectively improved, and new breakthrough is brought to macroscopic economic prediction and analysis.
By the method, the acceleration of the next quarter can be predicted more accurately in real time by combining internet data with traditional statistical data and combining an established index system. And through the machine learning method, the model has stronger generalization, and the interference killing feature is strong, and the degree of accuracy is high, and stability is strong, can have quantitative description to some market environment changes that traditional data can't describe through the internet index moreover, and the side reflects the influence that for example china and america trade war etc. brought, makes the whole prediction have the real-time more.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the description of the embodiments is only intended to facilitate the understanding of the methods and their core concepts of the present application. Meanwhile, a person skilled in the art should, according to the idea of the present application, change or modify the embodiments and applications of the present application based on the scope of the present application. In view of the above, the description should not be taken as limiting the application.

Claims (10)

1. A method of data processing of a model, comprising:
determining an original data set;
training at least three models by utilizing the original data set, and evaluating to obtain at least three original accuracy indexes;
creating an evaluation benchmark from the at least three original accuracy indicators;
determining a standard data set according to the original data set;
training the at least three models by using the standard data set, and evaluating to obtain at least three standard accuracy indexes;
determining at least two pending models among the at least three models according to the at least three standard accuracy indicators;
performing parameter optimization on the at least two undetermined models according to the evaluation benchmark to obtain at least two optimized models;
a selected model is determined among the at least two optimization models.
2. The method of claim 1, wherein training at least three models using the raw data set and evaluating, resulting in at least three raw accuracy indicators, comprises:
determining a training data set and an evaluation data set from the raw data set;
training the at least three models using the training data set;
training at least three models by using the original data set, and evaluating to obtain at least three original accuracy indexes, including:
evaluating the at least three models trained using the raw data set using the evaluation data set.
3. The method of claim 2, wherein the determining a training dataset and an evaluation dataset from the raw dataset comprises:
and determining a training data set and an evaluation data set by adopting a ten-fold cross-validation method.
4. The method of claim 1, wherein,
the at least three raw accuracy indicators comprise mean square errors of the at least three models trained using the raw data set;
the at least three standard accuracy indicators comprise mean square errors of the at least three models trained using the standard data set.
5. The method of claim 1, wherein the parameter optimizing the at least two pending models according to the evaluation criterion to obtain at least two optimized models comprises:
and performing parameter optimization on the at least two undetermined models by utilizing a grid search algorithm to obtain at least two optimized models.
6. The method of claim 1, wherein determining the selected model among the at least two optimization models comprises:
evaluating the at least two optimization models to obtain at least two optimization accuracy indexes;
and selecting the model with the optimal optimization accuracy index from the at least two optimization models as the selected model.
7. The method of claim 6, wherein the at least two optimization accuracy indicators comprise mean square errors of the at least two optimization models.
8. The method of claim 1, wherein the at least three models are selected from the group consisting of linear regression, ridge regression, lasso regression, elastic network regression, support vector machine, random forest, extreme random tree, xgboost, GBDT, AdaBoost.
9. The method of claim 1, wherein the processed industry-added value-related metric data is employed as a raw data set, the method further comprising:
and predicting the industrial acceleration on the gauge by using the selected model.
10. An electronic device comprising a processor and a memory, and a program executable by the processor stored in the memory, the program, when executed, causing the processor to perform the method of at least one of claims 1-8.
CN202010082777.2A 2020-02-07 2020-02-07 Model data processing method, electronic device and storage medium Pending CN111310122A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010082777.2A CN111310122A (en) 2020-02-07 2020-02-07 Model data processing method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010082777.2A CN111310122A (en) 2020-02-07 2020-02-07 Model data processing method, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN111310122A true CN111310122A (en) 2020-06-19

Family

ID=71146952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010082777.2A Pending CN111310122A (en) 2020-02-07 2020-02-07 Model data processing method, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN111310122A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111861264A (en) * 2020-07-31 2020-10-30 华中科技大学 Method for predicting concrete durability based on data mining and intelligent algorithm
CN113707320A (en) * 2021-08-30 2021-11-26 安徽理工大学 EN (EN) -MPA-SVM (multi-point support vector machine) -based abnormal physical sign miner distinguishing method based on correlation analysis
WO2023030282A1 (en) * 2021-09-02 2023-03-09 Huawei Technologies Co., Ltd. Methods and devices for assessing generalizability of benchmarks

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111861264A (en) * 2020-07-31 2020-10-30 华中科技大学 Method for predicting concrete durability based on data mining and intelligent algorithm
CN113707320A (en) * 2021-08-30 2021-11-26 安徽理工大学 EN (EN) -MPA-SVM (multi-point support vector machine) -based abnormal physical sign miner distinguishing method based on correlation analysis
CN113707320B (en) * 2021-08-30 2023-08-11 安徽理工大学 Abnormal physical sign miner distinguishing method based on correlation analysis and combining EN with MPA-SVM
WO2023030282A1 (en) * 2021-09-02 2023-03-09 Huawei Technologies Co., Ltd. Methods and devices for assessing generalizability of benchmarks

Similar Documents

Publication Publication Date Title
US10606862B2 (en) Method and apparatus for data processing in data modeling
AU2018101946A4 (en) Geographical multivariate flow data spatio-temporal autocorrelation analysis method based on cellular automaton
US11308418B2 (en) Automatic selection of variables for a machine-learning model
Kuravsky et al. A numerical technique for the identification of discrete-state continuous-time Markov models
US20190251458A1 (en) System and method for particle swarm optimization and quantile regression based rule mining for regression techniques
CN111310122A (en) Model data processing method, electronic device and storage medium
US20150120263A1 (en) Computer-Implemented Systems and Methods for Testing Large Scale Automatic Forecast Combinations
US8170894B2 (en) Method of identifying innovations possessing business disrupting properties
CN110717535B (en) Automatic modeling method and system based on data analysis processing system
CN107729241B (en) Software variation test data evolution generation method based on variant grouping
Chen et al. Optimal variability sensitive condition-based maintenance with a Cox PH model
CN110825522A (en) Spark parameter self-adaptive optimization method and system
Nicholson et al. Optimal network flow: A predictive analytics perspective on the fixed-charge network flow problem
CN111476274B (en) Big data predictive analysis method, system, device and storage medium
US20200050982A1 (en) Method and System for Predictive Modeling for Dynamically Scheduling Resource Allocation
CN111339163B (en) Method, device, computer equipment and storage medium for acquiring user loss state
Bidyuk et al. An Approach to Identifying and Filling Data Gaps in Machine Learning Procedures
Almomani et al. Selecting a good stochastic system for the large number of alternatives
CN115409541A (en) Cigarette brand data processing method based on data blood relationship
CA3177037A1 (en) Forecasting based on bernoulli uncertainty characterization
KR20230052010A (en) Demand forecasting method using ai-based model selector algorithm
Cherukuri et al. Control Spare Parts Inventory Obsolescence by Predictive Modelling
CN113191540A (en) Construction method and device of industrial link manufacturing resources
Sedano et al. The application of a two-step AI model to an automated pneumatic drilling process
Kolinski et al. The assessment of the economic efficiency of production process-simulation approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: No. 259, 1st Floor, Bowangyuan Podium, Yangfangdian, Haidian District, Beijing 100038

Applicant after: Beijing hongtianyu Technology Co.,Ltd.

Applicant after: Guangxi Zhuang Autonomous Region Macroeconomic Research Institute

Address before: No. 259, 1st Floor, Bowangyuan Podium, Yangfangdian, Haidian District, Beijing 100038

Applicant before: Beijing hongtianyu Technology Co.,Ltd.

Applicant before: Economic Research Institute of development and Reform Commission of Guangxi Zhuang Autonomous Region

Address after: 6/F, West Building, Guangxi Development Building, 111-1 Minzu Avenue, Nanning, Guangxi Zhuang Autonomous Region 530012

Applicant after: Guangxi Zhuang Autonomous Region Macroeconomic Research Institute

Applicant after: Beijing hongtianyu Technology Co.,Ltd.

Address before: No. 259, 1st Floor, Bowangyuan Podium, Yangfangdian, Haidian District, Beijing 100038

Applicant before: Beijing hongtianyu Technology Co.,Ltd.

Applicant before: Guangxi Zhuang Autonomous Region Macroeconomic Research Institute

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200619