CN111310122A

CN111310122A - Model data processing method, electronic device and storage medium

Info

Publication number: CN111310122A
Application number: CN202010082777.2A
Authority: CN
Inventors: 喻颍杰; 尚毛毛; 张卫华; 杨丛丛; 杨豫萍; 董大为; ***; 康敏华; 李楠; 周晴; 王业帅; 杭玢
Original assignee: Economic Research Institute Of Development And Reform Commission Of Guangxi Zhuang Autonomous Region; Beijing Hongtianyu Technology Co Ltd
Current assignee: Economic Research Institute Of Development And Reform Commission Of Guangxi Zhuang Autonomous Region; Beijing Hongtianyu Technology Co Ltd
Priority date: 2020-02-07
Filing date: 2020-02-07
Publication date: 2020-06-19

Abstract

The application relates to a data processing method of a model, which comprises the following steps: determining an original data set; training at least three models by utilizing the original data set, and evaluating to obtain at least three original accuracy indexes; creating an evaluation benchmark from the at least three original accuracy indicators; determining a standard data set according to the original data set; training the at least three models by using the standard data set, and evaluating to obtain at least three standard accuracy indexes; determining at least two pending models among the at least three models according to the at least three standard accuracy indicators; performing parameter optimization on the at least two undetermined models according to the evaluation benchmark to obtain at least two optimized models; a selected model is determined among the at least two optimization models.

Description

Model data processing method, electronic device and storage medium

Technical Field

The present application relates to the field of computer algorithms, and in particular, to a data processing method for a model, an electronic device, and a storage medium.

Background

The inventor of the application finds that the traditional economic analysis mainly depends on structured data, and the most obvious defect of the data is that the traditional economic analysis has strong time lag. For example, a season GDP published by the government often has a one-month lag period, and a statistical yearbook reflecting comprehensive economic and social conditions has a lag period of about three months, which is very disadvantageous for timely understanding of macroscopic economic situation, prediction and early warning.

In order to solve the problems, a big data algorithm is introduced to analyze and predict economic data. How to select a model in a big data algorithm becomes a difficult problem.

Disclosure of Invention

The application aims to provide a data processing method of a model, an electronic device and a storage medium.

One embodiment of the present application provides a data processing method for a model, including: determining an original data set; training at least three models by utilizing the original data set, and evaluating to obtain at least three original accuracy indexes; creating an evaluation benchmark from the at least three original accuracy indicators; determining a standard data set according to the original data set; training the at least three models by using the standard data set, and evaluating to obtain at least three standard accuracy indexes; determining at least two pending models among the at least three models according to the at least three standard accuracy indicators; performing parameter optimization on the at least two undetermined models according to the evaluation benchmark to obtain at least two optimized models; a selected model is determined among the at least two optimization models.

Another embodiment of the present application provides an electronic device comprising a processor and a memory, and a program stored in the memory and executable by the processor, wherein when the program is executed, the processor performs any one of the methods described above.

Another embodiment of the present application provides a storage medium storing a program executable by a processor, the processor performing any one of the methods described above when the program is executed.

By using the method, the electronic device and the storage medium, multiple times of training, evaluation and screening can be performed through multiple candidate models. Finally, the selected model with the best effect can be obtained.

By the method, timely data can be acquired, the data is analyzed and predicted by combining an artificial intelligence machine learning algorithm, economic problems can be effectively explained by using an economic theory, the problems of traditional statistical data can be solved through data information acquired by big data, the macroscopic economic prediction and analysis effects are effectively improved, and new breakthrough is brought to macroscopic economic prediction and analysis.

By the method, the acceleration of the next quarter can be predicted more accurately in real time by combining internet data with traditional statistical data and combining an established index system. And through the machine learning method, the model has stronger generalization, and the interference killing feature is strong, and the degree of accuracy is high, and stability is strong, can have quantitative description to some market environment changes that traditional data can't describe through the internet index moreover, and the side reflects the influence that for example china and america trade war etc. brought, makes the whole prediction have the real-time more.

Drawings

Fig. 1 shows a schematic flow chart of a data processing method of a model according to an embodiment of the present application.

Fig. 2 is a schematic flow chart illustrating a data processing method of a model according to another embodiment of the present application.

FIG. 3 illustrates a data histogram of raw data in an example embodiment.

FIG. 4 shows a data density distribution diagram of raw data in an example embodiment.

FIG. 5 illustrates a data box type diagram of raw data in an example embodiment.

FIG. 6 shows a data dependency diagram of raw data in an example embodiment.

Fig. 7 is a diagram illustrating a pilot analysis between the number of newly added businesses and the proportional acceleration rate in the example embodiment.

FIG. 8 illustrates a graphical representation of a pilot analysis between industrial resources and comparable speedup in an example embodiment.

Fig. 9 is a schematic diagram illustrating a pilot relationship between the number of newly added individual merchants and the proportional acceleration rate in the example embodiment.

Figure 10 shows a statistical diagram of the mean square error of the original accuracy indicator in an example embodiment.

Figure 11 shows a statistical diagram of the mean square error of the standard accuracy indicator in an example embodiment.

FIG. 12 shows a block diagram of an electronic device according to an example embodiment.

Detailed Description

The following is a description of embodiments of the data processing method, the electronic device, and the storage medium according to the present disclosure by specific embodiments, and those skilled in the art will understand the advantages and effects of the present disclosure from the disclosure of the present disclosure. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention. The drawings of the present invention are for illustrative purposes only and are not intended to be drawn to scale. The following embodiments will further explain the related art of the present invention in detail, but the disclosure is not intended to limit the scope of the present invention.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present application are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this application, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the application. As used in the specification and claims of this application, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this application refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As shown in fig. 1, method 1000 may include S110, S120, S130, S140, S150, S160, S170, and S180.

Wherein in S110 the original data set may be determined. The raw data may be obtained manually from a public data source, or may be automatically obtained by using a computer or a computer network, and the raw data is subjected to data arrangement to obtain a database of the raw data, i.e., a raw data set. Optionally, the raw data may be acquired in the network by using a crawler technology, or may be acquired by using each terminal node of the distributed network, or may be acquired by using communication between the cloud server and the terminal user.

As shown in fig. 1, in S120, at least three models can be trained and evaluated using the raw data set, resulting in at least three raw accuracy indicators. Alternatively, at least three models may be selected as candidate models first. Optionally, the candidate model may include: at least three of linear regression, ridge regression, lasso regression, elastic network regression, support vector machine, random forest, extreme random tree, xgboost, GBDT, AdaBoost.

Optionally, the raw data set is divided into a training data set and an evaluation data set. The at least three models may be trained using a training data set, and the training results may be evaluated to obtain at least three raw accuracy indicators. Alternatively, the raw accuracy may be the mean square error of the aforementioned training results.

Alternatively, the raw data set may be divided into a plurality of portions, at least one of which is used as the evaluation data set, while the other portions are used as the training data set. The training data set can be used for training a model, the evaluation data set is used for evaluating a training result to obtain an evaluation result, and the evaluation result is used as an original accuracy index.

Further, at least one portion may be rotated as an evaluation data set while the remaining portion is rotated as a training data set. The training models and the evaluation training results can be trained by using each training data set and each evaluation data set in turn to obtain a plurality of evaluation results. An original accuracy indicator corresponding to each model may be determined from the plurality of evaluation results. For example, the plurality of evaluation results may be a plurality of mean square errors, and the original accuracy indicator may be an average of the plurality of mean square errors.

Further, the original data set may be divided into ten parts on average, and any one of the ten parts may be used as the evaluation data set while the other parts are used as the training data set. And training a model and evaluating a training result according to each training data set and each evaluating data set, namely a ten-fold cross-validation method.

As shown in fig. 1, in S130, an evaluation criterion may be created from the at least three original accuracy indicators. The at least three raw accuracy indicators may be used as evaluation benchmarks. The result of some calculation based on the at least three original accuracy indicators may also be used as an evaluation criterion.

As shown in fig. 1, in S140, a standard data set may be determined from the original data set. Optionally, the data in the original data set may be subjected to data scale unification. Other ways of linear transformation of the raw data are also possible. In normalizing data, to prevent data leakage, Pipeline is used to normalize data and evaluate models.

As shown in fig. 1, in S150, the at least three models may be trained using the standard data set and evaluated to obtain at least three standard accuracy indicators. The standard dataset is divided into a training dataset and an evaluation dataset. The at least three models can be trained by using the training data set, and the training result is evaluated by using the evaluation data set, so as to obtain the standard accuracy index corresponding to each model of the at least three models. Alternatively, a ten-fold cross-validation method may be used to separate the standard data set, train the model, and evaluate the training results.

As shown in fig. 1, in S160, at least two pending models may be determined among the at least three models according to the at least three standard accuracy indicators. Optionally, at least two pending models may be determined according to a standard accuracy index corresponding to each model of the at least three models. Optionally, a model corresponding to the best at least two standard accuracy indicators may be selected as the pending model.

As shown in fig. 1, in S170, the at least two pending models may be optimized according to the evaluation criterion, so as to obtain at least two optimized models. Optionally, the at least two pending models may be optimized by parameters based on the evaluation criterion, to obtain an optimized model of each model. Optionally, the at least two pending models may be optimized by using a grid search algorithm to obtain at least two optimized models.

As shown in FIG. 1, in S180, a selected model is determined among the at least two optimization models. Optionally, the at least two optimization models may be evaluated to obtain an optimization accuracy index corresponding to each of the at least two optimization models. Optionally, the optimization model corresponding to the best optimization accuracy index may be selected as the selected model. Alternatively, the optimization accuracy indicator may be a mean square error of each optimization model.

As shown in fig. 2, method 2000 may include: s205, S210, S220, S230, S240, S250, S260, S270, and S280.

Wherein in S205, raw data may be collected. Table 1 shows the raw data in an example embodiment. As shown in the exemplary embodiment, the candidate models can be used to predict an industry increment value that is above a scale, and the raw data set can include data related to the industry increment value.

TABLE 1

As shown in an example embodiment, the raw data may include: industrial electricity consumption, industrial enterprise income tax, industrial enterprise value-added tax, PMI index, automobile output, electrolytic aluminum output, aluminum product output, ten nonferrous metal outputs, aluminum oxide output, steel product output, cement output, generated energy, individual industry and business number, wholesale retail business number, lodging catering business number, construction business number, farming, pasturing, fishery business number, manufacturing business number, leasing contract business service number and at least one of industry added value concordant speed increase. Breakdown items of the above may also be included. Alternatively, the raw data may not be limited to the above data categories. Alternatively, the raw data may include annual data, quarterly data, monthly data, and other data. Alternatively, the raw data may include aperiodic data.

As shown in fig. 2, in S210, the raw data may be analyzed and sorted to obtain a raw data set. The raw data may be analyzed by descriptive statistics and the results of the analysis of the raw data are presented by visual information. So as to strengthen the understanding of the user to the original data and facilitate the construction of a proper model.

FIG. 3 illustrates a data histogram of raw data in an example embodiment. FIG. 4 shows a data density distribution diagram of raw data in an example embodiment. FIG. 5 illustrates a data box type diagram of raw data in an example embodiment.

Descriptive statistics include statistics of the maximum, minimum, median, and quartile values of the raw data, etc. To analyze raw data distribution and data structure. Such as graph descriptive statistics, may also include analyzing the distribution of data.

Alternatively, the descriptive statistics may include statistics of the data distribution of the raw data. As shown in fig. 3, optionally, the data distribution of the original data may be shown by using a histogram. As shown in the illustrative embodiment, some data is distributed exponentially as Enterprises; some data are characterized by bimodal distributions, such as aluminum oxide and Steels. As shown in fig. 4, optionally, the data distribution characteristics of the original data may also be shown by using a density distribution diagram. The density distribution diagram can be smoother than the histogram diagram. Alternatively, descriptive statistics may analyze the data skewing distribution of the raw data. As shown in FIG. 5, optionally, a data box type diagram may be used to show the data skewness distribution of the raw data.

FIG. 6 shows a data dependency diagram of raw data in an example embodiment.

Further, pairwise association relations between data indexes can be analyzed. Wherein, the association relationship between every two data indexes can be a numerical value. The association relationship between every two data indexes may also be a set of values, for example, the association relationship may be a degree relationship between the data indexes corresponding to each time node in a series of time nodes. As shown in fig. 6, optionally, the association relationship between each two data indexes can also be represented by a graph.

Optionally, in S210, data preprocessing may also be performed on the raw data. Where data preprocessing may include cleansing data and feature derivation. Cleansing the data may include processing the raw data by deleting missing data, outliers, for statistical data. New enterprise indexes, namely new enterprise indexes, can be derived by utilizing the characteristics of newly added enterprises (including the number of newly added enterprises such as individual industrial and commercial enterprises, wholesale retail industry, lodging catering industry, construction industry, agriculture, forestry, animal husbandry, fishery, manufacturing industry, leasing, business service and the like) registered by various industries and businesses; deriving a new index, namely an industrial resource index, by utilizing the characteristics of various industrial products (including the yields of generated electricity, automobiles, electrolytic aluminum, aluminum materials, ten kinds of colored heavy metals, aluminum oxide, steel, cement and the like); the feature of 'number of individual industrial and commercial users' with higher relevance derives a new index-new individual industrial and commercial users.

Fig. 7 is a diagram illustrating a pilot analysis between the number of newly added businesses and the proportional acceleration rate in the example embodiment. FIG. 8 illustrates a graphical representation of a pilot analysis between industrial resources and comparable speedup in an example embodiment. Fig. 9 is a schematic diagram illustrating a pilot relationship between the number of newly added individual merchants and the proportional acceleration rate in the example embodiment.

As shown in fig. 2, optionally, a pilot relationship analysis between the indexes may be further included at S210. As shown in fig. 7, 8, and 9, the 3 indexes of the number of newly added enterprises, newly added industrial resources, and newly added individual merchants have high correlations with "increase in industrial value (monthly)" in the next month, and the pearson correlation coefficients are 0.98, 0.96, and 0.86, respectively; meanwhile, the increasing relation between the index and the industrial value increasing and increasing speed (monthly degree) shows that the index of 'newly added enterprises' and 'industrial resources' has certain precedent for 'increasing the value and increasing the speed (monthly degree) of the industry'.

As shown in fig. 2, in S220, ten candidate models may be determined as: linear Regression (LR), RIDGE Regression (RIDGE), LASSO regression (LASSO), elastic network regression (EN), Support Vector Machine (SVM), Random Forest (RFR), extreme random Tree (ETR), Xgboost (XGB), GBDT (GBR), AdaBoost (ABR). The type and number of the candidate models may not be limited thereto.

The raw data set obtained in S210 may be divided into a training data set and an evaluation data set. The ten models to be selected can be trained, and the training results are evaluated by using the evaluation data set to obtain the original accuracy indexes of the ten models to be selected. Alternatively, the original accuracy measure may be the Mean Square Error (MSE) of the training results.

In S220, optionally, model training may be performed on ten candidate models by using the training data set separated from the original data set, and using preset default training parameters of each candidate model. As shown in the exemplary embodiment, the raw accuracy metrics for the ten candidate models may be as follows.

LR:-49.458561(49.693290)

Ridge:-49.456994(49.695623)

LASSO:-47.962319(49.706692)

EN:-48.747337(49.954865)

SVM:-81.629725(49.751904)

RFR:-47.443491(40.450092)

ETR:-41.751627(34.196770)

ABR:-42.452201(37.206723)

GBR:-57.325249(73.926423)

XGB:-55.308945(62.236916)

As shown in fig. 2, in S220, optionally, a training data set and an evaluation data set may be determined using a ten-fold cross-separation verification method. For example, the original data set may be divided into ten parts on average, each part may be used as the evaluation data set, and the rest may be used as the training data set, so that ten sets of training data sets and evaluation data sets may be obtained. The model may be trained separately with each grouped training data set and the training results may be evaluated with the evaluation data set to obtain ten Mean Square Errors (MSEs). The ten mean square errors can be analyzed to obtain a statistical diagram of the mean square errors of the original accuracy index, as shown in fig. 10.

As shown in fig. 2, in S230, optionally, the evaluation criterion may be determined according to the accuracy indexes of the foregoing ten candidate models. Alternatively, the mean square error of each model of the aforementioned ten candidate models obtained in S220 may be used as the evaluation reference. The calculation result calculated according to the mean square error of each model in the ten candidate models can be used as the evaluation reference. For example, the mean, maximum, minimum, etc. of the ten mean square errors of each model may be used as the evaluation criterion.

As shown in fig. 2, in S240, a standard data set may be established from the original data set. Each data in the original data set can be standardized, and the value range of each index data is the same. In S240, Pipeline may be used to perform normalization processing of data.

As shown in fig. 2, in S250, the ten candidate models may be trained using the standard data set and evaluated to obtain ten standard accuracy indexes. The execution process of S250 is similar to S220 and will not be described in detail.

As shown in the exemplary embodiment, in S250, the mean square error of the training results of the 10 models can be as follows:

ScalerLR:-49.458561(49.693290)

ScalerRIDGE:-48.259370(49.158078)

ScalerLASSO:-42.222034(37.662790)

ScalerEN:-47.131189(41.963870)

ScalerSVM:-46.693213(30.796697)

ScalerRFR:-46.057714(35.396472)

ScalerETR:-41.115216(39.916171)

ScalerABR:-39.881764(35.669335)

ScalerGBR:-56.378297(71.365000)

ScalerXGB:-55.310488(62.237158)

As shown in fig. 11, in S250, the ten-fold cross-separation validation results of the example embodiment. Alternatively, the standard accuracy index may comprise the ten-fold cross-separation validation results shown in fig. 11.

As shown in fig. 2, in S260, 2 pending models can be determined from the ten models according to the standard accuracy index. Alternatively, a model in 2 mode with the best standard accuracy index can be selected from the ten models as the pending model. As shown in fig. 11, the adaboost (abr) model in the example embodiment has an optimal MSE, followed by an extreme random tree regression (ETR) model. Thus, the adaboost (abr) model and the extreme random tree regression (ETR) model may be selected as two pending models.

As shown in fig. 2, in S270, the adaboost (abr) model and the extreme random tree regression (ETR) model may be optimized with respect to the estimation criteria. For example, in an example embodiment, the main parameters n _ estimators, max _ depth may be selected for the extreme random tree (ETR) to adjust, with the following results:

optimal (MSE): 30.135471988372101 use { ' n _ estimators ':20, max _ depth ':6}

The three parameters adaboost (abr) selection n _ estimators, and learning _ rate may be adjusted as follows:

optimally: 34.960919707149943 use { 'leaving _ rate':0.3, 'n _ estimators':30}

The results of comparing the optimization results of the two models are shown in table 2.

TABLE 2

Model name	MSE (evaluation data set)
		Extreme random tree (ETR)	27.98
AdaBoost(ABR)	30.26

As shown in FIG. 2, in S280, the final selected model may be determined at both optimization models. As shown in table 2, the extreme random tree (ETR) model in the example embodiment has a better fit to the sample data, and therefore, the extreme random tree (ETR) may be selected as the final selected model for the project.

An electronic device 200 according to this embodiment of the present application is described below with reference to fig. 12. The electronic device 200 shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 12, the electronic device 200 is embodied in the form of a general purpose computing device. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.

Wherein the storage unit stores program code executable by the processing unit 210 to cause the processing unit 210 to perform the methods according to various exemplary embodiments of the present application described herein. For example, the processing unit 210 may perform a method as illustrated in at least one of fig. 1-11.

The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203.

The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 200 may also communicate with one or more external devices 300 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 200 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 250. Also, the electronic device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 260. The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or computer program product. Accordingly, this application may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to as a "circuit," module "or" system. Furthermore, the present application may take the form of a computer program product embodied in any tangible expression medium having computer-usable program code embodied in the medium.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the description of the embodiments is only intended to facilitate the understanding of the methods and their core concepts of the present application. Meanwhile, a person skilled in the art should, according to the idea of the present application, change or modify the embodiments and applications of the present application based on the scope of the present application. In view of the above, the description should not be taken as limiting the application.

Claims

1. A method of data processing of a model, comprising:

determining an original data set;

training at least three models by utilizing the original data set, and evaluating to obtain at least three original accuracy indexes;

creating an evaluation benchmark from the at least three original accuracy indicators;

determining a standard data set according to the original data set;

training the at least three models by using the standard data set, and evaluating to obtain at least three standard accuracy indexes;

determining at least two pending models among the at least three models according to the at least three standard accuracy indicators;

performing parameter optimization on the at least two undetermined models according to the evaluation benchmark to obtain at least two optimized models;

a selected model is determined among the at least two optimization models.

2. The method of claim 1, wherein training at least three models using the raw data set and evaluating, resulting in at least three raw accuracy indicators, comprises:

determining a training data set and an evaluation data set from the raw data set;

training the at least three models using the training data set;

training at least three models by using the original data set, and evaluating to obtain at least three original accuracy indexes, including:

evaluating the at least three models trained using the raw data set using the evaluation data set.

3. The method of claim 2, wherein the determining a training dataset and an evaluation dataset from the raw dataset comprises:

and determining a training data set and an evaluation data set by adopting a ten-fold cross-validation method.

4. The method of claim 1, wherein,

the at least three raw accuracy indicators comprise mean square errors of the at least three models trained using the raw data set;

the at least three standard accuracy indicators comprise mean square errors of the at least three models trained using the standard data set.

5. The method of claim 1, wherein the parameter optimizing the at least two pending models according to the evaluation criterion to obtain at least two optimized models comprises:

and performing parameter optimization on the at least two undetermined models by utilizing a grid search algorithm to obtain at least two optimized models.

6. The method of claim 1, wherein determining the selected model among the at least two optimization models comprises:

evaluating the at least two optimization models to obtain at least two optimization accuracy indexes;

and selecting the model with the optimal optimization accuracy index from the at least two optimization models as the selected model.

7. The method of claim 6, wherein the at least two optimization accuracy indicators comprise mean square errors of the at least two optimization models.

8. The method of claim 1, wherein the at least three models are selected from the group consisting of linear regression, ridge regression, lasso regression, elastic network regression, support vector machine, random forest, extreme random tree, xgboost, GBDT, AdaBoost.

9. The method of claim 1, wherein the processed industry-added value-related metric data is employed as a raw data set, the method further comprising:

and predicting the industrial acceleration on the gauge by using the selected model.

10. An electronic device comprising a processor and a memory, and a program executable by the processor stored in the memory, the program, when executed, causing the processor to perform the method of at least one of claims 1-8.