CN104504475A

CN104504475A - AR*-SVM (support vector machine) hybrid modeling based haze time series prediction method

Info

Publication number: CN104504475A
Application number: CN201410837471.8A
Authority: CN
Inventors: 李卫民; 张礼名; 周扬; 王盛; 毛敏娟
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2014-12-24
Filing date: 2014-12-24
Publication date: 2015-04-08

Abstract

The invention relates to an AR*-SVM (support vector machine) hybrid modeling based haze time series prediction method. The method includes: firstly, establishing AR* models for a haze time series; secondly, performing modeling of an AR*-SVM hybrid model on an original series and an innovation series acquired from the AR* models by applying an SVM module; wherein the AR*-SVM hybrid model acquires linear and nonlinear parts of a haze time series stream through the AR* and SVM models, and the AR* and SVM models are combined to improve modeling and prediction performances of the whole haze time series stream. The AR* models and the SVM models are combined, and different aspects of hidden patterns in the time series stream are captured through the models, so that degree of fitting of the models is improved, prediction accuracy of the haze series is improved, and tests prove that the hybrid modeling method has better results than the two methods applied independently.

Description

Based on AR *the haze Time Series Forecasting Methods of-SVM hybrid modeling

Technical field

The present invention relates to a kind of Forecasting Methodology of haze, particularly a kind of haze Time Series Forecasting Methods based on AR*-SVM hybrid modeling.

Background technology

The multifactor time series stream of haze, illustrates the process of studied object development and change within a period of time.The Forecasting Methodology of so-called time series stream eigenwert, just refer in one group of actual measurement haze index and pollution source index time series stream basis of institute's research object, by the analyzing and processing means of various mathematics, find out the variation characteristic of data, development trend and rule, and then estimation is made to the state of certain moment research object following.Like this, just all factors of influence research object are integrated description by the time.Because the multifactor time series stream of haze itself is containing the characteristic of noise, instability and chaos, so it is very difficult for will obtaining full detail in the data of history, therefore will to set up funtcional relationship between future value and historical record be also not easily.

Haze time series stream is the sequence of a non-stationary.It is very difficult for a non-stationary series being mapped to a suitable linear model, so be usually cannot be gratifying based on the such as prediction of the time series models of ARMA model (ARMA).Large quantity research shows, the yield volatility existence condition Singular variance of time series stream, namely variance not only changes along with the time, and there is vary within wide limits and the little feature concentrated on respectively in some time period of amplitude---undulatory property cluster, this phenomenon is also exist in fields such as finance, electric power, weather.GARCH is the ARCH model of broad sense, and the object of GARCH modeling is understanding to the changeableness of time series stream and modeling.The conditional variance modeling become when GARCH model is by come variance and covariance Accurate Prediction pair, solves excessive kurtosis discussed above and the cluster problem of fluctuation well.

For haze time series stream, be difficult to build the GARCH model be applicable to, base prediction thereon is also unsatisfied.Because GARCH exists many deficiencies in the modeling of haze data, many alternative methods such as nonlinear model is also suggested the modeling for this respect, to improve the effect of model and forecast.

The present invention is referred to as AR* model AR, ARMA, ARIMA, ARCH and GARCH model.For time series stream, be difficult to build the AR* model be applicable to, base prediction thereon is also unsatisfied.Because linear session series model AR* exists many deficiencies in the modeling of time series stream, many alternative methods such as nonlinear model is also suggested the modeling for this respect, to improve the effect of model and forecast.

Neural network model has a wide range of applications, and increasing researcher utilizes neural network model to carry out the variation tendency of predicted time sequence flows in recent years.When setting up neural network model for time series stream, the estimated performance of researcher to neural network model and linear model is had to be studied contrast, by using the data of such as industry, finance, weather and microeconomy aspect, the experimental result drawn shows that nerual network technique has absolute advantage relative to linear model.The major advantage of neural network model is its Nonlinear Modeling ability, and many research also all provides respective experiment, illustrates that the forecasting accuracy of nonlinear neural network model on time series stream shows and have good performance than linear model.

Support vector machine (SVM, Support Vector Machine) is the learning algorithm of the structure based principle of minimization risk being proposed better to solve the practical problemss such as small sample, non-linear and high dimension by Vanpik etc.Support vector machine method solves to be predicted by the method for linear regression in high-dimensional feature space, increases the complicacy of calculating hardly, avoids the dimension calamity rising dimension process and may occur.The network structure that it also avoid the methods such as artificial neural network is difficult to determine, cross study and owe the problems such as study and local minimum, and the best be considered at present for the problem such as classification, recurrence of small sample is theoretical.

SVM has better effect relative to AR* time series flow model.But for a certain haze time series stream, be difficult to judge that it is pure linear process or nonlinear process, therefore be difficult to it and select suitable model to carry out matching and modeling.Although many documents show the prediction having many approach application to time sequence flows, and draw comparatively accurate result, but because haze time series stream exists many labile factors, SVM technology and other nerual network techniques are not used to the best model of haze time series stream prediction.

Summary of the invention

The object of the invention is to the defect existed for prior art, provide a kind of haze Time Series Forecasting Methods based on AR*-SVM hybrid modeling, to obtain the better effect of haze time series forecasting.

For solving the problems of the technologies described above, the present invention adopts following technical scheme:

Based on a haze Time Series Forecasting Methods for AR*-SVM hybrid modeling, operation steps is as follows:

The first step, sets up AR* model to haze time series stream, first identifies the exponent number of this model, determine the parameter of AR* model and estimate, the final linear segment used in AR* model analysis flow data; The information of this linear segment is by using AR* model to draw seasonal effect in time series innovation sequence { ε _tobtain, this innovation sequence contains statistics and the fluctuation information of time series stream; Use it as the part building AR*-SVM mixture model, not only can reduce noise level, improve the accuracy of prediction simultaneously by the statistics of acquisition time sequence flows and fluctuation information;

Second step, uses SVM model to carry out the modeling of AR*-SVM mixture model to original series and the innovation sequence that obtains from AR* model; So mixture model AR*-SVM is the linear processes part being obtained haze time series stream by AR* and SVM model respectively, and combine the model and forecast performance improving whole haze time series stream.

Preferably, described AR* model comprises auto regressive moving average arma modeling, seasonal auto regressive moving average ARIMA model and generilized auto regressive conditional heteroskedastic GARCH model line pattern type, and this each model is as follows:

1) ARMA is ARMA model:

x_{t} = Σ_{m = 1}^{p} a_{m} x_{t - m} + ϵ_{t} + Σ_{n = 1}^{q} b_{n} ϵ_{t - n}

Wherein x _trefer to the observed reading of t, x _t-mrefer to the observed reading in t-m moment, (a ₁, a ₂... a _p) be called autoregressive coefficient, argument (b ₁, b ₂... b _q) be called running mean coefficient, { ε _tbe white noise sequence, be also referred to as innovation sequence, the exponent number that (p, q) is arma modeling.Build seasonal effect in time series ARMA (p, q) model first to need to determine its p, q value.

By calculating AIC, the exponent number p of ARMA (p, q) model, q can determine that AIC information criterion and Akaike information criterion are a kind of standards of measure statistical models fitting Optimality.By calculating the different p of this sequence, the AIC value of q value, get the AR* model that minimum AIC value decides this sequence, AIC computing formula is as follows:

x_{t} = Σ_{m = 1}^{p} a_{m} x_{t - m} + ϵ_{t} + Σ_{n = 1}^{q} b_{n} ϵ_{t - n}

Wherein be the estimation to noise item variance, N is the length of sequence.

2) ARIMA is seasonal ARMA model:

If haze time series { x _td jump divide y _t=(1-B) ^dx _tbe stable ARMA (p, a q) sequence, wherein B is One-step delay operator, represents the time of current sequence value to pulling out moment, i.e. a Bx in the past _t=x _t-1; D>=1 is integer, then claim { x _tfor having rank p, the autoregression summation moving average ARIMA model of d and q, also referred to as seasonal ARMA model, is designated as { x _t} ~ ARIMA (p, d, q);

3) GARCH model is EC GARCH:

y _t＝f(t-1,X)+ε _t

σ_{t}^{2} = ω + Σ_{i = 1}^{p} a_{i} ϵ_{t - i}^{2} + Σ_{j = 1}^{q} β_{j} σ_{t - j}^{2}

First formula y _tbe one with error term ε _tthe average equation about sequence X; Second formula be the first phase forward prediction variance based on foregoing information, be made up of three parts: one is average ω; Two are be called ARCH (ARCH) item, by the delayed undulatory property information of measuring from obtaining of the residuals squares of average equation above wherein a _i, β _jfor parameter; Three are being called generilized auto regressive conditional heteroskedastic (GARCH) item, is the prediction variance of upper first phase

AR* (p, the q) model in AR*-SVM framework can be determined by flow process above.

Preferably, described support vector machines is:

The theoretical foundation of this support vector machines is Statistical Learning Theory, the same with radial primary function network as multi-Layer Perceptron Neural Network, can be used for pattern classification and non-linear regression; Its core concept is, by Kernel Function Transformation, the sample of the input space is mapped to high-dimensional feature space, in high-dimensional feature space, find optimal classification surface, thus distinguishes sample; Therefore choosing of correlation parameter is the key determining SVM performance after the selection of kernel function type and definite kernel function; Owing to also not constructing the effective ways of suitable kernel function at present for particular problem, the standard kernel functions such as most or Polynomial kernel function, RBF kernel function, the perceptron kernel function that utilize in reality; RBF kernel function is a pervasive kernel function, is applicable to the sample of Arbitrary distribution by adjustment parameter;

Given training sample data (x _i, y _i), i=1,2 ..., l, x ∈ R ^m, y ∈ R, wherein x _ifor input vector, y _ibe corresponding output valve, l is number of samples, and support vector regression is exactly by data x by a Nonlinear Mapping ξ _ibe mapped to high-dimensional feature space G, and carry out linear regression in this space:

y＝g(x)＝σ ^Tξ(x)+b

Wherein σ is the weight vector of lineoid, and b is bias term;

The analytic expression of the linear regression lineoid that support vector machines determines is as follows:

f (x) = Σ_{i = 1}^{l} ({\overset{&OverBar;}{α}}_{i}^{*} - {\overset{&OverBar;}{α}}_{i}) K (x_{i}, x) + \overset{&OverBar;}{b}

Wherein f (x) is categorised decision function, i=1,2,3 ... l, l are the number of training sample, for gaussian radial basis function kernel function. the optimum solution of dual problem, for threshold value.

The present invention compared with prior art, has following apparent outstanding substantive distinguishing features and remarkable technical progress:

The present invention breaks through model AR* and cannot go up on time series stream and cannot carry out Nonlinear Modeling, also compensate for the deficiency of support vector machine at linear sequence prediction, AR*-SVM method with mixed model is proposed, it is advantageous that the advantage that make use of two class models, be applicable to linear processes modeling, this is a kind of strategy preferably to the solution of actual application problem.The present invention proposes an AR* class model and SVM models coupling, the different aspect of hidden patterns in pull-in time sequence flows is carried out by this two class model, thus improve the degree of fitting of model, to improve the precision of prediction of haze sequence, test also shows that hybrid modeling method of the present invention all has good result than using separately these two kinds of methods.The present invention adopts AR*-SVM and generalized regression nerve networks (GRNN), SVM model to compare analysis, and the data value of prediction and the actual value of target data compare, and calculate MAE, RMSE, MAPE respectively, test findings.

Accompanying drawing explanation

Fig. 1 is the structural representation of AR*-SVM model.

Fig. 2 is the P2.5 data of Shanghai haze January to March.

The AQI achievement data in nearly 24 hours of September 17 of Fig. 3 Shanghai.

Embodiment

The preferred embodiments of the present invention accompanying drawings is as follows:

The present invention proposes AR*-SVM method with mixed model, make use of the advantage of AR* model and SVM model, is applicable to linear processes modeling, and this is a kind of strategy preferably to the solution of actual application problem.The present invention proposes an AR* class model and SVM models coupling, i.e. AR*-SVM method with mixed model, is carried out the different aspect of hidden patterns in pull-in time sequence flows by this two class model.

AR* model refers to the general name to ARMA, ARIMA, ARCH and GARCH model, wherein auto regressive moving average arma modeling, seasonal auto regressive moving average ARIMA model and generilized auto regressive conditional heteroskedastic GARCH model line pattern type.SVM model is supporting vector machine model.

SVM technology describes historical perspective value (X _t-1, X _t-2..., X _t-k, ε) and future value X _ibetween nonlinear function, the relation between them can be represented by following formula:

X _i＝F(X _t-1，X _t-2，…，X _t-k，ε)

ε is parameter vector, and F is the weight of the function that formed of acute pyogenic infection of finger tip SVM and link wherein, and Fig. 1 gives the architecture of mixture model AR*--SVM, the innovation sequence that the data of SVM mode input comprise original stream data and derived by AR*.Input vector (Y _i-1) and output vector (Y _i) all between mathematic(al) representation as follows:

Y _i＝F(Y _i-1，ε)i＝t，t+1，…t+l；

Vector (Yi-1) is equal to (Xi-1, Xi-2 ... Xi-k), not only comprise historical perspective value, and comprise innovation sequence, the same non linear autoregressive models such as AR*-SVM network add the assembly of an energy extraction time sequence flows fluctuation and statistical information.If vector (Yi-1) is m dimension, what so carry out is the prediction of a m step.

Test figure of the present invention adopts haze achievement data.Standard mean absolute error (the Mean Absolute Error of prediction and evaluation, MAE), square error square root (Root Mean Square Error, RMSE), mean absolute error rate (Mean Absolute Percent Error, MAPE), X is given time series stream, Y is predicted value, difference between predicted value and given sequential value is fewer, the accuracy rate of its prediction is high, therefore the value of MAE, RMSE, MAPE is less, the prediction of the method is more accurate, effective.Formula is as follows:

MAE = \frac{1}{N} Σ_{t = 1}^{N} | X_{t} - Y_{t} |

RMSE = \sqrt{\frac{1}{N} Σ_{1}^{N} {(X_{t} - Y_{t})}^{2}}

MAPE = \frac{1}{N} Σ_{t = 1}^{N} | \frac{X_{t} - Y_{t}}{Y_{t}} |

Contrast as follows:

MAE, RMSE, MAPE comparative analysis of each method of table 1

Below two kinds of preferred embodiments of the haze Time Series Forecasting Methods based on AR*-SVM mixture model:

Embodiment one:

See Fig. 1, this is based on the haze Time Series Forecasting Methods of AR*-SVM mixture model, and take full advantage of the advantage of two models, overcome the deficiency of each self model, operation steps is as follows:

Preferably, described AR* model comprises auto regressive moving average arma modeling, the line style models such as seasonal auto regressive moving average ARIMA model and generilized auto regressive conditional heteroskedastic GARCH model, and this each model is as follows:

1) ARMA is ARMA model:

x_{t} = Σ_{m = 1}^{p} a_{m} x_{t - m} + ϵ_{t} + Σ_{n = 1}^{q} b_{n} ϵ_{t - n}

ARMA (p, q) the exponent number p of model, q can determine by calculating AIC, AIC information criterion and Akaike information criterion, it is a kind of standard of measure statistical models fitting Optimality, because it is that Japanese statistician Chi Chi expands time foundation and develops, therefore also known as akaike information criterion.It is based upon on the conceptual foundation of entropy, can weigh the complexity of estimated model and the Optimality of these models fitting data.By calculating the different p of this sequence, the AIC value of q value, get the AR* model that minimum AIC value decides this sequence, AIC computing formula is as follows:

AIC (p, q) = \ln {\hat{σ}}_{α}^{2} (p, q) + 2 (p + q) / N

Wherein be the estimation to noise item variance, N is the length of sequence.

2) ARIMA is seasonal ARMA model:

3) GARCH model is EC GARCH:

y _t＝f(t-1,X)+ε _t

σ_{t}^{2} = ω + Σ_{i = 1}^{p} a_{i} ϵ_{t - i}^{2} + Σ_{j = 1}^{q} β_{j} σ_{t - j}^{2}

Preferably, described support vector machines is:

y＝g(x)＝σ ^Tξ(x)+b

Wherein σ is the weight vector of lineoid, and b is bias term;

f (x) = Σ_{i = 1}^{l} ({\overset{&OverBar;}{α}}_{i}^{*} - {\overset{&OverBar;}{α}}_{i}) K (x_{i}, x) + \overset{&OverBar;}{b}

See Fig. 2, this test figure based on the haze Time Series Forecasting Methods of AR*-SVM hybrid modeling adopts the nearly trimestral haze achievement data in Shanghai.Data refer to the P2.5 data of Shanghai haze January to the March of Fig. 2, and wherein abscissa representing time, ordinate represents surveyed achievement data.

1) original haze data are divided into two parts, front 75 observation datas are as the training data of AR*-SVM model, and rear 10 observation datas are as the target data of AR*-SVM model.

2) training data is processed, set up its yield volatility and set up AR* model, and these data determine SVM model parameter.

For the arma modeling of process data, can show that ARMA (3,3) is the model that it is applicable to by the AIC value (as shown in table 2) calculating not same order.

The AIC value of table 2 ARMA not same order

Exponent number (p, q)	AIC value
		(1，1)	99.2171
(2，1)	100.5325
		(1，2)	100.6693
(2，2)	102.3874
		(3，3)	98.8735
(4，4)	104.3692

For the GARCH model of process data, calculate the AIC value (as shown in table 3) of not same order equally, can show that GARCH (1,1) is the model that it is applicable to.

The AIC value of table 3 GARCH not same order

Exponent number (p, q)	AIC value
		(1，1)	115.6939
(2，1)	117.6939
		(1，2)	116.2857
(2，2)	118.2970
		(3，3)	121.6797
(4，4)	124.5786

3) training data and the model of AR*-SVM set up is used to predict target data

Utilize the model combined method of Fig. 1, training data the AR* model set up and SVM models coupling are predicted data.And different Forecasting Methodology is compared, draw result as shown in table 1.

Embodiment two:

The present embodiment is substantially identical with embodiment one, and special feature is as described below:

As shown in Figure 3, the AQI index in figure is air quality index (Air Quality Index is called for short AQI), and it is the zero dimension index of quantitative description Air Quality., the test figure of the present embodiment adopt Shanghai 24 hours on the 17th September in 2014 haze achievement data.Original haze data are divided into two parts, and front 14 observation datas are as the training data of AR*-SVM model, and rear 10 observation datas are as the target data of AR*-SVM model.Concrete steps are with embodiment one.

No matter done dependence test to the short-term forecasting of achievement data, be sample data size, and the method that the present invention proposes all shows the stability of prediction.

MAE, RMSE, MAPE comparative analysis of each method of table 4

Utilize the model combined method of Fig. 1, training data AR* and the SVM combination of setting up is predicted data.And different Forecasting Methodology is compared, draw result as shown in table 4.

In sum, a kind of haze Time Series Forecasting Methods based on AR*-SVM mixture model provided by the invention, compares traditional Forecasting Methodology, and the stability predicted the outcome is better, and accuracy rate is higher.

Claims

1., based on a haze Time Series Forecasting Methods for AR*-SVM mixture model, it is characterized in that, operation steps is as follows:

Second step, uses SVM model to carry out the modeling of AR*-SVM mixture model to original series and the innovation sequence that obtains from AR* model; Mixture model AR*-SVM is the linear processes part being obtained haze time series stream by AR* and SVM model respectively, and combines the model and forecast performance improving whole haze time series stream.

2. the haze Time Series Forecasting Methods based on AR*-SVM mixture model according to claim 1, it is characterized in that: described AR* model comprises auto regressive moving average arma modeling, seasonal auto regressive moving average ARIMA model and generilized auto regressive conditional heteroskedastic GARCH model line pattern type, this each model is as follows:

1) ARMA is ARMA model:

Wherein x _trefer to the observed reading of t, x _t-mrefer to the observed reading in t-m moment, (a ₁, a ₂... a _p) be called autoregressive coefficient, argument (b ₁, b ₂... b _q) be called running mean coefficient, { ε _tbe white noise sequence, be also referred to as innovation sequence, the exponent number that (p, q) is arma modeling, first build seasonal effect in time series ARMA (p, q) model needs to determine its p, q value;

By calculating AIC, the exponent number p of ARMA (p, q) model, q can determine that AIC information criterion and Akaike information criterion are a kind of standards of measure statistical models fitting Optimality; By calculating the different p of this sequence, the AIC value of q value, get the AR* model that minimum AIC value decides this sequence, AIC computing formula is as follows:

Wherein be the estimation to noise item variance, N is the length of sequence;

2) ARIMA model is seasonal ARMA model:

3) GARCH model is EC GARCH:

y _t＝f(t-1,X)+ε _t

AR* (p, the q) model in AR*-SVM framework is determined by flow process above.

3. the haze Time Series Forecasting Methods based on AR*-SVM mixture model according to claim 1, is characterized in that: described support vector machines is:

The theoretical foundation of this support vector machines is Statistical Learning Theory, the same with radial primary function network as multi-Layer Perceptron Neural Network, can be used for pattern classification and non-linear regression; Its core concept is, by Kernel Function Transformation, the sample of the input space is mapped to high-dimensional feature space, in high-dimensional feature space, find optimal classification surface, thus distinguishes sample; Therefore choosing of correlation parameter is the key determining SVM performance after the selection of kernel function type and definite kernel function; Owing to also not constructing the effective ways of suitable kernel function at present for particular problem, most or Polynomial kernel function, RBF kernel function, the perceptron kernel function standard kernel function that utilize in reality; RBF kernel function is a pervasive kernel function, is applicable to the sample of Arbitrary distribution by adjustment parameter;

y＝g(x)＝σ ^Tξ(x)+b

Wherein σ is the weight vector of lineoid, and b is bias term;

Wherein f (x) is categorised decision function, i=1,2,3 ... l, l are the number of training sample, for gaussian radial basis function kernel function; the optimum solution of dual problem, for threshold value.