CN110647980A

CN110647980A - Time sequence prediction method based on GRU neural network

Info

Publication number: CN110647980A
Application number: CN201910883548.8A
Authority: CN
Inventors: 柳丽召; 蔡彪; 刘洋
Original assignee: Chengdu Univeristy of Technology
Current assignee: Chengdu Univeristy of Technology
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2020-01-03

Abstract

The invention belongs to the technical field of network information prediction and discloses a time sequence prediction method based on a GRU neural network, which comprises the following steps: collecting original data to be predicted; preprocessing the collected original data; carrying out standardization processing on the preprocessed data; performing dimension-increasing processing on the original time sequence data form by using the codes; training input data by utilizing a GRU neural network to obtain a trained time sequence prediction model and storing the trained time sequence prediction model; predicting time sequence data by using a GRU-SES model to obtain a primary predicted value; performing secondary exponential smoothing on the obtained preliminary prediction data to obtain a final prediction data value; and outputting a prediction result. The prediction method provided by the invention improves the precision of time sequence prediction, and has important significance for time sequence analysis in industrial production or actual life.

Description

Time sequence prediction method based on GRU neural network

Technical Field

The invention belongs to the technical field of network information prediction, and particularly relates to a time series prediction method based on a GRU neural network.

Background

Currently, the current state of the art commonly used in the industry is such that: the time series is widely existed in daily life and production, the development of all things in the world develops along with time, and the variable series formed according to the time dimension exists in various fields such as weather, finance, traffic, industry, agriculture and the like. Daily temperature changes in the weather, passenger traffic changes in traffic, stock prices changes in finance, and the like, all exist in the form of a time series. Some data in life at the same time can also be converted to be represented in a time series form, such as gene data and the like. Time series is a special form of a random process. The time series is a series of data obtained by arranging the counted data in time order.

Time series is a sequence of values observed statistically in the time dimension, and is a special form of a random process. In real life, there is a large amount of time-based data, such as weather, finance, industry, agriculture, transportation, and other various industries. The time series prediction is an important research in the field of big data analysis and mining, and the prediction of future data is carried out by analyzing and mining the time series data collected in the past and finding out a certain rule existing in the time series data. The time sequence prediction has very important research significance for various fields, the future development can be better planned, and unnecessary loss is reduced.

The time series prediction relies on multiple disciplines, and is a cross study in multiple fields. Early time series prediction usually employs a mathematical statistical method, and uses a quantitative or qualitative prediction method to predict the time series. This is usually done with poor prediction efficiency and with less accuracy. With the rise of data mining and machine learning in recent years, people are becoming aware that artificial neural networks perform well in time series prediction. Through the cyclic neural network, the time sequence with a certain rule can be well trained, and therefore the predicted value of the time sequence is obtained.

Time series prediction is an important research in the field of mathematical statistics and big data analysis, and is to find and mine a certain hidden rule according to acquired historical data and predict and evaluate future data. The time sequence prediction is not only a simple prediction for future data, but also a rule of the problem on time change is deeply recognized by researching the change and fluctuation of each data on a time sequence, so that better theoretical basis and method support are provided for the analysis, decision and processing of the problem. The traditional time sequence prediction method is based on time sequence analysis of probability statistics, and meanwhile, the traditional prediction method has more perfect theories, methods and tools and is widely applied. With the rise of big data and machine learning in recent years, people are concerned about new data analysis and modeling of time series by using a neural network. The data volume is greatly increased, and how to find and mine more valuable information is a main problem of research. Meanwhile, the time series has the characteristics of large quantity, much noise and the like, and the prediction research of the time series is still challenged and developed.

The existing various prediction models have different advantages and disadvantages, and if only one simple prediction method is used, a good prediction result cannot be obtained. For example, the function form used in the conventional probabilistic statistical model is relatively fixed, and only relatively good prediction results can be obtained for idealized data, and relatively good results cannot be obtained for relatively complex data in practice. A great deal of research now also shows that the artificial neural network has the advantage of high efficiency in a complex system, but also has the defects of parameter optimization, local optimization and the like.

The current prediction research of the time series is mainly divided into three general directions at the present stage, 1, a traditional probability statistical model; 2. an artificial intelligence model; 3. and (4) mixing the models.

1. The traditional probabilistic statistical model is a set of prediction models which are relatively long in development time and mature at the same time. The mathematical statistics is mainly used as theoretical knowledge, and a function is used for carrying out relational modeling on each data in the time sequence series. Wherein mainly comprises: regression models, Exponential Smoothing (ES) models, autoregressive sliding translation (ar) models, and the like.

1) The regression model is based on the principle of mathematical statistics, firstly, the original data is processed mathematically to obtain the data to be processed which accords with the standard, then the correlation between the original data and the predicted data is determined, so that a relatively perfect regression equation is established according to the correlation, and the regression equation is used for predicting the future data. Mainly classified into a unary linear regression method, a multiple regression linear regression method, and the like.

2) The ES model is based on statistical theory, and is developed in the 50 s of the 20 th century, and the trend of time series data is analyzed by using a simple operation mode at first. Various types of parameter estimation are introduced subsequently, and an ES model with parameters is provided. Now, 5 different trends, i.e., no trend, additive trend, decaying additive trend, multiplicative trend, decaying multiplicative trend, are mainly formed. And 3, namely a non-seasonal mode, an addition seasonal mode and a multiplication seasonal mode. More commonly used are simple ES models (trend-less seasonal mode), Holt-Winter non-seasonal models (additive trend-less seasonal mode), and Holt-Winter seasonal product models (additive trend multiplicative seasonal mode).

3) The ARMA model is evolved from an autoregressive model (AR) and a moving average Model (MA), and the accuracy of prediction is improved by mixing AR and MA, but the parameter estimation is more complex. In 1970, a series of processes such as simulation, estimation, modeling, prediction and control of an ARMA model are elaborated in a book of time series analysis-prediction and control, which is written by Box and Jenkins, so that a complete system is formed. And through ARMA evolution autoregressive sum moving average model (ARIMA), the state that the time sequence is not stable can be effectively relieved.

2. Artificial intelligence models have been developed rapidly in recent years with the rise of big data mining and machine learning. Traditional prediction methods do not work well in high complexity, non-linear time series prediction, whereas artificial intelligence models solve complex systems by computer simulation of natural phenomena or human intelligence. The parameter estimation of the artificial intelligence model is to make the computer carry out self-adaptive learning, thereby continuously adjusting the correction of the parameters and obtaining the prediction error in a minimized way. Typical artificial intelligence models mainly include decision trees (decisiontrees), Bayes (Bayes) networks, Support Vector Machines (SVMs), Artificial Neural Networks (ANN), and the like.

The decision tree is to use the information gain method in the information theory to find the variable with the maximum information quantity in the input data as a node of the decision tree, and to set up the next layer branch of the decision tree by setting the corresponding threshold, so as to obtain the whole decision tree by recursion. In the learning iteration of the model, the structure and the threshold value of each layer of tree are continuously adjusted, so that a complete decision tree is obtained, and the time series data are predicted.

The Bayesian network is to mine the relation among data in a time sequence, and the theory is a graph mode which reflects the connection probability in each data based on an uncertain knowledge system of graph theory. Nodes in the Bayesian network represent data, directed edges represent relations among the data, and the Bayesian network determines relations among the data and existing rules by continuously adjusting probability measure weights, so that time series data can be predicted.

The support vector machine describes a non-linear model by using a linear model, maps data from a low dimension to a high dimension, and further analyzes and predicts the data by using a linear regression model again in a high dimension space.

Artificial neural networks are computer simulations that are formed to mimic the connections between neurons in a biological neural network. The learning process of the artificial neural network is actually an iterative process of connecting the weight and the structure of the neuron edges and minimizing fitting errors by continuously adjusting the threshold and the activation function among the neurons.

The whole artificial intelligent prediction model has many other optimization algorithms, such as a particle swarm optimization algorithm, an artificial immune algorithm, an ant colony optimization algorithm and the like, and can be applied to time series prediction.

3. The hybrid model is a prediction method which mixes a plurality of prediction methods, so that the prediction accuracy and the prediction performance are better. The traditional probability statistical model and the artificial intelligence model have own defects and defects, but any algorithm also has own advantages, and the multiple algorithms are effectively combined and utilized, so that the defects of the original method can be made up to the greatest extent, and the analysis capability of the hybrid model is improved.

In summary, the problems of the prior art are as follows:

(1) the function form used in the existing prediction model adopting the traditional probability statistics is relatively fixed, and a better result cannot be obtained by predicting a high-complexity and nonlinear time sequence in practice;

(2) the existing artificial neural network prediction model has the defects of parameter optimization, local optimization and the like.

The significance of solving the technical problems is as follows: time series prediction is an important research in the direction of time series research, and is also an important application in the field of data analysis. Time series prediction is the prediction of future data from an analysis of existing data. The prediction of future single variable data can be carried out by using the original historical data of the time series, and the prediction of future certain variable data can also be carried out by using the data obtained by observation. For example, in the inventory of commodities, how to determine the inventory capacity of a future warehouse is very valuable for research, and the inventory capacity is less, which is not beneficial to selling the commodities; the inventory capacity is large, which is not beneficial to the inventory management and increases the inventory management cost. If the future inventory can be predicted, the problem can be well solved, the inventory capacity can be optimized, and the economic benefit is high. The time sequence prediction is widely applied to various industries such as traffic, economy, industry, agriculture and the like. The performance of time sequence prediction in the horizontal direction and the direction is improved, and the method has important guiding significance on the aspects of management, control, decision and the like in the related field.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a time sequence prediction method based on a GRU neural network.

The invention is realized in such a way that a time series prediction method based on a GRU neural network comprises the following steps:

firstly, collecting original data needing to be predicted;

step two, cleaning, integrating, converting, dispersing and reducing the collected original data to relevant data preprocessing;

step three, carrying out standardization processing on the preprocessed data by adopting a Z scoring method, and taking the standardized data as input data of a neural network;

step four, using codes to upgrade the original time sequence data form into [ x ]₁][x₂][x₃]…[x_n]]；

Step five, building a time sequence prediction model by using TensorFlow, and training input data by using a GRU neural network to obtain and store the trained time sequence prediction model;

reading the stored trained time sequence prediction model, and predicting time sequence data by using a GRU-SES model to obtain an initial prediction value;

seventhly, performing secondary exponential smoothing processing and inverse standardization processing on the obtained preliminary prediction data; obtaining a final predicted data value;

and step eight, performing data standardization inverse operation on the prediction result, reducing the prediction data value into an original data type, and outputting the result.

Further, in step two, the data preprocessing specifically includes:

data cleaning: eliminating noise data in the original data, and making up for missing data in the original data; improving the quality of the data with lower quality in the original data, and improving the overall quality of the original data;

data integration: eliminating redundant data, and uniformly storing the data in a file or a database;

data conversion: converting the data into a format meeting the algorithm requirement;

and (3) data reduction: and removing the less important characteristic attributes in the data to obtain a refined data set.

Further, in the third step, the step of performing data standardization processing by using the Z-score method specifically includes:

firstly, determining a central value in a data array, performing weighted average of absolute distances on each data in the data array and the determined central value by calculating the difference (positive or negative) between each data and the central value, and dividing the difference between each data and the central value by the average distance to obtain a standardized result;

the Z-score normalization processing formula is as follows:

in the formula (1), the normalized result is represented by S_XExpressed as the mean of the array

Expressing, and expressing the standard deviation of the number series by S;

equation (2) shows that the average value of the number sequence needs to be calculated, and the number of terms is set as each data X in the n number sequence_iFirstly, adding the sum and dividing the sum by the number n to obtain the arithmetic mean value of the number sequence

Equation (3) shows that the standard deviation of the array is calculated, and the data X in the array is calculated_iFirst subtract the average

Then, the square is carried out, the arithmetic mean calculation is carried out to obtain the variance of the number series, and the square is opened to obtain the standard deviation of the number series.

Further, in step five, the time series prediction model specifically includes:

in the time sequence prediction model, inputs represent input values, rnn _ layer is a defined neural network, weights in rnn _ layer module are training weight values, biases are training bias values, rnn is a core unit of the GRU neural network, GRU _ cell in rnn is a neuron in the GRU neural network, wherein the weight values and bias values in training gates and candidate are also included; the input values in the input module inputs flow into rnn _ layer module for learning and training of the neural network.

Further, in the fifth step, the training of the input data by using the GRU neural network specifically includes:

1) the GRU neural network is specifically a GRU neural network module in TensorFlow; the number of the hidden units of the GRU neural network is selected to be 8, namely the number of the hidden neurons in the GRU is 8; the output value of the hidden neuron is defined by the formula:

y＝W·x+b (4)

in the formula (4), y represents an output value, x represents an input value, W represents a weight value, and b represents an offset value;

2) when the GRU neural network is used for training input data, a function of a gap between a neural network target and an actual output is evaluated by using a loss function, wherein the smaller the function value is, the smaller the difference between the actual output and the target output is, namely, the more appropriate the weight is;

the loss function in training is defined by the formula:

in the formula (5), loss represents a loss value, y_iRepresenting the output value of the neuron at a time, y_realRepresenting real raw data; output value y of neuron_iMinus the true value y_realThen, squaring is carried out, and finally, the variance is obtained by using the calculation average; the resulting variance is also the loss value of the loss function;

3) optimizing the loss function by using an optimizer in TensorFlow and adopting an Adam algorithm, defining the learning rate of a gradient descent method to be 0.003, and optimizing the descent gradient of the training model each time;

further, in the seventh step, the obtained preliminary prediction data is subjected to secondary exponential smoothing processing and inverse standardization processing; the obtaining of the final predicted data value specifically includes:

1) calculating time sequence by using quadratic exponential smoothing formulaAndin which the time sequence is

And

generally selecting first data in an original data column; computing

Using the input value x at time t_tAnd at time t-1

To obtain

A value of (d); computing

Using time t

And value of (d) and time t-1

To obtain a value of

A value of (d);

the quadratic exponential smoothing formula is as follows:

alpha is a smoothing coefficient (0, 1) (6)

2) Selecting a smoothing coefficient alpha, wherein when the change of the original time sequence data has an obvious change trend, the alpha is generally selected to be between 0.3 and 0.5; when the change of the original time sequence data is relatively smooth, the value of alpha is generally selected to be between 0.1 and 0.3;

3) based on calculated time sequence series

And

using the formula (7) and the formula (8) to calculate A_TAnd B_T(ii) a The x of the predicted T period after the quadratic smoothing exponential processing is calculated using equation (9)_t+TThe predicted value of (2);

x_t+T＝A_T-B_Tt, T is the number of future forecasts (9)

4) And (3) performing arithmetic mean calculation on the predicted value calculated by the GRU neural network and the predicted value calculated by the quadratic smoothing index:

the arithmetic mean calculated by equation (10) is the final predicted data value predicted using the GRU-SES model.

Further, in step eight, the inverse operation of the data normalization specifically includes:

and (3) performing inverse operation to restore the predicted data which accords with the original data type by using a formula for data standardization, wherein the mathematical formula is as follows:

x in formula (11)_iTTo predict data according to the original data type, X_iTIs the predicted data calculated by the GRU-SES model, S is the standard deviation of the original data and is calculated by the formula (2),the arithmetic mean of the raw data is calculated from equation (3).

In summary, the advantages and positive effects of the invention are:

the prediction method provided by the invention improves the precision of time sequence prediction, and has important significance for time sequence analysis in industrial production or actual life.

According to the invention, some details in the data can be better found through the dimension-increasing processing of the data, and meanwhile, the artificial neural network can be better and faster trained and learned, so that a better result can be obtained.

The GRU neural network module of TensorFlow is used as the neural network to train data, so that the data training method is high in flexibility and has portability; and the performance is optimal.

According to the method, the Adam algorithm is adopted to optimize the loss function, so that the learning rate is further optimized, and the learning rate of each parameter is dynamically adjusted by using the first moment estimation and the second moment estimation of the gradient; meanwhile, after the Adam algorithm is subjected to bias correction every time, the iterative learning rate has a certain range, so that the parameters are stable. According to the method, the Adam algorithm is adopted to optimize the loss function, so that sparse gradient and non-stationary targets can be processed; different adaptive learning rates may be calculated for different parameters.

Therefore, the hybrid model is adopted, so that the performance improvement of time series prediction is higher, and the method is better than the method only using a single prediction model.

According to the invention, the predicted result with higher accuracy is obtained by performing second-order exponential smoothing on the result after GRU neural network prediction.

Drawings

Fig. 1 is a flowchart of a time series prediction method based on a GRU neural network according to an embodiment of the present invention.

Fig. 2 is a flowchart of a GRU-SES model provided by an embodiment of the present invention.

Fig. 3 is a schematic diagram of a timing prediction training model according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of time series classification provided by an embodiment of the present invention;

in the figure: (a) a stationary time series; (b) a non-directional time sequence; (c) a trend-type time series; (d) intervening event-type time sequences.

Fig. 5 is a schematic diagram of a conventional prediction method according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of an artificial intelligence prediction method according to an embodiment of the present invention.

Fig. 7 is a schematic diagram of machine learning classification according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of a neuron operational model according to an embodiment of the present invention.

Fig. 9 is a schematic diagram of an exemplary recurrent neural network model provided by an embodiment of the present invention.

Fig. 10 is a schematic diagram of a feedforward neural network expansion model provided in an embodiment of the present invention.

Fig. 11 is a schematic diagram of a recurrent neural network expansion model provided in an embodiment of the present invention.

FIG. 12 is a schematic diagram of a BPTT forward propagation model provided by an embodiment of the invention.

FIG. 13 is a schematic diagram of a BPTT backpropagation model provided by an embodiment of the invention.

Fig. 14 is a schematic diagram of images of tanh and sigmoid functions according to an embodiment of the present invention.

Fig. 15 is a schematic structural diagram of a GRU neural network according to an embodiment of the present invention.

Fig. 16 is a schematic structural diagram of a GRU neural network neuron provided in the embodiment of the present invention.

FIG. 17 is a schematic diagram of a refresh door model according to an embodiment of the present invention.

Fig. 18 is a schematic diagram of a reset gate model according to an embodiment of the present invention.

Fig. 19 is a schematic diagram of a model of a pending output value according to an embodiment of the present invention.

Fig. 20 is a schematic diagram of an output value model according to an embodiment of the present invention.

Fig. 21 is a schematic diagram of two-dimensional to three-dimensional conversion provided by the embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The technical scheme and the technical effect of the invention are explained in detail in the following with the accompanying drawings.

As shown in fig. 1, a time series prediction method based on a GRU neural network provided in an embodiment of the present invention includes:

s101, collecting original data needing prediction.

S102, cleaning, integrating, converting, dispersing and reducing the collected original data to be related data preprocessing.

And S103, carrying out standardization processing on the preprocessed data by adopting a Z scoring method, and taking the standardized data as input data of the neural network.

S104, using codes to upgrade the original time series data form into [ [ x ]₁][x₂][x₃]…[x_n]]。

And S105, building a time sequence prediction model by using the TensorFlow, training input data by using the GRU neural network, obtaining the trained time sequence prediction model and storing the trained time sequence prediction model.

And S106, reading the stored trained time sequence prediction model, and predicting time sequence data by using the GRU-SES model to obtain a preliminary prediction value.

S107, performing secondary exponential smoothing and inverse standardization on the obtained preliminary prediction data; resulting in the final predicted data value.

And S108, performing data standardization inverse operation on the prediction result, reducing the prediction data value into an original data type, and outputting the result.

The technical solution of the present invention will be further described with reference to the following specific examples.

The invention adopts the GRU (gated RecurrentUnit) neural network with the optimized structure of the Recurrent neural network to predict the time sequence. Because the cyclic neural network has the characteristic that previous data cannot be continuously reserved, the structure of the cyclic neural network is optimized and improved to be an LSTM (Long Short Term memory) neural network, and the LSTM neural network can well avoid the defects of the traditional cyclic neural network such as long-range dependence, gradient disappearance and the like. The GRU neural network is a structural optimization of the LSTM neural network, and the training efficiency of the neural network can be better improved by combining and reducing the structures of the gates.

1. Overview of temporal sequence prediction

(1) Time series

1) Overview of the stochastic Process

The random process is a random phenomenon of the transformation of the study object in time, and for the random phenomenon, an infinite number of random variables are required to be used for description. Let T be an infinite set of real numbers, and let a family (infinite number) of random variables dependent on the parameter T e T be called a random process, denoted as { X (T) }, T e T }, where X (T) is a random variable for each T e T. T is called a parameter set. Let t be regarded as time, and X (t) be the state of the process at time t, and X (t)₁) X (real number) is said to be t ═ t₁The process is in state x. For all te T, x (T) the totality of all possible values taken is called the state space of the random process.

2) Overview of time series

The time series is a series of observations made in a time sequence. In reality, many data are presented in time series, such as daily sales and inventory of a company, daily weather forecast, sun blackson outbreak, and so on. The field to which time series relates is quite extensive.

The purpose of time series analysis is to generalize and estimate time series patterns that reflect historical data by analyzing the autocorrelation of time series data, as well as various morphologies, such as trends, seasons, intervening events, and the like.

The time series can be divided into five forms depending on the sequence characteristics and the fluctuation situation, as shown in fig. 4.

A stationary time series whose observations vary between the same fixed level and a fixed region, and which features do not change with time. Without special changes or outliers, it can be reasonably inferred that future observations of such sequences still vary at the same level or region; furthermore, the dependency between successive observations can be used to improve the prediction.

The time sequence is of a non-directional type, and the time sequence encountering interference presents a state of wave non-directional. The external impacts have cumulative effects on the sequence, so that the sequence cannot maintain a fixed level, and therefore, estimation of a predicted value is difficult.

Trend-type time series are generally influenced by long-term factors, so that the average level of the series shows a fixed trend change, but the data scatter variation of each time point is fixed. The average level of this type of sequence changes over time, so it can be assumed that this long-term factor will continuously and fixedly affect the sequence, yielding a sequence predictor.

Seasonal time series, similar fluctuations can be observed at fixed time intervals. Since the average level of the sequence varies periodically, the sequence prediction value can be obtained by assuming that the periodic factor will continuously and fixedly affect the sequence.

Intervening event-type time series, where a few observations in the series behave differently than others due to interference from a single incident. Since the average level of this type of sequence does not vary and a single event is often unpredictable, it can be assumed that the sequence will maintain the average level and variation to obtain a sequence prediction value.

(2) Time series prediction method

1) Conventional prediction method

The traditional prediction method for time series prediction is developed by mathematical statistics at the earliest, and common prediction methods are divided into quantitative prediction methods and qualitative prediction methods, wherein the quantitative analysis methods are divided into a causal prediction method and a trend prediction method, the causal prediction method comprises a unitary linear regression method, a multivariate linear regression method and the like, and the trend prediction method comprises a moving average method, a weighted moving method, a trend analysis method and the like. The qualitative analysis method includes the Delphi method and the judgment method of professional personnel. The conventional prediction method is shown in fig. 5.

The regression method in the causal prediction method utilizes the principle of mathematical statistics, firstly, the original data is processed mathematically to obtain the data to be processed which accords with the standard, then the correlation relationship between the original data and the prediction data is determined, so that a relatively perfect regression equation is established according to the relationship, and the regression equation is used for predicting the future data.

The moving average method in the trend prediction method is to perform item-by-item translation according to a time sequence, firstly, sequentially calculate the average number of data in the time sequence according to the order, thereby obtaining the time sequence of the averaged average number, and then, correspondingly predict the original data according to the average number sequence. The method is more accurate in short-term prediction, but is poorer in long-term prediction.

The exponential smoothing method in the trend prediction method is developed on the basis of a moving average method, mainly calculates exponential smoothing values of a time series, and predicts future data through the exponential smoothing values. The method not only keeps the average value of all data, but also keeps the near-term value of the moving average method, and only exerts the influence of slow attenuation on the past data, so that the method is more suitable for prediction of medium and short term.

A Delphi method in a qualitative analysis method is a feedback anonymous function inquiry method, which mainly depends on subjective consciousness of people, firstly inquires each participant in a letter mode according to a certain program, and all participants submit opinions in an anonymous form so as to predict future data according to the opinions of the participants.

The method is used for carrying out detailed comparison aiming at three methods, namely a moving average method, a weighted moving method and an exponential smoothing method in a trend analysis method in the traditional prediction method.

The moving average method is a simple trend analysis method, and the application of the method is very wide. The moving average method calculates the average time sequence according to the appointed number of terms from the first term of the sequence, and then moves backwards to obtain a new sequence after moving average. The moving average method can eliminate the interference of accidental factors in the sequence, so that the long-term trend of the time sequence development can be observed from a new sequence after moving average. The moving average sequence requires a parameter k that determines the length of the time interval, and the smoothing function is expressed using a mathematical formula as follows:

in the moving average method, the determination of the parameter k is the most critical, and generally, the selection of k has the following principle:

a. the number of terms is moderate, and the more terms are selected from the number series, although the long-term trend can be better observed, in the moving average, the more information is lost; the number of selected items in the number sequence is small, so that the interference of accidental factors is not easy to eliminate in the moving average.

b. The period sequence parameter is selected from a period number or an integer multiple, for example, when the sequence is a sequence having a time period, the parameter k should be shifted by the number of period terms or an integer multiple. For example, when the number of rows is a month number, the parameter should be 12 as the parameter of the moving average.

c. And selecting odd number movement, wherein the result of the moving average preferably corresponds to the middle item in the original number sequence, and if the even number movement is selected, the result of the moving average needs to be subjected to two items of moving averages again to be aligned with the original number sequence.

The weighted moving average method can be regarded as the popularization of the moving average method, and the parameters k and the weight omega of the k period for determining the time interval length are needed_i(i ═ 1, 2, …, k), the smoothing function is expressed using the mathematical formula as follows:

the weight of the exponential smoothing method is expressed by a smoothing coefficient, and the exponential smoothing method is a weighted average method for jointly constructing the predicted value of the next period by observing the observed value of the current period and the predicted value of the current period. The exponential smoothing method decreases the coefficient of earlier data in the series according to an exponential rule, and the earlier data gives a smaller weight value. When the number of terms in the sequence is large, the influence of the initial value on the whole exponential smoothing prediction value is very limited, so that when the exponential smoothing is used, the initial smoothing value can be selected as the initial value of the sequence. The exponential smoothing method requires the use of a memory deterioration rate α, and the mathematical formula is used to represent the smoothing function as follows:

the exponential smoothing method can well eliminate the influence of irregular factors, and is more suitable for the number series with special inertia trend, and less suitable for the number series with certain curve trend.

The moving average method, the weighted moving average method, and the exponential smoothing method are summarized in comparison, as shown in table 1.

Table 1 comparative summary of the three methods

2) Artificial intelligence prediction method

With the rise of artificial intelligence in recent years, people are also aware of the fact that an artificial intelligence algorithm can be used for time series data with instability, nonlinearity and high complexity to replace the traditional prediction method for time series prediction. The traditional prediction mode is based on the basic theory of mathematical statistics, and the artificial intelligence model adopts a new research method, so that various problems encountered are solved by simulating human thinking by a computer, and more effective and accurate results can be obtained. And the selectable function form in the artificial intelligence model is also more variable and flexible than the function form in the traditional model. The artificial intelligence model can lead a computer to learn iterative parameters by itself, and the parameters in the model are better estimated by continuous adjustment and correction, while the parameter estimation in the traditional model is estimated according to the inherent theory in mathematical statistics.

In the model study of time series prediction, an artificial intelligence model is used as shown in fig. 6.

Decision tree (Decision tree) algorithms are classified according to the branch structure of the tree, and the essential principle is that a model is realized by performing step-by-step judgment and Decision on characteristic variables through if-then rules. In the loop iteration of the model, continuous cutting and optimization of branches and leaves are carried out, the decision tree is changed into a regular model by modifying the structure and the threshold value of the tree, and the model can be used for carrying out division and prediction. The decision tree has the advantage of being able to directly generate rules that are understandable to humans, whereas neural networks and bayesian networks are not available. The accuracy rate is also higher when the decision is made, and the method is a very effective algorithm. For the prediction of the time sequence, the specific process is that after a decision tree is built, when data enters, new data is judged according to the existing decision tree logic, if the new data is consistent with the judgment of the current node, the next branch is entered for judgment, and finally, a result is obtained through a leaf node. The decision tree algorithm mainly has the following variants: c4.5, C5.0, ID3, CART, and the like.

Random Forest (Random Forest) is composed of a plurality of decision trees, and the decision trees are all formed by adopting a Random method, so that the set of the decision trees is called as a Random Forest. All decision trees in the random forest are not connected, after data enter the random forest, each decision tree is classified, the classified results of all decision trees are counted, and the most decision trees are the final results. The random forest can process not only discrete data but also continuous data. The main process of the random forest is divided into random sampling and complete splitting, wherein the random sampling is to sample input data in rows and columns firstly; the complete splitting is to perform complete splitting on the sampled data, a decision tree can be established through the complete splitting, leaf nodes in the established decision tree can not be split any more, or sampled samples belong to one classification. The random forest has the advantages of good multi-classification solving capability, high training and predicting speed, good data fault tolerance, good processing of high-dimensionality data, no over-fitting problem, easiness in realizing parallelism and the like.

A Bayesian network (Bayesian network) is a set of directed acyclic graphs and a conditional probability table. Each node in the directed acyclic graph is represented as each random variable, and the directed edges represent the dependency relationships between the random variables. Each stored data element in the conditional probability table is a one-to-one correspondence node in the directed acyclic graph, the stored node, and the joint conditional probability of other immediate predecessor nodes. A bayesian network is a non-linear extension that can be viewed as a markov chain, and it can easily compute joint probability distributions. Bayesian networks are more complex than na iotave bayes, and training and constructing bayesian networks is very complex. The Bayesian network has the advantages that some input variables do not need to be indicated, some other variables can be used as input, any variable set can be connected with one another, and the inference can be carried out according to the probability of other variable sets.

Support Vector Machines (SVMs) are generalized linear classifiers that can minimize empirical errors and maximize geometric margin. Meanwhile, the SVM can map a low-dimensional vector into a high-dimensional space, and a hyperplane with the maximum interval is established in the space. When data are separated, two hyperplanes which are parallel to each other are established on two sides of the hyperplane, the hyperplane is separated to enable the distance between the two parallel hyperplanes to be maximum, and the larger the distance between the two parallel hyperplanes is, the smaller the total error of the SVM is. The essence of the SVM algorithm is to find a hyperplane that maximizes a value, i.e., the minimum distance, in the input data. The SVM has the advantages of solving the problem of small samples, solving the problem of nonlinearity, well processing a high-dimensional data set and the like.

An Artificial Neural Network (ANN) is a pattern matching algorithm that simulates a biological Neural Network, and the development of the Artificial Neural Network is huge so far, wherein the algorithm includes a perceptron Neural Network, an inverse transfer Neural Network, a Hopfield Network, a convolutional Neural Network, a recurrent Neural Network, and the like. A typical neural network consists essentially of an input layer, an output layer, and a hidden layer. Each layer is composed of a plurality of neurons, each neuron is also composed of an input unit, an output unit and a calculation unit, important things connected with the neurons are weight values, and training of a neural network is that the training weight values are optimal, so that the prediction effect of the whole network is optimal. The advantages of the artificial neural network are mainly: high accuracy, strong parallel processing capability, strong distributed storage capability, strong fault tolerance and the like.

2. Overview of Artificial neural networks

(1) Machine learning

The pioneer of machine learning, Arthur Samuel, defined machine learning as: "a research area that confers computer learning capabilities without being directly programmed for the problem. "later in 1998, Tom Mitchell defined Machine Learning in the book Machine Learning: the "subject of machine learning concerns are: how computer programs automatically improve performance as experience accumulates. "" for a certain type of task T and performance metric P, a computer program is said to be learning from experience E if the performance of the computer program measured in P on T is self-perfected with experience E. "this is a short formalism that shows how data is collected or collected (E), then decisions are made (T), and finally the results are evaluated (P).

Machine learning is classified into supervised learning, unsupervised learning and semi-supervised learning according to different training data sets, if not shown in fig. 7.

Supervised Learning (Supervised Learning): for each valid input data, a corresponding output data is obtained after training, i.e. the input data is output

Supervised learning is trained using labeled input data. Classification can be performed, such as image processing, text processing, spam detection, and the like; and regression, such as stock forecasts, house price forecasts, and the like.

Unsupervised Learning (Unsupervised Learning): for each valid input data, there is no corresponding output, i.e. there is no corresponding outputUnsupervised learning input data is unmarked and attempts are made to find "hidden" structures in the data. Clustering can be performed, such as product grouping, article classification, outlier detection, and the like.

Semi-supervised Learning (Semi-supervised Learning): for each valid input data, only part of the input data will correspond to the output data, i.e. part

Semi-supervised learning is trained using partially labeled input data.

(2) Artificial neural network

An Artificial Neural Network (ANN) is a computer-simulated information processing system proposed by simulating a biological Neural Network, and the basic theory is based on the related knowledge of Network topology. The artificial neural network can efficiently process various problems of large concurrency and complex data. After the artificial neurons in the neural network acquire information from the outside, the network structure formed by the connection among the neurons in the neural network is used for training the neural network by using different training algorithms, so that output data meeting requirements are obtained.

An artificial neural network is formed by connecting a large number of neurons, each neuron uses an activation function (activation function) as an output function, and the common activation function is shown in table 2. A connection is made between every two neurons, and a weight (weight) is used as a weight value of the connection signal. The output of the overall neural network is dependent on the structure of the network, the manner in which the network is connected, the activation function, and the weights.

TABLE 2 common activation function

The overall process of the artificial neural network is divided into different stages:

a learning (learning) phase, in which connection is established between neurons, and weights and deviations between neurons are continuously corrected, so that activation functions of neurons can be adjusted.

And a recall (call) stage, namely after the neural network processes input data, outputting corresponding data according to the existing neural network architecture.

An induction (induction) stage, which is to observe a part of the neural network to deduce the characteristics of the whole neural network, can provide an efficient memory and storage mode.

The operation basis of the whole neural network is that each neuron and an operation model of one neuron k is shown in fig. 8.

For example, for neuron k, it is assumed that there are p input information x_iI is 1, 2, … p, wherein

Weight of neuron k by w_ikThe expression that the magnitude of the weighted value indicates the strength of the connection relationship between the neurons, and if the weighted value is a positive number, the input data x is described_iHas promoting effect on the neuron; if the weighted value is negative, the input data x is indicated_iHas inhibitory effect on the neuron. All input data input by the neuron k is X ═ X (X)₀，x₁，x₂，…，x_p) Wherein the corresponding weighted value is W ═ (W)_0k，w_1k，w_2k，…，w_pk) The sum of their multiplications is:

wherein w of neuron k_0kAlso known as threshold (threshold). As the same simulation as the biological neural network, for w_0kIs typically set to a negative value, where x₀Typically a value of 1. Then all the input data are weighted as

And threshold value x₀w_0kPerforming subtraction operation, and if the calculation result is greater than or equal to 0, stimulating the neuron; if the calculation result is less than 0, the neuron is inhibited. The output of the neuron will then use net_kIs calculated again, the formula is as follows:

y_k＝f(net_k) (16)

wherein the function f is an activation function, and the activation function can be a linear or non-linear function to convert net_kAnd outputting the output value of the neuron.

The artificial neural network has the characteristics that:

high parallelism: the artificial neural network is formed by combining a plurality of same processing neurons, and although the calculation of each neuron is very simple, a large number of neurons are provided to jointly process a problem, so that the artificial neural network has good parallel processing capability and effect. The neural network is not only parallel in structure but also parallel in calculation and processing, and further the processing speed is greatly improved.

High non-linearity: each neuron in the neural network receives the input between other neurons, and the neurons are mutually influenced, so that the nonlinear mapping of input data and output data is realized, and the nonlinear characteristic can well improve the storage capacity and the fault tolerance.

Associative memory function: the neural network stores the information in the data in the weight through the weight determined by the connection among the neurons, and has an associative memory function. The memorized information cannot be determined according to the single weight, and the neural network is in a distributed storage form, so that the processing work such as feature extraction can be further performed. More features can be learned from incomplete pictures or articles to further make corresponding decisions. The entire neural network not only depends on the characteristics of the individual neurons, but may be determined by the interactions and connections between the neurons.

Good adaptivity: the neural network learns and trains the weight value in the structure, and the whole learning and training process simulates the nerve conduction of organisms, so that the neural network has good environment adaptability. The neural network is different from the traditional symbolic logic, the output value is obtained by learning and training input data, and the output value is not directly obtained according to the traditional rule between problems, so that the neural network has self-adaptive capacity.

Distributed storage: the storage characteristics in the neural network are not present in individual neurons only, but are stored distributed throughout the neural network. By activating the response of the input data on the neurons distributed in the network and recording the learned and trained features on the connection weight values, when the same data is input again in the same structure, the judgment can be carried out quickly and output simultaneously.

It is the artificial neural network with these characteristics that solves the shortcomings of traditional artificial intelligence for intuitive processing, such as speech pattern recognition, unstructured information, etc. The artificial neural network can be applied to multiple aspects of prediction, pattern recognition, intelligent control and the like.

(3) Recurrent neural networks

Recurrent Neural Networks (RNNs) are different from conventional Feed-Forward Neural Networks (FNNs) in that they can deal with the time-context relationship between input data. In the traditional neural network model, input data pass through a hidden layer from an input layer to an output layer, all layers are connected, but all nodes in the middle of each layer are not connected. For example, each word needs to be predicted for a sentence, the latter word in the sentence is to depend on the former word, the former and the latter words in the sentence are not independent, and there is a connection between the words. The cyclic neural network is used for a sequence, an output value depends on a previous output value, the specific expression form is that a current output value in the cyclic neural network memorizes previous information, the memorized information is used in a current input value, each node in the hidden layer is connected, and meanwhile, the input of the hidden layer comprises the input value at the current moment and the output value at the previous moment. Typical RNNs are shown in fig. 9.

In a typical recurrent neural network model, input units (inputs) are included, where the input data set is labeled as { x }₀，x₁，…，x_t，x_t+1… }; output units (output units) in which the output data set is labeled as y₀，y₁，…，y_t，y_t+1… }; hidden units (hidden units), where the output data set in the hidden unit is marked as h₀，h₁，…，h_t，h_t+1… }, these hidden elements are the main meters of the entire cyclic networkA calculation unit; the black squares therein represent time delays. As in fig. 9, there is a unidirectional flow of information from the input unit to the hidden unit; there is also another information stream flowing unidirectionally, from the concealment unit to the output unit. In the recurrent neural network, some limited modifications are made, some information is returned from the output unit to the hidden unit, and the input data in the hidden unit is also obtained from the previous hidden unit, that is, the nodes in the hidden layer can be self-connected or interconnected.

The expanded recurrent neural network of FIG. 9, where x_tIs the input data at time t, h_tIs the hidden state at time t, y_tIs the output data at time t. h is_tIs passed through the hidden state h at time t-1_t-1And input x at time t_tObtained, the data formula is expressed as:

f in the mathematical expression is usually chosen as a non-linear function, such as tanh and ReLU.

The recurrent neural networks are different from the traditional neural networks in parameter selection, the traditional neural networks usually select different parameters for each layer, and the recurrent neural networks share the same parameters W, U, V, so that the advantage of greatly reducing the number of parameters to be learned is that different input values are needed in each execution.

Here, a comparison between the feedforward neural network and the recurrent neural network is made by time series prediction, so as to obtain the difference points therein. For the feedforward neural network, the predictions at different time points are relatively independent, so a window processing method is required to be used to make the correlation between the front and rear time points compatible, and the window size is selected to be 3, as shown in fig. 10.

Wherein x is_tIs the input unit input at time t, y_tIs the output of the output unit at time t, h_tIs the output of the concealment unit at time t, W_tIs tWeight values in temporal neurons. x is the number of_t-1，x_t，x_t+1Is a set of data in the window processing method with a window size of 3. Expressed using the data formula as:

where concat denotes computing a vector into a larger dimension. It can be seen from the mathematical formula that a large amount of learning is required for W_tAnd b, much data is required to be stored and calculated.

The recurrent neural network, for each time instant, the output value depends not only on the input value with the current time instant but also on the output value with the previous time instant, as shown in fig. 11.

Wherein x is_tIs the input unit input at time t, y_tIs the output of the output unit at time t, h_tIs the output of the concealment unit at time t, W is the weight value in the neuron. Expressed using the data equation:

wherein the output value y at the time t_tNot only with x_tInput is related to h at time t-1_t-1The output value of the concealment unit is relevant. It can be seen here that for the feedforward neural network, 3 moments are required to learn the weight value W once_tAnd for the recurrent neural network, 3 times of the weighted value W can be learned at 3 moments. That is, the weight values of the neurons in the recurrent neural network are all shared, which is one of the most prominent advantages of the recurrent neural network.

RNNs have found great utility in a number of natural language processing applications, such as speech recognition, text analysis generation, machine translation, time-series processing, and the like.

However, the conventional recurrent neural network also has great disadvantages, wherein the most important is the gradient vanishing problem, that is, the Backward Propagation Through Time (BPTT) algorithm is involved. First, consider the basic formula of the following recurrent neural network:

the loss value is defined herein as the cross-entropy loss,

transforming the formula of the above formula

Here y_tRepresenting the original correct data at time t,

representing predicted data at time t. The whole training process is to use each input data as a training sample, and the overall error value is the sum of the error values at each time. As shown in fig. 12.

The goal of the training is to calculate the gradient of the error value relative to the parameter W, U, V and to learn the parameter using a random gradient descent method. The gradient values for each training sample at each time instant are summed as follows:

for the calculation of the gradient, the chain-type derivation rule will be used, mainly to propagate the error backwards with the back propagation algorithm. By way of example, using E₃As a sample, it is mainly for convenience of description. For E₃There are data formulated as follows:

it can be seen from the formula

Then s₃Is that it is required to depend on s₂In the same way, s can be known₂Is required to depend on s₁And W. Therefore, in order to obtain a gradient of W, s cannot be reduced₂As a constant. The chain rule needs to be used again, and the following results are obtained:

the above expression is obtained by adding the gradient values obtained at each time, and W is used for each calculation. The gradient at time t-3 is now propagated back to time t-0, as shown in fig. 13.

From the above analysis, it can be seen that the standard back propagation algorithm is commonly used in deep forward neural networks and recurrent neural networks. The standard back-propagation algorithm in a recurrent neural network is to sum up the different gradients of the weight values W at each time instant. In the traditional neural network, parameters are not shared among layers, so that accumulation and summation are not needed. The time-dependent directional propagation algorithm (BPTT) is an improvement of the standard back propagation algorithm on the recurrent neural network, the training of the standard recurrent neural network is very complicated, the training sequence is usually very long, and the back propagation needs to be increased by many layers. In practical applications, the back propagation is usually cut off into several steps.

There is another problem of the recurrent neural network, that is, the gradient vanishing problem, and it is difficult for the recurrent neural network to deal with the dependence of learning to a long range. By the above-mentioned use of E₃The analysis continues as an example of a sample, according to equation (25), which requires attention to

Also need toUsing the chain rule, e.g.

As used herein, a vector function is derived from a vector, so the result is a Matrix, called Jacobian Matrix, in which the elements of the Matrix are the derivatives of each datum. Rewrite equation (25) to:

from equation (26), it can be confirmed that the upper bound of the two-norm of Jacobian Matrix is 1. If now tanh is used as the activation function, which maps all ranks between-1 and 1, the derivative value is also bounded by 1.

Now look again at the image of the function of tanh and sigmoid and the derivative of the function, as shown in FIG. 14.

It is clear from fig. 13 that the derivative values, i.e. the gradient values, of the functions of tanh and sigmoid at both ends of the image are close to 0, i.e. close to a parallel line. When this occurs, it is indicated that the corresponding neuron is close to saturation. When the gradient of this neuron element is 0, the gradient of the neuron element of the front layer is also made 0. When smaller values are present in the matrix, multiplication of the matrices will cause the gradient values to fall at an exponential rate, and eventually after a few steps the gradient values will disappear. The gradient value at the far time is 0, and the time has no treatment and helps the learning training process, so that the whole learning process has no long-distance dependence.

There is also a problem of gradient explosion, which can occur if the value in the Jacobian Matrix is too large and the neural network is dependent on the activation function and the network parameters therein. But the reasons why gradient disappearance is more interesting than gradient explosions are mainly:

a. gradient explosions are easily observed, but the program crashes when the gradient values of the function in the training process go out of bounds, i.e., become NaN (infinite nand value).

b. The problem of gradient explosion is easily avoided, and the gradient explosion can be effectively avoided by clipping the gradient by a preset threshold value.

Because the problem of gradient explosion is easier to solve, and the problem of gradient disappearance is sometimes not obvious, so that the problem is not easy to deal with. In the Recurrent neural network, a common solution to solve the problem of gradient disappearance is to change the structure of the Recurrent neural network, such as Long Short-term memory (LSTM) and threshold Recurrent Unit (GRU), which are commonly used at present. LSTM was proposed in 1997 and is now a better solution to the NLP problem. The GRU was proposed in 2014, and is also an improved structure for LSTM, optimizing the number of thresholds. The two structures of the cyclic neural network can well solve the problem of gradient disappearance, so that the problem of long-distance dependence can be effectively solved.

(4) GRU neural network

A threshold cycling Unit (GRU) neural network is a model improved by a Long Short-Term Memory (LSTM) neural network, proposed by Kyunghyun Cho in 2014. The LSTM neural network is a deformation model of the recurrent neural network, and the most remarkable achievement is that the problem of long dependence in the recurrent neural network is well overcome. However, the LSTM neural network model is complex in form, and has problems of long training time, long prediction time, and the like. In order to improve these problems, the LSTM neural network of the GRU neural network has proposed a forgetting Gate (Forget Gate), an Input Gate (Input Gate) and an Output Gate (Output Gate), and these gates are designed to eliminate or enhance the Input information into the neural cell unit to control the cell state. The GRU neural network improves the design of the Gate, integrates the forgetting Gate and the input Gate in the LSTM into an Update Gate (Update Gate), namely optimizes the original cell structure consisting of three gates into the cell structure consisting of two gates. Fusion and other improvements in Cell State were also made. The overall GRU neural network structure is shown in fig. 15.

Like a conventional recurrent neural network, the GRU neural network is a chain model formed by repeating neural unit modules, and in the conventional recurrent neural network, only one simplest tanh function or ReLu function or the like may be in a neuron a. In GRU, however, neuron a is a more complex threshold structure. The detailed structure of a single neuron is shown in fig. 16.

The neurons of a GRU neural network are expressed using a mathematical formula as:

z_t＝σ(W_z·[h_t-1，x_t]) (27)

r_t＝σ(W_r·[h_t-1，x_t]) (28)

a detailed breakdown of individual nerve cells in the GRU neural network will now be made. First, as shown in FIG. 17, is a representation of an update gate in a GRU neural network. According to z in FIG. 17 and equation (27)_tRepresents an update gate (UpdateGate), h_t-1Representing the output of the last neuron, x_tIndicates the input of the neuron of this time, W_zRepresents the weight of the update gate, and σ represents the sigmoid function. Updating the door z_tIs the output h from the last neuron_t-1And input x of the neuron of this time_tAdded and multiplied by the update gate weight W_zAnd then, calculating by using a sigmoid function. For the update door z_tWhen the value is larger, the more information to be retained of the current neuron is represented, and the less information to be retained of the previous neuron is represented.

The Reset Gate (Reset Gate) model in the GRU neural network is then as shown in figure 18. According to r in FIG. 18 and equation (28)_tDenotes a Reset Gate (Reset Gate), h_t-1Representing the output of the last neuron，x_tIndicates the input of the neuron of this time, W_rRepresents the weight of the reset gate and σ represents the sigmoid function. Reset gate r_tIs the output h from the last neuron_t-1And input x of the neuron of this time_tAdded and multiplied by the reset gate weight W_rAnd then, calculating by using a sigmoid function. For the reset gate r_tWhen the value of the equation is 0, it indicates that the information transmitted from the previous neuron is to be discarded, that is, as long as the input of the current neuron is used as the input, the current neuron can discard some information that the previous neuron does not use.

Secondly, the model of the output values to be determined in the GRU neural network is shown in fig. 19. According to FIG. 19 and in equation (29)

Representing the output value, r, to be determined in the neuron_tDenotes a reset gate, h_t-1Representing the output of the last neuron, x_tRepresents the input of the neuron at this time, W represents the weight of the update gate, and tanh represents the hyperbolic tangent function. Output value to be determined

Is the output h from the last neuron_t-1And a reset gate r_tMultiplied by the input x of the neuron_tThe sum is multiplied by a weight, and the result is calculated using a hyperbolic tangent function.

Finally, the output value model in the GRU neural network is shown in fig. 20. According to h in FIG. 20 and equation (30)_tIndicates the output value of the neuron at this time, z_tRepresents an update gate, h_t-1The output of the last neuron is represented,

and representing the output value to be determined in the neuron at this time. The output value h of the neuron_tIs to subtract the update gate z from 1_tAfter that, the output value h of the last neuron is multiplied_t-1The resulting value, plus the update gate z_tMultiplying the output value to be determined in the neuron at this time

The value obtained.

From the formula and the respective exploded views in the GRU neural network, it can be seen that each neuron decides the output quantity of each information, so that there is a dependency relationship between the neurons. Generally, reset gates will be active for short distance learning and update gates will be active for long distance learning.

LSTM and GRU are compared in the prior art, and experiments show that GRU has the effect equivalent to that of LSTM in a plurality of problems, and GRU is easier to train than LSTM. GRU is also increasingly used in machine learning, for example, GRU neural networks are used in speech models instead of LSTM neural networks.

3. GRU-SES model design and implementation

(1) Overview of GRU-SES model

The GRU-SES model is obtained according to the optimization of the GRU neural network, and the prediction data generated by the GRU neural network is subjected to second-order exponential smoothing again, so that the accuracy of the prediction data processed by the GRU-SES model is improved compared with the accuracy of the prediction data directly using the GRU neural network. The flow chart of the GRU-SES model is shown in fig. 2.

The GRU-SES model process mainly comprises the following steps:

a. collecting raw data and collecting data needing prediction.

b. And (4) data preprocessing, namely performing data preprocessing on the collected original data.

c. And (4) data standardization, namely, carrying out standardization treatment on the preprocessed data.

d. And (4) ascending the dimension of the data, and performing ascending dimension operation on the data.

e. And (4) neural network training, namely training data by using a GRU neural network to obtain a trained model file.

f. And (4) data prediction is carried out by using a GRU-SES model.

g. And (4) prediction data processing, namely performing inverse standardization processing on the predicted data.

h. And outputting the prediction result, and outputting the result obtained by prediction.

(2) Data pre-processing

The data preprocessing is to perform a series of processing operations such as necessary cleaning, integration, conversion, dispersion, reduction and the like on the original data before the data is to be calculated, so that the preprocessed data more conforms to a data structure required by an algorithm. The main tasks of data preprocessing are as follows:

data cleaning: and eliminating noise data in the original data, and compensating missing data in the original data. And improving the quality of the data with lower quality in the original data, and improving the overall quality of the original data.

Data integration: the original data is stored uniformly and can be stored in a file or a database, and the project eliminates redundant data.

Data conversion: and converting the original data into a format meeting the algorithm requirement so as to meet the requirement of the algorithm.

And (3) data reduction: and eliminating the less important characteristic attributes in the original data to obtain a relatively refined data set.

Aiming at the prediction of time sequence data to be made by the invention, on one hand, the data is ensured to be ordered according to a certain time sequence, and wrong time points or time sequence disorder cannot exist; on the other hand, the integrity and the correctness of training data used by the data needing to be predicted are ensured, and the data trained by the neural network meet the requirements of the training data.

(3) Data normalization process

The data normalization process is to scale the preprocessed data according to a certain proportion, so that the data can fall into a certain specific interval after being transformed. If some data exists in some units, the normalization process converts the data into dimensionless pure values, so as to enable data indexes with different orders of magnitude in different units to be compared or weighted. Common standardization processing methods mainly include a Z score method, a range normalization transformation method and a range normalization transformation method.

The Z-score method is to determine the center value in the data sequence, calculate the difference (positive or negative) between each data and the center value, perform weighted average of absolute distance between each data and the determined center value in the data sequence, and divide the average distance by the difference between each data and the center value, thereby obtaining the normalized result, and the Z-score method is very similar to the normalized conversion in the normal distribution. In mathematical statistics, the average value can be regarded as the central value of the data, and the standard deviation can be regarded as the average distance reflecting all data from the central value, so the Z-score normalization processing formula is as follows:

in the formula (1), the normalized result is represented by S_XExpressed as the mean of the arrayThe number series standard deviation is denoted by S. Wherein formula (2) is the average value of the number sequence, and the number of terms is n_iFirst, the sum is added, and then the sum is divided by the number n of terms to obtain the arithmetic mean of the series

Wherein formula (3) is the standard deviation of the sequence to be calculated, firstly, the data X in the sequence is_iFirst subtracting the average value

Then squaring, calculating the arithmetic mean to obtain the variance of the number series, and then squaring to obtain the number seriesStandard deviation of (2). The results after Z-score processing have positive and negative values, the variation range of the results is- ∞to + ∞, the data mean after Z-score processing is 0, and the variance is 1. The Z-score method is widely used in statistical analysis.

The range normalization transform method is to arrange all data from small to large, then use the minimum value as a comparison standard to calculate the distance between each data and the minimum value data, and among the calculated distances, the distance between the maximum value and the minimum value in the number series is the largest, which is called range or full range (R). Then, the range R is used as a calculation scale, and the distance of each datum is divided by the range to obtain a relative distance, namely the result of the range normalization transformation method. The polar difference normalization transformation method has the following processing formula:

R＝X_max-X_min (32)

in the formula (31), the normalized result is S_XExpressed as X for the minimum value in the array_minThe range in the array is denoted by R. Wherein, the formula (32) is to calculate the range of the number sequence, i.e. to use the maximum value X in the number sequence_maxMinus the minimum X in the array_minAnd obtaining the range of the numbers. The range of the results obtained by the range normalization transformation method is 0-1, and the range of the data range is 1. The range normalization transformation method is commonly used for estimating data characteristic values of median, mode, quantile and the like of a data column.

The range normalization transformation method is to use the central value (generally selected as the mean value) of all data in the array as the comparison standard, first calculate the difference between the data in the array and the central value, then divide the obtained difference by the range (R) to obtain the result, i.e. the result after the range normalization transformation method. The standard range transformation method has the following processing formula:

in the formula (33), the reaction mixture,normalized results are given as S_XRepresenting, by the mean of the series X

The array range is represented by R. Wherein the average value

Calculated using equation 4-2, and the range R calculated using equation 4-5. The range of the results obtained by the range of-1 to 1 is obtained by the range of the.

The data standardization processing in the invention adopts a Z scoring method, the standardized data is used as the input data of the neural network, and meanwhile, the standardized data does not influence the effect of the neural network, but can greatly improve the algorithm efficiency and accuracy of the neural network.

(4) Data upscaling processing

The dimensions are also called dimensions in mathematics, and usually, points are zero-dimensional, lines are one-dimensional, planes are two-dimensional, and solids are three-dimensional, and higher dimensions cannot be represented temporarily. For example, a cube is described here, and generally if the cube is described in a three-dimensional space, that is, the length, width and height of the cube are described, a set of three one-dimensional vectors is a three-dimensional vector; if a square is weighted, i.e. there are four one-dimensional vectors describing the square, the set of four one-dimensional vectors is a four-dimensional vector. That is, if a description of an object is described in multiple directions or details, these multiple description vectors are a multi-dimensional vector. Here again, a two-dimensional array is raised to three dimensions as shown in FIG. 21.

In fig. 21, the left diagram is a two-dimensional array vector, which is raised to the representation form in the right diagram.

The model carries out dimension-increasing processing on time series data, and the dimension-increasing data series form is realized by using codes as follows. The original time sequence is as follows: [ x ] of₁x₂x₃…x_n]After performing ascending dimension, ascending dimension time sequenceThe form is as follows: [ [ x ]₁][x₂][x₃]…[x_n]]. After the data is subjected to dimension raising, some details in the data can be better found, and meanwhile, a better result can be obtained by training a neural network. Here, because the present invention performs the upscaling processing on the input data, the corresponding training of the weight values in the training neural network also performs the corresponding upscaling processing along with the upscaling of the input data.

Certainly, the higher the dimensionality of the data is, the better the dimensionality of the data is, the high-dimensionality data can generate corresponding dimensionality disasters, and the dimensionality disasters mean that the data are arranged sparsely in a high-dimensionality space, so that large errors can be generated in the calculation of distance or the calculation of similarity, and the low-dimensionality data processing algorithms can also have low efficiency when applied to the high-dimensionality data.

(5) GRU neural network model training

The GRU neural network in the present invention uses a GRU neural network module in TensorFlow. TensorFlow was originally a machine learning system developed by the Google Brain team and could be used to implement applications and research in machine learning and deep neural networks. Tensor is introduced into TensorFlow and defined as tensor, namely a commonly-considered N-dimensional array; the data is then computed in the manner of a dataflow graph, i.e., the process of computation of the tensor (tensor) flows from one end of the graph to the other. TensorFlow has now been widely used in various machine learning fields such as speech recognition, word processing, machine translation, and image recognition. The main features of the TensorFlow are:

the flexibility is high: developers using the TensorFlow only need to define the data flow graph required by the developers to write internal circulation, and then use some completed neural network interfaces provided in the TensorFlow to quickly build a deep learning framework of the developers.

Portability: TensorFlow can be used on a CPU or a GPU, and simultaneously, can directly run on a computer, a server and a mobile device. The requirement on hardware is not high, and the method can be widely operated on various devices.

Multi-language support: TensorFlow provides interfaces of multiple languages, and can write a deep learning framework of the TensorFlow by using multiple voices such as Python, Java, Go and the like.

Optimizing the performance: the TensorFlow provides operations such as multithreading, asynchrony, queues and the like, and can realize the whole performance of a hardware CPU or a GPU according to the programming codes.

In the GRU neural network of the present invention, the number of hidden units (hidden _ num _ units) is selected to be 8, that is, the number of hidden neurons in the GRU is 8. The output value of the hidden neuron is defined by the formula:

y＝W·x+b (4)

in the above equation, y represents an output value, x represents an input value, W represents a weight value, and b represents an offset value.

The loss function in training is defined by the formula:

in the above formula, loss represents a loss value, y_iRepresenting the output value of the neuron at a time, y_realRepresenting the real raw data. Output value y of neuron_iMinus the true value y_realThen, square is performed, and finally, the variance is obtained by using the calculation average. The resulting variance is also the loss value of the loss function. The loss function is a function for evaluating the difference between the target and the actual output of the neural network, and a smaller function value indicates a smaller difference between the actual output and the target output, i.e. a more appropriate weight value is indicated.

And finally, optimizing an objective function by using a gradient descent method, namely defining the learning rate (1earning _ rate) of the gradient descent method, and performing the descent gradient of each suboptimal training model, wherein the selected learning rate is 0.003. The invention directly uses the optimizer in TensorFlow to optimize the loss function, and the adopted optimizer is AdamaOptizer. Common optimization algorithms are for example: random Gradient Descent (SGD), adadra, Adam (Adaptive motion Estimation).

The stochastic gradient descent algorithm is an optimization algorithm which is most frequently used in an optimizer, and is used for calculating the gradient of each batch for each iteration and then updating parameters. SGD is simple in form and thus widely used. However, there are disadvantages in that it is difficult to select a learning rate (learning rate) in the SGD algorithm. In the optimization, all parameters are updated by using the same learning rate, and by using the same learning rate, for example, by using sparse data features, the features which do not frequently appear in the data are expected to be updated a little faster, and the features which frequently appear in the data are expected to be updated a little slower, so that the SGD is not in demand. Because of the gradient descent, it is easy to converge to a local optimum. In response to these deficiencies of the SGD algorithm, other optimizer algorithms have been proposed.

The Adagrad algorithm optimizes the learning rate and imposes a constraint on the learning rate. When the early-stage gradient is small, the constraint term defined by the Adagarad algorithm is large, so that the gradient can be amplified; when the late gradient is large, the constraint term defined by the Adagarad algorithm is small, so that the gradient can be constrained. Therefore, the learning rate is restrained, and the sparse gradient can be well adapted. Of course, the Adagrad algorithm has insufficient ground, and if the learning rate is set too high, the constraint term in the algorithm is too sensitive, and the adjustment of the gradient is too large.

The Adam algorithm is essentially an optimization of the adarad algorithm, which performs a further optimization of the learning rate, and dynamically adjusts the learning rate of each parameter using the first moment estimate and the second moment estimate of the gradient. After each offset correction, the Adam algorithm has a certain range of iterative learning rate, so that the parameters are relatively stable. Adam algorithm has the advantages of being capable of processing sparse gradients and non-stationary targets; different adaptive learning rates can be calculated for different parameters; the method is suitable for high-dimensional data.

The time series may be non-stationary targets while the time series data is upscaled. Therefore, aiming at the selection of the optimizer, the Adam algorithm which is suitable for high dimensionality and can better process the non-stationary target is selected.

And storing the model trained by the GRU neural network, and calling the trained model to predict the time sequence data in a later data prediction stage.

(6) GRU-SES predictive data processing

Firstly, the GRU neural network is used for reading the trained model stored in the last step, and then the GRU neural network is used for predicting time series data to obtain a primary predicted value.

And performing secondary exponential smoothing on the preliminarily obtained prediction data, and further improving the accuracy of the prediction data. The formula for quadratic exponential smoothing is as follows:

alpha is a smoothing coefficient (0, 1) (6)

The sequence of the first required calculation in the above equation

And

in which the time sequence is

And

the first data in the original data column is typically selected. Computing

In time, the input value x at time t needs to be used_tAnd at time t-1

To obtainThe value of (c). Computing

Using time tAnd value of (d) and time t-1

To obtain a value of

The value of (c). For the selection of the smoothing coefficient α in the formula (6), when the change of the original time series data has an obvious change trend, α is generally selected to be a value between 0.3 and 0.5; when the variation of the original time sequence data is relatively smooth, α is generally selected to be between 0.1 and 0.3. When calculating the time sequence

And

then, the predicted value after the second smoothing index processing can be calculated by using the formulas 4 to 12.

x_t+T＝A_T-B_TT, T is the number of future forecasts (9)

Before using formula (9), it is also necessary to calculate A by using formula (7) and formula (8)_TAnd B_T. For A in equation (7)_TIs required to use

Multiply by 2 and subtract

For B in equation (8)_TCalculating a new coefficient from the smoothing coefficient alpha, and multiplying the new coefficient by the new coefficient

Minus

The value of (c). Calculate A_TAnd B_TThen, x for the predicted T period can be calculated using equation (9)_t+TThe predicted value of (2).

And finally, calculating the arithmetic mean of the predicted value calculated by the GRU neural network and the predicted value calculated by the quadratic smoothing index.

And (3) calculating an arithmetic mean value by using the formula (10), namely a final predicted data value predicted by using the GRU-SES model.

(7) Predicted data result output

The obtained predicted value is also the data type after data standardization processing, and the inverse operation of data standardization is also needed to be carried out, so that the predicted data is restored to the type of the original data. Here, the predicted data corresponding to the original data type is restored by performing an inverse operation using a formula for data normalization, and the mathematical formula is as follows:

x in formula (11)_iTTo predict data according to the original data type, X_iTIs the predicted data calculated by the GRU-SES model, S is the standard deviation of the original data and is calculated by the formula (2),

the arithmetic mean of the raw data is calculated by formula (3). Conforms to the originalThe predicted data for the starting data type is obtained by multiplying the predicted data by the standard deviation and then adding the average value. The prediction period number T can be determined by itself according to the time series data which needs to predict how long in the future.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A time series prediction method based on a GRU neural network is characterized by comprising the following steps:

firstly, collecting original data needing to be predicted;

reading the stored trained time sequence prediction model, and predicting time sequence data by using a GRU-SES model to obtain a primary prediction value;

2. The method for predicting time series based on a GRU neural network as claimed in claim 1, wherein in step two, the data preprocessing specifically comprises:

3. The method for predicting time series based on a GRU neural network as claimed in claim 1, wherein in step three, the step of performing data normalization processing by using the Z-score method specifically comprises:

the Z-score normalization processing formula is as follows:

Expressing, and expressing the standard deviation of the number series by S;

equation (2) shows that the average value of the number sequence needs to be calculated, and the number of terms is set as each data X in the n number sequence_iFirst, the sum is added, and the sum is divided by the number of terms n to obtain the arithmetic mean of the series

4. The method according to claim 1, wherein in step five, the time series prediction model specifically includes:

5. The method for time series prediction based on a GRU neural network as claimed in claim 1, wherein in step five, the training of the input data using the GRU neural network specifically comprises:

1) the GRU neural network is specifically a GRU neural network module in TensorFlow; the number of the hidden units of the GRU neural network is 8, and the number of the hidden neurons in the GRU is 8; the output value of the hidden neuron is defined by the formula:

y＝W·x+b(4)

2) when the GRU neural network is used for training input data, a function of the difference between the target and the actual output of the neural network is evaluated by using a loss function, wherein the smaller the function value is, the smaller the difference between the actual output and the target output is, namely, the more appropriate the weight is;

the loss function in training is defined by the formula:

3) and (3) optimizing the loss function by using an optimizer in TensorFlow and adopting an Adam algorithm, defining the learning rate of a gradient descent method to be 0.003, and optimizing the descent gradient of the training model each time.

6. The method for predicting time series based on GRU neural network as claimed in claim 1, wherein in step seven, the preliminary prediction data obtained is processed by quadratic exponential smoothing and inverse normalization; obtaining the final predicted data value specifically includes:

1) calculating time sequence by using quadratic exponential smoothing formula

And

in which the time sequence is

And

generally selecting first data in an original data column; computing

Using the input value x at time t_tAnd time t-1

To obtain

A value of (d); computing

Using time t

And value of (d) and time t-1

To obtain a value of

A value of (d);

the quadratic exponential smoothing formula is as follows:

2) selecting a smoothing coefficient alpha, and selecting a value of alpha between 0.3 and 0.5 when the change of the original time sequence data has an obvious change trend; when the change of the original time sequence data is relatively smooth, selecting a value between 0.1 and 0.3 for alpha;

3) based on calculated time sequence series

And

x_t+T＝A_T-B_Tt, T is the number of future forecasts (9)

7. The method for predicting time series based on a GRU neural network as claimed in claim 1, wherein in step eight, the inverse operation of the data normalization specifically comprises:

the arithmetic mean of the raw data is calculated from equation (3).