CN115495991A

CN115495991A - Rainfall interval prediction method based on time convolution network

Info

Publication number: CN115495991A
Application number: CN202211197406.4A
Authority: CN
Inventors: 冯钧; 邵萍萍; 王文鹏; 丁昱凯; 严乐
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2022-09-29
Filing date: 2022-09-29
Publication date: 2022-12-20

Abstract

The invention discloses a rainfall interval prediction method based on a time convolution network, which comprises the following steps of collecting rainfall data and meteorological factor data to form an initial time sequence data set, and preprocessing the data; analyzing the characteristics of meteorological factors, constructing a characteristic extraction algorithm, acquiring the incidence relation between the meteorological factors and precipitation, and screening out a time sequence of the meteorological factors which have great influence on precipitation by adopting a maximum information coefficient-asynchronous principal component analysis algorithm; and (3) constructing a time convolution network model for predicting the drainage surface rainfall by combining the drainage surface rainfall, training the model by using data in the training set, then adjusting the model parameter evaluation model to converge the model, and finally predicting by using the test set data. According to the method, the capability of TCN for capturing effective long-time sequence features is utilized, the optimized LUBE coverage width evaluation standard is used as one of training loss objective functions to generate probability interval prediction of future observation results, and the precision of rainfall prediction is effectively improved.

Description

Rainfall interval prediction method based on time convolution network

Technical Field

The invention belongs to the technical field of precipitation forecast, and particularly relates to a precipitation interval prediction method based on a time convolution network.

Background

The medium-long term precipitation prediction is an important basis for the precise management of the water resources in the drainage basin. The data-driven model is one of the important ways to develop the prediction of the mid-long term precipitation. The beginning, duration and ending time of a rainy season are determined by the monthly rainfall, and the monthly rainfall provides more accurate annual rainfall distribution data than the seasonal rainfall, so that rainfall prediction needs to be researched on a monthly time scale, however, the sample capacity of monthly-scale rainfall statistical data is limited, the dimensionality of available forecasting factors is higher, high-correlation meteorological cause elements are screened out under the condition of limited sample capacity, and the construction of a robust data driving model is the key for successful medium-and long-term rainfall prediction.

In precipitation prediction, the models commonly used are mainly based on physical (process) driving and on data driving. The physical driving model is based on a hydrological conceptual model, the method analyzes all factors of precipitation formation, a general physical equation suitable for a certain territory is established to simulate the precipitation process, and the model parameters can be calibrated only by relying on experience and manual interaction and continuous iteration. The accuracy of model prediction depends on the knowledge and experience of the modeler and the completeness of the data. The parameters contained in the model have physical significance and good explanatory property, but the difficulty of parameter calibration is increased by the complex influence factors of medium-term and long-term precipitation. In recent years, data driving models such as linear regression, neural networks and support vector machines are applied to precipitation forecasting, the models regard hydrological processes as black boxes, and modeling is achieved by establishing mapping relations between input and output samples without considering physical mechanisms inside the system. The intelligent model has a certain limitation on the forecasting performance due to the fact that accurate precipitation and influence factors (precipitation and related meteorological factors) thereof cannot be obtained. In precipitation prediction, each meteorological factor has uncertainty, and the factors are correlated with each other, so that the physical driving model and the data driving model have respective defects.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a rainfall interval prediction method based on a time convolution network, which is suitable for the prediction of the rainfall interval in a medium-term and long-term period and has higher prediction precision.

The technical scheme is as follows: the invention provides a rainfall interval prediction method based on a time convolution network, which comprises the following steps:

(1) Collecting precipitation data and meteorological factor data to form an initial time sequence data set, and preprocessing the data;

(2) Analyzing the characteristics of meteorological factors, constructing a characteristic extraction algorithm, acquiring the incidence relation between the meteorological factors and precipitation, and screening out a time sequence of the meteorological factors which have great influence on precipitation by adopting a maximum information coefficient-asynchronous principal component analysis algorithm;

(3) Constructing a precipitation interval prediction model based on a time convolution network, dividing screened data into a training set and a test set according to a reservation method, then sampling by a sliding window, and preparing a precipitation interval prediction model input data group based on the time convolution network;

(4) According to the input historical precipitation and highly-correlated historical meteorological factor sequence, a TCN time convolution network model is selected to construct a precipitation interval prediction model based on a time convolution network, and an LUBE interval prediction method is introduced to output an interval prediction value so as to train the precipitation interval prediction model based on the time convolution network and complete comprehensive modeling of a precipitation process;

(5) And inputting new precipitation amount to the trained precipitation interval prediction model based on the time convolution network to complete the prediction of the precipitation interval.

Further, the preprocessing of the data in the step (1) includes data cleaning and missing completion of the data, wherein the missing completion of the data adopts a linear interpolation processing mode in a time dimension.

Further, the time series implementation process of screening out the meteorological factors having a large influence on the precipitation amount by adopting the maximum information coefficient-asynchronous principal component analysis algorithm in the step (2) is as follows:

construction of interpolation timing by DTWStep correlation time series: organizing a data set X _m×n Wherein m in each column represents the length of a univariate time sequence, and n represents the number of attributes in each row; normalizing each time series in the data matrix X; normalizing the time series Xi using the Z-score method to give normalized X' _i Has zero mean and one standard variance, namely X' _ih N (0, 1); calculating normalized X 'from DTW' _i And X _j ' obtaining an optimal deformation path

All elements of the optimal deformation path constitute elements of the interpolated time series, respectively: x ″ _i ＝{p ₁ (1),p ₂ (1),…,p _k (1) And X ″) _j ＝{p ₁ (2),p ₂ (2),…,p _k (2) All normalized time series should be extended to time series reflecting the corresponding asynchronous correlation; obtain a new interpolated time series, i.e., X = X "; x "is performed on 2 new time-series scatter plots, set D _i Dividing by X Y, and dividing the elements by X ″) _i The value is divided into X ″) _i In each grid, dividing the grid into Y grids according to Y values; calculating the frequency distribution D | G of the set D of points falling on a given grid G, reasonably selecting the upper boundary B (n) of the grid G, B (n) being a function of the sample size and representing the constraint of the total number B of squares of the grid G smaller than B (n), B (n) = n ^0.6 (ii) a Calculating different grids G to determine different probability distributions, namely I (D | G), and representing mutual information of the point sets based on the distribution D | G; on the basis of X ″) _i Finding the maximum mutual information maxI (D | G) in all the possible distributed D | G mutual information of the XY grid G, and recording the maxI (D | G) as I '(D, m, n) to obtain a characteristic matrix I' (D) of the two-dimensional data set D;

and effectively measuring the correlation between the forecast factor and the actual precipitation sequence through the maximum information coefficient, screening out a factor with strong correlation with the actual precipitation, recording the factor as a high information quantity factor, and removing the factor containing more overlapped information from the high information quantity factor.

Further, the TCN time convolution network model is selected and used to construct the precipitation interval prediction model based on the time convolution network in the step (4), and the implementation process is as follows:

in the case of a univariate sequence, the complete dilation causal convolution operation f on successive layers for a given one-dimensional sequence of precipitation inputs X and a filter function ω of size k: {0, \8230;, k-1}, the definition on the sequence s is expressed as follows:

where s is an element of the sequence and d is an expansion parameter, according to the network depth d =2 ⁱ Exponential increase, d =2 ⁱ I-level for the network; and s-d · i describes the past direction; d is referred to as the dilated convolution operation to distinguish the normal convolution operation.

Further, the implementation process of introducing the LUBE interval prediction method to output the predicted value of the interval in the step (4) is as follows:

the evaluation index of the LUBE interval prediction method comprises the evaluation of PICP from the probability that the actual observation value is between the upper bound and the lower bound of the prediction interval; PINAW is evaluated from the width between the upper and lower bounds of the prediction interval:

wherein, U (x) _k ) For the upper boundary value of the interval prediction, L (x) _k ) A lower boundary value for interval prediction, n is the number of samples of the measured value, a is the range of the target variable, i.e. the difference between the maximum and minimum values; then the CWC measuring the width and coverage probability can be defined as:

adding penalty parameters, considering width, covering probability and average deviation indexes are defined as:

wherein the parameter tau linearly amplifies the PINAW, the parameter

Exponentially amplifying PICPs and e ^-η(PICP-μ) The difference between if PCWC is too small, the epsilon and tau hyper parameters are used to avoid vanishing.

Has the advantages that: compared with the prior art, the invention has the following beneficial effects: according to the method, the capability of TCN for capturing effective long-time sequence features is utilized, the optimized LUBE coverage width evaluation standard is used as one of training loss objective functions to generate probability interval prediction of future observation results, and the precision of rainfall prediction is effectively improved; preprocessing input data, constructing an interpolation time sequence through DTW, introducing an MIC matrix into an APCA method, performing feature screening, further excavating asynchronous correlation of meteorological factors and rainfall, and being more suitable for feature selection of high-dimensional meteorological factors; the method is more suitable for forecasting the precipitation interval in the middle and long term, has higher forecasting precision, is superior to models such as the traditional support vector machine and the like, and greatly improves the forecasting precision.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of causal dilation convolution in a TCN time convolution network model in an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

As shown in figure 1, the invention provides a rainfall interval prediction method based on a time convolution network, which comprises the steps of firstly, collecting hydrological data of a research basin, and then storing the data into a historical database; secondly, preprocessing hydrologic historical data such as missing completion, data anomaly correction and data normalization, and then dividing a training set and a test set after feature screening; constructing a rainfall interval prediction model based on a time convolution network, training the model by using data in a training set, then adjusting a model parameter evaluation model to converge the model, and finally predicting by using test set data; the method specifically comprises the following steps:

step 1: precipitation data and meteorological factor data are collected to form an initial time sequence data set, data are cleaned, and missing data are supplemented.

Collecting hydrological data with the data granularity of 1 item/month in 1979-2019 of a target basin (Qijiang basin), wherein each piece of data comprises precipitation in the basin, an atmospheric circulation factor, a sea temperature factor and other meteorological influence factors, and storing the data into a historical database; and taking out the data from the historical database and carrying out data preprocessing, including data missing completion. The data missing completion method adopts a linear interpolation method on a time dimension, namely, a missing value is determined according to the time values before and after the missing part.

And 2, step: analyzing the meteorological factor characteristics, constructing a characteristic extraction algorithm of input data, adopting maximum information coefficient-asynchronous principal component analysis to screen out a time sequence of the meteorological factor which has a large influence on precipitation, and acquiring the incidence relation between the meteorological factor and precipitation.

Constructing an interpolation time sequence and an asynchronous correlation time sequence through DTW; the correlation relation between the forecasting factors and the actual precipitation sequence is effectively measured through the maximum information coefficient, and the factors with strong correlation with the actual precipitation are screened out, wherein the factors are high information quantity factors generally and have obvious influence on forecasting variables; removing factors containing more overlapped information from the high information quantity factors, and the specific process is as follows:

organizing a data set (or data matrix) X _m×n Wherein, each column m represents the length of the univariate time sequence, and n represents the number of attributes (or variables) of each row; each time series in the data matrix X is normalized. Normalizing the time series Xi using the Z-score method to give normalized X' _i Has zero mean and one standard variance, namely X' _ih N (0, 1). An interpolated time series is obtained. Calculating normalized X 'from DTW' _i And X _j ', obtaining an optimal deformation path

All elements of the optimal deformation path constitute elements of the interpolated time series, respectively: x ″) _i ＝{p ₁ (1),p ₂ (1),…,p _k (1) } and X ″ ", respectively _j ＝{p ₁ (2),p ₂ (2),…,p _k (2) All normalized time series should be extended to time series reflecting the corresponding asynchronous correlations. A new interpolated time series is obtained, i.e. X = X ". X "is performed on 2 new time-series scatter plots (set D) _i Dividing by X Y, and dividing the elements by X ″) _i The values are divided into X ″) _i In each cell, the cells are divided into Y cells by Y value. Calculating the frequency distribution D | G of the set D of points falling on a given grid G, reasonably selecting the upper boundary B (n) of the grid G, B (n) being a function of the sample size and representing the constraint of the total number B of squares of the grid G smaller than B (n), B (n) = n ^0.6 Computing different grids G determines different probability distributions, I (D | G), representing mutual information of the point sets based on the distributions D | G. Based on X ″ _i Finding the maximum mutual information maxI (D | G) in all the possible distributed D | G mutual information of the XY grid G, and recording the maxI (D | G) as I '(D, m, n) to obtain a characteristic matrix I' (D) of the two-dimensional data set D; finding out the maximum from the results obtained from the above results, that is, the maximum information coefficient.

Respectively calculating the maximum mutual information coefficient MIC (X') between each weather forecast factor and the actually measured precipitation _i And Y) are arranged in sequence from large to small, and a plurality of factors with the top rank are screened out to form a new factor set X' = { X1, X2, \8230;, xr }, wherein r is less than or equal to n. Separately calculating MIC values MIC (X ″) between the factors in the factor set X _i ,X″ _j ) (i, j is less than or equal to n) to form an MIC characteristic matrix. Calculating the eigenvalues of the MIC eigenmatrix and arranging the λ in descending order ₁ ≥λ ₂ ≥…≥λ _m Is more than or equal to 0. Respectively determining corresponding characteristic values lambda _i Characteristic vector e of _i (i =1,2, \8230;, m), wherein | | | e _i ||＝1。

Calculating the principal component contribution rate and the accumulated contribution rate of the characteristic value, wherein the calculation formula is as follows:

wherein L is ₁ Is a principal component contribution rate, L ₂ In order to obtain the cumulative contribution rate, i =1,2, \8230, m, q principal components corresponding to q feature values with the cumulative contribution rate of 80% -95% are generally taken, and the final forecast factor set X "= { X ″", is obtained ₁ ,X″ ₂ ,…,X″ _q Due to q<m, so a reduction in dimensionality can be achieved.

And step 3: and constructing a rainfall interval prediction model based on a time convolution network, dividing the screened data into a training set and a test set according to a reservation method, then sampling by a sliding window, and preparing a rainfall interval prediction model input data group based on the time convolution network.

The data after feature screening is divided into two mutually exclusive sets by a retention method: the method comprises the following steps of training set and testing set, wherein the proportion of the training set to the testing set is 8, the training set is data in 1979-2011, and the testing set is data in 2011-2019. And preparing input data of the model by adopting a sliding window sampling technology.

And 4, step 4: according to the input historical precipitation and highly-correlated historical meteorological factor sequences, a TCN time convolution network model is selected to construct a precipitation interval prediction model based on a time convolution network, and an interval prediction value is output by introducing a LUBE interval prediction method, so that training of the precipitation interval prediction model based on the time convolution network is realized, and comprehensive modeling of a precipitation process is completed.

And constructing a TCN time convolution network model which is formed by connecting causal expansion volumes and residual errors. Causal dilated convolutions are shown in FIG. 2: the causal convolution means that the output at the time t can only obtain convolution from the input which is not later than t, the causal convolution plays a role of a filter in time, and unlike the standard convolution, the output at the time t has no influence on future values, so that the causal relationship between historical data and future predicted values can be better captured, the TCN uses a 1DFCN (one-dimensional full convolution network) architecture, the length of each hidden layer is the same as that of an input layer, and zero padding is used to ensure that subsequent layers have the same length. The longer prediction period increases the difficulty of predicting the target value, the extended causal convolution allows for skipping the input values of a particular step, and the difference between the extended convolution and the conventional convolution is that it allows the input of the convolution to be sampled at intervals. Such convolution increases the input range of the network, accessing longer input subsequences. The concrete structure is as follows:

in the case of a univariate sequence, the complete dilation causal convolution operation f on the concealment layer for a given one-dimensional sequence of precipitation inputs X and a filter function ω of size k: {0, \8230;, k-1}, the definition on the sequence s is expressed as follows:

where s is an element of the sequence, d is an expansion parameter, and s-d · i describes the past direction. D is referred to as the dilated convolution operation to distinguish the normal convolution operation. The normal convolution operator (×) is a specific version of the dilation convolution (when d = 1). The receptive field of causal networks is increased in two ways: one is to increase the filter size k; the second is to increase the expansion factor d. In this study, d is according to the network depth d =2 ⁱ Exponentially increased to ensure that long-term history can be effectively covered; specifically, d =2 ⁱ For the i-level of the network.

A LUBE (lower and upper bound estimation) interval prediction method is introduced into an output layer, and uncertainty of a prediction result is quantized to generate probability interval prediction of a future observation result. The evaluation standard (PCWC) of the coverage width of the LUBE is optimized, and the precision of rainfall prediction is effectively improved, and the method specifically comprises the following steps:

the number of evaluation indexes of the LUBE interval prediction method is mainly 2, and the PICP is evaluated from the probability that an actual observation value is between the upper bound and the lower bound of a prediction interval; PINAW is evaluated from the width between the upper and lower bounds of the prediction interval:

wherein, U (x) _k ) For the upper boundary value of the interval prediction, L (x) _k ) For the lower boundary value of the interval prediction, n is the number of samples of the measured value, and a is the range of the target variable, i.e. the difference between the maximum and minimum values. Then the CWC measuring the width and coverage probability can be defined as:

in the rainfall prediction, under the influence of high-dimensional meteorological factors, the accuracy of a prediction result is often greatly influenced by characteristics, and the balance between the section coverage rate and the section width of the rainfall prediction is difficult to adjust by phi. Adding penalty parameters on the basis, and considering the width, the coverage probability and the average deviation index can be defined as:

wherein the parameter tau linearly amplifies the PINAW, the parameter

Exponentially amplifying PICPs and e ^-η(PICP-μ) The difference between if PCWC is too small, 8 and τ superparameters are used to avoid vanishing.

And 5: and inputting new precipitation amount to the trained precipitation interval prediction model based on the time convolution network to complete the prediction of the precipitation interval.

The optimized PCWC indexes are introduced to carry out interval prediction training on the model, in addition, 2 evaluation standards are used for evaluating the performance of the model, namely the root mean square error RMSE and the Nash efficiency coefficient NSE, and the model evaluation standard formula is as follows:

root mean square error

Coefficient of nash efficiency:

wherein, y _i Is the observed value at the time of the ith time,

is a predicted value at the i-th time,

is the average of the observations and N represents the number of observations during the test.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, many modifications and adaptations can be made without departing from the principle of the present invention, and such modifications and adaptations should also be considered to be within the scope of the present invention.

Claims

1. A rainfall interval prediction method based on a time convolution network is characterized by comprising the following steps:

(2) Analyzing meteorological factor characteristics, constructing a characteristic extraction algorithm, acquiring the incidence relation between meteorological factors and precipitation, and screening out a time sequence of the meteorological factors which have great influence on the precipitation by adopting a maximum information coefficient-asynchronous principal component analysis algorithm;

2. The method for predicting the precipitation interval based on the time convolution network as claimed in claim 1, wherein the preprocessing of the data in the step (1) includes data cleaning and data missing completion, wherein the data missing completion adopts a linear interpolation processing mode in a time dimension.

3. The rainfall interval prediction method based on the time convolution network of claim 1, wherein the time series implementation process for screening out the meteorological factors having a large influence on the rainfall amount by using the maximum information coefficient-asynchronous principal component analysis algorithm in the step (2) is as follows:

constructing an interpolation time sequence through DTW, wherein the time sequence of asynchronous correlation comprises the following steps: organizing a data set X _m×n Wherein, each column m represents the length of the univariate time sequence, and n represents the number of the attributes of each row; normalizing each time series in the data matrix X; normalizing the time series Xi using the Z-score method to give normalized X' _i Has zero mean and one standard variance, namely X' _ih N (0, 1); calculating normalized X 'from DTW' _i And X' _j Obtaining the optimal deformation path

All elements of the optimal deformation path constitute elements of the interpolated time series, respectively: x' _i ＝{p ₁ (1),p ₂ (1),…,p _k (1) And X " _j ＝{p ₁ (2),p ₂ (2),…,p _k (2) All normalized time series should be extended to time series reflecting the corresponding asynchronous correlations; obtain a new interpolated time series, i.e., X = X "; x over 2 new time-series scatter plots, set D " _i Dividing by X Y, and dividing the elements by X " _i Value division into X' _i In each grid, dividing the grid into Y grids according to Y values; calculating a frequency distribution D | G of the set D of points falling on a given grid G, the upper boundary B (n) of the grid G being chosen appropriately, B (n) being a function of the sample size, the constraint representing the total number B of squares of the grid G being less than B (n), B (n) = n ^0.6 (ii) a Calculating different grids G to determine different probability distributions, namely I (D | G), and representing mutual information of the point sets based on the distribution D | G; based on X' _i Finding the largest mutual information maxI (D | G) in all the possible mutual information of the distribution D | G of the XY grid G and recording the maxI (D | G) as I '(D, m, n) to obtain a feature matrix I' (D) of the two-dimensional data set D;

4. The method for predicting the precipitation interval based on the time convolution network is characterized in that in the step (4), the TCN time convolution network model is selected to construct the precipitation interval prediction model based on the time convolution network, and the implementation process comprises the following steps:

5. The method for predicting the precipitation interval based on the time convolutional network of claim 1, wherein the step (4) of introducing the LUBE interval prediction method to output the interval prediction value is implemented as follows:

wherein, U (x) _k ) For the upper boundary value of the interval prediction, L (x) _k ) A lower boundary value for interval prediction, n is the number of samples of the measured value, a is the range of the target variable, i.e. the difference between the maximum and minimum values; the CWC measuring the width and coverage probability can be defined as:

wherein the parameter tau linearly amplifies the PINAW, the parameter

Exponentially amplifying PICPs and e ^-η(PICP-μ) The difference between if PCWC is too small, then the epsilon and tau hyper-parameters are used to avoid vanishing.