CN115310674A - Long-time sequence prediction method based on parallel neural network model LDformer - Google Patents
Long-time sequence prediction method based on parallel neural network model LDformer Download PDFInfo
- Publication number
- CN115310674A CN115310674A CN202210834021.8A CN202210834021A CN115310674A CN 115310674 A CN115310674 A CN 115310674A CN 202210834021 A CN202210834021 A CN 202210834021A CN 115310674 A CN115310674 A CN 115310674A
- Authority
- CN
- China
- Prior art keywords
- attention
- data
- layer
- prediction
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000003062 neural network model Methods 0.000 title claims description 15
- 230000007246 mechanism Effects 0.000 claims abstract description 32
- 238000004821 distillation Methods 0.000 claims abstract description 20
- 230000006870 function Effects 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 18
- 238000005259 measurement Methods 0.000 claims description 10
- 230000004913 activation Effects 0.000 claims description 9
- 239000003550 marker Substances 0.000 claims description 9
- 238000009826 distribution Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 8
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 230000000306 recurrent effect Effects 0.000 claims description 6
- 238000009827 uniform distribution Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000004880 explosion Methods 0.000 claims description 4
- 230000015654 memory Effects 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 239000013598 vector Substances 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000012731 temporal analysis Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000000700 time series analysis Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 2
- 238000013138 pruning Methods 0.000 claims description 2
- 238000013459 approach Methods 0.000 claims 1
- 230000007774 longterm Effects 0.000 abstract description 7
- 238000002679 ablation Methods 0.000 abstract 1
- 238000002474 experimental method Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 5
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 244000141353 Prunus domestica Species 0.000 description 1
- 230000008033 biological extinction Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- YHXISWVBGDMDLQ-UHFFFAOYSA-N moclobemide Chemical compound C1=CC(Cl)=CC=C1C(=O)NCCN1CCOCC1 YHXISWVBGDMDLQ-UHFFFAOYSA-N 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Economics (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Strategic Management (AREA)
- Human Resources & Organizations (AREA)
- General Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Marketing (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Entrepreneurship & Innovation (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Development Economics (AREA)
- Data Mining & Analysis (AREA)
- Game Theory and Decision Science (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- Primary Health Care (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Long-time sequence prediction is a very important problem, and has a wide range of scenes in many fields, such as stocks, traffic, power, and the like. The existing time sequence prediction method has the problems of high time complexity, large parameter quantity, low prediction precision and the like, and is not suitable for high-precision long-term prediction of real world data. Aiming at the problems, the invention provides a parallel time series prediction model LDformer, firstly, an Informer framework is combined with an LSTM, and the deep characteristics of the time series are fully considered. Then, a probability sparse attention mechanism combining UniDrop is provided, so that the risk of losing key connection in the sequence is reduced; taking stability of data and parameter quantity into consideration, data are extracted through one-dimensional convolution in distillation operation. Experimental results on different prediction lengths of three power data sets ETTm1, ETTh1 and ETTh2 show that the method provided by the invention is superior to the most advanced baseline in long-time sequence prediction, and the effectiveness of key component design is verified through ablation experiments.
Description
Technical Field
The invention relates to the field of power prediction, in particular to a long-time sequence prediction method based on a parallel neural network model LDformer.
Background
With the advent of the big data age, data has penetrated every industry. Various types of sensors and applications continuously collect large-scale time series such as sales of goods in retail stores and supermarkets, passenger flow in railways and aviation departments, traffic flow in cities, load demands in electric power departments, stock prices in financial fields, weather conditions in meteorological fields, and the like. The power distribution problem is that the grid manages the distribution of power to different customer areas according to sequentially changing demands. It is difficult to predict the future demand of a particular user area because it varies with different factors such as weekday, holiday, season, weather, temperature, etc. The existing time sequence prediction method cannot be suitable for high-precision long-term prediction of long-term real-world data, and any wrong prediction can have serious consequences. Therefore, there is currently no effective way to predict future power usage, and the manager has to make decisions based on empirical values, which are typically much higher than the actual demand. Conservative strategies result in unnecessary power and equipment depreciation waste. It is worth noting that the oil temperature of the transformer can effectively reflect the working condition of the power transformer. Therefore, long-time sequence predictive modeling is the key to solving this problem. However, long-time sequence prediction still faces a serious challenge because most time sequence models are directed at short-term prediction and are not suitable for long-term time sequences because of large historical data amount, high computational complexity and high prediction accuracy requirement, and a good result is not obtained all the time. In order to solve the above problems, research on long-term sequence prediction is particularly important.
Most of the current research on time series prediction is based on machine learning and deep learning short time series prediction. In machine learning research, many scholars adopt ARIMA and SVM, but the models are relatively more suitable for stationary time sequence, but the real-time sequence data does not have almost pure stationary data, so that the application of the models is limited by data characteristics, and the universality is poor. Scholars also propose a Bayesian Time Factorization (BTF) framework for modeling multidimensional time series in specific spatio-temporal data in the presence of missing values, but machine learning methods cannot obtain more accurate results for complex prediction problem results. With the development of deep learning, researchers find that deep learning is more applicable to complex problems. In recent years, transformers have been applied to many fields for long-term sequence prediction tasks. However, the method is time-consuming and complex and has a large number of parameters. Therefore, the learner proposes the Transformer's improved algorithm Informer, but the attention of the learner may lose some key connections in the sequence, and the prediction accuracy still needs to be improved. Therefore, the invention is further improved on the basis of the improved algorithm informar of the Transformer.
Disclosure of Invention
The invention improves the prediction precision of long-time sequence prediction, overcomes the defects of high time complexity, large parameter quantity, low running speed, easiness in losing the key connection among sequences and the like of the traditional Transformer model, provides a long-time sequence prediction method based on a parallel neural network model LDformer, and predicts the future according to the existing historical data.
The invention mainly comprises four parts: and (1) determining input and output of the model. And (2) preprocessing the data set. And (3) determining the time characteristic of the data and encoding. (4) And constructing a parallel neural network model LDformer for long-time sequence prediction. And (5) verifying the validity of the method.
The contents of the above five parts are respectively described as follows:
1. the input and output of the model are determined. A power data set is input to the method, with each data point consisting of a target Oil Temperature "Oil Temperature (OT)" and 6 different types of external Load values "High usefull Load (HUFL)", "High UseLess Load (HULL)", "Middle usefull Load (MUFL)", "Middle usefull Load (MULL)", "Low usefull Load (LUFL)", "Low usel Load (LULL)". An appropriate training data set is selected to predict the target value "OT" with 6 external load values. By collecting six features X from the training set (1) ,X (2) ,X (3) ,X (4) ,X (5) ,X (6) Small batch of m samples dataset ofTo predict n sequences of target values "OT
2. And preprocessing the data set. The dataset preprocessing mainly comprises a normalization process. Because abnormal values and more noises exist among time series data collected in power measurement, the influence of the abnormal values and extreme values can be avoided indirectly through centralization by using standardization.
3. Determining a temporal characteristic of the data is encoded from multiple angles. In the long-time sequence prediction modeling problem, not only local timing information but also hierarchical timing information such as week, month, and year, and burst timestamp information (event or some holidays, etc.) are required. The conventional self-attention mechanism is difficult to directly adapt, and can bring about the problem of mismatching between queries and keys between an encoder and a decoder, and finally influences the prediction effect.
4. And constructing a parallel neural network model LDformer for long-time sequence prediction. The LDformer consists of an Embedding layer (Embedding) considered in multiple angles, a long short term memory network (LSTM), an Encoder (Encoder), and a Decoder (Decoder). The Embedding layer (Embedding) considers from three angles, data coding, position coding and time stamp coding are respectively carried out, and the dimension is respectively expanded to a uniform dimension d-model. The LSTM receives input data for feature extraction to obtain deep expression capability in the time series. And then entering an encoder, wherein the encoder adopts a multi-channel parallel mode to improve the robustness of the model, simultaneously uses probability sparseness self-attention combined with a UniDrop technology to properly reduce the number of parameters and reduce overfitting to receive a large number of long sequence inputs, and distillation operation is added between the encoder modules to reduce the redundancy combination of the characteristic mapping of the encoder with a value V. The decoder is configured to accept a long sequence of inputs to generate an immediate prediction of an output element.
5. And (5) verifying the validity of the method. Through experimental verification on a real power data set and comparison with other leading-edge researches, the prediction accuracy of the method in the long-time sequence prediction problem is obviously higher than that of a comparison method, and the method is improved aiming at the defects of the algorithm.
The detailed implementation steps adopted by the invention to realize the purpose are as follows:
step 1: and determining the input and output of the model according to the power data set, and selecting an appropriate proportion to divide the data set. Defining model inputs as six load characteristics and a target value { X (1) ,X (2) ,X (3) ,X (4) ,X (5) ,X (6) Y, wherein the six Load characteristics are "High usefull Load (HUFL)", "High UseLess Load (HULL)", "Middle usefull Load (MUFL)", "Middle UseLess Load (MULL)", "Low usefull Load (LUFL)", and "Low UseLess Load (LULL)", respectively. The target value is the Oil Temperature (OT).
Step 2: and (4) preprocessing data. An input training data set is first normalized. The normalization method uses StandardScaler () to normalize data, ensuring that each dimension data variance is 1 and the mean is 0. So that the test results are not dominated by feature values that are too large for certain dimensions. Having a conversion function ofWhere μ is the mean of all sample data, σIs the standard deviation of all sample data. And then the step 3 is carried out.
And step 3: and (3) entering the Embedding layer Embedding of the training data set obtained in the step (2). In the long-time sequence prediction modeling problem, not only local timing information but also hierarchical timing information is required. Therefore, the invention considers from three angles, respectively performs data coding, position coding and time stamp coding, respectively expands the dimensionality to the uniform dimensionality d-model, and sums to obtain the final Embedding result. Step 3.1 is data encoding, step 3.2 is position encoding, and step 3.3 is time stamp encoding.
Step 3.1: and (6) encoding data. Embedding in the data converts the data to a uniform dimension d-model using one-dimensional convolution. The formula is as follows:
DE=conv1d(x) (1)
step 3.2: and (4) position coding. Here, the elements in the input sequence are processed together, which is different from RNN one by one, although the speed is increased, the precedence relationship of the elements in the sequence is ignored, so the addition is position coding, and the formula is as follows:
step 3.3: and (5) time stamp coding. The time stamp coding method comprises a month _ embedded, a day _ embedded, a weekday _ embedded, a hour _ embedded and a minute _ embedded, the data set time slices used in the method are 15 minutes and 1 hour respectively, and therefore the minute _ embedded and the hour _ embedded are selected to obtain the time stamp coding result.
And 4, step 4: and constructing a parallel neural network model LDformer for long-time sequence prediction. After data is simply divided and processed, a parallel neural network model LDformer is constructed for time sequence prediction, and after the data passes through an embedded layer, the method mainly comprises the following steps:
step 4.1: and (3) receiving input data by using the LSTM to perform feature extraction, and obtaining deep expression capability in the time series. The main reason is that the LSTM adds a gating mechanism (an input gate, a forgetting gate and an output gate) on the basis of the recurrent neural network to determine the storage and the abandonment of information, and the method solves the problems of gradient extinction and gradient explosion of the common recurrent neural network in the long-sequence training process. In short, LSTM can perform better in longer sequences than normal RNNs.
And 4.2: an encoder module is constructed. The encoder is designed for extracting robustness remote dependence of time sequence input, the overall architecture of the encoder is approximately the same as that of a transform, the encoder mainly comprises two sublayers, a multi-head attention layer (combined with a probability sparse attention mechanism of UniDrop) and a feedforward layer consisting of two linear mappings, a batch normalization layer is arranged behind the two sublayers, and jump connection is arranged between the sublayers. The difference is that the encoder adopts a multi-channel parallel mode, and four channels with the time sequence data length of L, L/2, L/4 and L/8 are respectively selected to be executed in parallel. Distillation operations are combined to improve model robustness. The distillation operation mainly uses one-dimensional convolution to trim the dimension and reduce the memory occupation before sending the output of the upper layer to the multi-head attention module of the lower layer. Wherein the distillation operation is always one layer less than the encoder layer. Where attention in the encoder uses the UniDrop-combined probabilistic sparse self-attention mechanism built in accordance with the present invention.
Step 4.2.1: the probability sparse self-attention mechanism of UniDrop is combined. Considering the time complexity and the risk of losing some key connections in the sequence, the invention proposes a probabilistic sparse attention mechanism incorporating UniDrop, the canonical self-attention mechanism being defined as mapping a query (Q) and a set of key, value (K, V) pairs to an output, where Q, K, V and output are vectors. The output is calculated as a weighted sum of V, where the weight assigned to each V is calculated by a compatibility function of Q with the corresponding K. The formula is as follows:
wherein,d is the input dimension. Because the attention mechanism has more parameters and is easy to over-fit and lose key connection among sequences, the UniDrop technology is introduced in the invention. The Feature Dropout (FD) can randomly inhibit certain neurons in the network with a certain probability. FD-1 is applied to the attention weight A for increasing the generalization of multi-headed attention. FD-2 is applied after the activation function between two linear transformations of the feed-forward network sublayer. However, the direct application of FD-1 to the weight A may lower the value A (i j) meaning that the relationship between marker i and marker j is ignored, so a larger FD-1 means a greater risk of losing some critical information from the sequence position. To mitigate this potential risk, dropout is added at Q, K, and V, respectively, before computing attention. FD-4 is used for output characteristics before linear transformation. Ith line Q of Q, K, V after dropout i ,k i ,v i The ith q attention is defined as a kernel smoother in probabilistic form, as shown in the following equation:
wherein the attention of the ith query to all keys is defined as a probabilityThe output is a combination of its value and V. The attention mechanism supports the probability distribution of the corresponding query attention away from a uniform distribution. If but if p (k) j |q i ) Near uniform distribution, self-attention becomes the sum of the V values, which becomes redundant of the input. Thus, this problem can be effectively avoided by distributing the "similarity" between p and q to distinguish "important" queries, using the KL divergence to measure "similarity", as follows:
the sparsity metric for the ith query, except for the constant, can be defined as:
wherein the first term is q i The Sum of the asymmetric exponential kernels over all the bonds is then logarithmized, i.e., log-Sum-Exp (LSE), and the second term is their arithmetic mean. If the ith query is larger, it indicates that its attention probability p is more "diverse" and that the probability of containing the dominant dot-product pair in the header field of the long-tailed self-attention distribution is higher. However, traversing all queries of the measurement M (q) i K) requires the computation of each dot product pair, which means that quadratic O (L) is required Q L K ) Then there are also potential numerical stability problems with using LSE operations. Based on the above, the above formula is improved, and the final sparsity measurement formula is obtained as follows:
therefore, a part with high probability is obtained, and probability sparseness self-attention combined with UniDrop is obtained. The formula is as follows:
whereinIs that q And the sparse matrix with the same size only contains the Top-u query under the sparse measurement, namely, the part with larger probability is taken. Here, u = c · lnL is set Q Controlled by a constant sampling factor c.
Step 4.2.2: and (4) carrying out distillation operation. As a natural consequence of the attention mechanism, the encoder's feature map has a redundant combination of values V. Thus, at the next level, the present invention uses an extraction operation to privilege dominant features with dominant features and generates a focused self-attention feature map that sharply prunes the time dimension of the input. Convolutional neural networks can well recognize simple patterns in data and generate complex patterns in higher-level layers. The one-dimensional convolution is very effective for obtaining interesting features from data without high correlation of positions, and the one-dimensional convolution can be well applied to time series analysis of sensor data. Therefore, one-dimensional convolution is selected to extract features, and the convolution kernel is set to be 3x3. The distillation operation proceeds from the j-th layer to the (j + 1) -th layer. The formula is as follows:
wherein [. ]] AB Containing basic operations in a multi-headed attention and attention block, conv1d () is executed in the time dimension using the LeakyReLU () activation function. The LeakyReLU () function is a variant of ReLU, changes the reaction of the part with the input less than 0, lightens the sparsity of the ReLU, inherits the advantages of the ReLU, can accelerate the convergence speed, relieve the problems of gradient disappearance and explosion, and simplify the calculation. The LeakyReLU () activation function formula is as follows:
LeakyReLU(x)=max(0,x)+negative_slope·min(0,x) (11)
step 4.3: a decoder is constructed. The decoder generates time series output through a forward process, and part of the structure of the decoder can refer to the structure of the decoder in the Transformer. The decoder includes two attention mechanisms and a linear mapped feedforward layer section. The decoder gets the input vector as:
whereinIs a start-up marker that is,is the placeholder for the target sequence (with its scalar set to 0), the first level attention is probabilistic sparse self-attention in conjunction with the UniDrop, as in step 4.2.1. The mask multi-headed attention is set to- ∞, by preventing each location from focusing on future locations, thereby avoiding autoregressive. The second layer attention is normal self-attention. Where generative reasoning is used to mitigate velocity dips in long-term predictions. After both layers of attention, there is an Add&And (3) a Norm layer. Add (d)&Norm is composed of two parts, add and Norm, and the calculation formula is as follows:
LayerNorm(X+MultiHeadAttention(X)) (13)
and finally, directly outputting the prediction result through a full connection layer.
And 5: and (4) training and optimizing the LDformer model. And (5) training and optimizing the model according to the LDformer model constructed in the step (4) to enable the model to reach an optimal state. The MSE loss function is selected when the target sequence is predicted, the MSE loss function is transmitted back to the whole model from the output of the decoder, the Adam optimizer is used for optimizing the whole model, the learning rate is decreased from the set initial value, the attenuation is 2 times in each period, the total epoch value is set, and the optimization is stopped in advance when appropriate. And obtaining a predicted value obtained by model training. And comparing the real value with the predicted value, and calculating the MAE, MSE and RMSE indexes of the predicted value. The index formula of the predicted value is as follows:
The method has the key effects that a parallel neural network model LDformer is provided, the long-time sequence prediction problem in electric quantity prediction is solved, an Informer framework is combined with an LSTM, and deep characteristics of a time sequence are fully considered; the probability sparse self-attention mechanism combined with the UniDrop is invented, and the risks that the original attention mechanism has large parameter quantity and loses the key connection among sequences are avoided. The method is simple in implementation process, can be applied to not only power data sets but also time sequence data sets in other fields, and can be well suitable for a large number of complex data scenes.
Drawings
FIG. 1 is a diagram of a model framework of the LDformer of the present invention.
FIG. 2 is a structure diagram of the Embedding layer considered by multiple angles.
Fig. 3 is a block diagram of a parallel encoder module of the present invention.
FIG. 4 is the overall structure of the UniDrop in the attention mechanism of the present invention.
Fig. 5 is a block diagram of the decoder of the present invention.
FIG. 6 is a histogram of the mean error between the true and predicted values for different prediction lengths of the four models.
FIG. 7 is a plot of the convergence of loss as a function of learning rate for two data sets at different lengths in the four models.
FIG. 8 is a graph of the runtime variation of the four models in a dataset.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
The method is used for modeling aiming at the long-time sequence prediction problem in electric quantity prediction. A parallel neural network model LDformer for a long-time sequence is provided, and the method is suitable for time sequence data collected in most fields, such as weather prediction, air quality prediction, traffic flow prediction and the like. The invention is implemented in a pychar environment through the python language. Example scenarios as shown in fig. 1, fig. 1 is a model framework diagram of the LDformer of the present invention, which includes an Embedding layer, an LSTM, an encoder, a decoder, and a final fully-connected layer output. The encoder module uses a four-path parallel model to splice and input the output result into a decoder, and the decoder outputs the prediction result through a full connection layer after decoding. The specific implementation is as follows:
step 1: taking an electric power data set as an example, in order to solve the problem of long-time sequence prediction, the invention provides a parallel neural network model LDformer for long-time sequence prediction. Firstly, input and output of a model are determined, an appropriate training data set is selected, and the model input is six load characteristics and a target value { X (1) ,X (2) ,X (3) ,X (4) ,X (5) ,X (6) Y, by collecting six features { X } from the training set (1) ,X (2) ,X (3) ,X (4) ,X (5) ,X (6) Small batch of m samples dataset ofTo predict n sequences of target values "OTAnd then the step 2 is carried out.
Step 2: and (4) preprocessing data. Firstly, the input training data set is standardized by using StandardScaler (), and the variance of each dimension data is ensured to be 1, and the mean value is ensured to be 0. So that the test results are not dominated by feature values that are too large for certain dimensions. Having a conversion function ofWhere μ is the mean of all sample data and σ is the standard deviation of all sample data. And then the step 3 is carried out.
And step 3: and (3) entering the Embedding layer Embedding of multi-angle consideration into the data set obtained in the step (2). And respectively carrying out data coding, position coding and time stamp coding, respectively expanding the dimensionality to a unified dimensionality 512, and summing to obtain a final Embedding result. Step 3.1 is data encoding, step 3.2 is position encoding, and step 3.3 is time stamp encoding.
Step 3.1: and (6) encoding data. Embedding in the data converts the data to a uniform dimension 512 using one-dimensional convolution. Which has the formula of
DE=conv1d(x) (17)
Step 3.2: and (4) position coding. Here, the elements in the input sequence are processed together, which is different from RNN one-by-one processing, although the speed is increased, the precedence relationship of the elements in the sequence is ignored, and therefore, the elements are added as position codes, and the formula is as follows:
step 3.3: and (5) time stamp coding. The time stamp coding method comprises a month _ embedded, a day _ embedded, a weekday _ embedded, a hour _ embedded and a minute _ embedded, the data set time slices used in the method are 15 minutes and 1 hour respectively, and therefore the minute _ embedded and the hour _ embedded are selected to obtain the time stamp coding result.
And 4, step 4: and constructing a parallel neural network model LDformer for long-time sequence prediction. After data passes through the embedding layer, the following steps are mainly carried out:
step 4.1: and receiving input data by using the LSTM to perform feature extraction, and obtaining deep expression capability in a time sequence. The LSTM adds a gating mechanism (an input gate, a forgetting gate and an output gate) on the basis of the recurrent neural network to determine the storage and the abandonment of information, and the method solves the problems of gradient loss and gradient explosion of the common recurrent neural network in the long-sequence training process.
Step 4.2: an encoder module is constructed. The encoder is designed for extracting robustness remote dependence of time sequence input and mainly comprises two sublayers, a multi-head attention layer (combined with a probability sparse attention mechanism of UniDrop) and a feedforward layer consisting of two linear mappings, wherein a batch normalization layer is arranged behind the two sublayers, and jump connection is formed between the sublayers. The encoder adopts a multi-channel parallel mode, and four channels with time sequence data length of L, L/2, L/4 and L/8 are respectively selected to be executed in parallel. Distillation operations are combined to improve model robustness. The distillation operation mainly uses one-dimensional convolution to trim the dimensions and reduce the memory usage before sending the output of the upper layer to the multi-head attention module of the lower layer. Wherein the distillation operation is always one layer less than the Encoder layer. Where attention in the encoder uses the UniDrop-combined probabilistic sparse self-attention mechanism built in accordance with the present invention.
Step 4.2.1: the probability sparse self-attention mechanism of UniDrop is combined. Feature Dropout (FD) in the UniDrop technique can randomly suppress certain neurons in the network with a certain probability. FD-1 is applied to the attention weight A for increasing the generalization of multi-headed attention. FD-2 is applied after the activation function between two linear transformations of the feed-forward network sublayer. However, the direct application of FD-1 to the weight A may lower the value A (i j) meaning that the relationship between marker i and marker j is ignored, so a larger FD-1 means a greater risk of losing some critical information from the sequence position. To mitigate this potential risk, dropout is added at Q, K, and V, respectively, before computing attention. FD-4 is used for output characteristics before linear transformation. Ith line Q of Q, K, V after dropout i ,k i ,v i The ith q attention is defined as a kernel smoother in probabilistic form, as shown in the following equation:
wherein the attention of the ith query to all keys is defined as a probabilityThe output is a combination of its value and V. The attention mechanism supports the probability distribution of the corresponding query attention away from a uniform distribution. If but not if p (k) j |q i ) Near uniform distribution, self-attention becomes the sum of V valuesAnd becomes redundant of the input. Thus, this problem can be effectively avoided by distributing the "similarity" between p and q to distinguish "important" queries, using the KL divergence to measure "similarity", as follows:
the sparsity metric for the ith query, except for the constant, can be defined as:
wherein the first term is q i The Sum of the asymmetric exponential kernels over all the bonds is then logarithmic, i.e. Log-Sum-Exp (LSE), and the second term is their arithmetic mean. If the ith query is larger, it indicates that its attention probability p is more "diverse" and that the probability of containing the dominant dot-product pair in the header field of the long-tailed self-attention distribution is higher. However, M (q) of all queries traversing the measurement i K) requires the computation of every dot product pair, which also means that quadratic O (L) is required Q L K ) Then the LSE operation used also has potential numerical stability issues. Based on the above, the above formula is improved, and the final sparsity measurement formula is obtained as follows:
therefore, a part with high probability is obtained, and probability sparseness self-attention combined with UniDrop is obtained. The formula is as follows:
whereinIs that q And phaseThe sparse matrix with the same size only contains Top-u queries under the sparse measurement, namely, the part with larger probability is taken. Setting u = c · lnL Q Controlled by a constant sampling factor c, the invention sets c equal to 5.
Step 4.2.2: and (4) carrying out distillation operation. As a natural consequence of the attention mechanism, the encoder's feature map has redundant combinations of values V. Thus, in the next layer, distillation operations are used to privilege dominant features with dominant features and generate a focused self-attention feature map, pruning the time dimension of the input. The one-dimensional convolution is very effective for obtaining interesting features from data without high correlation of positions, and the one-dimensional convolution can be well applied to time series analysis of sensor data. Therefore, one-dimensional convolution is selected to extract features, and the convolution kernel is set to 3x3. The distillation operation proceeds from the jth layer onward to the (j + 1) th layer. The formula is as follows:
wherein [. ]] AB Containing the basic operations in the multi-headed attention and attention block, conv1d () is executed in the time dimension using the LeakyReLU () activation function. The LeakyReLU () activation function formula is as follows:
LeakyReLU(x)=max(0,x)+negative_slope·min(0,x) (26)
step 4.3: a decoder is constructed. The decoder generates time series output through a forward process, and part of the structure of the time series output can refer to the structure of the decoder in the transform. The decoder includes two attention mechanisms and a linear mapped feedforward layer section. The decoder gets the input vector as:
whereinIs a start-up marker that is,is the placeholder for the target sequence (with its scalar set to 0), the first level attention is probabilistic sparse self-attention in conjunction with the UniDrop, as in step 4.2.1. The mask multi-headed self attention is set to- ∞, by which each location is prevented from focusing on future locations, thereby avoiding autoregressive. The second layer attention is normal self-attention. After both layers of attention, there is an Add&And (3) a Norm layer. Add (d)&Norm is composed of two parts, add and Norm, and the calculation formula is as follows:
LayerNorm(X+MultiHeadAttention(X)) (28)
finally, the prediction result is directly output through a full connection layer, and if the over-prediction target sequence is 24, the 24 is output.
And 5: and (5) simulating by adopting a power data set to finish the training and optimization of the LDformer model.
The invention is characterized in that 96 pieces of historical data of six load characteristics are used for predicting 24 pieces of data, 36 pieces of data and 48 pieces of data of target data in a data set under different time divisions, and a comparison graph of errors of real values and predicted values is shown in figure 6.
And calculating MAE, MSE and RMSE indexes of the predicted values according to the predicted values obtained by the prediction model. The index formula of the predicted value is as follows:
wherein y is the real data, and y is the real data,to prepareAnd measuring data, wherein n is the size of the data set.
Under the condition that the same data set generates a predicted value, the simulation explains the performance of the model through three indexes of MAE, MSE and RMSE, and compares the performance results of the model for predicting data with different lengths, and also makes full comparison on the loss value and the running time of the model. The results are presented using line graphs, as shown in fig. 7 and 8. The main simulation parameters are as follows:
the network structure is as follows: LDformer
Batch size: 64
Learning rate: 1e -4 —1.25e-05
Iteration times are as follows: maximum 10, stop when appropriate
And (3) an optimization algorithm: adam
Loss function: MSE.
Claims (1)
1. A long-time sequence prediction method based on a parallel neural network model LDformer in the field of power prediction comprises an embedded layer, a long-time memory network LSTM, an encoder and a decoder which are considered from multiple angles. The encoder uses a multi-pass parallel approach in conjunction with the distillation operation, where the encoder uses a probability sparse attention mechanism in conjunction with the UniDrop. The decoder includes two attention mechanisms, the first of which is a masked UniDrop-combined probabilistic sparse attention mechanism that prevents each location from focusing on future locations, avoiding autoregressive, and the second of which is ordinary self-attention. The method comprises the following specific steps:
step 1: taking an electric power data set as an example, in order to solve the long-time sequence prediction problem, a long-time sequence prediction method based on a parallel neural network model LDformer is provided. Firstly, input and output of a model are determined, a proper training data set is selected, and the model input is six load characteristics and a target value { X } (1) ,X (2) ,X (3) ,X (4) ,X (5) ,X (6) Y, by collecting six features { X } from the training set (1) ,X (2) ,X (3) ,X (4) ,X (5) ,X (6) Small batch of m samples dataset ofTo predict n sequences of target values "OTAnd then step 2 is carried out.
Step 2: and (4) preprocessing data. Firstly, the input training data set is normalized by using a StandardScaler (), and the variance of each dimension data is ensured to be 1, and the mean value is 0. So that the test results are not dominated by feature values that are too large for certain dimensions. Having a conversion function ofWhere μ is the mean of all sample data and σ is the standard deviation of all sample data. And then the step 3 is carried out.
And step 3: and (3) entering the Embedding layer Embedding of multi-angle consideration into the data set obtained in the step (2). And respectively carrying out data coding, position coding and time stamp coding, respectively expanding the dimensionality to a unified dimensionality d-model, and summing to obtain a final Embedding result. Step 3.1 is data encoding, step 3.2 is position encoding, and step 3.3 is time stamp encoding.
Step 3.1: and (4) encoding data. Embedding in the data converts the data to a uniform dimension d-model using one-dimensional convolution. The formula is as follows:
DE=conv1d(x) (1)
step 3.2: and (4) position coding. Here, the elements in the input sequence are processed together, which is different from RNN one-by-one processing, although the speed is increased, the precedence relationship of the elements in the sequence is ignored, and therefore, the elements are added as position codes, and the formula is as follows:
step 3.3: and (5) time stamp coding. The time stamp coding method comprises a month _ embedded, a day _ embedded, a weekday _ embedded, a hour _ embedded and a minute _ embedded, the data set time slices used in the method are 15 minutes and 1 hour respectively, and therefore the minute _ embedded and the hour _ embedded are selected to obtain the time stamp coding result.
And 4, step 4: and constructing a parallel neural network model LDformer for long-time sequence prediction. After data passes through the embedding layer, the following steps are mainly carried out:
step 4.1: and receiving input data by using the LSTM to perform feature extraction, and obtaining deep expression capability in a time sequence. The LSTM adds a gating mechanism (an input gate, a forgetting gate and an output gate) on the basis of the recurrent neural network to determine the storage and the abandonment of information, and the method solves the problems of gradient loss and gradient explosion of the common recurrent neural network in the long-sequence training process.
Step 4.2: an encoder module is constructed. The encoder is designed for extracting robustness remote dependence of time sequence input and mainly comprises two sublayers, a multi-head attention layer (a probability sparse attention mechanism combined with UniDrop) and a feedforward layer formed by two linear mappings, wherein a batch normalization layer is arranged behind the two sublayers, and jump connection is formed between the sublayers. The encoder adopts a multi-channel parallel mode, and four channels with the time sequence data length of L, L/2, L/4 and L/8 are respectively selected to be executed in parallel. Distillation operations are combined to improve model robustness. The distillation operation mainly uses one-dimensional convolution to trim the dimensions and reduce the memory usage before sending the output of the upper layer to the multi-head attention module of the lower layer. Wherein the distillation operation is always one layer less than the Encoder layer. Where attention in the encoder uses the probabilistic sparse self-attention mechanism built in conjunction with the union of the present invention.
Step 4.2.1: the probability sparse self-attention mechanism of UniDrop is combined. Feature Dropout (FD) in the UniDrop technique can randomly suppress certain neurons in the network with a certain probability. FD-1 is applied to the attention weight A for increasing the generalization of multi-headed attention. FD-2 applies two linear variants of feed-forward network sublayerAfter changing the activation function between. However, applying FD-1 directly to weight A may lower the value A (i-j), meaning ignoring the relationship between marker i and marker j, so a larger FD-1 means a greater risk of losing some critical information from the sequence position. To mitigate this potential risk, dropout is added at Q, K, and V, respectively, before computing attention. FD-4 is used for output characteristics before linear transformation. Ith line Q of Q, K, V after dropout i ,k i ,v i The ith q attention is defined as a kernel smoother in probabilistic form, as shown in the following equation:
wherein the attention of the ith query to all keys is defined as a probabilityThe output is a combination of its value and V. The attention mechanism supports the probability distribution of the respective query attention away from a uniform distribution. If but not if p (k) j |q i ) Near uniform distribution, self-attention becomes the sum of the V values, which becomes redundant of the input. Thus, this problem can be effectively avoided by distributing the "similarity" between p and q to distinguish "important" queries, using the KL divergence to measure "similarity", as follows:
the sparsity metric for the ith query, except for the constant, can be defined as:
wherein the first term is q i The sum of the asymmetric exponential kernels over all bonds is then logarithmic, i.e.Log-Sum-Exp (LSE), the second term being their arithmetic mean. If the ith query is larger, it indicates that its attention probability p is more "diverse" and there is a higher probability of including the dominant dot product pair in the header field of the long-tailed self-attention distribution. However, traversing all queries of the measurement M (q) i K) requires the computation of every dot product pair, which also means that quadratic O (L) is required Q L K ) Then the LSE operation used also has potential numerical stability issues. Based on the above, the above formula is improved, and the final sparsity measurement formula is obtained as follows:
therefore, a part with a large probability is obtained, and the probability sparse self-attention combined with the UniDrop is obtained. The formula is as follows:
whereinIs that q And the sparse matrix with the same size only contains the Top-u query under the sparse measurement, namely, the part with larger probability is taken. Setting u = c · lnL Q Controlled by a constant sampling factor c.
Step 4.2.2: and (4) carrying out distillation operation. As a natural consequence of the attention mechanism, the encoder's feature map has a redundant combination of values V. Thus, in the next layer, distillation operations are used to privilege dominant features with dominant features and generate a focused self-attention feature map, pruning the time dimension of the input. The one-dimensional convolution is very effective for obtaining interesting features from data without high correlation of positions, and the one-dimensional convolution can be well applied to time series analysis of sensor data. Therefore, one-dimensional convolution is selected to extract features, and the convolution kernel is set to be 3x3. The distillation operation proceeds from the j-th layer to the (j + 1) -th layer. The formula is as follows:
wherein [. ]] AB Containing the basic operations in the multi-headed attention and attention block, conv1d () is executed in the time dimension using the LeakyReLU () activation function. The LeakyReLU () activation function formula is as follows:
LeakyReLU(x)=max(0,x)+negative_slope·min(0,x) (10)
step 4.3: a decoder is constructed. The decoder generates time series output through a forward process, and part of the structure of the time series output can refer to the structure of the decoder in the transform. The decoder includes two attention mechanisms and a linear mapped feedforward layer section. The decoder gets the input vector as:
whereinIs a start-up marker that is,is the placeholder for the target sequence (with its scalar set to 0) and the first layer attention is the probabilistic sparse self-attention bound to the union as in step 4.2.1. The mask multi-headed self attention is set to- ∞, by which each location is prevented from focusing on future locations, thereby avoiding autoregressive. The second layer attention is normal self-attention. After both layers of attention, there is an Add&And a Norm layer. Add&Norm is composed of two parts, add and Norm, and the calculation formula is as follows:
LayerNorm(X+MultiHeadAttention(X)) (12)
and finally, directly outputting the prediction result through a full connection layer.
And 5: and simulating by adopting a power data set to finish the training and optimization of the LDformer model.
The average error contrast graph of the actual value and the predicted value of 24 data, 36 data and 48 data of the target data is predicted by 96 historical data of six load characteristics in a data set under different time divisions, and is shown in figure 6.
And calculating MAE, MSE and RMSE indexes of the predicted values according to the predicted values obtained by the prediction model. The index formula of the predicted value is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210834021.8A CN115310674A (en) | 2022-07-14 | 2022-07-14 | Long-time sequence prediction method based on parallel neural network model LDformer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210834021.8A CN115310674A (en) | 2022-07-14 | 2022-07-14 | Long-time sequence prediction method based on parallel neural network model LDformer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115310674A true CN115310674A (en) | 2022-11-08 |
Family
ID=83857039
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210834021.8A Pending CN115310674A (en) | 2022-07-14 | 2022-07-14 | Long-time sequence prediction method based on parallel neural network model LDformer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115310674A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115795351A (en) * | 2023-01-29 | 2023-03-14 | 杭州市特种设备检测研究院(杭州市特种设备应急处置中心) | Elevator big data risk early warning method based on residual error network and 2D feature representation |
CN116128158A (en) * | 2023-04-04 | 2023-05-16 | 西南石油大学 | Oil well efficiency prediction method of mixed sampling attention mechanism |
CN116612393A (en) * | 2023-05-05 | 2023-08-18 | 北京思源知行科技发展有限公司 | Solar radiation prediction method, system, electronic equipment and storage medium |
CN117275723A (en) * | 2023-09-15 | 2023-12-22 | 上海全景医学影像诊断中心有限公司 | Early parkinsonism prediction method, device and system |
CN117290706A (en) * | 2023-10-31 | 2023-12-26 | 兰州理工大学 | Traffic flow prediction method based on space-time convolution fusion probability sparse attention mechanism |
-
2022
- 2022-07-14 CN CN202210834021.8A patent/CN115310674A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115795351A (en) * | 2023-01-29 | 2023-03-14 | 杭州市特种设备检测研究院(杭州市特种设备应急处置中心) | Elevator big data risk early warning method based on residual error network and 2D feature representation |
CN115795351B (en) * | 2023-01-29 | 2023-06-09 | 杭州市特种设备检测研究院(杭州市特种设备应急处置中心) | Elevator big data risk early warning method based on residual error network and 2D feature representation |
CN116128158A (en) * | 2023-04-04 | 2023-05-16 | 西南石油大学 | Oil well efficiency prediction method of mixed sampling attention mechanism |
CN116612393A (en) * | 2023-05-05 | 2023-08-18 | 北京思源知行科技发展有限公司 | Solar radiation prediction method, system, electronic equipment and storage medium |
CN117275723A (en) * | 2023-09-15 | 2023-12-22 | 上海全景医学影像诊断中心有限公司 | Early parkinsonism prediction method, device and system |
CN117275723B (en) * | 2023-09-15 | 2024-03-15 | 上海全景医学影像诊断中心有限公司 | Early parkinsonism prediction method, device and system |
CN117290706A (en) * | 2023-10-31 | 2023-12-26 | 兰州理工大学 | Traffic flow prediction method based on space-time convolution fusion probability sparse attention mechanism |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mo et al. | Remaining useful life estimation via transformer encoder enhanced by a gated convolutional unit | |
CN115310674A (en) | Long-time sequence prediction method based on parallel neural network model LDformer | |
Wang et al. | Correlation aware multi-step ahead wind speed forecasting with heteroscedastic multi-kernel learning | |
US20230018125A1 (en) | Processing Multi-Horizon Forecasts For Time Series Data | |
CN114548592A (en) | Non-stationary time series data prediction method based on CEMD and LSTM | |
Chen et al. | House price prediction based on machine learning and deep learning methods | |
CN116340796A (en) | Time sequence data analysis method, device, equipment and storage medium | |
CN114117852B (en) | Regional heat load rolling prediction method based on finite difference working domain division | |
Wang et al. | A New Hybrid Forecasting Model Based on SW‐LSTM and Wavelet Packet Decomposition: A Case Study of Oil Futures Prices | |
Samin-Al-Wasee et al. | Time-series forecasting of ethereum price using long short-term memory (lstm) networks | |
Kim et al. | A convolutional transformer model for multivariate time series prediction | |
CN117094451B (en) | Power consumption prediction method, device and terminal | |
Liu et al. | Maintenance spare parts demand forecasting for automobile 4S shop considering weather data | |
Jaiswal et al. | A comparative analysis on stock price prediction model using deep learning technology | |
CN116404637A (en) | Short-term load prediction method and device for electric power system | |
Wang et al. | Risk assessment of customer churn in telco using FCLCNN-LSTM model | |
CN115759343A (en) | E-LSTM-based user electric quantity prediction method and device | |
CN115423091A (en) | Conditional antagonistic neural network training method, scene generation method and system | |
Wang et al. | MIANet: Multi-level temporal information aggregation in mixed-periodicity time series forecasting tasks | |
Lin et al. | Design a hybrid framework for air pollution forecasting | |
Raju et al. | Dual Deep Learning model for Electricity Price Forecasting: Bi-LSTM and GRU fusion | |
Bian et al. | Iterative convolutional enhancing self-attention Hawkes process with time relative position encoding | |
Özen et al. | A comprehensive country-based day-ahead wind power generation forecast model by coupling numerical weather prediction data and CatBoost with feature selection methods for Turkey | |
Vargo et al. | Similarity Scoring with Random Field Models for Traffic Flow Management Applications | |
Broadhurst et al. | Data Analytics On Nasdaq Stock Prices: Reddit Social Media Case Study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |