CN115982567A

CN115982567A - Refrigerating system load prediction method based on sequence-to-sequence model

Info

Publication number: CN115982567A
Application number: CN202211626813.2A
Authority: CN
Inventors: 李佳佳; 宁德军; 陈逸君; 王天逸; 郭千朋
Original assignee: Shanghai Carbon Soot Energy Service Co ltd
Current assignee: Shanghai Carbon Soot Energy Service Co ltd
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-04-18

Abstract

The invention discloses a refrigerating system load prediction method based on a sequence-to-sequence model, which comprises the steps of sorting historical data into a multi-dimensional time sequence MTS with a timestamp attribute, training the sequence-to-sequence prediction model by using the MTS, then using time sequence data of a given window as input, and predicting load tendency by using the trained prediction model, wherein the prediction model comprises an encoder and a decoder, the encoder adopts a multilayer structure consisting of a multilayer structure, a sparse self-attention module and a distillation module, a vector formed by embedding and encoding the time sequence data is used as the input of an initial layer structure, then the output of a previous layer structure is used as the input of a next layer structure until a final characteristic diagram is output, the input of each layer structure is divided into two paths, the first path sequentially passes through the sparse self-attention module and the distillation module to output a characteristic diagram I, the second path passes through a downsampling characteristic diagram II, and then the characteristic diagram I and the characteristic diagram II are subjected to characteristic fusion through a residual error connecting module to output the characteristic diagram of the current layer.

Description

Refrigerating system load prediction method based on sequence-to-sequence model

Technical Field

The invention relates to the technical field of energy efficiency of a refrigerating machine room, in particular to a refrigerating system load prediction method based on a sequence-to-sequence model.

Background

Energy efficiency optimization of efficient power systems is increasingly important in view of the fact that the energy consumption of power plant houses of many high energy consuming enterprises accounts for nearly 50% of the total energy consumption of the enterprises, however, efficient energy efficiency optimization strategies rely on accurate prediction of the load of the efficient power systems.

The existing load prediction methods mostly adopt a regression algorithm and a short time sequence prediction method, such as a recurrent neural network (LSTM, GRU) and the like, and when the methods are used for predicting medium and long time sequence prediction problems, the effect is poor due to rapid accumulation of errors. At present, with the success of sequence-to-sequence algorithms from a coder, a Transformer and the like in the fields of natural language processing and the like, a new idea is brought for developing medium-long time sequence prediction of high-efficiency power system loads, however, the models have the following problems:

1. the overhead of network computation amount and time complexity due to the use of attention mechanism becomes huge compared with the traditional CNN and RNN deep learning method;

2. the long-order dependence, the position dependence, the special time point and the mutual weight influence of the time sequence are not fully considered for the embedded coding of the time sequence;

3. in order to obtain more features, many models adopt a way of stacking a plurality of encoders, which not only increases a large number of network parameters, but also causes excessive parameters, which leads to difficult training and convergence.

Disclosure of Invention

In order to solve the problems, the invention provides a refrigeration system load prediction method based on a sequence-to-sequence model, which solves the problems of error accumulation and low accuracy rate of medium-long time sequence prediction of high-efficiency power system load, realizes more accurate prediction of medium-long time load of a power station house system, and meets the accuracy requirement of engineering-level power system energy efficiency optimization application.

The invention can be realized by the following technical scheme:

a refrigerating system load prediction method based on a sequence-to-sequence model is characterized in that a large amount of historical data is arranged into a multidimensional time sequence MTS with a timestamp attribute, a sequence-to-sequence prediction model is trained, then time sequence data of a given window is used as input, the trained prediction model is used for predicting load trend in the next period of time,

the prediction model comprises an encoder and a decoder, wherein the encoder is of a multilayer structure consisting of a sparse self-attention module and a distillation module, a vector formed by embedding and encoding time sequence data is used as the input of an initial layer structure, the output of a previous layer structure is used as the input of a next layer structure until a final feature map is output, the input of each layer structure is divided into two paths, the first path sequentially passes through the sparse self-attention module and the distillation module to output a feature map I, the second path sequentially passes through a downsampling output feature map II, and then the feature map I and the feature map II are subjected to feature fusion through a residual error connection module to output a current layer feature map;

the decoder fills the target elements to be predicted into zero, then carries out embedded coding to generate a vector input mask sparse self-attention module, then uses the generated feature map as a query vector and a final feature map output by the encoder to sequentially input the full self-attention module and a full connection layer, and finally outputs the predicted target elements in real time in a generating mode.

Further, the sparse self-attention module projects the feature of the query matrix Q and the key matrix K after being fused to a new probability space by using a full connection layer, and then utilizes a formula I(Q) = FC (Q + K) calculate importance score of query by setting n = clnL _Q ，L _Q Representing the number of rows of a matrix Q, selecting the first n query vectors with the highest scores as attention scores for subsequent calculation, wherein FC (-) represents full-connection operation, the number of input channels of an FC layer is the characteristic size, and the number of output channels is 1,I (Q) with the shape of L _Q ×1。

Further, the process of "refining" the feature map i from the jth sparse self-attention module to the j +1 sparse self-attention module is defined as:

wherein t represents the current time period [. ]] _AB Denotes the attention block, gamma denotes a learnable parameter, DS (-) denotes the down-sampling operation,

conv1d (-) denotes one-dimensional convolution filtering in the time dimension with the ELU (-) activation function.

Further, the multi-dimensional time series MTS is expressed in a matrix form, z-score standardization is firstly carried out, then batch division is carried out according to rows, then the following formula is utilized to carry out embedded coding to form a vector as the input of a prediction model,

wherein the content of the first and second substances,

represents the result of the multi-dimensional time series embedded coding, i is epsilon {1, …, L _x T and L _x The sub-table represents the current time period and the number of rows of data. Let the characteristic dimension after encoding be D _model ；

A characteristic dimension representing the input is d _model In a multi-dimensional time series/>

After being projected by one-dimensional convolution, the characteristic dimension is D _model The vector of (a); />

Representing a temporal coding;

represents the position coding:

wherein

And then projected to D using a one-dimensional convolution _model Dimension, pos, represents the current location.

Parameters for adjusting the position coding and time coding weights are represented by the following calculation method:

wherein Relu () is an activation function, conv1 () is a one-dimensional convolution, and the number of input channels thereof is D _model The number of output channels is 1.

Further, z-score normalization is performed using the following equation,

wherein d is _(i,j) Is the first in the multidimensional time series MTSj columns and i rows of values, D _(,) All values in column j are shown, mean (-) and Std (-) indicate Mean and standard deviation, respectively, in column j of the data set.

The beneficial technical effects of the invention are as follows:

1. the sparse self-attention mechanism based on the neural network is provided, and the capability of the transform self-attention mechanism in the aspects of time complexity and memory use is further optimized by acquiring a pair of outstanding contribution attention point products through the learnable neural network;

2. the stacking mode of the traditional depth model encoder is improved, a multi-class pooling and residual distillation mechanism is adopted, and the characteristics are obtained as much as possible under the condition that a plurality of encoders are not overlapped;

3. a new time sequence embedding coding mode is provided, so that the local position coding and the global time coding have stronger robust performance.

Drawings

FIG. 1 is a data presentation diagram of the present invention;

FIG. 2 is a schematic diagram of the overall structure of the prediction model of the present invention;

fig. 3 is a partial detailed view of the encoder structure of the present invention.

Detailed Description

The following detailed description of the preferred embodiments will be made with reference to the accompanying drawings.

As shown in FIG. 1, the present invention provides a method for predicting load of a refrigeration system based on a sequence-to-sequence model, which arranges a large amount of historical data into a multi-dimensional time sequence MTS with a time stamp attribute, trains a sequence-to-sequence prediction model, then predicts the load trend in the next period of time by using the trained prediction model with the time sequence data of a given window as input,

the decoder fills the target elements to be predicted into zero, then carries out embedded coding to generate a vector input mask sparse self-attention module, then uses the generated feature map as a query vector and a final feature map output by the encoder to input the feature map into a full self-attention module and a full connection layer in sequence, and finally outputs the predicted target elements in real time in a generating mode.

The method comprises the following specific steps:

step 1: time series data preparation and preprocessing

Historical data in the efficient power system is a key factor for model training and establishment, so the method needs to further process the data, arranges a large amount of historical data into a multidimensional time sequence MTS with a time stamp attribute, trains a prediction model, namely a deep network, based on the MTS, gives time sequence data of a certain window, and can predict the load trend in the next period of time by using the trained deep network, as shown in FIG. 1.

When the maximum and minimum values of a certain attribute in MTS are unknown, or there are outliers, min-max normalization of data is not applicable, so we define the z-score normalization method of data as follows:

wherein d is _(i,) Is the value in column jth row i of the MTS dataset as follows:

D _(,) stands for column jThere are numerical values, importantly, mean (-) and Std (-) indicate the Mean and standard deviation, respectively, of column j in the data set. It is worth noting that D' is only used for training the model, and the data difference is prevented from being too large to affect the training. While the validation and test data sets need to be cut out of the entire data set, no z-score normalization is required.

Generally, we convert the training data D' into several small batches by rows, for example, the minimum batch number can be set to be 10 rows by 100 rows of our training data, then B =100/10 is 10 batches in total, and the row sequence number in each batch is k. The inputs to the deep network may be defined as:

where B refers to the index of the minimum batch (B ∈ 1, …, B), and k is the index of the row in the minimum batch (k ∈ 1, …, L) _x ) Lx is the total number of rows of the data time series, and the characteristic dimension of the minimum batch is D _x 。

The output predicted value is:

where H is the number of time steps in the interval after the current timestamp, k is the row index in the output (k ∈ 1, …, L) _y ) Ly is the total number of predicted lines in the time series, and the characteristic dimension of the output is D _y 。

Step 2: constructing a sequence-to-sequence model;

the overall architecture of the present invention is shown in fig. 2 and follows an encoder-decoder architecture. In the encoding process, the input is embedded to form a vector, then the vector enters the sparse self-attention module provided by the invention, the output of the sparse self-attention module needs to pass through the distillation module and then output a feature map, in order to ensure that the loss is reduced in the feature forward propagation process, the embedded vector is subjected to down-sampling and then is fused with the output of the distillation module to obtain the feature map, and the process is called as a residual error connection module.

The decoder receives the long sequence input, fills the target elements to be predicted into zero, carries out embedded coding in the same way as the encoder, generates a vector input mask sparse self-attention module, and then sends the generated feature map into a full self-attention module as a query vector and a key vector and a value vector output by the encoder. The fully-connected layer instantaneously predicts the output elements in a generational manner.

2.1 Embedded coding

A multi-dimensional time series is a chronological sequence of data whose values are in a continuous space. In most cases, the original data is used as model input after embedding, and therefore the embedding code determines how well the data behaves. Previous work designed data embedding by manually adding different time windows, hysteresis operators, and other manual feature derivations, however, this approach was too cumbersome and required domain-specific knowledge. In a deep learning model, an embedding method based on a neural network is widely used, and particularly, considering that position semantics and timestamp information can influence the embedding of data, the invention provides the following embedding coding mode:

wherein the content of the first and second substances,

is the result of the multi-dimensional time sequence embedded code, i belongs to {1, …, L _x Let the characteristic dimension after encoding be D _model ；

Multi-dimensional time series->

(characteristic dimension is d) _model ) Vector projected by one-dimensional convolution (feature dimension D) _model )；

For position coding:

wherein

Then projected to D using a one-dimensional convolution _model Dimension. In other words, once the length L of the sequence is entered _x And a characteristic dimension d _model The position embedding is fixed, pos represents the current position.

Each global timestamp is embedded by a learnable value

The one-hot code used, in particular year, month, day, hour, minute, second, holiday is mapped to and/or by full concatenation>

Same characteristic dimension (D) _model ) The vector of (2). Such as 11 minutes 11 seconds, 11 months 11 days 11, a time 2022, and if all times are 2000 to 2022,

then the year: [0,0, … ] is 23-dimensional vector, the 1 st 0 is set to 1 in 2022, and then the full connection is projected to 512-dimensional vector

And (4) month: [0,0, … ] is a 12-dimensional vector, the 11 th 0 is set to 1 in 11 months, and the full connection is projected to 512-dimensional vector

Day: [0,0, … ] is 31-dimensional vector, the 11 th 0 is set to 1 at 11 days, and then the full connection is used to project to 512 dimensions

When the method is used: [0,0, … ] is a 24-dimensional vector, when 11, the 11 th 0 is set to 1, and then the full connection is projected to 512-dimensional vector

Dividing into: [0,0, … ] is a 60-dimensional vector, the 11 th 0 is set to 1 in 11 minutes, and the vector is projected to 512 dimensions by full connection

Second: [0,0, … ] is a 60-dimensional vector, the 11 th 0 is set to 1 in 11 seconds, and the full connection is projected to 512-dimensional

In addition, the invention uses

The parameters are used for adjusting the weight of the position code and the time code, and the calculation mode can be expressed as: />

Wherein Relu () is the activation function, conv1D () is the one-dimensional convolution with a number of input channels D _model The number of output channels is 1. Such an embedding method not only can tap out more MTS features, but also facilitates training, which is used by both encoder and decoder embedding.

2.2 sparse self-attention Module

The traditional full-volume self-attention module is based on tuple inputs, i.e. queries, keys and values, which can be described as:

where Q, K, V are the matrices of query, key and value, respectively, d _k Is the input dimension. Further, if q is _i ,k _i ,v _i For representing the ith row in the Q, K, V matrix, then the ith row of the output can be represented as:

wherein k (q) _i, k _j ) Is in fact an asymmetric exponential kernel function

This also means that it can weight the sum of vectors of values (V matrix), which requires a quadratic dot product calculation and O (L) _Q L _K ) Memory deviceUse, this is a major limitation in extending the prediction capability.

Numerous studies have shown that sparse self-attention scores form a long-tailed distribution, i.e. a few pairs of dot products contribute to the main attention, and other pairs of dot products can be ignored. In this case, if we can compute the most important n query vectors by the relationship between Q and K, then O (L) _Q L _K ) Can be optimized. Therefore, the present invention provides a method for implementing a filtering process of queries through neural network learning, which is defined as follows:

I(Q)＝FC(Q+K)

where I (Q) represents the importance score of the query, which is shaped as L _Q X 1; FC (-) indicates a full connection operation, the number of input channels of the FC layer is a characteristic size, and the number of output channels is 1.

In this way, we abandon the traditional method of computing attention scores in the Transformer, and instead, we project the fused features of Q and K to a new probability space by using the full-link layer. In addition, we obtained the score of the query and set n = clnL _Q The top n most-scored query vectors are selected to implement this process, which we name as sparse self-attention method, so that the temporal and spatial complexity of the self-attention module can be from O (L) ² ) Increase to O (LlnL).

In summary, our method has the following advantages:

1. the calculation workload in the process of filtering the query vectors is reduced;

2. obtaining a faster training speed and a lower GPU utilization rate;

3. good continuity of the feature fields is achieved.

2.3 encoder

To extract the robust long-range dependence of long-sequence inputs, we propose a single-encoder feature extraction method and improve the distillation operation, which is represented in fig. 3. After the input embedded coded vector is calculated by our sparse self-attention module, we get the N-headed weight matrix of the attention module in the graph, and our "refinement" process from the jth attention block to the (j + 1) attention block can be defined as:

wherein [. ]] _AB Represents the attention block, γ is a learnable parameter; DS (-) represents the downsampling operation, we use global mean pooling (stride = 2) on DS (-). In addition to this, the present invention is,

the calculation method comprises the following steps:

where Conv1d (-) performs one-dimensional convolution filtering (convolution kernel size 3) in the time dimension with the ELU (-) activation function. While downsampling may reduce the dimensionality of the features, some semantic information will be lost. To mitigate this effect, we obtain as much semantic information as possible (both step size 2) by adding a max pooling layer (Maxpool) and a mean pooling layer (AvePool) in parallel, and we also add a learnable y γ to adjust the importance of these two collective operations. Furthermore, to prevent the disappearance of gradients and features, we add a residual join. After encoding, the length of the feature map is one fourth of the original length. Compared with the method of stacking encoders, the method has the advantages of fewer parameters, higher calculation speed and capability of obtaining as many characteristics as possible.

2.4 decoder

The input of the decoder comprises two parts, one part is the output (key sum value) of the encoder, and the other part is the query vector calculated by the mask sparse self-attention module after the embedded vector with the target element filled with 0. In contrast to the sparse self-attention module, the mask sparse self-attention module masks future parts before calculating Softmax (·) and fills in with the sum of the V vectors at all time points before each query. This filling method can prevent the model from paying attention to future information. Finally, the query, key, and value are passed to a conventional full-volume self-attention module, through a full connectivity layer, to obtain the predicted result.

And step 3: experimental setup

3.1 data set

The data set used by the invention is data of a refrigerating system of a cold room, the time span is 2022.05.01-2022.08.31, the data interval is 1 minute, and 177120 pieces of data are provided, and the data dimensions are respectively as follows: time, chiller load (kw), chiller primary side load (kw), cooling tower load (kw), and cooling pump load (kw), for 5 dimensions. We follow 6:2:2, dividing a training set, a verification set and a test set, processing a data set by adopting a sliding window, wherein the length of an input sequence is N, and then taking M steps as a true value (carrying out MSE training with a prediction value of a model).

3.2 Experimental setup

The deep network model provided by the invention is realized under a Pythrch frame, and is trained by using an Adam optimizer, and the initial learning rate is 10 ^-4 Weight decay 5e ^-4 Momentum is 0.9, batch size is 32, iterations 20, and learning rate decays 0.5 every 5 epochs. Training is implemented on a NVIDIA Geforce GTX 3090Ti GPU and Intel (R) Core (TM) i9-10900K CPU.

3.3 evaluation index

For the evaluation index of the depth model of the present invention, we use CORR, MAE and MSE, where CORR represents the empirical correlation coefficient, MAE is the average absolute error, and MSE represents the average squared error. They are defined as follows:

wherein y and

respectively, a ground truth signal and a system prediction signal. Further, we set y = y ₁ ,y ₂ ,…,y _n

And

n represents the number of samples.

And 4, step 4: model execution

Based on the model for predicting the load from the sequence of the refrigeration machine room to the sequence, the historical load data of the refrigeration machine room to be processed is used as input, and a corresponding load prediction result is obtained and is used for optimizing the energy efficiency of the refrigeration machine room.

Due to the adoption of the technical scheme, the invention has the beneficial effects that:

1) Compared with the traditional prediction method, the CORR obtained by medium-long term prediction (more than 30 steps) is improved by 3 percent, the MAE is improved by 25 percent, and the MSE is improved by 70 percent;

2) Compared with the self-attention module of the traditional Transformer, the invention can lead the time and space complexity to be from O (L) ² ) Reduced to O (LlnL);

3) Compared with the traditional Transformer model, the method reduces the parameter quantity of the model by 50 percent

The above;

4) Compared with the traditional self-attention module of the Transformer, when the prediction step length exceeds 100 steps, the training time is shortened by more than 5 times, and the video memory occupation is shortened by more than 2 times.

In addition, more fields of application research experiments are carried out, and the specific steps are as follows:

example 1: load prediction for refrigeration system of cold room

The method is applied to load prediction of a refrigeration system, the step length of input data is 60, the dimension is 5, the prediction step length is 30, the prediction dimension is 5, time, cold load (kw), cold primary side load (kw), cooling tower load (kw), cooling pump load (kw) and total load are the sum of loads of all devices. The model was constructed in the manner shown in FIG. 2, and the results for CORR, MAE and MSE between the predicted and true values we obtained are CORR:0.954, MAE:0.188, MSE:0.079, whereas the conventional method of the Transformer obtained CORR:0.917, MAE.

Example 2 of implementation: exchange rate prediction based on depth algorithm of the invention

We have collected daily rates from 1990 to 2016 in eight countries including australia, uk, canada, switzerland, china, japan, new zealand and singapore. For a total of 7588 pieces of data, we have according to 6:2: and 2, performing training set, verification set and test set segmentation on the ratio. We apply this patent to exchange rate prediction with input data step size set to 120 and prediction step size set to 60 and build the model of the invention according to fig. 2. The results for CORR, MAE and MSE between our predicted and true values are CORR:0.911, MAE.

Example 3 of implementation:

we used a public statistical data set of residential electricity usage as a test. The data set provides two years of data, recorded every minute for each data point, from one region of our country. The data set contained 1,051,200 data points for 365 days in 2 years. Each data point contains 8-dimensional features including the date the data point was recorded, the predicted value "oil temperature" and 6 different types of external load values. We follow 6:2: and 2, performing training set, verification set and test set segmentation on the ratio. We apply this patent to exchange rate prediction with input data step size set to 720 and prediction step size set to 360 and build the model of the invention according to fig. 2. The results for CORR, MAE and MSE between our predicted and true values are CORR:0.772, MAE:0.308, MSE.

It will be appreciated by those skilled in the art that these are merely examples and that many variations or modifications may be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is therefore defined by the appended claims.

Claims

1. A refrigeration system load prediction method based on a sequence-to-sequence model is characterized in that: a large amount of historical data is arranged into a multi-dimensional time sequence MTS with a time stamp attribute, a prediction model from the sequence to the sequence is trained, then time sequence data of a given window is used as input, the trained prediction model is used for predicting the load trend in the next period of time,

the prediction model comprises a coder and a decoder, wherein the coder adopts a multilayer structure consisting of a sparse self-attention module and a distillation module, a vector formed by embedding and coding time sequence data is used as the input of an initial layer structure, the output of a previous layer structure is used as the input of a next layer structure until a final characteristic diagram is output, the input of each layer structure is divided into two paths, the first path sequentially passes through the sparse self-attention module and the distillation module to output a characteristic diagram I, the second path sequentially passes through a downsampling output characteristic diagram II, and then the characteristic diagram I and the characteristic diagram II are subjected to characteristic fusion through a residual error connecting module to output a current layer characteristic diagram;

2. The sequence-to-sequence model-based refrigerant system load prediction method of claim 1, wherein: the sparse self-attention module projects the feature of the query matrix Q and the key matrix K after fusion to a new probability space by using a full connection layer, calculates the importance score of the query by using a formula I (Q) = FC (Q + K), and sets n = clnL _Q ，L _Q Representing the number of rows of a matrix Q, selecting the first n query vectors with the highest scores as attention scores for subsequent calculation, wherein FC (-) represents full-connection operation, and the number of input channels of an FC layer is the characteristic sizeThe number of output channels is 1,I (Q) and the shape is L _Q ×1。

3. The sequence-to-sequence model-based refrigerant system load prediction method of claim 2, wherein: the process of refining the feature map I from the jth sparse self-attention module to the j +1 sparse self-attention module is defined as follows:

wherein t represents the current time period, [ ·] _AB Denotes the attention block, γ denotes a learnable parameter, DS (-) denotes the down-sampling operation,

conv1d (-) indicates one-dimensional convolution filtering in the time dimension with the ELU (-) activation function.

4. The sequence-to-sequence model-based refrigerant system load prediction method of claim 1, wherein: expressing the multi-dimensional time series MTS in a matrix form, firstly carrying out z-score standardization processing, then carrying out batch division according to rows, then carrying out embedded coding by using the following formula to form a vector as the input of a prediction model,

wherein, the first and the second end of the pipe are connected with each other,

A characteristic dimension representing the input is d _model Is based on a multi-dimensional time series->

Representing a temporal coding;

represents the position coding: />

Wherein

Then projected to D using a one-dimensional convolution _model Dimension, pos, represents the current location.

wherein Relu () is an activation function, conv1d () is a one-dimensional convolution,the number of input channels is D _model The number of output channels is 1.

5. The sequence-to-sequence model-based refrigerant system load prediction method of claim 4, wherein: the z-score normalization process is performed using the following equation,

wherein d is _(i,j) Is the value of the ith row in the jth column in the multi-dimensional time series MTS, D _(,) All values in column j are shown, mean (-) and Std (-) indicate Mean and standard deviation, respectively, in column j of the data set.