CN115622047B

CN115622047B - Power Transformer load prediction method based on Transformer model

Info

Publication number: CN115622047B
Application number: CN202211379043.6A
Authority: CN
Inventors: 何霆; 王屾; 朱文龙; 陈世茂; 曾建华; 杨子骥
Original assignee: Zhonghai Energy Storage Technology Beijing Co Ltd
Current assignee: Zhonghai Energy Storage Technology Beijing Co Ltd
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-07-18
Anticipated expiration: 2042-11-04
Also published as: CN115622047A

Abstract

The invention provides a power Transformer load prediction method based on a transducer model, which comprises the following steps: collecting load data of a power transformer, and arranging the collected load data of the power transformer according to time to obtain a sequence sample data set; dividing the data set into a training set, a testing set and a verification set, and ensuring that each data set sampling period can represent a characteristic change sample of the same period; defining and establishing an interactive multi-head attention transducer model, and initializing network internal parameters and learning rate; a three-layer decoder is constructed using a multi-headed attention layer and a multi-headed attention-interaction layer. The power transformer load prediction method provided by the invention can better capture the dependency relationship between long sequence data, thereby realizing accurate prediction of the power transformer load and having certain practicability in smart grid construction.

Description

Power Transformer load prediction method based on Transformer model

Technical Field

The invention belongs to the technical field of power metering data processing, and particularly relates to a method for predicting a load of a power transformer.

Background

The smart grid realizes reliable, safe, economical, efficient and environment-friendly operation of the power grid through an advanced sensing and measuring technology and an advanced control system. The power transformer is an important device in power grid construction, and according to historical operation rule data information, accurate long-term prediction of the load is made, so that an important condition for constructing a smart power grid is provided. The power transformer load prediction is to take historical time series data as a data source, establish a power transformer load prediction mathematical model by utilizing technologies such as data mining, deep learning and the like, and predict the power transformer load according to the established model, thereby being beneficial to realizing reasonable power distribution and reducing power waste.

With the continuous increase of the installed capacity of wind power, the technical and economic effects of wind power grid connection on a main power grid are larger and larger, and the data processing of a transformer is more challenging. Because the grid-connected operation of the wind power plant can bring negative effects to the aspects of power quality, voltage stability, power grid safety and the like of the power grid, the power transformer load can be accurately predicted, and the power quality and the voltage stability degree can be effectively improved. Therefore, how to reasonably estimate the load of the power transformer can effectively reduce unnecessary power waste and fully play the role of auxiliary decision of the intelligent power grid.

The power transformer has the characteristics of complex structure and nonlinear variation of material parameters. In the process of power distribution, the transformer can be adjusted relatively conservatively. In reality, it is very difficult to predict the load of the power transformer, because it is affected by various factors such as weather, temperature, season, environment, etc., and thus exhibits complex variation characteristics. The power transformer load prediction methods proposed at present can be generally divided into two types, one is a statistical model represented by ARIMA, prophet and the other is an autoregressive model represented by RNN. The method is often used for short-term prediction according to single or multiple variables, has low prediction time and precision, is difficult to process a large amount of data and complex time sequence relations in high dimension in real application, and is not suitable for practical application.

Disclosure of Invention

In order to solve the defects existing in the prior art, the invention aims to provide the power Transformer load prediction method based on the interactive multi-head attention transducer model, which is based on the encoder-decoder framework of the transducer model, utilizes depth separable convolution to realize information interaction of different subspaces of the traditional multi-head attention, improves the fitting capacity of the model to data, and simultaneously utilizes a maximum pooling layer to distill time-sequential data, reduces memory overhead in the model training process, and realizes accurate prediction of the power Transformer load.

A second object of the invention is to propose an application using the above prediction method.

A third object of the invention is to propose a device using the above prediction method.

The technical scheme for realizing the purposes of the invention is as follows:

a power Transformer load prediction method based on a transducer model comprises the following steps:

s1, collecting load data of a power transformer, and arranging the collected load data of the power transformer according to time to obtain a sequence sample data setx _i Indicating the value of the observed variable at time i, L _x Represents the length of the observed time series, d _x Representing the number of observed variables;

carrying out normalization processing on the sequence sample data set to enable the sample data value to be between 0 and 1, and obtaining the data set as a sample for supervised learning;

s2, dividing the normalized data set into a training set, a testing set and a verification set, and ensuring that each data set sampling period (the acquisition interval time) can represent a characteristic change sample of the same period;

s3: defining and establishing an interactive multi-head attention transducer model, and initializing network internal parameters and learning rate; the original data is converted into a feature vector with position information after passing through an embedding layer and a position coding layer, wherein the time sequence coding comprises a global time sequence coding and a local time sequence coding, the global time sequence coding consists of year, month and week information in a data time stamp, and a local time sequence coding formula is as follows:

where PE represents position encoding, pos represents position, j represents dimension,

s4, the transducer model consists of an encoder and a decoder, wherein in the encoder, a multi-head attention layer and a multi-head attention interaction layer are adopted for feature extraction, and the method comprises the following steps: inputting the vector with the timing information into the multi-head attention layer to obtain an intermediate value:

wherein W is ^Q ,W ^K ,W ^V The weight matrix is Q, K and V are input vectors;

consists of a plurality of parts, each part representing a subspace:

information interaction on different subspaces is achieved using depth separable convolutions;

wherein Conv1 and Conv2 represent depth-wise Convolume and point-wise Convolume, respectively, and Elu represents an activation function;

then, a linear change layer is used for feature dimension conversion, and finally, a pooling layer is used for downsampling to obtain output:

s5: adopting a multi-head attention layer and a multi-head attention interaction layer to construct a three-layer decoder; first using features f from the multi-head attention interaction layer ₁ And feature f from residual connection ₂ Calculating weight ratioWherein->Representing a weight matrix, b _g Representing the bias, sigmoid represents the activation function. Then based on the ratio, the two features f are compared ₁ And f ₂ Weighted summation is performed

Fusion(f1,f2)＝g⊙f ₁ +(1-g)f ₂

S6: a decoder is constructed using a multi-headed attention layer and a multi-headed attention-interaction layer. The multi-head attention layer is responsible for carrying out inner product operation on the Query matrix and the Key matrix to obtain a contribution degree score, and then multiplying the obtained contribution degree score and the Value matrix to obtain a feature vector. The multi-head attention interaction layer is responsible for carrying out subspace information interaction on the formed feature vectors, and finally, a linear change layer outputs a final prediction sequence.

The data points in S1 are arranged in time, the sampling can be 1 hour apart or 15 minutes apart, 1min apart, the shorter the time the more finely the data is. S4, in a feature extraction part, the traditional multi-head attention mechanism is to divide the features into a plurality of blocks, and information interaction of different subspaces is not considered, so that the feature extraction capability of a model on time series data is limited. The invention improves the attention mechanism in the model; with convolution processing, the blocks are interrelated and longer-term data can be predicted. The invention introduces a multi-head attention interaction layer based on a multi-head attention mechanism, and uses depth separable convolution to realize information interaction on different subspaces. The method reduces the memory overhead in the model training process. Features can be adaptively selected and redundant information filtered out.

Wherein, the sensor collects data related to the load of the power transformer by using a temperature measuring element, an ammeter and a voltmeter, wherein the data comprises one or more of load, oil temperature, position, climate and demand.

Further, in step S4:

output vector generated by multi-head attention layer

Information interaction is carried out through a multi-head attention interaction layer, wherein the multi-head attention interaction layer consists of a depth separable convolution, a linear change layer and a maximum pooling; output tensor formed for multi-head self-attention mechanismFirstly, carrying out information aggregation on a channel dimension by using PointWise convolution of 1x 1; performing information interaction on a space dimension by using DepthWise convolution after an ELU activation function so as to learn correlation on space and correlation among channels at the same time; finally, the time series distillation operation is realized by using a maximum pooling layer with the step length of 2. The operation reduces the length of each layer of encoder by half in the time dimension, and filters redundant information, thereby reducing the memory consumption in the training process.

Wherein, S2, the preprocessed data set is processed according to 7:2:1 are respectively divided into a training set, a test set and a verification set, and each data set sampling period can represent a characteristic change sample of the same period (the same period is the acquired interval time).

Further, in step S4:

the input part of the decoder is denoted asWherein (1)>The value of the last k time steps from the Encoder input,/->Placeholders (filled with 0 s) as target sequences to be predicted; finally, full connectionThe layer is used to output a predicted value whose dimension depends on the number of variables that need to be predicted.

In the step S4, an average absolute error (MSE) loss function and an Adam algorithm with random gradient descent are used in the network convergence process.

According to the method, on one hand, the learning rate of each parameter is dynamically modified, and on the other hand, a momentum method is introduced, so that the parameter update has more opportunities to jump out of local optimum, and network convergence is accelerated and optimized.

The training process is a process of inputting a model, iterating in the gradient descent process, and reducing errors.

The power Transformer prediction method based on the transducer model further comprises S7: performing fitting evaluation on the model, and using earlyStopping to prevent the model from fitting during the training process; for each round of trained model, the verification set obtained in the step S2 is used for verification, and if the test error is found to rise on the verification set along with the increase of training rounds, the training is stopped; the weight after stopping is taken as the final parameter of the network.

The application of the power Transformer prediction method based on the transducer model is that the model is used for prediction: after model evaluation verification, the test set data obtained in the step S2 are input into the model verified by the step S7 to predict future time values.

The method may be used in wind farms or other similar feature facilities, preferably in transformer load prediction for wind farms.

The power Transformer load prediction model based on the interactive multi-head attention transducer receives a historical load sequence as input to predict load values of a plurality of time steps in the future; through the information interaction between the multiple-head attention points, the characteristic extraction capability of the model on long-sequence data is improved, and therefore high-precision long-term prediction on the load of the power transformer is achieved.

An apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method steps when the program is executed.

The invention has the beneficial effects that:

the power Transformer load prediction method based on the interactive multi-head attention transducer model provided by the invention has the advantages that compared with the existing prediction method, the power Transformer load prediction method based on the interactive multi-head attention transducer model has the following advantages: the traditional time sequence prediction method cannot accurately predict long sequence data, introduces interactive multi-head attention on the basis of a transducer, is used for enhancing the characteristic extraction capability of a model on the sequence data, and simultaneously realizes the distillation operation on the sequence data by utilizing a maximum pooling layer in order to reduce the memory overhead in the model training process.

The power transformer load prediction method provided by the invention can better capture the dependency relationship between long sequence data, thereby realizing accurate prediction of the power transformer load and having certain practicability in smart grid construction.

The prediction method utilizes the maximum pooling layer to distill time series data, reduces memory overhead in the model training process, and realizes accurate prediction of the power transformer load.

Drawings

FIG. 1 is a flow chart of the power Transformer load prediction based on the interactive multi-headed attention transducer model of the present invention;

FIG. 2 is a model diagram of the power Transformer load prediction based on the interactive multi-headed attention transducer model of the present invention;

fig. 3 shows the prediction effect of the prediction method IMAHN according to the present invention compared with the real data.

Detailed Description

The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

Unless otherwise indicated, all technical means employed in the specification are those known in the art.

The invention is further described in detail below with reference to the drawings and the embodiments, and is a power Transformer load prediction method based on an interactive multi-head attention transducer model.

The training data set used by the embodiment collects the load conditions of the power transformers in two different areas of the same province in China from 2016 to 2018. Each data point was recorded once per minute (labeled m), and was designated ETT-small-m1. The dataset contained 2 years x 365 days x 24 hours x 4 = 70,080 data points. In addition, the present dataset also provides one hour level granularity of dataset variant usage (labeled h), namely ETT-small-h1 and ETT-small-h2. Each data point contains an 8-dimensional characteristic including the date of record of the data point, the predicted value "oil temperature", and 6 different types of external load values, high payload (High usefull load), high payload (High useless load), medium payload (middle useful load), medium payload (middle useless load), low payload (low usefull load), low payload (low useless load), respectively.

Example 1:

fig. 1 is a flowchart of a power Transformer load prediction method based on an interactive multi-head attention transducer model according to the present invention. The method specifically comprises the following steps:

s2, normalizing the data set according to 7:2:1 is divided into a training set, a test set and a verification set, and each data set sampling period can represent a characteristic change sample of the same period.

And ensuring that each data set sampling period can represent a characteristic change sample of the same period;

wherein W is ^Q ,W ^K ,W ^V The weight matrix is Q, K and V are input vectors;

consists of a plurality of parts, each part representing a subspace:

in step S4:

output vector generated by multi-head attention layer

Information interaction is carried out through a multi-head attention interaction layer, and the interaction module consists of a depth separable convolution, a linear change layer and a maximum pooling; output tensor formed for multi-head self-attention mechanismInformation aggregation is first performed in the channel dimension using a PointWise convolution of 1x 1. Information interaction is performed in a spatial dimension by using DepthWise convolution after ELU activation functions so as to learn correlation in space and correlation among channels at the same time. Finally, the time series distillation operation is realized by using a maximum pooling layer with the step length of 2. Wherein the information interaction module consists of a depth separable convolution, a linear change layer and maximum pooling, and is used for forming an output tensor for a multi-head self-attention mechanism>Information aggregation is first performed in the channel dimension using a PointWise convolution of 1x 1. Performing information interaction on a space dimension by using DepthWise convolution after an ELU activation function so as to learn correlation on space and correlation among channels at the same time; finally, the time series distillation operation is realized by using a maximum pooling layer with the step length of 2.

S4, in the step of:

output vector O generated by a multi-headed attention layer ⁱ ＝Attention(QWi _i ^Q ,KW _i ^K ,VW _i ^V ) And information interaction is performed through the multi-head attention interaction layer.

The input part of the decoder is denoted asWherein (1)>The value of the last k time steps from the Encoder input,/->Placeholders (filled with 0 s) as target sequences to be predicted; finally, the fully connected layer is used to output a predicted value, the dimension of which depends on the number of variables that need to be predicted.

Step 4 in the network convergence process, adam's algorithm with mean absolute error (MSE) loss function and random gradient descent is used.

S5: adopting a multi-head attention layer and a multi-head attention interaction layer to construct a three-layer decoder; first using features f from the multi-head attention interaction layer ₁ And feature f from residual connection ₂ Calculating weight ratioWherein->Representing a weight matrix, b _g Representing the bias, sigmoid represents the activation function. Then based on the ratio, the two features were weighted and summed Fusion (f 1, f 2) =g.sup.f ₁ +(1-g)f ₂

S7: performing fitting evaluation on the model, and using earlyStopping to prevent the model from fitting during the training process; for each round of trained model, the verification set obtained in the step S2 is used for verification, and if the test error is found to rise on the verification set along with the increase of training rounds, the training is stopped; the weight after stopping is taken as the final parameter of the network.

After model evaluation verification, the test set data obtained in the step 2 are input into the model verified in the step 5 to predict future time values. Fig. 3 shows the partial prediction results of the method on the ETT data set, and tables 1 and 2 show the comparison results of the method under single variable and multiple variable conditions compared with other prediction methods, respectively, and the effectiveness and the advancement of the model can be seen from the diagrams.

Table 1 univariate time series prediction results

In Table 1, IMHAN is the method proposed by the present invention, and Informer, LSTMa, deepAR, ARIMA, prophet is the comparative method.

MAE (mean absolute error), MSE (mean square error) is an evaluation index.

Example 2:

the same power Transformer load prediction method as in example 1 was used to obtain a Transformer model. In this embodiment, a plurality of variables including load, oil temperature, location, climate, demand are input for prediction. The data in the original data set is obtained by means of temperature measuring elements, current and user side power measurement and the like. The present embodiment predicts the load as a variable using multiple variables; the dimensions of the formula input are different from those of example 1.

The results obtained by the transducer model are shown in table 2:

TABLE 2 multivariate time series prediction results

In table 2, IMAHN is the method presented herein and Informer, LSTMa, LSTnet is a comparative predictive method.

Example 3: application of

After model evaluation verification, the test set data obtained in the step 2 are input into the model verified in the step 7 to predict future time values, so that the type selection and setting of transformers in the power grid are guided.

For the transformer of the wind driven generator grid connection, the multivariable middle position and the climate of the transformer are changed according to different wind farm settings, and the prediction method is particularly suitable for load prediction of the transformer of the wind power farm.

Although the invention has been described by way of examples, it will be appreciated by those skilled in the art that modifications and variations may be made thereto without departing from the spirit and scope of the invention.

Claims

1. The power Transformer load prediction method based on the transducer model is characterized by comprising the following steps of:

s2, dividing the normalized data set into a training set, a testing set and a verification set, and ensuring that each data set sampling period can represent a characteristic change sample in the same period;

s4, the transducer model consists of an encoder and a decoder, wherein in the encoder, a multi-head attention layer and a multi-head attention interaction layer are adopted for feature extraction, and the method comprises the following steps: the vector with timing information is input into the multi-head attention layer to obtain an intermediate value:

wherein W is ^Q ,W ^K ,W ^V The weight matrix is Q, K and V are input vectors;

consists of a plurality of parts, each part representing a subspace:

s5: adopting a multi-head attention layer and a multi-head attention interaction layer to construct a three-layer decoder; first using features f from the multi-head attention interaction layer ₁ And feature f from residual connection ₂ Calculating weight ratioWherein->Representing a weight matrix, b _g Representing the bias, sigmoid represents the activation function; then based on the ratio, the characteristic f is calculated ₁ And f ₂ Weighted summation Fusion (f 1, f 2) =g+.f ₁ +(1-g)f ₂

S6: constructing a decoder by adopting a multi-head attention layer and a multi-head attention interaction layer; the multi-head attention layer is responsible for carrying out inner product operation on the Query matrix and the Key matrix to obtain a contribution degree score, and then multiplying the obtained contribution degree score by the Value matrix to obtain a feature vector; the multi-head attention interaction layer is responsible for carrying out subspace information interaction on the formed feature vectors, and finally, a linear change layer outputs a final prediction sequence.

2. The method for predicting load of power Transformer based on transducer model according to claim 1, wherein the data related to load of power Transformer is collected by using temperature measuring element, ammeter, voltmeter, sensor, and the data includes one or more of load, oil temperature, position, climate, and demand.

3. The power Transformer load prediction method based on the Transformer model according to claim 1, wherein in step S4:

output vector generated by multi-head attention layer

Information interaction is carried out through a multi-head attention interaction layer, wherein the multi-head attention interaction layer consists of a depth separable convolution, a linear change layer and a maximum pooling; output tensor formed for multi-head self-attention mechanismFirstly, carrying out information aggregation on a channel dimension by using PointWise convolution of 1x 1; performing information interaction on a space dimension by using DepthWise convolution after an ELU activation function so as to learn correlation on space and correlation among channels at the same time; finally, the time series distillation operation is realized by using a maximum pooling layer with the step length of 2.

4. The power Transformer load prediction method based on the Transformer model according to claim 1, wherein S2 is characterized by: 2: the ratio of 1 is divided into a training set, a test set and a verification set respectively, and each data set sampling period can represent a characteristic change sample of the same period.

5. The power Transformer load prediction method based on the Transformer model according to claim 1, wherein in step S4:

the input part of the decoder is denoted asWherein (1)>The value of the last k time steps from the Encoder input,/->Filling placeholders as target sequences to be predicted with 0 s; finally, the fully connected layer is used to output a predicted value, the dimension of which depends on the number of variables that need to be predicted.

6. The power Transformer load prediction method based on the Transformer model according to claim 1, wherein the step S4 network convergence process uses an average absolute error (MSE) loss function and a random gradient descent Adam algorithm.

7. The power Transformer load prediction method based on the Transformer model according to any one of claims 1 to 6, further comprising S7:

performing fitting evaluation on the model, and using earlyStopping to prevent the model from fitting during the training process; for each round of trained model, the verification set obtained in the step S2 is used for verification, and if the test error is found to rise on the verification set along with the increase of training rounds, the training is stopped; the weight after stopping is taken as the final parameter of the network.

8. Use of a power transformer load prediction method according to any of claims 1-7, characterized in that the prediction is performed by means of a model: after model evaluation verification, the test set data obtained in the step S2 are input into the model verified in the step S7 to predict future time values.

9. The use according to claim 8, characterized by a transformer load prediction for a wind farm.

10. A computer program running device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any one of claims 1-7 when executing the program.