CN112308402B

CN112308402B - Power time series data abnormity detection method based on long and short term memory network

Info

Publication number: CN112308402B
Application number: CN202011182119.7A
Authority: CN
Inventors: 沙朝锋; 耿同欣; 郑伟杰
Original assignee: Fudan University; China Electric Power Research Institute Co Ltd CEPRI; Electric Power Research Institute of State Grid Liaoning Electric Power Co Ltd
Current assignee: Fudan University; China Electric Power Research Institute Co Ltd CEPRI; Electric Power Research Institute of State Grid Liaoning Electric Power Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2022-04-12
Anticipated expiration: 2040-10-29
Also published as: CN112308402A

Abstract

The invention discloses an abnormality detection method for power time series data based on a long-term and short-term memory network. The method comprises the following steps: (1) preprocessing power time sequence data; (2) neural network model pre-training, using encoder-decoder structureCalculating the layer dynamic attention; (3) abnormal data detection in neural network models

After training is completed, the model weight W is stored locally, the model is directly loaded when new power time sequence data x are detected, and calculation is carried out

Distance from the representative point c

And obtaining the abnormal score of the user to judge whether the user is abnormal or not. The method is used for abnormality detection of the power time series data, and is simple and high in detection accuracy.

Description

Power time series data abnormity detection method based on long and short term memory network

Technical Field

The invention belongs to the technical field of data analysis and anomaly detection, and particularly relates to an anomaly detection method of power time series data based on a long-term and short-term memory network.

Background

The real-time operation data of the power system has the potential capability of reflecting the current operation state and the future development trend of the power system. Along with the rapid development of the intellectualization of the power system, the scale of embedding various sensors in the power system is continuously enlarged, so that the types of data collected by a sensing layer are more refined, and the data to be processed is increased rapidly. According to incomplete statistics, the power grid service data collected every day in a single city can reach a PB level. In general, the real-time operation data of the power system has the characteristics of multiple data acquisition devices, high acquisition frequency, large data scale, complex data types and the like. And the collected data is typical time-series data (TSD). The method has the advantages that the power time sequence data are fully utilized, the appropriate technology is adopted to carry out abnormity detection and timely find faults existing in the power system, and decision and auxiliary support can be provided for efficient and safe operation of the power system.

Disclosure of Invention

In order to achieve the above object, the present invention provides a method for detecting abnormality of power timing data based on long and short term memory network; the invention uses the long and short term memory network model in deep learning to analyze the power time sequence data and detect the abnormal data in the power time sequence data so as to help the power system to find the existing faults in time.

An abnormality detection method for power time series data based on a long-short term memory network comprises the following steps:

s1: preprocessing power time sequence data, removing unimportant features in the data, cleaning partial noise data, and taking a data preprocessing result as input of next model training;

s2: pre-training a neural network model, and calculating the layered dynamic attention by adopting an encoder-decoder structure; the basic model uses a Long short-term memory network (LSTM), the ReLU is used as an activation function, the loss function uses a custom loss function, the optimizer uses Adam, and the model is trained until convergence; this step is based on a power time series data set { x }₁,…,x_NObtaining initial parameters of the neural network model parameters of the next step and representative point representation c of the whole time sequence data set;

s3: abnormal data detection in neural network models

Distance from the representative point c

And obtaining the abnormal score of the user to judge whether the user is abnormal or not.

Compared with the prior art, the invention has the beneficial effects that:

the method comprises the steps of firstly preprocessing the existing data to obtain a group of relatively regular data sets without redundancy; the invention adopts the long-term and short-term memory network, and is more suitable for the analysis of time sequence data compared with other networks; the Adam optimizer is adopted, the Adam algorithm is a method for calculating the self-adaptive learning rate of each parameter, the algorithm needs less memory and is efficient in calculation, and the method is suitable for solving the problems of large-scale data and parameters; the method uses the custom loss function, performs specific optimization aiming at the abnormal detection, and can improve the network training efficiency and the abnormal detection accuracy; according to the invention, a ReLU activation function is used, and the single-side inhibition of the function enables neurons in a neural network to have sparse activation, so that a model after sparse realization through the ReLU can better mine relevant characteristics and fit training data; the invention uses the network pre-training technology, thereby greatly reducing the training time of the model.

Drawings

FIG. 1 is a flow chart of an anomaly detection method based on power time series data of a long-term and short-term memory network according to the present invention.

FIG. 2 is a block diagram of an encoder-decoder neural network architecture for pre-training a hierarchical attention mechanism in accordance with the present invention.

Fig. 3 shows the principle and structure of the method for detecting the abnormality of the power time series data based on the long-short term memory network according to the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Example 1

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all changes that come within the meaning and range of equivalency of the claims are to be embraced therein.

As shown in fig. 1 and fig. 3, a method for detecting an abnormality of power time series data based on a long-short term memory network, the method comprising two parts of off-line training a model and detecting an abnormality based on the model by a power time series data analysis method based on a long-short term memory network model, comprises the following steps:

s1: and (3) preprocessing power time sequence data, removing unimportant features in the data, and cleaning part of noise data. The result of the data preprocessing is used as the input of the next model training;

s2: the method comprises the following steps of pre-training a neural network model, training the model until convergence, based on a hierarchical attention mechanism encoder-decoder neural network structure (figure 2), using a long-short-term memory network (LSTM) for a basic model, using Adam for an optimizer, using a custom loss function for a loss function, using a ReLU as an activation function, and training the model. In the training process, a back propagation algorithm is adopted, the algorithm is based on a chain rule of a complex function, a gradient descending mode is adopted, the gradient is intuitively understood to be first-order approximation, so the gradient can be understood to be a coefficient of the sensitivity of a certain variable or a certain intermediate variable to the output influence, and the understanding is more intuitive when the chain method is changed from multiplication to Jacobian matrix multiplication. To reduce training time, pre-training (pre-training) can be used to find a near-optimal solution for weights in the neural network.

The pre-trained model adopts a coder-decoder structure based on a hierarchical attention mechanism, and comprises the following modules:

2.1 encoder

Input orderColumn(s) of

For k variable sequence with length of l, the hidden state sequence of n encoder units is obtained by passing the k variable sequence through LSTM unit

Wherein X_iCorresponding encoder hidden state

2.2 decoder

Combining the previous decoder unit hidden state with the k-dimensional prediction result of the previous decoder unit

The current decoder hidden state is obtained by the LSTM unit. Thereby obtaining a sequence of hidden states of the decoder unit

Wherein

2.3 computing hierarchical dynamic attention

(1) Calculating attention weights

For each decoder unit, all encoder hidden states E obtained from step 2.1 and decoder hidden states obtained from step 2.2

Calculating attention alpha_i*The method used is a bilinear mapping method (i.e.,

) Or location-based methods (i.e. using a location-based method)

). Then to alpha_i*Use ofThe softmax is normalized to obtain the attention weight

(2) Computing dynamic attention context vectors

Using the attention weight obtained in the previous step

Set H of hidden states for all encoders and hidden states for decoders_iThe weighted sum is taken E ═ D, resulting in a dynamic attention context vector, i.e.,

(3) computing hierarchical attention hiding states

Obtained in the last step

After being connected with the hidden state of the decoder, the corresponding dynamic attention hidden state is obtained through a ReLU activation function

Then, a Max-Pooling or averaging method is used for calculating and obtaining the context vector of the encoder according to all the hidden states E of the encoder obtained in the step 2.1

Finally, the encoder context vector c_eAnd dynamic attention hidden state

By blending functions (serial affine transformation of two vectors, or pooling operations on both), we get a hierarchical attention hidden state

Specifically, the number of hidden units in the model is 128, a single-layer LSTM is used, and parameter estimation is performed by using an Adam optimization algorithm. The size of the model training batch is set to 512, the number of iterations is 500, the learning rate is set to 0.001, overfitting is prevented by using early stopping, and the time point of early stopping is judged by the error reduction trend of the model on the cross validation set.

S3: abnormal data detection in neural network models

Distance from the representative point c

Further, in step S3, a Long short-term memory (LSTM) is used, which is an excellent variant model of RNN, inherits most characteristics of RNN model, and solves the problem of gradient disappearance caused by gradual reduction in the gradient back propagation process, and the LSTM is very suitable for processing the highly time-related problems, such as the power time-series data concerned in this patent.

Specifically, the number of hidden units in the model is 96, single-layer LSTM is used, and parameter estimation is performed by using an Adam optimization algorithm. The model training batch size is set to 512, the number of iterations is 200, the learning rate is set to 0.001, and overfitting is prevented by using the advance stop

The beneficial effect of adopting the further scheme is as follows: the LSTM is more suitable for time sequence data than other networks, so that the addition of the LSTM effectively improves the effect of anomaly detection.

Further, in step S3, Adam is used as an optimization function, which is essentially RMSprop with momentum term, and it dynamically adjusts the learning rate of each parameter using the first moment estimate and the second moment estimate of the gradient. The formula is as follows:

(1)m_t＝μ*m_t-1+(1-μ)*g_t

(2)n_t＝v*n_t-1+(1-v)*g_t ²

(3)

where equations (1) (2) are the first and second order moment estimates, respectively, for the gradient, which can be considered as Eg_t|，E|g_t ²The estimate of | equation (3) (4) is a correction to the first order second moment estimate, which can be approximated as an unbiased estimate of the expectation. It can be seen that the moment estimation directly on the gradient has no additional requirements on the memory and can be dynamically adjusted according to the gradient. The last term is preceded by a dynamic constraint formed on the learning rate n and having a well-defined range.

The beneficial effect of adopting the further scheme is as follows: after offset correction, the learning rate of each iteration has a certain range, so that the parameters are relatively stable.

Further, step S3 uses a custom loss function, which is defined as follows:

herein, the

For the LSTM model, the parameters of the network are W, and a time sequence x is formed_iMapping into a hypersphere. The objective function described above requires that the raw data be mapped into a hypersphere, in order to contain as much data as possible within the hypersphere. Wherein the first term minimizes the volume of the hypersphere; the second term is a penalty term for those located outside the hypersphere,the hyper-parameter α is used to control a trade-off of hyper-sphere volume and over-boundary; the third term is a regularization term that controls the norm of the network parameter W to avoid model overfitting.

After the Adam completes the training of the model, the model is directly loaded and calculated when new power time sequence data x are detected

Distance from the representative point c

The beneficial effect of adopting the further scheme is as follows: the loss function is specifically optimized for anomaly detection, network training efficiency is effectively improved, and anomaly detection accuracy and recall rate are improved.

Further, step S3 uses ReLU as the activation function, whose formula is defined as follows:

ReLU(x)＝max{0,x}

the function changes all negative values into 0, and the positive values are unchanged, namely, unilateral inhibition is carried out, so that neurons in the neural network also have sparse activation.

The beneficial effect of adopting the further scheme is as follows: for a linear function, the expression capacity of the ReLU is stronger, and the ReLU is particularly embodied in a deep network; for the nonlinear function, because the Gradient of the non-negative interval is constant, the ReLU has no Problem of Gradient disappearance (cancellation Gradient distribution), so that the convergence rate of the model is maintained in a stable state.

Further, the step S3 initializes parameters of the LSTM model using the S2 pre-trained network parameters, the pre-trained model being a model trained on a large reference data set for solving a similar problem. Because the calculation cost for training the model is high, it is common practice to import published results and use the corresponding model, and slight adjustment of model parameters is performed on the basis of the model, so that the training process of the model is completed.

The beneficial effect of adopting the further scheme is as follows: the training speed of the model can be increased, the generated model can be stored in a weight mode and can be migrated to the solution of other problems, and the idea is adopted in the process of migration learning.

Claims

1. An abnormality detection method for power time series data based on a long-term and short-term memory network is characterized by comprising the following steps:

s2: pre-training a neural network model, and calculating the layered dynamic attention by adopting an encoder-decoder structure; the basic model uses a long-short term memory network LSTM and a ReLU as an activation function, the loss function uses a custom loss function, the optimizer uses Adam, and the model is trained until convergence; this step is based on the power time series data set { x }₁,…,x_NObtaining initial parameters of the neural network model parameters of the next step and representative point representation c of the whole time sequence data set;

s3: abnormal data detection in neural network models

After training is completed, the model weight W and the model are stored locally, the model is directly loaded when new power time sequence data x are detected, and calculation is carried out

Distance from the representative point c

Obtaining the abnormal score of the user to judge whether the user is abnormal; wherein:

in step S2, the method for calculating the hierarchical dynamic attention using the encoder-decoder structure is as follows:

(1) input sequence

Obtaining hidden state sequences of n encoder units by a k-dimensional variable sequence with the length of l through an LSTM unit

Wherein X_iCorresponding encoder hidden state

(2) Combining the previous decoder unit hidden state with the k-dimensional prediction result of the previous decoder unit

Obtaining the hidden state of the current decoder through the LSTM unit, and further obtaining the hidden state sequence of the decoder unit

Wherein

(3) Computing hierarchical dynamic attention

Calculating attention weight

For each decoder unit, according to the hidden state sequence E of all the encoder units obtained in the step (1) and the hidden state of the decoder obtained in the step (2)

Calculating attention alpha_i*The method used is a bilinear mapping method, i.e.

Or a location-based approach, the location of the user,namely, it is

Then to alpha_i*Normalization using softmax to obtain attention weight

② calculating dynamic attention context vector

Using the attention weight obtained in the previous step

computing layered attention hidden state

Obtained in the last step

And then calculating to obtain a context vector of the encoder according to the hidden state sequence E of the encoder unit obtained in the step (1) by using a Max-Pooling or averaging method

Finally, the encoder context vector c_eAnd dynamic attention hidden state

By mixed functions, i.e. affine transformations of two vectors in series, or pooling operations of bothTo do, get a layered attention hidden state

In step S2, the custom loss function is defined as follows:

herein, the

For another LSTM network, initializing the network with a parameter W from the trained parameters, and assigning a time sequence x_iMapping into a hypersphere.

2. The abnormality detection method according to claim 1, characterized in that in step S2, ReLU is selected based on the activation function of the neural network model, all negative values become 0, and positive values are unchanged, i.e., unilateral inhibition, so that neurons in the neural network also have sparse activation, and the formula is as follows:

ReLU(x)＝max{0,x}。

3. the abnormality detection method according to claim 1, characterized in that in step S2, Adam is selected as an optimization algorithm for training a neural network model, and the learning rate of each parameter is dynamically adjusted using first order moment estimation and second order moment estimation of the gradient; the formula is as follows:

m_t＝μ*m_t-1+(1-μ)*g_t (1)

wherein the formulas (1) and (2) are respectively for the gradient g_tThe first order moment estimate and the second order moment estimate of (d) are regarded as the expected value E | g_t|，E|g_t ²Estimate of | equation (3) (4) is a correction to the first order second moment estimate, approximated as an unbiased estimate of the expectation; the front part of equation (5) is a dynamic constraint formed on the learning rate η.