CN111461400B

CN111461400B - Kmeans and T-LSTM-based load data completion method

Info

Publication number: CN111461400B
Application number: CN202010128406.3A
Authority: CN
Inventors: 冯珺; 陈蕾; 童力; 黄红兵; 黄海潮; 陈彤; 黄�俊; 余慧华; 韩翊; 陈建铭
Original assignee: State Grid Zhejiang Electric Power Co Ltd; Zhejiang Huayun Information Technology Co Ltd; Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Current assignee: State Grid Zhejiang Electric Power Co Ltd; Zhejiang Huayun Information Technology Co Ltd; Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2023-06-23
Anticipated expiration: 2040-02-28
Also published as: CN111461400A

Abstract

The invention discloses a load data complement method based on Kmeans and T-LSTM, and relates to a data complement method. The existing data complement method has large data deviation and often cannot achieve the expected effect. The invention comprises the following steps: constructing a data model; respectively training the data of the K load intervals to obtain corresponding K data models; the load data of the same day of the data to be complemented are taken at regular time; calculating the average value of load data of the same day; acquiring a corresponding data model according to the average value; and inputting the to-be-complemented load data into a corresponding data model, and calculating to obtain the complemented complete load data. According to the technical scheme, load data with similar characteristics can be classified into one type, and interference of data with different characteristics is discharged; the real load value of the missing data can be accurately reflected. The method realizes accurate data complement and has the advantages of small error and high convergence rate.

Description

Kmeans and T-LSTM-based load data completion method

Technical Field

The invention relates to a data complement method, in particular to a load data complement method based on Kmeans and T-LSTM.

Background

In the current age background, the rapid development and diversified data acquisition approaches of information industry technologies enable the data volume of each industry organization to be increased rapidly, for example, the power load data of the national network has extremely large data storage quantity, and the rapid speed is still increased rapidly at present. Experience has shown that there are many available contents in these data, and it is very interesting to extract potential data value and to apply upper layers if it is possible to analyze the content underlying the data more effectively and completely.

However, most theoretical innovation, development and technical implementation in the current data mining field are based on ideal and complete data sets, however, load data collected by a real terminal is missing and incomplete due to various reasons such as terminal damage and no communication, and incomplete load data can distort, invalidate and even draw an erroneous conclusion on the result of data mining. The completion of the missing data is a particularly important, non-negligible link in the data mining process.

The existing data complement methods comprise linear complement, difference value complement and the like, the idea of the linear complement algorithm is to obtain a missing data value by using the average of the previous time data and the next time data of the missing point, and the method is simple but has large deviation compared with a true value, and often cannot achieve the expected effect. Moreover, many complement algorithms do not classify historical load data, and the model is influenced by abrupt change of the load data, so that the error is too large. In addition, the LSTM (Long Short Term Memory) network based on the time sequence has better complementing effect under the condition that continuous and time intervals are regular, but the actual condition is that missing data are random, so that the LSTM network data complementing can not meet the requirement.

Disclosure of Invention

The invention aims to solve the technical problems and provide the technical task of perfecting and improving the prior art scheme, and provides a load data complement method based on Kmeans and T-LSTM so as to achieve the purpose of accurately complementing data. For this purpose, the present invention adopts the following technical scheme.

A load data complement method based on Kmeans and T-LSTM comprises the following steps:

1) Constructing a data model;

101 Batch acquiring load data;

102 Randomly digging out continuous points in the load data as the load data to be complemented;

103 Carrying out Kmeans clustering on the load data;

104 Obtaining optimal K classification modes through Kmeans clustering, dividing a total sample into K categories according to the K classification modes, wherein each category corresponds to different load sections, and obtaining K classified load sections;

105 Calculating a load average value and carrying out normalization processing on load data;

106 Determining a load interval according to the load average value, and inputting the load data subjected to normalization processing into a T-LSTM neural network of the corresponding load interval for training, so as to obtain a data model of the corresponding load interval; respectively training the data of the K load intervals to obtain corresponding K data models;

2) The load data of the same day of the data to be complemented are taken at regular time;

3) Calculating the average value of load data of the same day;

4) Acquiring a corresponding data model according to the average value;

5) And inputting the to-be-complemented load data into a corresponding data model, and calculating to obtain the complemented complete load data.

As a preferable technical means: when the data model is constructed:

in step 101), the acquired load data includes load data of a certain unit of a certain day and 1 day before a certain day and a seventh day;

in step 102), randomly digging out continuous points in load data of a certain day as load data to be complemented;

in step 105), the average value of the load on a certain day is calculated, and the load data on a certain day and the day and seventh days before are normalized.

As a preferable technical means: in the step 2), load data of the day before and the seventh day before the data to be complemented are also obtained in addition to the load data of the day of the data to be complemented at regular time;

in step 5), load data of the previous day and the previous seventh day of the normalization processing are input into the corresponding data model, in addition to the data to be complemented are input into the corresponding data model; the data model is complemented according to the load data of the day, the previous day and the previous seventh day.

As a preferable technical means: step 104), K value is obtained by using an elbow method when Kmeans clustering is performed.

As a preferable technical means: and when the step 1) is carried out to construct the data model, finally, a verification step is further included, the data with the missing is normalized and then is input into the corresponding data model, the historical information at the moment is supplemented, the historical data before yesterday and seven days are included, the complete sequence is finally obtained, then the complete sequence is compared with the real data to obtain an error, and after the error is converged, training is finished, and a final data model is obtained and stored.

The beneficial effects are that: according to the technical scheme, the Kmeans method is adopted for clustering the collected public variable load data, the load data with similar characteristics can be well classified, and interference of different characteristic data is discharged. And then the data of the same category is input into the T-LSTM neural network, because the T-LSTM design considers the deletion rule of the load deletion data, some of the deletion data are continuous, some of the deletion data are discontinuous, and the delta T can be well distinguished, so that the neural network learns interval information, and the real load value of the deletion data can be reflected more accurately. The method realizes accurate data complement and has the advantages of small error and high convergence rate.

Drawings

Fig. 1 is a flow chart of the present invention.

Fig. 2 is a graph of the sum of squares of cluster errors versus k of the present invention.

Fig. 3 is a diagram of the LSTM network structure of the present invention.

FIG. 4 is a diagram of the structure of the T-LSTM of the present invention.

Fig. 5 is a data model training diagram of the present invention.

FIG. 6 is a test flow chart of the present invention

Detailed Description

The technical scheme of the invention is further described in detail below with reference to the attached drawings.

As shown in fig. 1, the present invention includes the steps of:

1) Constructing a data model;

101 Obtaining load data of a certain day and 1 day before a certain day of a certain unit in batches;

103 Carrying out Kmeans clustering on the load data;

105 Calculating the average value of the load on a certain day, and normalizing the load data on the day before and the seventh day before the certain day;

2) The method comprises the steps of regularly taking load data of the day of data to be complemented, load data of the day before the data to be complemented and load data of the seventh day before the data to be complemented;

3) Calculating the average value of load data of the same day;

4) Acquiring a corresponding data model according to the average value;

5) And inputting the load data to be complemented and the load data of the previous day and the previous seventh day of normalization processing into corresponding data models, and calculating to obtain the complemented complete load data.

The following further describes some of the steps:

kmeans clustering: the K value is obtained by using the elbow method, and the clustering effect is the best because the curvature at the elbow is the largest.

The technical scheme adopts an elbow method to determine the k value (the number of clusters) of the clusters. The core idea of the elbow method is that when K is smaller than the actual cluster number, the aggregation degree of each cluster is greatly increased due to the increase of K, so that the decrease amplitude of the square sum of the cluster errors of all samples is large, when K reaches the actual cluster number, the return of the aggregation degree obtained by increasing K again is rapidly reduced, so that the decrease amplitude of the square sum of the cluster errors is rapidly reduced, and then gradually becomes gentle along with the continuous increase of K value, namely, the relation diagram of the square sum of the cluster errors and K is an elbow shape, and the K (curvature highest) value corresponding to the elbow is the actual cluster number of data, so that the K value is determined by utilizing the characteristic.

Because the power supply characteristics of different public transformers are different, the daily load changes of the public transformers have the characteristics of the public transformers, and the absolute values of the loads are also very different, the method for classifying the data by using the clustering analysis is provided, and interference among samples with different power supply characteristics is eliminated. And (3) dividing the total auspicious cost into a plurality of categories through Kmeans clustering, and taking the total auspicious cost as a training sample of each data complement network. The method comprises the following specific steps: taking 96 load values of 4 kilo-metric transformers of Jinhua department in one day and the load values of the same as characteristics of samples, inputting the characteristics into a Kmeans clustering model, and drawing a relation graph of a sum of squares of clustering errors (sum of differences between the load values of the samples and the load values of the center points) and k as shown in figure 2. Since k decreases relatively rapidly before 3 and gradually from 3 onwards, the number of clusters of kmeans can be assumed to be 3 (the highest curvature).

T-LSTM (long and short term memory network of variants): with the T-LSTM neural network, the problem of missing data complement can be well handled by the T-LSTM, wherein the uncertainty of the missing data of the load is considered, and the situation of missing of a plurality of points is possible.

LSTM was originally proposed by Hochreiter et al and improved by Graves, a modified version of the recurrent neural network proposed for the gradient explosion problem and long-term dependency problem in the native RNN, as shown in fig. 3. The main work of the LSTM is to modify the internal structure of the RNN network, and control the memory duration is realized by adding a plurality of gates, for example, a plurality of information is filtered by a plurality of forgetting gates, so that the information is remembered for a longer time. As shown in fig. 3.

The formula is as follows:

g _t ＝tanh(W _g x _t +U _g h _t-1 +b _g )

i _t ＝σ(W _i x _t +U _i h _t-1 +b _i )

f _t ＝σ(W _f x _t +U _f h _t-1 +b _f )

o _t ＝σ(W _o x _t +U _o h _t-1 +b _f )

c _t ＝f _t ·c _{t-1} +i _t ·g _t

h _t ＝o _t ·tanh(c _t )

wherein h is _t ,c _t ∈R ^H H is the hidden layer size and σ (·) is the sigmoid function, i, f, o, g represent input gate, forget gate, output gate and cell state, respectively.

{W _f ,U _f ,b _f },{W _i ,U _i ,b _i },{W _o ,U _o ,b _o },{W _c ,U _c ,b _c Each is a network parameter of each part. More specifically, the input gate i adjusts the degree of new value data fed into the unit, the forgetting gate f adjusts the degree of forgetting history, and the output gate o determines the weights of different parts to calculate the output.

However, for the data with missing, our input is discontinuous, the time interval is irregular, and the LSTM network cannot have a good processing effect, so the technical solution adopts a T-LSTM network considering the time interval, as shown in fig. 4: Δt is added to the input layer, and other parameters are not changed, so that the network learns the time interval information.

The improved portion of T-LSTM over LSTM is as follows:

g(δ _t )＝1/log(e+δt)

h _t ＝o _t ·tanh(c _t )

wherein delta _t I.e. the time interval Δt of the current input, the definition of the input gate, the output gate and the forget gate are identical to LSTM, except for the update of the cell state. Compared with LSTM, T-LSTM considers not only the specific value of the current input but also the interval of the input, and solves the problem of inconsistent interval in the time sequence with the missing. Inputting the cell state c at the last moment in each T-LSTM cell _t-1 Hidden layer state h _t-1 Current input value x _t And time interval delta _t Obtaining the cell state c of this cell _t Hidden layer state h _t And proceeds to the next T-LSTM cell.

And (3) calculating the average value of the to-be-completed load data: in order to determine which class of load interval the data to be complemented belongs to.

Training a data model:

as shown in fig. 5, in order to improve accuracy, in step 1), when the data model is constructed, a verification step is finally included, the data with the missing is normalized and then input into the corresponding data model, and the historical information of this moment is supplemented, including the historical data before yesterday and seven days, so as to finally obtain a complete sequence, then the complete sequence is compared with the real data to obtain an error, and when the error converges, training is finished, and a final data model is obtained and stored. The complete model training process comprises the following steps: extracting data as a training data set, performing kmeans clustering to obtain n kinds of load data and load intervals after data processing, normalizing the data with the defects, then encoding by using T-LSTM to obtain a Temporal context, then inputting the Temporal context into a decoder taking the LSTM as a unit, assisting with the historical information of the moment, including the historical data before yesterday and seven days, finally obtaining a complete decoded sequence, comparing the complete decoded sequence with real data to obtain errors, and after error convergence, finishing training to obtain K models and storing.

The following data model training description is performed by taking Jinhua portion data as an example:

1. public transformer load data of 5 thousands of 2018, 11 months to 2019, 5 months and 8 months in total were prepared in Jinhua department.

2. The training data set is processed, namely continuous missing points are dug out as data to be complemented, and the average value of the load of the whole day to be complemented is obtained.

3. And inputting the processed training data into Kmeans for clustering to obtain K classifications.

4. The K classified data are normalized by adding the load data of the first 1 day and the first 7 days and then are input into a T-LSTM network for coding processing to obtain a sample context

5. Inputting the obtained sample context into LSTM decoder, and comparing with real data to obtain error

6. If the error is not converged, continuing training

7. And after the error is converged, training is finished, K models are obtained and stored.

On the basis of obtaining K models, taking the data of Jinhua homemade part as an example to carry out flow description of data completion:

the data set was derived from 221 days of data from 11 months of the last year of Jinhua, a total of 174 users, 96 load points per day. We manually excavate about 1% of the points' data (approximating the true data loss rate) and are all missing for 5 points in succession, which is closer to the true case loss.

The method comprises the following specific steps:

1. data from day 221 of 2018, 11 of Jinhua ministry of China was prepared, and there were 174 public transformer users and 96 load data per day.

2. Data to be completed and load data from the previous day and the previous day 7 were collected in batches.

3. And 5 continuous points to be complemented are manually dug out to serve as verification, and the average value of the data of the days to be complemented is calculated.

4. And judging which type of load interval the load average value belongs to.

5. Data normalization

5. The historical information of the moment of adding the missing value to the trained model is added, the historical information comprises the historical data of the previous day and the previous seventh day, and the completed load data is finally obtained

And calculating the completed load data and the original data to obtain the average absolute error and average absolute percentage error of the test data. The data are shown in Table one.

The left is the result obtained by the method, the right is the result obtained by a linear model (the missing data value is obtained by averaging the sum of the last point and the last point), where mae is the mean absolute error and mape is the mean absolute percent error. It can be seen that the method is better than the linear model and has a percentage error of about 10% in the case of a relatively large load value.

Table one: test results

The load data complement method based on Kmeans and T-LSTM shown in the above figures 1-6 is a specific embodiment of the present invention, has already shown the essential characteristics and improvements of the present invention, and can be modified in terms of shape, structure, etc. according to the practical use requirement, under the teaching of the present invention, all of which are within the scope of protection of the present invention.

Claims

1. The load data complement method based on Kmeans and T-LSTM is characterized by comprising the following steps:

1) Constructing a data model;

101 Batch acquiring load data;

103 Carrying out Kmeans clustering on the load data;

3) Calculating the average value of load data of the same day;

4) Acquiring a corresponding data model according to the average value;

2. The Kmeans and T-LSTM based load data completion method of claim 1, wherein:

when the data model is constructed:

3. A Kmeans and T-LSTM based load data completion method according to claim 2, wherein: in the step 2), load data of the day before and the seventh day before the data to be complemented are also obtained in addition to the load data of the day of the data to be complemented at regular time;

4. A Kmeans and T-LSTM based load data completion method according to claim 3, wherein: step 104), K value is obtained by using an elbow method when Kmeans clustering is performed.

5. A Kmeans and T-LSTM based load data completion method according to claim 2, wherein: and when the step 1) is carried out to construct the data model, finally, a verification step is further included, the data with the missing is normalized and then is input into the corresponding data model, the historical information at the moment is supplemented, the historical data before yesterday and seven days are included, the complete sequence is finally obtained, then the complete sequence is compared with the real data to obtain an error, and after the error is converged, training is finished, and a final data model is obtained and stored.