CN114819369A

CN114819369A - Short-term wind power prediction method based on two-stage feature selection and random forest improvement model

Info

Publication number: CN114819369A
Application number: CN202210491926.XA
Authority: CN
Inventors: 史坤鹏; 李婷; 安军; 周毅博; 刘座铭; 姜旭; 郭雷; 曲绍杰; 蒋宪军; 赵亮
Original assignee: Northeast Dianli University; State Grid Jilin Electric Power Corp
Current assignee: State Grid Jilin Electric Power Corp; Northeast Electric Power University
Priority date: 2022-05-05
Filing date: 2022-05-05
Publication date: 2022-07-29

Abstract

The invention relates to the technical field of new energy power generation, in particular to a short-term wind power prediction method based on a two-section type feature selection and random forest improvement model, which is characterized by comprising the following steps of: the method comprises the steps of training sample screening based on two-section type feature selection, wind power prediction based on a random forest improvement model, adding key feature selection and intimacy sample screening in a data preprocessing link of a training sample set according to a maximum correlation-minimum redundancy principle, and constructing the random forest improvement model through training sample resampling, feature random extraction and decision tree recombination improvement measures; aiming at the problem that a sample set outside a bag excessively depends on the characteristics of training samples, an external inspection index considering the characteristics of numerical weather forecast wind speed is provided, and the self-adaptive capacity of a random forest model to unknown data is further enhanced. The method can improve the wind power prediction accuracy and has the advantages of high calculation efficiency and strong anti-interference capability.

Description

Short-term wind power prediction method based on two-stage feature selection and random forest improvement model

Technical Field

The invention belongs to the technical field of new energy power generation, and particularly relates to a short-term wind power prediction method based on a two-stage feature selection and random forest improvement model.

Background

The short-term wind power prediction is carried out, and the method has important significance for optimizing a power dispatching mode and improving the wind power receiving level of a power grid. The method is influenced by characteristics of wind energy resources, wind power is strong in randomness and weak in regularity, and the accuracy and the self-adaptive capacity of the conventional short-term wind power prediction need to be improved.

The wind power prediction method is generally based on a neural network model and an improved model thereof, and often has the problems of overfitting in the training process, insufficient generalization capability and the like. At present, the improvement method mainly comprises the following steps: one is to optimize the training objective and control the complexity of the model parameters, i.e. to add Regularization (Regularization) factor, Slack Variable (Slack Variable) or confidence risk in the objective Function or loss Function (Cost Function) to make the training objective more relaxed or terminate the training in advance, and the common algorithms are Bayesian neural Network (Bayesian Network), support vector machine (svm), Convolutional Neural Network (CNN) in deep learning, and Network node clipping technique (Dropout). Secondly, improve the training mode, improve data characteristic diversity, promptly: 1) updating training samples periodically by utilizing the measured data, and performing rolling correction on the model parameters; 2) in the training process, a Cross Validation (CV) method is adopted to perform characteristic recombination on an input sample, such as: k-fold cross validation, leave one out, Holdout validation. In recent years, a Data Augmentation (Data Augmentation) technology has appeared, which repeatedly samples a limited sample set according to prior knowledge to increase the diversity and randomness of sample characteristics, such as a Bagging method, ensemble learning and random forest. Thirdly, a combined prediction method is adopted to enhance the model self-adaptive capacity: by utilizing different characteristic components of the prediction object or in each link of the prediction process, the adaptability analysis of different algorithms is developed, the model combination is ensured to be suitable for a plurality of prediction scenes, and the prediction result is closer to the actual situation.

Considering the improvement measures overall, the invention provides a short-term wind power prediction method based on a two-section type feature selection and random forest improvement model, 1) random resampling is carried out on a close sample by using a Bagging method, the feature diversity of a training sample is improved, and the method belongs to a second type of improvement method; 2) the random forest is a combined model of a plurality of decision trees, the performance of the decision trees is sorted, reduced and recombined, the generalization capability of the random forest model is further improved, and the first and third improvement methods are also considered.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the short-term wind power prediction method based on the two-segment feature selection and random forest improvement model is high in prediction accuracy, high in calculation efficiency and strong in anti-interference capability, and can solve the technical problems that an existing wind power prediction model is over-trained in fitting and insufficient in generalization capability.

The technical scheme for solving the technical problem is that the short-term wind power prediction method based on two-segment feature selection and random forest mode improvement is characterized in that: it comprises the following steps:

step 1: training sample screening based on two-stage feature selection

1.1 Key feature selection

The historical data of the wind power plant consists of 10 characteristic variables of air temperature, air pressure, humidity, wind direction, wind speed of 10m, wind speed of 30m, wind speed of 50m hub, wind speed of 70m, wind speed of 100m and historical wind power, the following two methods are adopted to evaluate the importance of the 10 characteristic variables,

the method comprises the following steps: carrying out importance evaluation on each characteristic variable by using a random forest model, and finding out that importance indexes of the 10 characteristic variables are different, wherein the wind speed of a hub with the length of 50m and the wind power historical power have high importance and can be classified as key characteristic variables, and other characteristic variables are removed as redundant characteristics;

the method 2 comprises the following steps: respectively taking a single characteristic variable as input, training a random forest model to obtain 10 prediction error curves corresponding to the 10 characteristic variables, and finding that the model prediction error trained by the hub wind speed and the wind power historical power of 50m is small in whole and can be classified as a key characteristic variable according to the grouping condition of the error curves, and other characteristic variables are removed as redundant characteristics;

1.2 intimacy sample screening

Besides removing redundant characteristic variables, an intimate sample set strongly related to a predicted target needs to be screened from massive historical data of key characteristic variables, and the calculation process is divided into two steps:

a) construction of a daily sample: converting historical data of the wind power plant, including wind power P and wind speed V into daily data samples { P } ₁ ,P ₂ ,...,P _N And { V } ₁ ,V ₂ ,...,V _N If the length of the historical data is 1 year, N is 365 so as to meet the requirement of wind power prediction in the day ahead;

b) screening close samples: will be dailyAfter the wind power data samples are normalized, the association degree indexes of the wind power data samples and the prediction day reference samples are respectively calculated every day, descending order is carried out according to the association degree, the first 2M samples with strong correlation are screened out, and M is 20 and serves as a close sample set { P _M1 ,P _M2 ,...,P _MM And { V } _M1 ,V _M2 ,...,V _MM Obtaining an input sample set of a random forest model as { P } _M1 ,P _M2 ,...,P _MM ,V _M1 ,V _M2 ,...,V _MM Considering the generalized correlation between the input characteristics and the prediction target, adopting a mutual information correlation index;

step 2: wind power prediction based on random forest improvement model

2.1 basic flow of random forests

Random Forest (RF) is an integrated learning method in a parallel combination mode, including a Bagging method and a Random subspace theory, and the calculation process of the Random Forest is as follows:

1) carrying Out back-to-back random resampling on the original sample set by using a Bagging method to obtain a plurality of training sample subsets which are respectively used for training each decision tree, carrying Out unbiased estimation on the generalization error of the random forest model by using an Out of Bag sample set (OOB),

assuming that the total number of samples in the original training sample set is N, the probability that each sample is not drawn is:

where x is the training sample, P is the probability distribution of the out-of-bag sample set, S _i Is the ith training sample subset;

the formula shows that 36.8% of samples in the original sample set do not appear in the training samples, and samples which are not extracted form an out-of-bag sample set OOB which is used for estimating the generalization error of the random forest model and is equal to the process of k-fold cross validation;

2) based on a random subspace theory, randomly selecting a part of characteristic variables from the characteristic variables of the training samples to participate in the training of the decision tree, so that the training samples branch from top to bottom until the set leaf node size is reached;

3) repeating the steps 1) and 2) for M times, training to obtain M decision trees, and combining the M decision trees to form a random forest model. The ensemble learning theory proves that the error of each decision tree is less than 50%, and the overall error of the random forest is reduced along with the increase of the number of the decision trees, and finally tends to a relatively stable lower limit value;

4) the trained random forest is used for testing a sample set, the prediction results of a plurality of decision trees are optimally combined according to a certain integration rule to obtain the prediction value of the random forest, and the integration rule mainly adopts a voting method for the classification problem; for regression and prediction problems, the integration rule mainly adopts an averaging method;

2.2 random forest improvement model

The random forest model integrates a large number of decision trees, the number of related characteristic variables is large, model interpretability is reduced, and particularly double randomness is introduced, so that the random forest is like a 'black box' model, the internal optimization process of the random forest model lacks observability and controllability, physical interpretability is poor, reliability is to be evaluated, in addition, OOB bag external samples are derived from a training sample set and belong to the same distribution with a training sample, and the inherent characteristics of the training sample set are difficult to exceed, therefore, generalization error evaluation based on the OOB samples still belongs to an internal verification process, in a decision tree performance evaluation stage, some new samples are required to be added, and external verification indexes are introduced to improve the generalization capability of the model to unknown samples, therefore, the random forest improvement model based on external inspection indexes and decision tree recombination is provided:

adding links of screening and recombining decision trees

The Bagging random resampling strategy is beneficial to enhancing the independence of the sub-models and improving the generalization capability of the random forest models, the Bagging method is changed into the Bagging method according to the random subspace theory so as to improve the OOB proportion of the samples outside the bags, and based on the selective integrated learning thought, the links of screening and recombining decision trees are added in the random forest models, aiming at performing prediction performance evaluation on each trained decision tree and removing the decision trees with poor prediction performance so as to weaken the adverse effect of the false samples on the training of the random forest models;

external inspection index based on NWP wind speed

OOB error is unbiased estimation of the model generalization error, only the model generalization error corresponding to the training sample can be estimated, and still belongs to an internal inspection index, because the difference between the prediction day sample of the day-ahead wind power and the training sample is very large, the model generalization error corresponding to the prediction day sample estimated by the OOB error index is invalid,

the Transfer Learning (TL) method is adopted, which is helpful to improve the generalization ability of the model, namely: the method comprises the steps of improving the characteristic similarity between a training sample and a Prediction sample to improve the model mobility from a training domain to a target domain, and therefore, in a decision tree screening link, providing an external inspection index of a reference Numerical Weather forecast (NWP), namely, carrying out association degree analysis on a Prediction result of each decision tree and the NWP wind speed of a Prediction day, screening out a decision tree subset strongly related to the wind speed of the Prediction day according to the association degree index, and further recombining a new random forest to enhance the generalization capability of the random forest on a Prediction set.

Through the design scheme, the invention has the following beneficial effects:

1) after the random forest models are sorted by the decision trees, the prediction errors firstly decrease and then increase, and inflection points exist, so that the prediction errors of the random forest are smaller and the training cost is lower after the links of decision tree screening and recombination are added;

2) compared with the original OOB error index, the generalization capability of the random forest model can be further improved and the prediction error is reduced based on the external verification index of the NWP wind speed characteristic;

3) the method is scientific and reasonable, high in applicability, high in calculation efficiency and high in anti-interference capability.

Drawings

FIG. 1 is a schematic view of the flow structure of the present invention;

FIG. 2(a) is a schematic diagram of the importance evaluation of all feature variables;

FIG. 2(b) is a schematic diagram of OOB error indicator estimation for each feature variable;

FIG. 3 is a basic flow diagram of a random forest;

FIG. 4(a) is a schematic diagram of the relevance index of each decision tree in a random forest;

FIG. 4(b) is a diagram illustrating the generalized error variation of the original random forest model;

FIG. 4(c) is a schematic diagram illustrating descending order of relevance indicators of each decision tree;

FIG. 4(d) is a schematic diagram of the generalized error variation of the random forest model after descending order arrangement;

FIG. 4(e) is a schematic diagram of a decision tree with a selected relevancy indicator greater than the average;

FIG. 4(f) is a schematic diagram of the generalized error variation of the random forest model after the decision tree reorganization;

FIG. 5(a) is a diagram illustrating the prediction result of the BP neural network on the training data set;

FIG. 5(b) is a diagram illustrating the prediction result of the BP neural network on the prediction data set;

FIG. 6(a) is a diagram illustrating the prediction results of a random forest model on a training data set;

fig. 6(b) is a schematic diagram of the prediction result of the random forest model on the prediction data set.

Detailed Description

The invention is further described with reference to the following figures and detailed description:

in order to make the public fully understand the technical spirit and the beneficial effects of the invention, the applicant will describe in detail the specific embodiments of the invention with reference to the attached drawings, but the description of the embodiments by the applicant is not a limitation of the technical solution, and any changes made in the form of the inventive concept rather than the essential change should be regarded as the protection scope of the invention.

Referring to fig. 1, the short-term wind power prediction method based on two-segment feature selection and random forest mode improvement comprises the following steps:

step 1: training sample screening based on two-stage feature selection

1.1 Key feature selection

The data sources related to wind power prediction mainly include: in recent years, historical data of generated power of a wind power plant, historical data of various meteorological information of a wind measuring tower and NWP data of the next several days. The model training sample generally consists of 10 characteristic variables of air temperature, air pressure, humidity, wind direction, wind speed of 10m, wind speed of 30m, wind speed of 50m hub, wind speed of 70m, wind speed of 100m and historical power of wind power, and the importance of the 10 characteristic variables is evaluated by adopting the following two methods.

The method comprises the following steps: and (3) evaluating the importance of each characteristic variable by using a random forest model, and finding that the importance indexes of the 10 characteristic variables are different, wherein as shown in fig. 2(a), the 7 th and 10 th characteristic variables (namely, the hub wind speed and the wind power historical power) have obviously higher importance and can be classified as key characteristic variables, and other characteristic variables are removed as redundant characteristics.

The method 2 comprises the following steps: and (b) respectively taking a single characteristic variable as input, training a random forest model to obtain 10 prediction error curves, and finding that the model prediction errors trained by the 7 th and 10 th characteristic variables (namely, the hub wind speed and the wind-electricity historical power) are small as a whole and can be classified into key characteristic variables according to the grouping condition of the error curves as shown in fig. 2(b), and other characteristic variables are removed as redundant characteristics.

And respectively taking all 10 characteristic variables and 2 key characteristic variables as input variables to train the random forest model. Through comparison discovery, the key characteristic selection link can realize dimension reduction processing on mass multi-source historical data, and the training efficiency of the model can be greatly improved.

1.2 intimacy sample screening

a) construction of a daily sample: converting historical data (including wind power P and wind speed V) of wind power plantFor daily data samples { P ₁ ,P ₂ ,...,P _N And { V } ₁ ,V ₂ ,...,V _N And (if the length of the historical data is 1 year, N is 365) so as to meet the requirement of the wind power prediction before the day.

b) Screening close samples: after normalization processing is carried out on the wind power data samples every day, relevance indexes of the wind power data samples and the prediction day reference samples are respectively calculated, descending order arrangement is carried out according to the relevance indexes, and the first 2M samples (M is 20) with strong correlation are screened out to serve as a close sample set { P } _M1 ,P _M2 ,...,P _MM And { V } _M1 ,V _M2 ,...,V _MM Obtaining an input sample set of a random forest model as { P } _M1 ,P _M2 ,...,P _MM ,V _M1 ,V _M2 ,...,V _MM }. In consideration of the generalized correlation between the input features and the prediction target, correlation indexes such as mutual information are proposed.

In order to verify the influence degree of screening of the intimacy sample on the random forest model, wind power prediction is carried out by using models trained by all samples and the intimacy sample respectively. Therefore, after the intimacy sample is screened, the prediction error and the training time of the random forest model are reduced.

Step 2: wind power prediction based on random forest improvement model

2.1 basic flow of random forests

Random Forest (RF) is an integrated learning method in a parallel combination mode, including a Bagging method and a Random subspace theory, as shown in fig. 3, the calculation process of the Random Forest is:

where x is the training sample, P is the probability distribution of the out-of-bag sample set, S _i Is the ith subset of training samples. The formula shows that 36.8% of samples in the original sample set do not appear in the training samples, and samples which are not extracted form an out-of-bag sample set OOB which is used for estimating the generalization error of the random forest model and is equal to the process of k-fold cross validation;

3) repeating the steps 1) and 2) for T times to obtain T decision trees, and combining the T decision trees to form a random forest model. The ensemble learning theory proves that the error of each decision tree is less than 50%, and the overall error of the random forest is reduced along with the increase of the number of the decision trees, and finally tends to a relatively stable lower limit value.

4) The trained random forest is used for testing a sample set, the prediction results of a plurality of decision trees are optimally combined according to a certain integration rule to obtain the prediction value of the random forest, and the integration rule mainly adopts a voting method for the classification problem; for regression and prediction problems, the integration rules mainly use an averaging method.

2.2 random forest improvement model

adding links of screening and recombining decision trees

external inspection index based on NWP wind speed

1. Example analysis:

assuming that the original random forest includes 100 decision trees, the external inspection index-degree of association corresponding to each decision tree is varied from 0.04 to 0.07, as shown in fig. 4(a), if each decision tree is directly combined into the random forest, the generalization error of the model is about 0.02, as shown in fig. 4(b), and when the number of decision trees is small, the generalization error is unstable. In order to solve the above problems, all decision trees are arranged in a descending order according to the relevance indexes, as shown in fig. 4(c), and a generalization error curve obtained after a random forest model is formed is as shown in fig. 4(d), it can be found that the generalization error rapidly decreases to below 0.01, but with the addition of the decision tree with a smaller relevance index, the generalization error increases to some extent, and finally reaches about 0.02. Therefore, the simple increase of the decision trees does not necessarily reduce the generalization error of the random forest, because the degree of correlation is also closely related to the external inspection index corresponding to each decision tree.

Therefore, the invention adds a decision tree recombination link to the random forest, and aims to select the decision trees with better external inspection indexes to participate in the later random forest prediction work. As can be seen from fig. 4(c), the new random forest is formed by sorting and recombining the decision trees, the generalization error of the new random forest decreases first and then increases, and a turning point exists, and if the number of decision trees corresponding to the turning point is defined as the new random forest scale, the obtained generalization error should be minimum. Therefore, the decision trees with the external inspection index values larger than the average value are selected, as shown in fig. 4(e), recombined into random forests (the number of the decision trees is adjusted from 100 to 50), and then wind power prediction verification is carried out, so that the prediction error of the new random forest is greatly reduced, and the problem of the prediction error oscillation of the original random forest is solved, as shown in fig. 4 (f).

2. And (3) analyzing the anti-interference capability:

the actually measured wind power historical data usually contains a large amount of abandoned wind data, and the long-time, large-amplitude and impact step change can occur, so that the historical data is seriously distorted, the fluctuation characteristic extraction of the historical data is seriously influenced, and the prediction error is larger. As can be seen from fig. 5(a) and 5 (b): if wind curtailment data (such as 1: 00-3: 00 in the morning) occurs in the valley period, the adverse effect on the prediction result of the BP neural network model is large; as can be seen from fig. 6(a) and 6 (b): if wind abandon data occurs in the valley period, the influence on the prediction result of the random forest model is small; comparing fig. 5(b) and fig. 6(b), it can be seen that the stochastic forest model has a stronger interference rejection capability than the BP neural network model in consideration of the curtailment data. The random forest model carries out random resampling on the training samples, dependence on original sample characteristics can be weakened, particularly, the random forest improvement model has stronger tolerance on influence of abandoned wind data after a Bagging random sub-sampling algorithm is adopted, and the prediction accuracy of the random forest model is less influenced by the abandoned wind data.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A short-term wind power prediction method based on two-segment feature selection and random forest model improvement is characterized by comprising the following steps: it comprises the following steps:

step 1: training sample screening based on two-stage feature selection

1.1 Key feature selection

the method 2 comprises the following steps: respectively taking a single characteristic variable as input, training a random forest model to obtain 10 prediction error curves corresponding to the 10 characteristic variables, finding that the model prediction error trained by the hub wind speed and the wind power historical power of 50m is small in whole and can be classified as a key characteristic variable according to the grouping condition of the error curves, and rejecting other characteristic variables as redundant characteristics;

1.2 intimacy sample screening

b) screening close samples: after normalization processing is carried out on the daily wind power data samples, association degree indexes of the daily wind power data samples and the prediction day reference samples are respectively calculated, descending order is carried out according to the association degree, the first 2M strongly-related samples are screened out, and M is set to be 20 and used as a close sample set { P ═ P _M1 ,P _M2 ,...,P _MM And { V } _M1 ,V _M2 ,...,V _MM Obtaining an input sample set of a random forest model as { P } _M1 ,P _M2 ,...,P _MM ,V _M1 ,V _M2 ,...,V _MM Considering the generalized correlation between the input characteristics and the prediction target, adopting a mutual information correlation index;

step 2: wind power prediction based on random forest improvement model

2.1 basic flow of random forests

Random Forest (RF) is an integrated learning method in a parallel combination mode, including Bagging method and Random subspace theory, and the calculation process of Random Forest is:

assuming that the total number of samples in the original training sample set S is N, the probability that each sample is not drawn is:

3) repeating the steps 1) and 2) for T times to obtain T decision trees, and combining the T decision trees to form a random forest model. The ensemble learning theory proves that the error of each decision tree is less than 50%, and the overall error of the random forest is reduced along with the increase of the number of the decision trees, and finally tends to a relatively stable lower limit value;

4) the trained random forest is used for testing a sample set, the prediction results of a plurality of decision trees are optimized and combined according to a certain integration rule to obtain the prediction value of the random forest, and the integration rule mainly adopts a voting method for the classification problem; for regression and prediction problems, the integration rule mainly adopts an averaging method;

2.2 random forest improvement model

The random forest model integrates a large number of decision trees, the number of related characteristic variables is large, model interpretability is reduced, and particularly double randomness is introduced, so that the random forest is like a 'black box' model, the internal optimization process of the random forest model lacks observability and controllability, physical interpretability is poor, reliability is to be evaluated, in addition, OOB bag external samples are derived from a training sample set and belong to the same distribution with training samples, and the inherent characteristics of the training sample set are difficult to exceed, therefore, generalization error evaluation based on the OOB samples still belongs to an internal verification process, in a decision tree performance evaluation stage, some new samples are required to be added, and external verification indexes are introduced to improve the generalization capability of the random forest model to unknown samples, therefore, the random forest improvement model based on external inspection indexes and decision tree recombination is provided:

adding links of screening and recombining decision trees

The Bagging random resampling strategy is beneficial to enhancing the independence of the sub-models and improving the generalization capability of the random forest models, the Bagging method is changed into the Bagging method according to the random subspace theory so as to improve the OOB proportion of the samples outside the bags, and based on the selective integrated learning thought, the links of screening and recombining decision trees are added in the random forest models, so that the prediction performance evaluation is carried out on each trained decision tree, the decision trees with poor prediction performance are removed, and the adverse effect of the false samples on the training of the random forest models is weakened;

external inspection index based on NWP wind speed

OOB error is unbiased estimation on the model generalization error, only the model generalization error corresponding to the training sample can be estimated, and still belongs to an internal inspection index, because the difference between the predicted daily sample and the training sample of the day-ahead wind power is very large, the model generalization error corresponding to the predicted daily sample estimated by the OOB error index will fail,

the Transfer Learning (TL) method is adopted, which is helpful to improve the generalization ability of the model, namely: the characteristic similarity between the training samples and the Prediction samples is improved to improve the model mobility from a training domain to a target domain, therefore, in a decision tree screening link, an external inspection index of a reference Numerical Weather forecast (NWP) is provided, namely, the correlation degree analysis is carried out on the Prediction result of each decision tree and the NWP wind speed of a Prediction day, a decision tree subset strongly related to the wind speed of the Prediction day is screened out according to the correlation degree index, and then a new random forest is formed again, so that the generalization capability of the random forest on a Prediction set is enhanced.