CN114819369A - Short-term wind power prediction method based on two-stage feature selection and random forest improvement model - Google Patents

Short-term wind power prediction method based on two-stage feature selection and random forest improvement model Download PDF

Info

Publication number
CN114819369A
CN114819369A CN202210491926.XA CN202210491926A CN114819369A CN 114819369 A CN114819369 A CN 114819369A CN 202210491926 A CN202210491926 A CN 202210491926A CN 114819369 A CN114819369 A CN 114819369A
Authority
CN
China
Prior art keywords
random forest
model
samples
prediction
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210491926.XA
Other languages
Chinese (zh)
Inventor
史坤鹏
李婷
安军
周毅博
刘座铭
姜旭
郭雷
曲绍杰
蒋宪军
赵亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Jilin Electric Power Corp
Northeast Electric Power University
Original Assignee
Northeast Dianli University
State Grid Jilin Electric Power Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeast Dianli University, State Grid Jilin Electric Power Corp filed Critical Northeast Dianli University
Priority to CN202210491926.XA priority Critical patent/CN114819369A/en
Publication of CN114819369A publication Critical patent/CN114819369A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • HELECTRICITY
    • H02GENERATION; CONVERSION OR DISTRIBUTION OF ELECTRIC POWER
    • H02JCIRCUIT ARRANGEMENTS OR SYSTEMS FOR SUPPLYING OR DISTRIBUTING ELECTRIC POWER; SYSTEMS FOR STORING ELECTRIC ENERGY
    • H02J3/00Circuit arrangements for ac mains or ac distribution networks
    • H02J3/004Generation forecast, e.g. methods or systems for forecasting future energy generation

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Development Economics (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Software Systems (AREA)
  • Educational Administration (AREA)
  • Health & Medical Sciences (AREA)
  • Power Engineering (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of new energy power generation, in particular to a short-term wind power prediction method based on a two-section type feature selection and random forest improvement model, which is characterized by comprising the following steps of: the method comprises the steps of training sample screening based on two-section type feature selection, wind power prediction based on a random forest improvement model, adding key feature selection and intimacy sample screening in a data preprocessing link of a training sample set according to a maximum correlation-minimum redundancy principle, and constructing the random forest improvement model through training sample resampling, feature random extraction and decision tree recombination improvement measures; aiming at the problem that a sample set outside a bag excessively depends on the characteristics of training samples, an external inspection index considering the characteristics of numerical weather forecast wind speed is provided, and the self-adaptive capacity of a random forest model to unknown data is further enhanced. The method can improve the wind power prediction accuracy and has the advantages of high calculation efficiency and strong anti-interference capability.

Description

Short-term wind power prediction method based on two-stage feature selection and random forest improvement model
Technical Field
The invention belongs to the technical field of new energy power generation, and particularly relates to a short-term wind power prediction method based on a two-stage feature selection and random forest improvement model.
Background
The short-term wind power prediction is carried out, and the method has important significance for optimizing a power dispatching mode and improving the wind power receiving level of a power grid. The method is influenced by characteristics of wind energy resources, wind power is strong in randomness and weak in regularity, and the accuracy and the self-adaptive capacity of the conventional short-term wind power prediction need to be improved.
The wind power prediction method is generally based on a neural network model and an improved model thereof, and often has the problems of overfitting in the training process, insufficient generalization capability and the like. At present, the improvement method mainly comprises the following steps: one is to optimize the training objective and control the complexity of the model parameters, i.e. to add Regularization (Regularization) factor, Slack Variable (Slack Variable) or confidence risk in the objective Function or loss Function (Cost Function) to make the training objective more relaxed or terminate the training in advance, and the common algorithms are Bayesian neural Network (Bayesian Network), support vector machine (svm), Convolutional Neural Network (CNN) in deep learning, and Network node clipping technique (Dropout). Secondly, improve the training mode, improve data characteristic diversity, promptly: 1) updating training samples periodically by utilizing the measured data, and performing rolling correction on the model parameters; 2) in the training process, a Cross Validation (CV) method is adopted to perform characteristic recombination on an input sample, such as: k-fold cross validation, leave one out, Holdout validation. In recent years, a Data Augmentation (Data Augmentation) technology has appeared, which repeatedly samples a limited sample set according to prior knowledge to increase the diversity and randomness of sample characteristics, such as a Bagging method, ensemble learning and random forest. Thirdly, a combined prediction method is adopted to enhance the model self-adaptive capacity: by utilizing different characteristic components of the prediction object or in each link of the prediction process, the adaptability analysis of different algorithms is developed, the model combination is ensured to be suitable for a plurality of prediction scenes, and the prediction result is closer to the actual situation.
Considering the improvement measures overall, the invention provides a short-term wind power prediction method based on a two-section type feature selection and random forest improvement model, 1) random resampling is carried out on a close sample by using a Bagging method, the feature diversity of a training sample is improved, and the method belongs to a second type of improvement method; 2) the random forest is a combined model of a plurality of decision trees, the performance of the decision trees is sorted, reduced and recombined, the generalization capability of the random forest model is further improved, and the first and third improvement methods are also considered.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the short-term wind power prediction method based on the two-segment feature selection and random forest improvement model is high in prediction accuracy, high in calculation efficiency and strong in anti-interference capability, and can solve the technical problems that an existing wind power prediction model is over-trained in fitting and insufficient in generalization capability.
The technical scheme for solving the technical problem is that the short-term wind power prediction method based on two-segment feature selection and random forest mode improvement is characterized in that: it comprises the following steps:
step 1: training sample screening based on two-stage feature selection
1.1 Key feature selection
The historical data of the wind power plant consists of 10 characteristic variables of air temperature, air pressure, humidity, wind direction, wind speed of 10m, wind speed of 30m, wind speed of 50m hub, wind speed of 70m, wind speed of 100m and historical wind power, the following two methods are adopted to evaluate the importance of the 10 characteristic variables,
the method comprises the following steps: carrying out importance evaluation on each characteristic variable by using a random forest model, and finding out that importance indexes of the 10 characteristic variables are different, wherein the wind speed of a hub with the length of 50m and the wind power historical power have high importance and can be classified as key characteristic variables, and other characteristic variables are removed as redundant characteristics;
the method 2 comprises the following steps: respectively taking a single characteristic variable as input, training a random forest model to obtain 10 prediction error curves corresponding to the 10 characteristic variables, and finding that the model prediction error trained by the hub wind speed and the wind power historical power of 50m is small in whole and can be classified as a key characteristic variable according to the grouping condition of the error curves, and other characteristic variables are removed as redundant characteristics;
1.2 intimacy sample screening
Besides removing redundant characteristic variables, an intimate sample set strongly related to a predicted target needs to be screened from massive historical data of key characteristic variables, and the calculation process is divided into two steps:
a) construction of a daily sample: converting historical data of the wind power plant, including wind power P and wind speed V into daily data samples { P } 1 ,P 2 ,...,P N And { V } 1 ,V 2 ,...,V N If the length of the historical data is 1 year, N is 365 so as to meet the requirement of wind power prediction in the day ahead;
b) screening close samples: will be dailyAfter the wind power data samples are normalized, the association degree indexes of the wind power data samples and the prediction day reference samples are respectively calculated every day, descending order is carried out according to the association degree, the first 2M samples with strong correlation are screened out, and M is 20 and serves as a close sample set { P M1 ,P M2 ,...,P MM And { V } M1 ,V M2 ,...,V MM Obtaining an input sample set of a random forest model as { P } M1 ,P M2 ,...,P MM ,V M1 ,V M2 ,...,V MM Considering the generalized correlation between the input characteristics and the prediction target, adopting a mutual information correlation index;
step 2: wind power prediction based on random forest improvement model
2.1 basic flow of random forests
Random Forest (RF) is an integrated learning method in a parallel combination mode, including a Bagging method and a Random subspace theory, and the calculation process of the Random Forest is as follows:
1) carrying Out back-to-back random resampling on the original sample set by using a Bagging method to obtain a plurality of training sample subsets which are respectively used for training each decision tree, carrying Out unbiased estimation on the generalization error of the random forest model by using an Out of Bag sample set (OOB),
assuming that the total number of samples in the original training sample set is N, the probability that each sample is not drawn is:
Figure BDA0003626868100000031
where x is the training sample, P is the probability distribution of the out-of-bag sample set, S i Is the ith training sample subset;
the formula shows that 36.8% of samples in the original sample set do not appear in the training samples, and samples which are not extracted form an out-of-bag sample set OOB which is used for estimating the generalization error of the random forest model and is equal to the process of k-fold cross validation;
2) based on a random subspace theory, randomly selecting a part of characteristic variables from the characteristic variables of the training samples to participate in the training of the decision tree, so that the training samples branch from top to bottom until the set leaf node size is reached;
3) repeating the steps 1) and 2) for M times, training to obtain M decision trees, and combining the M decision trees to form a random forest model. The ensemble learning theory proves that the error of each decision tree is less than 50%, and the overall error of the random forest is reduced along with the increase of the number of the decision trees, and finally tends to a relatively stable lower limit value;
4) the trained random forest is used for testing a sample set, the prediction results of a plurality of decision trees are optimally combined according to a certain integration rule to obtain the prediction value of the random forest, and the integration rule mainly adopts a voting method for the classification problem; for regression and prediction problems, the integration rule mainly adopts an averaging method;
2.2 random forest improvement model
The random forest model integrates a large number of decision trees, the number of related characteristic variables is large, model interpretability is reduced, and particularly double randomness is introduced, so that the random forest is like a 'black box' model, the internal optimization process of the random forest model lacks observability and controllability, physical interpretability is poor, reliability is to be evaluated, in addition, OOB bag external samples are derived from a training sample set and belong to the same distribution with a training sample, and the inherent characteristics of the training sample set are difficult to exceed, therefore, generalization error evaluation based on the OOB samples still belongs to an internal verification process, in a decision tree performance evaluation stage, some new samples are required to be added, and external verification indexes are introduced to improve the generalization capability of the model to unknown samples, therefore, the random forest improvement model based on external inspection indexes and decision tree recombination is provided:
adding links of screening and recombining decision trees
The Bagging random resampling strategy is beneficial to enhancing the independence of the sub-models and improving the generalization capability of the random forest models, the Bagging method is changed into the Bagging method according to the random subspace theory so as to improve the OOB proportion of the samples outside the bags, and based on the selective integrated learning thought, the links of screening and recombining decision trees are added in the random forest models, aiming at performing prediction performance evaluation on each trained decision tree and removing the decision trees with poor prediction performance so as to weaken the adverse effect of the false samples on the training of the random forest models;
external inspection index based on NWP wind speed
OOB error is unbiased estimation of the model generalization error, only the model generalization error corresponding to the training sample can be estimated, and still belongs to an internal inspection index, because the difference between the prediction day sample of the day-ahead wind power and the training sample is very large, the model generalization error corresponding to the prediction day sample estimated by the OOB error index is invalid,
the Transfer Learning (TL) method is adopted, which is helpful to improve the generalization ability of the model, namely: the method comprises the steps of improving the characteristic similarity between a training sample and a Prediction sample to improve the model mobility from a training domain to a target domain, and therefore, in a decision tree screening link, providing an external inspection index of a reference Numerical Weather forecast (NWP), namely, carrying out association degree analysis on a Prediction result of each decision tree and the NWP wind speed of a Prediction day, screening out a decision tree subset strongly related to the wind speed of the Prediction day according to the association degree index, and further recombining a new random forest to enhance the generalization capability of the random forest on a Prediction set.
Through the design scheme, the invention has the following beneficial effects:
1) after the random forest models are sorted by the decision trees, the prediction errors firstly decrease and then increase, and inflection points exist, so that the prediction errors of the random forest are smaller and the training cost is lower after the links of decision tree screening and recombination are added;
2) compared with the original OOB error index, the generalization capability of the random forest model can be further improved and the prediction error is reduced based on the external verification index of the NWP wind speed characteristic;
3) the method is scientific and reasonable, high in applicability, high in calculation efficiency and high in anti-interference capability.
Drawings
FIG. 1 is a schematic view of the flow structure of the present invention;
FIG. 2(a) is a schematic diagram of the importance evaluation of all feature variables;
FIG. 2(b) is a schematic diagram of OOB error indicator estimation for each feature variable;
FIG. 3 is a basic flow diagram of a random forest;
FIG. 4(a) is a schematic diagram of the relevance index of each decision tree in a random forest;
FIG. 4(b) is a diagram illustrating the generalized error variation of the original random forest model;
FIG. 4(c) is a schematic diagram illustrating descending order of relevance indicators of each decision tree;
FIG. 4(d) is a schematic diagram of the generalized error variation of the random forest model after descending order arrangement;
FIG. 4(e) is a schematic diagram of a decision tree with a selected relevancy indicator greater than the average;
FIG. 4(f) is a schematic diagram of the generalized error variation of the random forest model after the decision tree reorganization;
FIG. 5(a) is a diagram illustrating the prediction result of the BP neural network on the training data set;
FIG. 5(b) is a diagram illustrating the prediction result of the BP neural network on the prediction data set;
FIG. 6(a) is a diagram illustrating the prediction results of a random forest model on a training data set;
fig. 6(b) is a schematic diagram of the prediction result of the random forest model on the prediction data set.
Detailed Description
The invention is further described with reference to the following figures and detailed description:
in order to make the public fully understand the technical spirit and the beneficial effects of the invention, the applicant will describe in detail the specific embodiments of the invention with reference to the attached drawings, but the description of the embodiments by the applicant is not a limitation of the technical solution, and any changes made in the form of the inventive concept rather than the essential change should be regarded as the protection scope of the invention.
Referring to fig. 1, the short-term wind power prediction method based on two-segment feature selection and random forest mode improvement comprises the following steps:
step 1: training sample screening based on two-stage feature selection
1.1 Key feature selection
The data sources related to wind power prediction mainly include: in recent years, historical data of generated power of a wind power plant, historical data of various meteorological information of a wind measuring tower and NWP data of the next several days. The model training sample generally consists of 10 characteristic variables of air temperature, air pressure, humidity, wind direction, wind speed of 10m, wind speed of 30m, wind speed of 50m hub, wind speed of 70m, wind speed of 100m and historical power of wind power, and the importance of the 10 characteristic variables is evaluated by adopting the following two methods.
The method comprises the following steps: and (3) evaluating the importance of each characteristic variable by using a random forest model, and finding that the importance indexes of the 10 characteristic variables are different, wherein as shown in fig. 2(a), the 7 th and 10 th characteristic variables (namely, the hub wind speed and the wind power historical power) have obviously higher importance and can be classified as key characteristic variables, and other characteristic variables are removed as redundant characteristics.
The method 2 comprises the following steps: and (b) respectively taking a single characteristic variable as input, training a random forest model to obtain 10 prediction error curves, and finding that the model prediction errors trained by the 7 th and 10 th characteristic variables (namely, the hub wind speed and the wind-electricity historical power) are small as a whole and can be classified into key characteristic variables according to the grouping condition of the error curves as shown in fig. 2(b), and other characteristic variables are removed as redundant characteristics.
And respectively taking all 10 characteristic variables and 2 key characteristic variables as input variables to train the random forest model. Through comparison discovery, the key characteristic selection link can realize dimension reduction processing on mass multi-source historical data, and the training efficiency of the model can be greatly improved.
1.2 intimacy sample screening
Besides removing redundant characteristic variables, an intimate sample set strongly related to a predicted target needs to be screened from massive historical data of key characteristic variables, and the calculation process is divided into two steps:
a) construction of a daily sample: converting historical data (including wind power P and wind speed V) of wind power plantFor daily data samples { P 1 ,P 2 ,...,P N And { V } 1 ,V 2 ,...,V N And (if the length of the historical data is 1 year, N is 365) so as to meet the requirement of the wind power prediction before the day.
b) Screening close samples: after normalization processing is carried out on the wind power data samples every day, relevance indexes of the wind power data samples and the prediction day reference samples are respectively calculated, descending order arrangement is carried out according to the relevance indexes, and the first 2M samples (M is 20) with strong correlation are screened out to serve as a close sample set { P } M1 ,P M2 ,...,P MM And { V } M1 ,V M2 ,...,V MM Obtaining an input sample set of a random forest model as { P } M1 ,P M2 ,...,P MM ,V M1 ,V M2 ,...,V MM }. In consideration of the generalized correlation between the input features and the prediction target, correlation indexes such as mutual information are proposed.
In order to verify the influence degree of screening of the intimacy sample on the random forest model, wind power prediction is carried out by using models trained by all samples and the intimacy sample respectively. Therefore, after the intimacy sample is screened, the prediction error and the training time of the random forest model are reduced.
Step 2: wind power prediction based on random forest improvement model
2.1 basic flow of random forests
Random Forest (RF) is an integrated learning method in a parallel combination mode, including a Bagging method and a Random subspace theory, as shown in fig. 3, the calculation process of the Random Forest is:
1) carrying Out back-to-back random resampling on the original sample set by using a Bagging method to obtain a plurality of training sample subsets which are respectively used for training each decision tree, carrying Out unbiased estimation on the generalization error of the random forest model by using an Out of Bag sample set (OOB),
assuming that the total number of samples in the original training sample set is N, the probability that each sample is not drawn is:
Figure BDA0003626868100000061
where x is the training sample, P is the probability distribution of the out-of-bag sample set, S i Is the ith subset of training samples. The formula shows that 36.8% of samples in the original sample set do not appear in the training samples, and samples which are not extracted form an out-of-bag sample set OOB which is used for estimating the generalization error of the random forest model and is equal to the process of k-fold cross validation;
2) based on a random subspace theory, randomly selecting a part of characteristic variables from the characteristic variables of the training samples to participate in the training of the decision tree, so that the training samples branch from top to bottom until the set leaf node size is reached;
3) repeating the steps 1) and 2) for T times to obtain T decision trees, and combining the T decision trees to form a random forest model. The ensemble learning theory proves that the error of each decision tree is less than 50%, and the overall error of the random forest is reduced along with the increase of the number of the decision trees, and finally tends to a relatively stable lower limit value.
4) The trained random forest is used for testing a sample set, the prediction results of a plurality of decision trees are optimally combined according to a certain integration rule to obtain the prediction value of the random forest, and the integration rule mainly adopts a voting method for the classification problem; for regression and prediction problems, the integration rules mainly use an averaging method.
2.2 random forest improvement model
The random forest model integrates a large number of decision trees, the number of related characteristic variables is large, model interpretability is reduced, and particularly double randomness is introduced, so that the random forest is like a 'black box' model, the internal optimization process of the random forest model lacks observability and controllability, physical interpretability is poor, reliability is to be evaluated, in addition, OOB bag external samples are derived from a training sample set and belong to the same distribution with a training sample, and the inherent characteristics of the training sample set are difficult to exceed, therefore, generalization error evaluation based on the OOB samples still belongs to an internal verification process, in a decision tree performance evaluation stage, some new samples are required to be added, and external verification indexes are introduced to improve the generalization capability of the model to unknown samples, therefore, the random forest improvement model based on external inspection indexes and decision tree recombination is provided:
adding links of screening and recombining decision trees
The Bagging random resampling strategy is beneficial to enhancing the independence of the sub-models and improving the generalization capability of the random forest models, the Bagging method is changed into the Bagging method according to the random subspace theory so as to improve the OOB proportion of the samples outside the bags, and based on the selective integrated learning thought, the links of screening and recombining decision trees are added in the random forest models, aiming at performing prediction performance evaluation on each trained decision tree and removing the decision trees with poor prediction performance so as to weaken the adverse effect of the false samples on the training of the random forest models;
external inspection index based on NWP wind speed
OOB error is unbiased estimation of the model generalization error, only the model generalization error corresponding to the training sample can be estimated, and still belongs to an internal inspection index, because the difference between the prediction day sample of the day-ahead wind power and the training sample is very large, the model generalization error corresponding to the prediction day sample estimated by the OOB error index is invalid,
the Transfer Learning (TL) method is adopted, which is helpful to improve the generalization ability of the model, namely: the method comprises the steps of improving the characteristic similarity between a training sample and a Prediction sample to improve the model mobility from a training domain to a target domain, and therefore, in a decision tree screening link, providing an external inspection index of a reference Numerical Weather forecast (NWP), namely, carrying out association degree analysis on a Prediction result of each decision tree and the NWP wind speed of a Prediction day, screening out a decision tree subset strongly related to the wind speed of the Prediction day according to the association degree index, and further recombining a new random forest to enhance the generalization capability of the random forest on a Prediction set.
1. Example analysis:
assuming that the original random forest includes 100 decision trees, the external inspection index-degree of association corresponding to each decision tree is varied from 0.04 to 0.07, as shown in fig. 4(a), if each decision tree is directly combined into the random forest, the generalization error of the model is about 0.02, as shown in fig. 4(b), and when the number of decision trees is small, the generalization error is unstable. In order to solve the above problems, all decision trees are arranged in a descending order according to the relevance indexes, as shown in fig. 4(c), and a generalization error curve obtained after a random forest model is formed is as shown in fig. 4(d), it can be found that the generalization error rapidly decreases to below 0.01, but with the addition of the decision tree with a smaller relevance index, the generalization error increases to some extent, and finally reaches about 0.02. Therefore, the simple increase of the decision trees does not necessarily reduce the generalization error of the random forest, because the degree of correlation is also closely related to the external inspection index corresponding to each decision tree.
Therefore, the invention adds a decision tree recombination link to the random forest, and aims to select the decision trees with better external inspection indexes to participate in the later random forest prediction work. As can be seen from fig. 4(c), the new random forest is formed by sorting and recombining the decision trees, the generalization error of the new random forest decreases first and then increases, and a turning point exists, and if the number of decision trees corresponding to the turning point is defined as the new random forest scale, the obtained generalization error should be minimum. Therefore, the decision trees with the external inspection index values larger than the average value are selected, as shown in fig. 4(e), recombined into random forests (the number of the decision trees is adjusted from 100 to 50), and then wind power prediction verification is carried out, so that the prediction error of the new random forest is greatly reduced, and the problem of the prediction error oscillation of the original random forest is solved, as shown in fig. 4 (f).
2. And (3) analyzing the anti-interference capability:
the actually measured wind power historical data usually contains a large amount of abandoned wind data, and the long-time, large-amplitude and impact step change can occur, so that the historical data is seriously distorted, the fluctuation characteristic extraction of the historical data is seriously influenced, and the prediction error is larger. As can be seen from fig. 5(a) and 5 (b): if wind curtailment data (such as 1: 00-3: 00 in the morning) occurs in the valley period, the adverse effect on the prediction result of the BP neural network model is large; as can be seen from fig. 6(a) and 6 (b): if wind abandon data occurs in the valley period, the influence on the prediction result of the random forest model is small; comparing fig. 5(b) and fig. 6(b), it can be seen that the stochastic forest model has a stronger interference rejection capability than the BP neural network model in consideration of the curtailment data. The random forest model carries out random resampling on the training samples, dependence on original sample characteristics can be weakened, particularly, the random forest improvement model has stronger tolerance on influence of abandoned wind data after a Bagging random sub-sampling algorithm is adopted, and the prediction accuracy of the random forest model is less influenced by the abandoned wind data.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (1)

1. A short-term wind power prediction method based on two-segment feature selection and random forest model improvement is characterized by comprising the following steps: it comprises the following steps:
step 1: training sample screening based on two-stage feature selection
1.1 Key feature selection
The historical data of the wind power plant consists of 10 characteristic variables of air temperature, air pressure, humidity, wind direction, wind speed of 10m, wind speed of 30m, wind speed of 50m hub, wind speed of 70m, wind speed of 100m and historical wind power, the following two methods are adopted to evaluate the importance of the 10 characteristic variables,
the method comprises the following steps: carrying out importance evaluation on each characteristic variable by using a random forest model, and finding out that importance indexes of the 10 characteristic variables are different, wherein the wind speed of a hub with the length of 50m and the wind power historical power have high importance and can be classified as key characteristic variables, and other characteristic variables are removed as redundant characteristics;
the method 2 comprises the following steps: respectively taking a single characteristic variable as input, training a random forest model to obtain 10 prediction error curves corresponding to the 10 characteristic variables, finding that the model prediction error trained by the hub wind speed and the wind power historical power of 50m is small in whole and can be classified as a key characteristic variable according to the grouping condition of the error curves, and rejecting other characteristic variables as redundant characteristics;
1.2 intimacy sample screening
Besides removing redundant characteristic variables, an intimate sample set strongly related to a predicted target needs to be screened from massive historical data of key characteristic variables, and the calculation process is divided into two steps:
a) construction of a daily sample: converting historical data of the wind power plant, including wind power P and wind speed V into daily data samples { P } 1 ,P 2 ,...,P N And { V } 1 ,V 2 ,...,V N If the length of the historical data is 1 year, N is 365 so as to meet the requirement of wind power prediction in the day ahead;
b) screening close samples: after normalization processing is carried out on the daily wind power data samples, association degree indexes of the daily wind power data samples and the prediction day reference samples are respectively calculated, descending order is carried out according to the association degree, the first 2M strongly-related samples are screened out, and M is set to be 20 and used as a close sample set { P ═ P M1 ,P M2 ,...,P MM And { V } M1 ,V M2 ,...,V MM Obtaining an input sample set of a random forest model as { P } M1 ,P M2 ,...,P MM ,V M1 ,V M2 ,...,V MM Considering the generalized correlation between the input characteristics and the prediction target, adopting a mutual information correlation index;
step 2: wind power prediction based on random forest improvement model
2.1 basic flow of random forests
Random Forest (RF) is an integrated learning method in a parallel combination mode, including Bagging method and Random subspace theory, and the calculation process of Random Forest is:
1) carrying Out back-to-back random resampling on the original sample set by using a Bagging method to obtain a plurality of training sample subsets which are respectively used for training each decision tree, carrying Out unbiased estimation on the generalization error of the random forest model by using an Out of Bag sample set (OOB),
assuming that the total number of samples in the original training sample set S is N, the probability that each sample is not drawn is:
Figure FDA0003626868090000021
where x is the training sample, P is the probability distribution of the out-of-bag sample set, S i Is the ith training sample subset;
the formula shows that 36.8% of samples in the original sample set do not appear in the training samples, and samples which are not extracted form an out-of-bag sample set OOB which is used for estimating the generalization error of the random forest model and is equal to the process of k-fold cross validation;
2) based on a random subspace theory, randomly selecting a part of characteristic variables from the characteristic variables of the training samples to participate in the training of the decision tree, so that the training samples branch from top to bottom until the set leaf node size is reached;
3) repeating the steps 1) and 2) for T times to obtain T decision trees, and combining the T decision trees to form a random forest model. The ensemble learning theory proves that the error of each decision tree is less than 50%, and the overall error of the random forest is reduced along with the increase of the number of the decision trees, and finally tends to a relatively stable lower limit value;
4) the trained random forest is used for testing a sample set, the prediction results of a plurality of decision trees are optimized and combined according to a certain integration rule to obtain the prediction value of the random forest, and the integration rule mainly adopts a voting method for the classification problem; for regression and prediction problems, the integration rule mainly adopts an averaging method;
2.2 random forest improvement model
The random forest model integrates a large number of decision trees, the number of related characteristic variables is large, model interpretability is reduced, and particularly double randomness is introduced, so that the random forest is like a 'black box' model, the internal optimization process of the random forest model lacks observability and controllability, physical interpretability is poor, reliability is to be evaluated, in addition, OOB bag external samples are derived from a training sample set and belong to the same distribution with training samples, and the inherent characteristics of the training sample set are difficult to exceed, therefore, generalization error evaluation based on the OOB samples still belongs to an internal verification process, in a decision tree performance evaluation stage, some new samples are required to be added, and external verification indexes are introduced to improve the generalization capability of the random forest model to unknown samples, therefore, the random forest improvement model based on external inspection indexes and decision tree recombination is provided:
adding links of screening and recombining decision trees
The Bagging random resampling strategy is beneficial to enhancing the independence of the sub-models and improving the generalization capability of the random forest models, the Bagging method is changed into the Bagging method according to the random subspace theory so as to improve the OOB proportion of the samples outside the bags, and based on the selective integrated learning thought, the links of screening and recombining decision trees are added in the random forest models, so that the prediction performance evaluation is carried out on each trained decision tree, the decision trees with poor prediction performance are removed, and the adverse effect of the false samples on the training of the random forest models is weakened;
external inspection index based on NWP wind speed
OOB error is unbiased estimation on the model generalization error, only the model generalization error corresponding to the training sample can be estimated, and still belongs to an internal inspection index, because the difference between the predicted daily sample and the training sample of the day-ahead wind power is very large, the model generalization error corresponding to the predicted daily sample estimated by the OOB error index will fail,
the Transfer Learning (TL) method is adopted, which is helpful to improve the generalization ability of the model, namely: the characteristic similarity between the training samples and the Prediction samples is improved to improve the model mobility from a training domain to a target domain, therefore, in a decision tree screening link, an external inspection index of a reference Numerical Weather forecast (NWP) is provided, namely, the correlation degree analysis is carried out on the Prediction result of each decision tree and the NWP wind speed of a Prediction day, a decision tree subset strongly related to the wind speed of the Prediction day is screened out according to the correlation degree index, and then a new random forest is formed again, so that the generalization capability of the random forest on a Prediction set is enhanced.
CN202210491926.XA 2022-05-05 2022-05-05 Short-term wind power prediction method based on two-stage feature selection and random forest improvement model Pending CN114819369A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210491926.XA CN114819369A (en) 2022-05-05 2022-05-05 Short-term wind power prediction method based on two-stage feature selection and random forest improvement model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210491926.XA CN114819369A (en) 2022-05-05 2022-05-05 Short-term wind power prediction method based on two-stage feature selection and random forest improvement model

Publications (1)

Publication Number Publication Date
CN114819369A true CN114819369A (en) 2022-07-29

Family

ID=82510967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210491926.XA Pending CN114819369A (en) 2022-05-05 2022-05-05 Short-term wind power prediction method based on two-stage feature selection and random forest improvement model

Country Status (1)

Country Link
CN (1) CN114819369A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050666A (en) * 2023-03-20 2023-05-02 中国电建集团江西省电力建设有限公司 Photovoltaic power generation power prediction method for irradiation characteristic clustering
CN116975646A (en) * 2023-09-22 2023-10-31 长江三峡集团实业发展(北京)有限公司 Wind element data correction method and device, computer equipment and storage medium
CN117970428A (en) * 2024-04-02 2024-05-03 山东省地质科学研究院 Seismic signal identification method, device and equipment based on random forest algorithm

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050666A (en) * 2023-03-20 2023-05-02 中国电建集团江西省电力建设有限公司 Photovoltaic power generation power prediction method for irradiation characteristic clustering
CN116975646A (en) * 2023-09-22 2023-10-31 长江三峡集团实业发展(北京)有限公司 Wind element data correction method and device, computer equipment and storage medium
CN117970428A (en) * 2024-04-02 2024-05-03 山东省地质科学研究院 Seismic signal identification method, device and equipment based on random forest algorithm

Similar Documents

Publication Publication Date Title
CN114819369A (en) Short-term wind power prediction method based on two-stage feature selection and random forest improvement model
Wang et al. Short-term wind power prediction based on multidimensional data cleaning and feature reconfiguration
Shi et al. An improved random forest model of short‐term wind‐power forecasting to enhance accuracy, efficiency, and robustness
CN103324980B (en) A kind of method for forecasting
CN110619360A (en) Ultra-short-term wind power prediction method considering historical sample similarity
CN110717610B (en) Wind power prediction method based on data mining
CN112288157A (en) Wind power plant power prediction method based on fuzzy clustering and deep reinforcement learning
CN114021483A (en) Ultra-short-term wind power prediction method based on time domain characteristics and XGboost
Hao et al. Wind power short-term forecasting model based on the hierarchical output power and poisson re-sampling random forest algorithm
CN116341717A (en) Wind speed prediction method based on error compensation
CN115995810A (en) Wind power prediction method and system considering weather fluctuation self-adaptive matching
CN116663393A (en) Random forest-based power distribution network continuous high-temperature fault risk level prediction method
CN110991743A (en) Wind power short-term combination prediction method based on cluster analysis and neural network optimization
CN115481788A (en) Load prediction method and system for phase change energy storage system
Lyu et al. A data-driven solar irradiance forecasting model with minimum data
CN114707684A (en) Improved LSTM-based raw tobacco stack internal temperature prediction algorithm
Ma et al. Short-Term PV Power Prediction Based on FCM-ISSA-LSTM
Wang et al. Research on House Price Forecast Based on Hyper Parameter Optimization Gradient Boosting Regression Model
Wu et al. Prediction of daily precipitation based on deep learning and broad learning techniques
CN117893030B (en) Power system risk early warning method based on big data
CN117688504B (en) Internet of things abnormality detection method and device based on graph structure learning
Jankauskas et al. Short-term wind energy forecasting with advanced recurrent neural network models: a comparative study
Song et al. An improved convolutional neural network-based approach for short-term wind speed forecast
Li et al. Short-term LOAD Forecasting Method of TPA-LSTNet Model Based on Time Series Clustering
Liu et al. Short-term PV power prediction model based on weather feature clustering and Adaboost-GA-BP

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination