CN117829342A - Coal-fired power plant nitrogen oxide emission prediction method based on improved random forest algorithm - Google Patents

Coal-fired power plant nitrogen oxide emission prediction method based on improved random forest algorithm Download PDF

Info

Publication number
CN117829342A
CN117829342A CN202311639882.1A CN202311639882A CN117829342A CN 117829342 A CN117829342 A CN 117829342A CN 202311639882 A CN202311639882 A CN 202311639882A CN 117829342 A CN117829342 A CN 117829342A
Authority
CN
China
Prior art keywords
random forest
data
regression model
mic
power plant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311639882.1A
Other languages
Chinese (zh)
Inventor
贺凯迅
董朕
蒋瀚
彭鑫
钟麦英
朱延正
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China University of Science and Technology
Shandong University of Science and Technology
Original Assignee
East China University of Science and Technology
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China University of Science and Technology, Shandong University of Science and Technology filed Critical East China University of Science and Technology
Priority to CN202311639882.1A priority Critical patent/CN117829342A/en
Publication of CN117829342A publication Critical patent/CN117829342A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Mathematical Physics (AREA)
  • Primary Health Care (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Educational Administration (AREA)
  • Computational Linguistics (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a prediction method for nitrogen oxide emission of a coal-fired power plant based on an improved random forest algorithm, which belongs to the technical field of flue gas denitration of the coal-fired power plant and comprises the steps of calculating maximum information coefficients between all independent variables and dependent variables, primarily screening the variables, constructing a random forest regression model, reconstructing the random forest regression model for a plurality of times, monitoring the prediction effect of the reconstructed random forest regression model in real time, carrying out comparative evaluation on the prediction performance of the random forest regression model by adopting model prediction performance evaluation indexes, and carrying out online prediction on real-time data of a flue gas system of the power plant. According to the invention, the threshold value of variable screening is automatically adjusted by using the out-of-bag error of the random forest, so that the difficulty of parameter adjustment is greatly reduced, and the efficiency of feature selection is improved; the robustness of the model is effectively improved through the combination of the sample random subspace strategy and the random subspace strategy; the adaptive updating mechanism evaluates the applicability of the model through a monitoring algorithm and effectively guides updating.

Description

Coal-fired power plant nitrogen oxide emission prediction method based on improved random forest algorithm
Technical Field
The invention discloses a prediction method for nitrogen oxide emission of a coal-fired power plant based on an improved random forest algorithm, and belongs to the technical field of flue gas denitration of the coal-fired power plant.
Background
The energy structure of China presents a diversified trend at present, and although new energy has faster development, the problems of small occupation ratio, narrow range and the like exist, the thermal power generation is still dominant, and the thermal power generation accounts for nearly 60% of the annual energy production of China so far. The pulverized coal is combusted in the boiler to generate a great amount of harmful gases such as nitrogen oxides (NOx), sulfur dioxide and dust, pollute the atmosphere and have negative effects on human health. NOx emissions from thermal power plants have attracted widespread attention due to the stringent environmental policies of the country. It is significant to find an effective technique that can eliminate NOx at the denitration reactor inlet on line and optimize the combustion process control. In production practice, power plants are often equipped with desulfurization and denitrification units to eliminate sulfur and NOx from the flue gas. At present, a Selective Catalytic Reduction (SCR) method is widely applied to denitration equipment, and has the advantages of simple structure and environment-friendly reaction process. Timely and accurate detection of NOx emissions is a key to accurately controlling ammonia injection and improving SCR efficiency. Currently, detection of NOx emissions mainly relies on a traditional continuous emission monitoring system CEMS, which has a large number of hardware devices, and installation and debugging of the devices are complex. In addition, CEMS has a severe working environment and large electromagnetic interference, which can cause frequent abnormal working conditions of the system and large maintenance workload. Even under normal operating conditions, there is a large time delay in the measurement of NOx emissions due to sampling and detection delays.
Disclosure of Invention
The invention aims to provide a prediction method for emission of nitrogen oxides in a coal-fired power plant based on an improved random forest algorithm, which solves the problem that the existing CEMS system cannot provide real-time detection when scheduled maintenance and equipment maintenance are carried out, and reduces the time delay of NOx concentration detection.
The method for predicting the emission of the nitrogen oxides in the coal-fired power plant based on the improved random forest algorithm comprises the following steps:
s1, performing data processing, including data acquisition, abnormal sample removal, data standardization and data set division;
s2, calculating the maximum information coefficient MIC between all independent variables X and independent variables y;
s3, according to a given initial threshold v mic The MIC is reserved to be larger than v mic The primary screening of the variables is realized, and a variable set is obtained;
s4, constructing a random forest regression model by utilizing the variable set obtained in the S3, and calculating an error err of a sample outside the bag of the random forest regression model OOB1
S5, gradually increasing v mic Repeating S3 and S4 until err of the random forest regression model OOB1 To the minimum;
s6, outputting err OOB1 Variable set s under random forest regression model after reaching minimum mic Reconstructing a random forest regression model;
s7, utilizing random forest variable importance criterion to make the relation of s mic Ordering the variables of (2);
s8, according to the threshold v RF For s mic Is carried out by the variables of (2)Secondary screening, wherein the importance index of the reserved variable is greater than v RF Is to re-output s mic
S9, gradually increasing v RF S output by S8 mic Reconstructing the random forest regression model, repeating S7 and S8 until the error err of the random forest regression model outside the bag OOB2 Reaching the minimum, obtaining the optimal variable set s RF
S10, outputting an optimal variable set s RF Reconstructing a random forest regression model;
s11, monitoring the prediction effect of the random forest regression model after the S10 is reconstructed in real time, and updating the model when the prediction performance of the random forest regression model is lower than a threshold value;
s12, comparing and evaluating the prediction performance of the random forest regression model by adopting a model prediction performance evaluation index;
s13, carrying out online prediction on real-time data of a power plant flue gas system.
S1 comprises the following steps:
acquiring a time sequence operation data set of a flue gas system of a power plant, removing abnormal samples in the data set by adopting a Laida criterion, resampling the data set, carrying out standardization processing on data in the data set by using a standard score, and dividing the standardized data set into a training set and a testing set;
s1.1. the time sequence operation data set of the power plant flue gas system is D (X, y), wherein X is E R N×m N represents the number of samples, m represents the number of auxiliary variables, y is the concentration of nitrogen oxides, namely the dependent variable, and R represents the sum of the independent variables;
s1.2, adopting the Laida criterion to reject abnormal samples in the data set comprises the following steps: standard deviation σ is calculated according to the bessel formula:
wherein:is the average value of y, v i For the ith deviation, n is the number of samples, y i A nitrogen oxide concentration value for the i-th sample;
if a certain sample data y i V of (2) i Satisfy |v i ∣>3 sigma, the sample data is considered to be abnormal data, and is rejected;
s1.3, using a standard score to perform standardized processing on data in a data set, wherein the formula is as follows:
wherein X is one value of X before normalization processing,is the mean value of x>Is standard deviation, x normalization Values after X normalization processing after abnormal samples are removed from X (X, y);
s1.4, dividing the standardized data set into a training set and a testing set, taking 70% of the total data as the training set and the remaining 30% as the testing set.
S2 comprises the following steps:
s2.1, calculating mutual information I (X, y) of X and y:
wherein p (x) represents the marginal probability density of x, p (y) represents the marginal probability density of y, and p (x, y) represents the joint probability density of the two variables;
s2.2. Calculating the maximum information MI on the grid G *
MI * (D,h,v)=maxI(D|G);
Wherein MI is * (D, h, v) represents MI * Is a function of D, h, v, d= { (x) i ,y i ) I = 1,2,..n } is a finite set of ordered pairs, d|g represents the distribution of points in D over the G's cells, h, v is the mesh size;
s2.3. MI is to be measured * Normalization:
m (D) in h,v For normalized MI *
S2.4. will be at M (D) h,v The highest normalized value obtained in (c) was taken as MIC:
MIC=max{M(D) h,v }。
the variable set obtaining in S3 includes:
MIC(X j ,y)≥v mic
wherein X is j Is the j-th variable of X.
The random forest regression model in S4 is { f (x, Θ) k ) 'k' represents the number of trees, Θ k Is a random variable, each tree model outputs a numerical value, the average value of the tree is the prediction result of the random forest regression model, 10 times of cross validation is carried out on training set data to adjust the parameters of the random forest regression model, and err is calculated OOB1 Comprising:
s4.1. slave MIC (X j Extracting n in y) tree Each self-sampling set contains a MIC (X j Two-thirds of the amount of data in y) corresponding to the data;
s4.2, generating an unbeard regression tree by self-help sampling, and randomly sampling m tree A prediction variable for selecting an optimal division point from the variables stored in X;
s4.3. By combining n tree Calculating a predictive value by predicting a tree;
s4.4. calculating err OOB1
Wherein,is a predicted value.
S7 comprises the following steps:
s7.1. for each decision tree t k Input of out-of-bag dataObtaining the mean square error OOB_MSE of the predicted value and the true value k
Wherein,represents the number of samples outside the kth decision tree bag, < ->Representing a predicted value of a kth decision tree;
s7.2. removalVariable X in (a) j Calculating the mean square error OOB_MSE of the j-th predicted value and the true value by using the residual variable k,j
S7.3.X j For decision tree t k Predicted mean square error resultsThe method comprises the following steps:
s7.4. traversing randomEach decision tree of the forest regression model is used for obtaining X j Mean square error, X, for all decision trees j Important results IMP of (a) j The method comprises the following steps:
wherein K is the number of decision trees for constructing a random forest.
The secondary screening in S8 includes:
IMP j ≥v RF
s11 comprises the following steps:
s11.1, defining a monitoring window of model performance, wherein the window size is initialized to Mw;
s11.2, calculating a prediction error by using a random forest regression model for each sample, wherein a calculation formula of the prediction error is as follows:
err in the above formula represents a prediction error, and yu represents a true value of nitrogen oxides in the power plant;
s11.3, calculating a median prediction error in the window, and providing an updated alarm signal if the median error is greater than a threshold delta;
s11.4, updating the training data set by using new data of the latest window, and deleting the training data set before the initial trainingIs a sample of (2);
s11.5, reconstructing a random forest regression model by using the new training data set.
S12 includes:
evaluating model performance using model performance metrics including root mean square error RMSE, mean absolute error MAE, and correlation coefficient R 2
S13 comprises the following steps: acquiring real-time data X q Data is subjected to standardization processing, and S is utilized RF The corresponding random forest regression model finishes the prediction of the dependent variable y and outputs a prediction result
Compared with the prior art, the invention has the following beneficial effects: the threshold value of variable screening is automatically adjusted by utilizing the out-of-bag error of the random forest, so that the difficulty of parameter adjustment is greatly reduced, and the efficiency of feature selection is improved; the robustness of the model is effectively improved through the combination of the sample random subspace strategy and the random subspace strategy; the adaptive updating mechanism evaluates the applicability of the model through a monitoring algorithm and effectively guides updating; the method can rapidly screen out the beneficial variables from massive historical data, and the prediction model parameter setting method is simple, efficient and convenient for field personnel to operate, and the model updating strategy facilitates long-term application of the prediction model.
Drawings
FIG. 1 is a technical flow chart of the present invention;
FIG. 2 is a static random forest test set prediction curve;
figure 3 is a static random forest test set error curve.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the present invention will be clearly and completely described below, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The method for predicting the emission of the nitrogen oxides in the coal-fired power plant based on the improved random forest algorithm comprises the following steps:
s1, performing data processing, including data acquisition, abnormal sample removal, data standardization and data set division;
s2, calculating the maximum information coefficient MIC between all independent variables X and independent variables y;
s3, according to a given initial threshold v mic The MIC is reserved to be larger than v mic The primary screening of the variables is realized, and a variable set is obtained;
s4, constructing a random forest regression model by utilizing the variable set obtained in the S3, and calculating an error err of a sample outside the bag of the random forest regression model OOB1
S5, gradually increasing v mic Repeating S3 and S4 until err of the random forest regression model OOB1 To the minimum;
s6, outputting err OOB1 Variable set s under random forest regression model after reaching minimum mic Reconstructing a random forest regression model;
s7, utilizing random forest variable importance criterion to make the relation of s mic Ordering the variables of (2);
s8, according to the threshold v RF For s mic Performing secondary screening on variables, and keeping the importance index of the variables larger than v RF Is to re-output s mic
S9, gradually increasing v RF S output by S8 mic Reconstructing the random forest regression model, repeating S7 and S8 until the error err of the random forest regression model outside the bag OOB2 Reaching the minimum, obtaining the optimal variable set s RF
S10, outputting an optimal variable set s RF Reconstructing a random forest regression model;
s11, monitoring the prediction effect of the random forest regression model after the S10 is reconstructed in real time, and updating the model when the prediction performance of the random forest regression model is lower than a threshold value;
s12, comparing and evaluating the prediction performance of the random forest regression model by adopting a model prediction performance evaluation index;
s13, carrying out online prediction on real-time data of a power plant flue gas system.
S1 comprises the following steps:
acquiring a time sequence operation data set of a flue gas system of a power plant, removing abnormal samples in the data set by adopting a Laida criterion, resampling the data set, carrying out standardization processing on data in the data set by using a standard score, and dividing the standardized data set into a training set and a testing set;
s1.1. the time sequence operation data set of the power plant flue gas system is D (X, y), wherein X is E R N×m N represents the number of samples, m represents the number of auxiliary variables, y is the concentration of nitrogen oxides, namely the dependent variable, and R represents the sum of the independent variables;
s1.2, adopting the Laida criterion to reject abnormal samples in the data set comprises the following steps: standard deviation σ is calculated according to the bessel formula:
wherein:is the average value of y, v i For the ith deviation, n is the number of samples, y i A nitrogen oxide concentration value for the i-th sample;
if a certain sample data y i V of (2) i Satisfy |v i ∣>3 sigma, the sample data is considered to be abnormal data, and is rejected;
s1.3, using a standard score to perform standardized processing on data in a data set, wherein the formula is as follows:
wherein X is one value of X before normalization processing,is the mean value of x>Is standard deviation, x normalization The value after X standardization processing after the abnormal sample is removed from D (X, y);
s1.4, dividing the standardized data set into a training set and a testing set, taking 70% of the total data as the training set and the remaining 30% as the testing set.
S2 comprises the following steps:
s2.1, calculating mutual information I (X, y) of X and y:
wherein p (x) represents the marginal probability density of x, p (y) represents the marginal probability density of y, and p (x, y) represents the joint probability density of the two variables;
s2.2. Calculating the maximum information MI on the grid G *
MI * (D,h,v)=max I(D|G);
Wherein MI is * (D, h, v) represents MI * Is a function of D, h, v, d= { (x) i ,y i ) I = 1,2,..n } is a finite set of ordered pairs, d|g represents the distribution of points in D over the G's cells, h, v is the mesh size;
s2.3. MI is to be measured * Normalization:
m (D) in h,v For normalized MI *
S2.4. will be at M (D) h,v The highest normalized value obtained in (c) was taken as MIC:
MIC=max{M(D) h,v }。
the variable set obtaining in S3 includes:
MIC(X j ,y)≥v mic
wherein X is j Is the j-th variable of X.
The random forest regression model in S4 is { f (x, Θ) k ) 'k' represents the number of trees, Θ k Is a random variable, each tree model outputs a numerical value, the average value of the tree is the prediction result of the random forest regression model, 10 times of cross validation is carried out on training set data to adjust the parameters of the random forest regression model, and err is calculated OOB1 Comprising:
s4.1. slave MIC (X j Extracting n in y) tree Each self-sampling set contains a MIC (X j Two-thirds of the amount of data in y) corresponding to the data;
s4.2, generating an unbeard regression tree by self-help sampling, and randomly sampling m tree A prediction variable for selecting an optimal division point from the variables stored in X;
s4.3. By combining m tree Calculating a predictive value by predicting a tree;
s4.4. calculating err OOB1
Wherein,is a predicted value.
S7 comprises the following steps:
s7.1. for each decision tree t k Input of out-of-bag dataObtaining the mean square error OOB_MSE of the predicted value and the true value k
Wherein,represents the number of samples outside the kth decision tree bag, < ->Representing a predicted value of a kth decision tree;
s7.2. removalVariable X in (a) j Calculating the mean square error OOB_MSE of the j-th predicted value and the true value by using the residual variable k,j
S7.3.X j For decision tree t k Predicted mean square error resultsThe method comprises the following steps:
s7.4, traversing each decision tree of the random forest regression model to obtain X j Mean square error, X, for all decision trees j Important results IMP of (a) j The method comprises the following steps:
wherein K is the number of decision trees for constructing a random forest.
The secondary screening in S8 includes:
IMP j ≥v RF
s11 comprises the following steps:
s11.1, defining a monitoring window of model performance, wherein the window size is initialized to Mw;
s11.2, calculating a prediction error by using a random forest regression model for each sample, wherein a calculation formula of the prediction error is as follows:
err in the above formula represents a prediction error, and yu represents a true value of nitrogen oxides in the power plant;
s11.3, calculating a median prediction error in the window, and providing an updated alarm signal if the median error is greater than a threshold delta;
s11.4, updating the training data set by using new data of the latest window, and deleting the training data set before the initial trainingIs a sample of (2);
s11.5, reconstructing a random forest regression model by using the new training data set.
S12 includes:
evaluating model performance using model performance metrics including root mean square error RMSE, mean absolute error MAE, and correlation coefficient R 2
S13 comprises the following steps: acquiring real-time data X q Data is subjected to standardization processing, and S is utilized RF The corresponding random forest regression model finishes the prediction of the dependent variable y and outputs a prediction result
The technical flow of the invention is shown in figure 1, and the experimental object adopted in the embodiment is a 1030MW supercritical and subcritical thermal power plant in China. The basic process is that raw coal is firstly ground into powder through a coal mill. After drying, the primary air is blown into the once-through boiler through the coal pipe, and the secondary air provides sufficient oxygen and tangential power for boiler combustion. The burning coal powder converts fuel into heat energy, and the heat energy directly converts water into supercritical steam to drive a turbine to generate electricity. 390 auxiliary variables are collected from the thermal power plant, 5184 samples are collected, the sampling period is 72h, the sampling frequency is 50s, and 5012 samples are left after abnormal values are removed. The unit load is changed from 600mw to 1000mw. And adopting stable working conditions, load increasing, load reducing and other working conditions to perform performance verification. The first 3456 samples collected (about 70% of the historical data) were selected as training sets according to the time series, and the remaining 30% (1556) were used as test sets. All auxiliary variables of the historical data were normalized using Z-Score. MIC was used to make preliminary selections of important variables. A small value is first empirically initialized and then the threshold is raised by 0.01 steps. And establishing a random forest regression model, and performing 10-time cross validation on the training data set to adjust parameters of the random forest regression model. The iteration number is 20, the tree number is 50, and the leaf node number is 5. Considering that the calculation of OOB errors has a certain randomness, each iteration is run independently 10 times in order to obtain the best result. Table 1 gives OOB errors and RMSE results on the training set obtained during the main iteration.
TABLE 1 results for different Vmic values
Vmic OOBError RMSE Number of reserved variables
0.1 13.3224 1.8031 376
0.15 12.9798 1.7614 359
0.2 12.7679 1.7954 345
0.25 13.495 1.8544 326
0.3 12.8646 1.7862 312
0.31 12.6948 1.8033 310
0.32 12.5396 1.7969 305
0.33 12.7246 1.7996 299
0.34 15.1845 1.9675 290
As shown in Table 1, when the model yields minimal OOB errors, the RMSE on the training set is relatively small. The OOB error and RMSE do not change linearly with increasing threshold during the iteration. For example, use v mic OOB error ratio when=0.25 uses v mic And is much larger when=0.2. This phenomenon suggests that it is difficult to obtain the optimal set of variables for regression modeling solely by means of MIC threshold values. The main reason is that the MIC only looks at the correlation between the auxiliary variable and the target variable, and does not take the influence of the auxiliary variable in the regression model into full consideration. This finding is consistent with previous analysis. From the above results, this step retains 305 variables. In order to eliminate redundant variables for regression modeling, in a second step, the variables are re-selected according to the variable importance index of the random forest. In this operation, the initial threshold is increased by 0.02 starting from 0.1. The parameters of the random forest regression model are the same as in the first step. The OOB error calculated for each iteration and RMSE in the training set are shown in table 2. As the number of significant variables increases, the number of significant variables retained decreases, but the OOB error does not decrease linearly, because the performance of the regression model is not only related to the dependence of the auxiliary variable on the target variable, but also includes the combined effect of all variables in the regression model.
TABLE 2 different v RF Results of the values
According to the table2, and finally selecting 61 important variables to construct a NOx prediction model. According to the variable selection result, a static random forest regression model can be constructed to predict NOx emission. Figures 2 and 3 show the performance of the static random forest regression model in the test dataset. Fig. 2 shows the prediction curve and fig. 3 shows the error curve. In the range of 1-1400, the model shows better prediction performance, the prediction error is smaller, and the error change is relatively stable. After 1400 points, the prediction error is in an ascending trend. The slope of the median curve plotted for every 50 points median error increases significantly at 1400-1450 points. Specifically, the median error corresponding to 1450 point is 3.403,1500 and the median error corresponding to 4.410. This phenomenon is caused by the inherent drawbacks of the static model. Curing of the training samples and model parameters can lead to performance degradation as the operating conditions change. Thus, when static models are used on an industrial scale, manual maintenance is required, which increases the effort for long-term use of the model. In an embodiment, mw is set to 50 and empirically, the threshold delta is set to 3.0. And when the median error is greater than or equal to delta, updating the training set and reconstructing the random forest regression model. According to the median error curve of the static model, a first update occurs at 1400 and a second update occurs at 1500. Specifically, the updated median error values are 2.9054 of 1450, 2.9345 of 1500, and 2.1542 of 1550. The results show that the updated model has good linear relation to the test set, and RMSE and MAE are significantly reduced. R of updated model 2 The values are slightly increased, indicating that the strategy improves the interpretation ability of the model for the test data.
To compare the performance of the proposed variable selection and NOx emission prediction methods, this section builds a comparison model based on MIC variable selection results. Due to the randomness of the random forest, each experiment was run 10 times independently to obtain the optimal result. Five traditional methods of SVR, BP network and PLS, GPR, DT regression are adopted to compare the selected 61 important variables. Table 3 describes the static random forest performance for selecting different variable sets based on MIC. v mic The performance of static random forests fluctuates slightly over the range of 0.1-0.45. Wherein v is mic RF power at=0.1Best results are obtained when v mic When the value of (2) is increased to 0.5, the scale of the variable set effectively decreases, but the random forest performance does not change much. The results show that the best set of modeled variables cannot be obtained by means of MIC alone due to the strong redundancy relationship between the variables.
TABLE 3 static random forest with different variable sets selected by MIC
Vmic R2 RMSE MAE Number of reserved variables
0.1 0.9285 3.8745 2.8235 376
0.15 0.927 3.914 2.8872 359
0.2 0.9237 4.0039 2.8839 345
0.25 0.927 3.9154 2.8571 326
0.3 0.9245 3.981 2.934 312
0.31 0.9271 3.9126 2.8803 310
0.32 0.9275 3.9016 2.8327 305
0.33 0.9271 3.9117 2.8525 299
0.34 0.9216 4.0581 2.9755 290
0.4 0.9227 4.0292 2.9362 215
0.45 0.9238 3.9991 2.9271 159
0.5 0.9228 4.0236 2.9571 65
0.55 0.9232 4.0157 2.9617 30
SVR is implemented using the Libsvm toolbox, while other algorithms, such as GPR, BP, DT and PLS, are implemented using MATLAB's toolbox. The number of features extracted by PLS is 40, based on the variance contribution being greater than 80%. The GPR and SVR parameters remain default. The number of leaf nodes of the DT algorithm is set to 9 by the default optimization algorithm of the tool box. BP adopts a three-layer structure, an implicit layer activation function is "sigmoid", and an output layer activation function is "purelin". The number of hidden layer neurons was 90, which was adjusted by five cross-validation runs. When evaluating the prediction performance, RMSE describes the accuracy of the prediction, the accuracy of the model increasing as the root mean square error decreases; the MAE is used to measure the mean absolute error between the predicted value and the tag value of the experimental dataset, and the accuracy of the prediction model increases as the MAE decreases. The correlation coefficient describes the interpretation ability, and the closer the value is to 1, the stronger the interpretation ability of the description model.
TABLE 4 results of different models
Model R 2 RMSE MAE
Update RF 0.9596 2.9042 2.0185
Static RF 0.9282 3.874 2.9295
GPR 0.8987 4.6006 3.4278
DT 0.8783 5.0419 3.6766
BP 0.8384 5.8107 4.3982
PLS 0.7948 6.5472 4.7151
SVR 0.784 6.7167 4.6343
Table 4 shows the performance of all comparative models, with random forest performance being best in static methods, followed by GPR, DT and BP. The efficiency of PLS and SVR is low. The result shows that the integrated learning method has better performance than the traditional model based on a single mode when processing multi-condition data. Models based on update strategies exhibit the best performance, with the MAE (2.0185) and RMSE (2.9042) being minimal, indicating that in long-term industrial applications, timely model updates are necessary. Experimental analysis shows that compared with the traditional method, the method has higher prediction precision and wider prediction range. The practical application of the method is beneficial to improving the efficiency of NOx emission prediction.
The above embodiments are only for illustrating the technical aspects of the present invention, not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may be modified or some or all of the technical features may be replaced with other technical solutions, which do not depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. The method for predicting the emission of the nitrogen oxides in the coal-fired power plant based on the improved random forest algorithm is characterized by comprising the following steps of:
s1, performing data processing, including data acquisition, abnormal sample removal, data standardization and data set division;
s2, calculating the maximum information coefficient MIC between all independent variables X and independent variables y;
s3, according to a given initial threshold v mic The MIC is reserved to be larger than v mic The primary screening of the variables is realized, and a variable set is obtained;
s4, constructing a random forest regression model by utilizing the variable set obtained in the S3, and calculating an error err of a sample outside the bag of the random forest regression model OOB1
S5, gradually increasing v mic Repeating S3 and S4 until err of the random forest regression model OOB1 To the minimum;
s6, outputting err OOB1 Variable set s under random forest regression model after reaching minimum mic Reconstructing a random forest regression model;
s7, utilizing random forest variable importance criterion to make the relation of s mic Ordering the variables of (2);
s8, according to the threshold v RF For s mic Performing secondary screening on variables, and keeping the importance index of the variables larger than v RF Is to re-output s mic
S9, gradually increasing v RF S output by S8 mic Reconstructing the random forest regression model, repeating S7 and S8 until the error err of the random forest regression model outside the bag OOB2 Reaching the minimum, obtaining the optimal variable set s RF
S10, outputting an optimal variable set s RF Reconstructing a random forest regression model;
s11, monitoring the prediction effect of the random forest regression model after the S10 is reconstructed in real time, and updating the model when the prediction performance of the random forest regression model is lower than a threshold value;
s12, comparing and evaluating the prediction performance of the random forest regression model by adopting a model prediction performance evaluation index;
s13, carrying out online prediction on real-time data of a power plant flue gas system.
2. The method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm of claim 1, wherein S1 comprises:
acquiring a time sequence operation data set of a flue gas system of a power plant, removing abnormal samples in the data set by adopting a Laida criterion, resampling the data set, carrying out standardization processing on data in the data set by using a standard score, and dividing the standardized data set into a training set and a testing set;
s1.1. the time sequence operation data set of the power plant flue gas system is D (X, y), wherein X is E R N×m N represents the number of samples, m represents the number of auxiliary variables, y is the concentration of nitrogen oxides, namely the dependent variable, and R represents the sum of the independent variables;
s1.2, adopting the Laida criterion to reject abnormal samples in the data set comprises the following steps: standard deviation σ is calculated according to the bessel formula:
wherein:is the average value of y, v i For the ith deviation, n is the number of samples, y i A nitrogen oxide concentration value for the i-th sample;
if a certain sample data y i V of (2) i Satisfy |v i ∣>3 sigma, the sample data is considered to be abnormal data, and is rejected;
s1.3, using a standard score to perform standardized processing on data in a data set, wherein the formula is as follows:
wherein X is one value of X before normalization processing,is the mean value of x>Is standard deviation, x normalization The value after X standardization processing after the abnormal sample is removed from D (X, y);
s1.4, dividing the standardized data set into a training set and a testing set, taking 70% of the total data as the training set and the remaining 30% as the testing set.
3. The coal-fired power plant nitrogen oxide emission prediction method based on the improved random forest algorithm as claimed in claim 2, wherein S2 comprises:
s2.1, calculating mutual information I (X, y) of X and y:
wherein p (x) represents the marginal probability density of x, p (y) represents the marginal probability density of y, and p (x, y) represents the joint probability density of the two variables;
s2.2. Calculating the maximum information MI on the grid G *
MI * (D,h,v)=max I(D|G);
Wherein MI is * (D, h, v) represents MI * Is a function of D, h, v, d= { (x) i ,y i ) I = 1,2,..n } is a finite set of ordered pairs, d|g represents the distribution of points in D over the G's cells, h, v is the mesh size;
s2.3. MI is to be measured * Normalization:
m (D) in h,v For normalized MI *
S2.4. will be at M (D) h,v The highest normalized value obtained in (c) was taken as MIC:
MIC=max{M(D) h,v }。
4. the method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm as claimed in claim 3, wherein obtaining the variable set in S3 comprises:
MIC(X j ,y)≥v mic
wherein X is j Is the j-th variable of X.
5. The method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm as claimed in claim 4, wherein the random forest regression model in S4 is { f (x, Θ) k ) 'k' represents the number of trees, Θ k Is a random variable, each tree model outputs a numerical value, the average value of the tree is the prediction result of the random forest regression model, 10 times of cross validation is carried out on training set data to adjust the parameters of the random forest regression model, and err is calculated OOB1 Comprising:
s4.1. slave MIC (X j Extracting n in y) tree Each self-sampling set contains a MIC (X j Two-thirds of the amount of data in y) corresponding to the data;
s4.2, generating an unbeard regression tree by self-help sampling, and randomly sampling m tree A prediction variable for selecting an optimal division point from the variables stored in X;
s4.3. By combining m tree Calculating a predictive value by predicting a tree;
s4.4. calculating err OOB1
Wherein,is a predicted value.
6. The method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm of claim 5, wherein S7 comprises:
s7.1. for each decision tree t k Input of out-of-bag dataObtaining the mean square error OOB_MSE of the predicted value and the true value k
Wherein,represents the number of samples outside the kth decision tree bag, < ->Representing a predicted value of a kth decision tree;
s7.2. removalVariable X in (a) j Calculating the mean square error OOB_MSE of the j-th predicted value and the true value by using the residual variable k,j
S7.3.X j For decision tree t k Predicted mean square error resultsThe method comprises the following steps:
s7.4, traversing each decision tree of the random forest regression model to obtain X j Mean square error, X, for all decision trees j Important results IMP of (a) j The method comprises the following steps:
wherein K is the number of decision trees for constructing a random forest.
7. The method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm of claim 6, wherein the secondary screening in S8 comprises:
IMP j ≥v RF
8. the method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm as claimed in claim 7, wherein S11 comprises:
s11.1, defining a monitoring window of model performance, wherein the window size is initialized to Mw;
s11.2, calculating a prediction error by using a random forest regression model for each sample, wherein a calculation formula of the prediction error is as follows:
err in the above formula represents a prediction error, and yu represents a true value of nitrogen oxides in the power plant;
s11.3, calculating a median prediction error in the window, and providing an updated alarm signal if the median error is greater than a threshold delta;
s11.4, updating the training data set by using new data of the latest window, and deleting the training data set before the initial trainingIs a sample of (2);
s11.5, reconstructing a random forest regression model by using the new training data set.
9. The method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm of claim 8, wherein S12 comprises:
evaluating model performance using model performance metrics including root mean square error RMSE, mean absolute error MAE, and correlation coefficient R 2
10. The method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm as claimed in claim 9, wherein S13 comprises: acquiring real-time data X q Data is subjected to standardization processing, and S is utilized RF The corresponding random forest regression model finishes the prediction of the dependent variable y and outputs a prediction result
CN202311639882.1A 2023-12-04 2023-12-04 Coal-fired power plant nitrogen oxide emission prediction method based on improved random forest algorithm Pending CN117829342A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311639882.1A CN117829342A (en) 2023-12-04 2023-12-04 Coal-fired power plant nitrogen oxide emission prediction method based on improved random forest algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311639882.1A CN117829342A (en) 2023-12-04 2023-12-04 Coal-fired power plant nitrogen oxide emission prediction method based on improved random forest algorithm

Publications (1)

Publication Number Publication Date
CN117829342A true CN117829342A (en) 2024-04-05

Family

ID=90510498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311639882.1A Pending CN117829342A (en) 2023-12-04 2023-12-04 Coal-fired power plant nitrogen oxide emission prediction method based on improved random forest algorithm

Country Status (1)

Country Link
CN (1) CN117829342A (en)

Similar Documents

Publication Publication Date Title
CN111804146B (en) Intelligent ammonia injection control method and intelligent ammonia injection control device
CN107292446B (en) Hybrid wind speed prediction method based on component relevance wavelet decomposition
CN111144609A (en) Boiler exhaust emission prediction model establishing method, prediction method and device
CN113433911B (en) Accurate control system and method for ammonia spraying of denitration device based on accurate concentration prediction
CN112488145A (en) NO based on intelligent methodxOnline prediction method and system
CN112613237B (en) CFB unit NOx emission concentration prediction method based on LSTM
Wu et al. Prediction of NOx emission concentration from coal-fired power plant based on joint knowledge and data driven
CN115511657A (en) Wind power output and photovoltaic output evaluation method based on combined prediction model
CN115510904A (en) Boiler heating surface ash deposition monitoring method based on time sequence prediction
Gong et al. Intelligent fuzzy modeling of heavy-duty gas turbine for smart power generation
CN112183872A (en) Blast furnace gas generation amount prediction method combining generation of countermeasure network and neural network
CN117829342A (en) Coal-fired power plant nitrogen oxide emission prediction method based on improved random forest algorithm
CN115113519A (en) Coal-gas co-combustion boiler denitration system outlet NO x Concentration early warning method
CN116128136A (en) LSO-Catboost-based coal-fired power plant boiler NO X Emission prediction method
CN112348696B (en) BP neural network-based heating unit peak regulation upper limit evaluation method and system
CN116029433A (en) Energy efficiency reference value judging method, system, equipment and medium based on grey prediction
CN113935230A (en) Implementation of NO based on attention mechanism LSTM modelxEmission amount prediction method
CN113435584A (en) SCR (Selective catalytic reduction) outlet nitrogen oxide concentration prediction method based on LSTM (localized surface plasmon resonance)
Tang et al. Dynamic prediction model for NOx emission at the outlet of SCR system based on extreme learning machine
Sun et al. Modeling method of boiler combustion system based on empirical mode decomposition
Tang et al. Computer Prediction Model of Heat Consumption in Thermal System of Coal-Fired Power Station Based on Big Data Analysis and Information Sorting
Zhao et al. An interpretable ultra-short-term wind power prediction model based on the feature matrix reconstruction through regression trees
Cui et al. Research on fault diagnosis and early warning of power plant boiler reheater temperature deviation based on machine learning algorithm
CN117648645B (en) Main steam temperature prediction method based on Bayes-Catboost
Tang et al. Deep neural network based the oxygen content of boiler flue gas

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination