CN117829342A

CN117829342A - Coal-fired power plant nitrogen oxide emission prediction method based on improved random forest algorithm

Info

Publication number: CN117829342A
Application number: CN202311639882.1A
Authority: CN
Inventors: 贺凯迅; 董朕; 蒋瀚; 彭鑫; 钟麦英; 朱延正
Original assignee: East China University of Science and Technology; Shandong University of Science and Technology
Current assignee: East China University of Science and Technology; Shandong University of Science and Technology
Priority date: 2023-12-04
Filing date: 2023-12-04
Publication date: 2024-04-05

Abstract

The invention discloses a prediction method for nitrogen oxide emission of a coal-fired power plant based on an improved random forest algorithm, which belongs to the technical field of flue gas denitration of the coal-fired power plant and comprises the steps of calculating maximum information coefficients between all independent variables and dependent variables, primarily screening the variables, constructing a random forest regression model, reconstructing the random forest regression model for a plurality of times, monitoring the prediction effect of the reconstructed random forest regression model in real time, carrying out comparative evaluation on the prediction performance of the random forest regression model by adopting model prediction performance evaluation indexes, and carrying out online prediction on real-time data of a flue gas system of the power plant. According to the invention, the threshold value of variable screening is automatically adjusted by using the out-of-bag error of the random forest, so that the difficulty of parameter adjustment is greatly reduced, and the efficiency of feature selection is improved; the robustness of the model is effectively improved through the combination of the sample random subspace strategy and the random subspace strategy; the adaptive updating mechanism evaluates the applicability of the model through a monitoring algorithm and effectively guides updating.

Description

Coal-fired power plant nitrogen oxide emission prediction method based on improved random forest algorithm

Technical Field

The invention discloses a prediction method for nitrogen oxide emission of a coal-fired power plant based on an improved random forest algorithm, and belongs to the technical field of flue gas denitration of the coal-fired power plant.

Background

The energy structure of China presents a diversified trend at present, and although new energy has faster development, the problems of small occupation ratio, narrow range and the like exist, the thermal power generation is still dominant, and the thermal power generation accounts for nearly 60% of the annual energy production of China so far. The pulverized coal is combusted in the boiler to generate a great amount of harmful gases such as nitrogen oxides (NOx), sulfur dioxide and dust, pollute the atmosphere and have negative effects on human health. NOx emissions from thermal power plants have attracted widespread attention due to the stringent environmental policies of the country. It is significant to find an effective technique that can eliminate NOx at the denitration reactor inlet on line and optimize the combustion process control. In production practice, power plants are often equipped with desulfurization and denitrification units to eliminate sulfur and NOx from the flue gas. At present, a Selective Catalytic Reduction (SCR) method is widely applied to denitration equipment, and has the advantages of simple structure and environment-friendly reaction process. Timely and accurate detection of NOx emissions is a key to accurately controlling ammonia injection and improving SCR efficiency. Currently, detection of NOx emissions mainly relies on a traditional continuous emission monitoring system CEMS, which has a large number of hardware devices, and installation and debugging of the devices are complex. In addition, CEMS has a severe working environment and large electromagnetic interference, which can cause frequent abnormal working conditions of the system and large maintenance workload. Even under normal operating conditions, there is a large time delay in the measurement of NOx emissions due to sampling and detection delays.

Disclosure of Invention

The invention aims to provide a prediction method for emission of nitrogen oxides in a coal-fired power plant based on an improved random forest algorithm, which solves the problem that the existing CEMS system cannot provide real-time detection when scheduled maintenance and equipment maintenance are carried out, and reduces the time delay of NOx concentration detection.

The method for predicting the emission of the nitrogen oxides in the coal-fired power plant based on the improved random forest algorithm comprises the following steps:

s1, performing data processing, including data acquisition, abnormal sample removal, data standardization and data set division;

s2, calculating the maximum information coefficient MIC between all independent variables X and independent variables y;

s3, according to a given initial threshold v _mic The MIC is reserved to be larger than v _mic The primary screening of the variables is realized, and a variable set is obtained;

s4, constructing a random forest regression model by utilizing the variable set obtained in the S3, and calculating an error err of a sample outside the bag of the random forest regression model _OOB1 ；

S5, gradually increasing v _mic Repeating S3 and S4 until err of the random forest regression model _OOB1 To the minimum;

s6, outputting err _OOB1 Variable set s under random forest regression model after reaching minimum _mic Reconstructing a random forest regression model;

s7, utilizing random forest variable importance criterion to make the relation of s _mic Ordering the variables of (2);

s8, according to the threshold v _RF For s _mic Is carried out by the variables of (2)Secondary screening, wherein the importance index of the reserved variable is greater than v _RF Is to re-output s _mic ；

S9, gradually increasing v _RF S output by S8 _mic Reconstructing the random forest regression model, repeating S7 and S8 until the error err of the random forest regression model outside the bag _OOB2 Reaching the minimum, obtaining the optimal variable set s _RF ；

S10, outputting an optimal variable set s _RF Reconstructing a random forest regression model;

s11, monitoring the prediction effect of the random forest regression model after the S10 is reconstructed in real time, and updating the model when the prediction performance of the random forest regression model is lower than a threshold value;

s12, comparing and evaluating the prediction performance of the random forest regression model by adopting a model prediction performance evaluation index;

s13, carrying out online prediction on real-time data of a power plant flue gas system.

S1 comprises the following steps:

acquiring a time sequence operation data set of a flue gas system of a power plant, removing abnormal samples in the data set by adopting a Laida criterion, resampling the data set, carrying out standardization processing on data in the data set by using a standard score, and dividing the standardized data set into a training set and a testing set;

s1.1. the time sequence operation data set of the power plant flue gas system is D (X, y), wherein X is E R ^N×m N represents the number of samples, m represents the number of auxiliary variables, y is the concentration of nitrogen oxides, namely the dependent variable, and R represents the sum of the independent variables;

s1.2, adopting the Laida criterion to reject abnormal samples in the data set comprises the following steps: standard deviation σ is calculated according to the bessel formula:

wherein:is the average value of y, v _i For the ith deviation, n is the number of samples, y _i A nitrogen oxide concentration value for the i-th sample;

if a certain sample data y _i V of (2) _i Satisfy |v _i ∣>3 sigma, the sample data is considered to be abnormal data, and is rejected;

s1.3, using a standard score to perform standardized processing on data in a data set, wherein the formula is as follows:

wherein X is one value of X before normalization processing,is the mean value of x>Is standard deviation, x _{normalization} Values after X normalization processing after abnormal samples are removed from X (X, y);

s1.4, dividing the standardized data set into a training set and a testing set, taking 70% of the total data as the training set and the remaining 30% as the testing set.

S2 comprises the following steps:

s2.1, calculating mutual information I (X, y) of X and y:

wherein p (x) represents the marginal probability density of x, p (y) represents the marginal probability density of y, and p (x, y) represents the joint probability density of the two variables;

s2.2. Calculating the maximum information MI on the grid G ^* ：

MI ^* (D，h，v)＝maxI(D|G)；

Wherein MI is ^* (D, h, v) represents MI ^* Is a function of D, h, v, d= { (x) _i ，y _i ) I = 1,2,..n } is a finite set of ordered pairs, d|g represents the distribution of points in D over the G's cells, h, v is the mesh size;

s2.3. MI is to be measured ^* Normalization:

m (D) in _h，v For normalized MI ^* ；

S2.4. will be at M (D) _h，v The highest normalized value obtained in (c) was taken as MIC:

MIC＝max{M(D) _h，v }。

the variable set obtaining in S3 includes:

MIC(X ^j ，y)≥v _mic ；

wherein X is ^j Is the j-th variable of X.

The random forest regression model in S4 is { f (x, Θ) _k ) 'k' represents the number of trees, Θ _k Is a random variable, each tree model outputs a numerical value, the average value of the tree is the prediction result of the random forest regression model, 10 times of cross validation is carried out on training set data to adjust the parameters of the random forest regression model, and err is calculated _OOB1 Comprising:

s4.1. slave MIC (X ^j Extracting n in y) _tree Each self-sampling set contains a MIC (X ^j Two-thirds of the amount of data in y) corresponding to the data;

s4.2, generating an unbeard regression tree by self-help sampling, and randomly sampling m _tree A prediction variable for selecting an optimal division point from the variables stored in X;

s4.3. By combining n _tree Calculating a predictive value by predicting a tree;

s4.4. calculating err _OOB1 ：

Wherein,is a predicted value.

S7 comprises the following steps:

s7.1. for each decision tree t _k Input of out-of-bag dataObtaining the mean square error OOB_MSE of the predicted value and the true value _k ：

Wherein,represents the number of samples outside the kth decision tree bag, < ->Representing a predicted value of a kth decision tree;

s7.2. removalVariable X in (a) _j Calculating the mean square error OOB_MSE of the j-th predicted value and the true value by using the residual variable _k，j ：

S7.3.X _j For decision tree t _k Predicted mean square error resultsThe method comprises the following steps:

s7.4. traversing randomEach decision tree of the forest regression model is used for obtaining X _j Mean square error, X, for all decision trees _j Important results IMP of (a) ^j The method comprises the following steps:

wherein K is the number of decision trees for constructing a random forest.

The secondary screening in S8 includes:

IMP ^j ≥v _RF 。

s11 comprises the following steps:

s11.1, defining a monitoring window of model performance, wherein the window size is initialized to Mw;

s11.2, calculating a prediction error by using a random forest regression model for each sample, wherein a calculation formula of the prediction error is as follows:

err in the above formula represents a prediction error, and yu represents a true value of nitrogen oxides in the power plant;

s11.3, calculating a median prediction error in the window, and providing an updated alarm signal if the median error is greater than a threshold delta;

s11.4, updating the training data set by using new data of the latest window, and deleting the training data set before the initial trainingIs a sample of (2);

s11.5, reconstructing a random forest regression model by using the new training data set.

S12 includes:

evaluating model performance using model performance metrics including root mean square error RMSE, mean absolute error MAE, and correlation coefficient R ² ：

S13 comprises the following steps: acquiring real-time data X _q Data is subjected to standardization processing, and S is utilized _RF The corresponding random forest regression model finishes the prediction of the dependent variable y and outputs a prediction result

Compared with the prior art, the invention has the following beneficial effects: the threshold value of variable screening is automatically adjusted by utilizing the out-of-bag error of the random forest, so that the difficulty of parameter adjustment is greatly reduced, and the efficiency of feature selection is improved; the robustness of the model is effectively improved through the combination of the sample random subspace strategy and the random subspace strategy; the adaptive updating mechanism evaluates the applicability of the model through a monitoring algorithm and effectively guides updating; the method can rapidly screen out the beneficial variables from massive historical data, and the prediction model parameter setting method is simple, efficient and convenient for field personnel to operate, and the model updating strategy facilitates long-term application of the prediction model.

Drawings

FIG. 1 is a technical flow chart of the present invention;

FIG. 2 is a static random forest test set prediction curve;

figure 3 is a static random forest test set error curve.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the present invention will be clearly and completely described below, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

s8, according to the threshold v _RF For s _mic Performing secondary screening on variables, and keeping the importance index of the variables larger than v _RF Is to re-output s _mic ；

S1 comprises the following steps:

wherein X is one value of X before normalization processing,is the mean value of x>Is standard deviation, x _{normalization} The value after X standardization processing after the abnormal sample is removed from D (X, y);

S2 comprises the following steps:

s2.1, calculating mutual information I (X, y) of X and y:

s2.2. Calculating the maximum information MI on the grid G ^* ：

MI ^* (D,h,v)＝max I(D|G)；

Wherein MI is ^* (D, h, v) represents MI ^* Is a function of D, h, v, d= { (x) _i ,y _i ) I = 1,2,..n } is a finite set of ordered pairs, d|g represents the distribution of points in D over the G's cells, h, v is the mesh size;

s2.3. MI is to be measured ^* Normalization:

m (D) in _h,v For normalized MI ^* ；

S2.4. will be at M (D) _h,v The highest normalized value obtained in (c) was taken as MIC:

MIC＝max{M(D) _h,v }。

the variable set obtaining in S3 includes:

MIC(X ^j ,y)≥v _mic ；

wherein X is ^j Is the j-th variable of X.

s4.3. By combining m _tree Calculating a predictive value by predicting a tree;

s4.4. calculating err _OOB1 ：

Wherein,is a predicted value.

S7 comprises the following steps:

s7.2. removalVariable X in (a) _j Calculating the mean square error OOB_MSE of the j-th predicted value and the true value by using the residual variable _k,j ：

s7.4, traversing each decision tree of the random forest regression model to obtain X _j Mean square error, X, for all decision trees _j Important results IMP of (a) ^j The method comprises the following steps:

wherein K is the number of decision trees for constructing a random forest.

The secondary screening in S8 includes:

IMP ^j ≥v _RF 。

s11 comprises the following steps:

S12 includes:

The technical flow of the invention is shown in figure 1, and the experimental object adopted in the embodiment is a 1030MW supercritical and subcritical thermal power plant in China. The basic process is that raw coal is firstly ground into powder through a coal mill. After drying, the primary air is blown into the once-through boiler through the coal pipe, and the secondary air provides sufficient oxygen and tangential power for boiler combustion. The burning coal powder converts fuel into heat energy, and the heat energy directly converts water into supercritical steam to drive a turbine to generate electricity. 390 auxiliary variables are collected from the thermal power plant, 5184 samples are collected, the sampling period is 72h, the sampling frequency is 50s, and 5012 samples are left after abnormal values are removed. The unit load is changed from 600mw to 1000mw. And adopting stable working conditions, load increasing, load reducing and other working conditions to perform performance verification. The first 3456 samples collected (about 70% of the historical data) were selected as training sets according to the time series, and the remaining 30% (1556) were used as test sets. All auxiliary variables of the historical data were normalized using Z-Score. MIC was used to make preliminary selections of important variables. A small value is first empirically initialized and then the threshold is raised by 0.01 steps. And establishing a random forest regression model, and performing 10-time cross validation on the training data set to adjust parameters of the random forest regression model. The iteration number is 20, the tree number is 50, and the leaf node number is 5. Considering that the calculation of OOB errors has a certain randomness, each iteration is run independently 10 times in order to obtain the best result. Table 1 gives OOB errors and RMSE results on the training set obtained during the main iteration.

TABLE 1 results for different Vmic values

Vmic	OOBError	RMSE	Number of reserved variables
				0.1	13.3224	1.8031	376
0.15	12.9798	1.7614	359
				0.2	12.7679	1.7954	345
0.25	13.495	1.8544	326
				0.3	12.8646	1.7862	312
0.31	12.6948	1.8033	310
				0.32	12.5396	1.7969	305
0.33	12.7246	1.7996	299
				0.34	15.1845	1.9675	290

As shown in Table 1, when the model yields minimal OOB errors, the RMSE on the training set is relatively small. The OOB error and RMSE do not change linearly with increasing threshold during the iteration. For example, use v _mic OOB error ratio when=0.25 uses v _mic And is much larger when=0.2. This phenomenon suggests that it is difficult to obtain the optimal set of variables for regression modeling solely by means of MIC threshold values. The main reason is that the MIC only looks at the correlation between the auxiliary variable and the target variable, and does not take the influence of the auxiliary variable in the regression model into full consideration. This finding is consistent with previous analysis. From the above results, this step retains 305 variables. In order to eliminate redundant variables for regression modeling, in a second step, the variables are re-selected according to the variable importance index of the random forest. In this operation, the initial threshold is increased by 0.02 starting from 0.1. The parameters of the random forest regression model are the same as in the first step. The OOB error calculated for each iteration and RMSE in the training set are shown in table 2. As the number of significant variables increases, the number of significant variables retained decreases, but the OOB error does not decrease linearly, because the performance of the regression model is not only related to the dependence of the auxiliary variable on the target variable, but also includes the combined effect of all variables in the regression model.

TABLE 2 different v _RF Results of the values

According to the table2, and finally selecting 61 important variables to construct a NOx prediction model. According to the variable selection result, a static random forest regression model can be constructed to predict NOx emission. Figures 2 and 3 show the performance of the static random forest regression model in the test dataset. Fig. 2 shows the prediction curve and fig. 3 shows the error curve. In the range of 1-1400, the model shows better prediction performance, the prediction error is smaller, and the error change is relatively stable. After 1400 points, the prediction error is in an ascending trend. The slope of the median curve plotted for every 50 points median error increases significantly at 1400-1450 points. Specifically, the median error corresponding to 1450 point is 3.403,1500 and the median error corresponding to 4.410. This phenomenon is caused by the inherent drawbacks of the static model. Curing of the training samples and model parameters can lead to performance degradation as the operating conditions change. Thus, when static models are used on an industrial scale, manual maintenance is required, which increases the effort for long-term use of the model. In an embodiment, mw is set to 50 and empirically, the threshold delta is set to 3.0. And when the median error is greater than or equal to delta, updating the training set and reconstructing the random forest regression model. According to the median error curve of the static model, a first update occurs at 1400 and a second update occurs at 1500. Specifically, the updated median error values are 2.9054 of 1450, 2.9345 of 1500, and 2.1542 of 1550. The results show that the updated model has good linear relation to the test set, and RMSE and MAE are significantly reduced. R of updated model ² The values are slightly increased, indicating that the strategy improves the interpretation ability of the model for the test data.

To compare the performance of the proposed variable selection and NOx emission prediction methods, this section builds a comparison model based on MIC variable selection results. Due to the randomness of the random forest, each experiment was run 10 times independently to obtain the optimal result. Five traditional methods of SVR, BP network and PLS, GPR, DT regression are adopted to compare the selected 61 important variables. Table 3 describes the static random forest performance for selecting different variable sets based on MIC. v _mic The performance of static random forests fluctuates slightly over the range of 0.1-0.45. Wherein v is _mic RF power at=0.1Best results are obtained when v _mic When the value of (2) is increased to 0.5, the scale of the variable set effectively decreases, but the random forest performance does not change much. The results show that the best set of modeled variables cannot be obtained by means of MIC alone due to the strong redundancy relationship between the variables.

TABLE 3 static random forest with different variable sets selected by MIC

Vmic	R2	RMSE	MAE	Number of reserved variables
					0.1	0.9285	3.8745	2.8235	376
0.15	0.927	3.914	2.8872	359
					0.2	0.9237	4.0039	2.8839	345
0.25	0.927	3.9154	2.8571	326
					0.3	0.9245	3.981	2.934	312
0.31	0.9271	3.9126	2.8803	310
					0.32	0.9275	3.9016	2.8327	305
0.33	0.9271	3.9117	2.8525	299
					0.34	0.9216	4.0581	2.9755	290
0.4	0.9227	4.0292	2.9362	215
					0.45	0.9238	3.9991	2.9271	159
0.5	0.9228	4.0236	2.9571	65
					0.55	0.9232	4.0157	2.9617	30

SVR is implemented using the Libsvm toolbox, while other algorithms, such as GPR, BP, DT and PLS, are implemented using MATLAB's toolbox. The number of features extracted by PLS is 40, based on the variance contribution being greater than 80%. The GPR and SVR parameters remain default. The number of leaf nodes of the DT algorithm is set to 9 by the default optimization algorithm of the tool box. BP adopts a three-layer structure, an implicit layer activation function is "sigmoid", and an output layer activation function is "purelin". The number of hidden layer neurons was 90, which was adjusted by five cross-validation runs. When evaluating the prediction performance, RMSE describes the accuracy of the prediction, the accuracy of the model increasing as the root mean square error decreases; the MAE is used to measure the mean absolute error between the predicted value and the tag value of the experimental dataset, and the accuracy of the prediction model increases as the MAE decreases. The correlation coefficient describes the interpretation ability, and the closer the value is to 1, the stronger the interpretation ability of the description model.

TABLE 4 results of different models

Model	R ²	RMSE	MAE
				Update RF	0.9596	2.9042	2.0185
Static RF	0.9282	3.874	2.9295
				GPR	0.8987	4.6006	3.4278
DT	0.8783	5.0419	3.6766
				BP	0.8384	5.8107	4.3982
PLS	0.7948	6.5472	4.7151
				SVR	0.784	6.7167	4.6343

Table 4 shows the performance of all comparative models, with random forest performance being best in static methods, followed by GPR, DT and BP. The efficiency of PLS and SVR is low. The result shows that the integrated learning method has better performance than the traditional model based on a single mode when processing multi-condition data. Models based on update strategies exhibit the best performance, with the MAE (2.0185) and RMSE (2.9042) being minimal, indicating that in long-term industrial applications, timely model updates are necessary. Experimental analysis shows that compared with the traditional method, the method has higher prediction precision and wider prediction range. The practical application of the method is beneficial to improving the efficiency of NOx emission prediction.

The above embodiments are only for illustrating the technical aspects of the present invention, not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may be modified or some or all of the technical features may be replaced with other technical solutions, which do not depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. The method for predicting the emission of the nitrogen oxides in the coal-fired power plant based on the improved random forest algorithm is characterized by comprising the following steps of:

2. The method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm of claim 1, wherein S1 comprises:

3. The coal-fired power plant nitrogen oxide emission prediction method based on the improved random forest algorithm as claimed in claim 2, wherein S2 comprises:

s2.1, calculating mutual information I (X, y) of X and y:

s2.2. Calculating the maximum information MI on the grid G ^* ：

MI ^* (D,h,v)＝max I(D|G)；

s2.3. MI is to be measured ^* Normalization:

m (D) in _h,v For normalized MI ^* ；

MIC＝max{M(D) _h,v }。

4. the method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm as claimed in claim 3, wherein obtaining the variable set in S3 comprises:

MIC(X ^j ,y)≥v _mic ；

wherein X is ^j Is the j-th variable of X.

5. The method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm as claimed in claim 4, wherein the random forest regression model in S4 is { f (x, Θ) _k ) 'k' represents the number of trees, Θ _k Is a random variable, each tree model outputs a numerical value, the average value of the tree is the prediction result of the random forest regression model, 10 times of cross validation is carried out on training set data to adjust the parameters of the random forest regression model, and err is calculated _OOB1 Comprising:

s4.3. By combining m _tree Calculating a predictive value by predicting a tree;

s4.4. calculating err _OOB1 ：

Wherein,is a predicted value.

6. The method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm of claim 5, wherein S7 comprises:

wherein K is the number of decision trees for constructing a random forest.

7. The method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm of claim 6, wherein the secondary screening in S8 comprises:

IMP ^j ≥v _RF 。

8. the method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm as claimed in claim 7, wherein S11 comprises:

9. The method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm of claim 8, wherein S12 comprises:

10. The method for predicting nitrogen oxide emissions in a coal-fired power plant based on an improved random forest algorithm as claimed in claim 9, wherein S13 comprises: acquiring real-time data X _q Data is subjected to standardization processing, and S is utilized _RF The corresponding random forest regression model finishes the prediction of the dependent variable y and outputs a prediction result