CN115728463A

CN115728463A - Interpretable water quality prediction method based on semi-embedded feature selection

Info

Publication number: CN115728463A
Application number: CN202211533758.2A
Authority: CN
Inventors: 田禹; 孟一鸣; 张浩然; 王树鹏; 孙会航; 黎彦良
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2023-03-03

Abstract

An interpretable water quality prediction method based on semi-embedded feature selection belongs to the technical field of environmental engineering. The invention solves the problem that the existing water quality prediction method cannot quantitatively and accurately explain the influence of the characteristics of each river basin on the water quality change of the river. The method selects a corresponding interpreter according to a machine learning regression algorithm, interprets the model based on the SHAP value output by the SHAP packet, compares the contribution of each river basin characteristic to the river water quality, identifies the core water quality driving factors, and analyzes the action mechanism of each core driving factor to the river water quality. The invention is suitable for river water quality prediction.

Description

Interpretable water quality prediction method based on semi-embedded feature selection

Technical Field

The invention belongs to the technical field of environmental engineering, and relates to an environmental system simulation prediction technology.

Background

With the continuous development of human society, a large amount of pollutants enter rivers, so that the problem of river basin water pollution is increasingly serious. Nowadays, the water pollution control is promoted to be accelerated, and the comprehensive treatment of the drainage basin environment is implemented. However, the natural and humanistic characteristics of the river basin are complex and changeable, the influence factors of the river water quality are numerous, the interaction is complicated, the accurate simulation of the change of the river water quality is difficult to carry out by the conventional means, and the influence of each environmental element on the river water quality cannot be quantitatively evaluated. The method has the advantages that an accurate, efficient and interpretable river water quality prediction model is built, the concentration change trend of pollutants after diffusion, migration and conversion is predicted, the contribution and action mechanism of various watershed characteristics to the river water quality change are analyzed, and the method has important guiding significance for coping with water environment crisis and realizing precise watershed pollution prevention and control and scientific decision.

The diversity of water quality influencing factors, the complexity of an influencing mechanism and the spatial heterogeneity of river basin characteristics are the key points for limiting the application of a river water quality model. At present, river water quality prediction models mainly comprise three types: conceptual models, mechanism models, and machine learning models. The conceptual model mainly relates to a production convergence theory, a water quality diffusion conversion theory and a related mathematical theory, has a simple structure, is convenient to construct, but has a single theory and backward functions; the mechanism model is composed of a series of continuous equations and dynamic equations, can dynamically simulate the generation, migration and conversion processes of pollutants in a water body, has high simulation precision, is difficult to calibrate parameters, cannot solve the problem of spatial heterogeneity in large watershed application, and is limited in application scale; the machine learning model is a probability and statistics-based method, does not need to consider the complex migration and transformation process of pollutants in a water body, solves the environmental problem by using computer thinking, is suitable for water quality prediction of large watershed scale, large sample amount and long time scale, and has higher nonlinear data processing capacity. However, because of the inherent black box attribute of the machine learning algorithm, the water quality prediction model based on machine learning has the problem of poor interpretability, and at present, in the related research of river water quality prediction by using the machine learning model at home and abroad, only the simulation prediction function of water quality change can be realized, but the contribution and action mechanism of each river basin characteristic to the river water quality change cannot be quantitatively and accurately explained.

Disclosure of Invention

The invention aims to solve the problem that the influence of characteristics of each river basin on the river water quality change cannot be quantitatively and accurately explained by the existing water quality prediction method, and provides an interpretable water quality prediction method based on semi-embedded characteristic selection.

The invention relates to an interpretable water quality prediction method based on semi-embedded feature selection, which specifically comprises the following steps:

collecting a river course distribution map of a target river, determining a drainage basin range covered by the river, acquiring a DEM (Digital Elevation Model) map of the drainage basin range, extracting a drainage basin river network according to the DEM (Digital Elevation Model, DEM, digital Elevation Model data) map, and dividing a drainage basin catchment area;

step two, constructing a target river water quality prediction characteristic system according to influence factors of the target river water quality, acquiring characteristics influencing the target river water quality, collecting historical data of the characteristics of the target river and corresponding water quality data by taking a river basin catchment area as a unit, cleaning the characteristic data and the corresponding water quality data, and generating a preliminary input data set of a machine learning model;

thirdly, performing autocorrelation analysis on historical data of every two characteristics in the primary input data set in sequence to obtain strong autocorrelation characteristic pairs; the spearman correlation coefficient of the strong autocorrelation feature pairs is greater than 0.8;

step four, establishing a front-end processor learning model, calculating the global feature importance values of all features in the primary input data set by using a SHAP (SHAP) interpretation frame based on the model, deleting feature values which strongly meet the requirement of low global feature importance values in autocorrelation feature pairs, completing feature selection, and acquiring a final input data set of the machine learning model;

step five, dividing the input data set into a training set and a test set, and determining the coefficient r by utilizing the training set ² Mean square error MSE, mean absolute error MAE, mean absolute percentage error MAPE as model evaluationIndex, optimizing the parameters of the machine learning regression model to obtain the optimal parameter combination;

step six, taking the optimal parameter combination as the machine learning regression model parameters, and utilizing the training set and the test set to respectively train and verify the machine learning regression model to obtain the optimal machine learning regression model;

and seventhly, predicting the target river water quality by using the optimal machine learning regression model in the sixth step and using the characteristic data of the current target river water quality.

Further, in the present invention, after the seventh step, the method further includes: and disassembling the water quality predicted value based on a SHAP explanation frame of the optimal machine learning regression model, and acquiring a SHAP value representing the influence of each characteristic on the water quality predicted value.

Further, in the present invention, in the first step, the specific method for dividing the catchment area of the drainage basin comprises:

collecting DEM (Digital Elevation Model) data in a target drainage basin range, and performing spatial interpolation processing on the collected DEM data to obtain a DEM map in the drainage basin range; and extracting a river network of the drainage basin and dividing a drainage basin catchment area on the basis of a DEM (digital elevation model) diagram of the drainage basin range by utilizing a hydrological analysis tool of ArcGIS (English spelling of Chinese name).

Further, in the second step of the present invention, collecting the historical data of the characteristics of the target river within n years and the corresponding water quality data by taking the river basin catchment area as a unit is as follows:

collecting basin characteristic data by taking a basin catchment area as a unit based on water quality data provided by a yearbook environment bulletin and a public data platform of statistics of a target river;

the collecting of the watershed feature data comprises: collecting raster data and point element data;

for the raster data, summarizing the data to each catchment area by using a partition statistical function in an ArcGIS space analysis tool;

and for the point element data, carrying out spatial discretization processing on the point element data, and summarizing the point element data to each catchment area through partition statistics.

Further, in the second step of the present invention, the specific method for cleaning the flow domain characteristic data is as follows:

searching error values, missing values and outliers in the collected watershed feature data for processing; and uniformly deleting error values, filling missing values in the watershed characteristic data by using a contemporaneous mean value filling method, replacing high outliers in the watershed characteristic data by using a third quartile and replacing low outliers in the watershed characteristic data by using a first quartile on the basis of the quadrant graph, and realizing data cleaning.

Further, in the third step of the present invention, the autocorrelation analysis is sequentially performed on the historical data of every two features in the preliminary input data set, and a specific method for obtaining a strong autocorrelation feature pair is as follows:

and calling Scipy in Python software, sequentially calculating spearman correlation coefficients among the features, and screening out feature pairs with the spearman correlation coefficients larger than 0.8 as strong autocorrelation feature pairs.

Further, in the present invention, in step four, the SHAP interpretation framework includes a plurality of interpreters, is applicable to any machine learning model, is used for interpreting the tree-based model, and is used for interpreting the deep learning model.

Further, in the sixth step of the present invention, the concrete method for disassembling the water quality prediction value of each water quality prediction sample based on the SHAP frame is as follows:

wherein, f (x) _j ) The water quality predicted value of the jth sample is obtained;

predicting the mean value of the water quality of the machine learning regression model; m is a characteristic number;

for the SHAP value of the jth characteristic pair jth water quality predicted value,

represents the contribution degree of the ith feature to the jth sample, and positive and negative represent the contribution direction to be positive or negative.

The method selects a corresponding interpreter according to a machine learning regression algorithm, interprets the model based on the SHAP value output by the SHAP packet, compares the contribution of each river basin characteristic to the river water quality, identifies the core water quality driving factors, and analyzes the action mechanism of each core driving factor to the river water quality. Compared with a single machine learning prediction model, the interpretability of the water quality prediction model is greatly improved, the water quality driving factor analysis function of the water quality prediction model is enhanced, and the water quality prediction model has more practical guiding significance.

The method constructs a machine learning water quality prediction model through data cleaning, feature selection, super-parameter optimization and model verification, and combines the interpretability of a SHAP framework reinforced machine learning model to realize efficient simulation of river water quality, core water quality driving factor analysis and driving mechanism analysis, thereby being suitable for water quality simulation and pollution factor analysis of watersheds of any scale.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a functional block diagram of a method of the present invention;

FIG. 3 is a characteristic Spireman autocorrelation thermodynamic diagram;

FIG. 4 is a feature global importance histogram;

FIG. 5 is a diagram of measured and fitted data distribution of a SVM-based COD concentration prediction model;

FIG. 6 is a SHAP overview of the COD prediction model.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The first specific implementation way is as follows: the following describes the present embodiment with reference to fig. 1 and fig. 2, and the interpretable water quality prediction method based on semi-embedded feature selection in the present embodiment specifically includes:

step two, constructing a target river water quality prediction characteristic system according to the influence factors of the target river water quality, acquiring the characteristics influencing the target river water quality, collecting historical data of the target river characteristics and corresponding water quality data by taking a river basin catchment area as a unit, and cleaning the characteristic data and the corresponding water quality data to generate a preliminary input data set of a machine learning model;

thirdly, performing autocorrelation analysis on the historical data of every two characteristics in the preliminary input data set in sequence to obtain a strong autocorrelation characteristic pair; the spearman correlation coefficient of the strong autocorrelation feature pairs is greater than 0.8;

for example, if four existing features a, B, C, and D are AB, AC, AD, BC, BD, and CD 6 feature pair combinations, then the spearman correlation coefficients between the 6 feature pairs are respectively found to be 0.9, 0.82, 0.3, 0.5, 0.6, and 0.83 by autocorrelation analysis, and the feature pairs AB, AC, and CD with spearman correlation coefficients greater than 0.8 are strong autocorrelation feature pairs;

for example, the global feature importance values of the features in step three are calculated by using the SHAP interpretation framework as follows: a-0.92, B-0.83, C-0.76, D-0.94. For the strong autocorrelation feature pair AB, the global feature importance of A is greater than that of B, so that the feature A is reserved and the feature B is removed; similarly, for the strong autocorrelation feature pair AC, the feature A is reserved, and the feature C is removed; for the CD which is a strong autocorrelation feature pair, the feature D is reserved, and the feature C is removed. At the moment, completing feature selection and keeping the features A and D; if the characteristics which do not belong to the strong autocorrelation characteristic pair exist, directly reserving the characteristics;

step five, dividing the final input data set into a training set and a test set, and determining the coefficient r by utilizing the training set ² The mean square error MSE, the mean absolute error MAE and the mean absolute percentage error MAPE are used as model evaluation indexes, and the parameters of the machine learning regression model are optimized to obtain an optimal parameter combination;

step seven, adopting the optimal machine learning regression model in the step six, and predicting the target river water quality by utilizing the characteristic data of the current target river water quality;

collecting basin characteristic data by taking a basin catchment area as a unit based on water quality data provided by a yearbook environment bulletin and a public data platform for statistics of a target river;

for the SHAP value of the ith characteristic to the jth water quality predicted value,

The invention firstly adopts semi-embedded feature selection and utilizes an explanation frame, wherein the former can improve the model precision, the latter can enhance the model interpretability, and the two realize the improvement of the model performance.

The method is a rapid, accurate, interpretable and wide-scale water quality prediction model construction method, and the characteristic selection method based on the shape value realizes high-efficiency automatic simplification on a water quality prediction index system, so that the model operation speed and accuracy are greatly improved; a river water quality prediction model is constructed by utilizing a machine learning regression algorithm, the problem of basin characteristic space heterogeneity in large-scale basin water quality prediction is solved, and the application scale of the water quality prediction model is expanded; the interpretability strengthening method breaks through the problem that the machine learning regression algorithm is poor in interpretability in the river water quality prediction application, and is suitable for any machine learning regression model. The method is beneficial to accurately judging the key control factors of the river water quality, and has important significance on accurate pollution control and scientific decision of a river basin.

The specific embodiment is as follows:

with reference to fig. 2 to fig. 6, a river COD concentration prediction model of the singapore regadenoson basin in sichuan province is constructed as an example, and the specific implementation process is as follows:

(1) Determining a basin range and dividing a catchment area;

the Minjiang river is a first-level branch of the Yangtze river, has the total length of 658.8km and is a key branch of Yangtze river protection attack and solidness fighting. The case is developed by taking the Minjiang Dongyuan as an example, and the area of the Minjiang Dongyuan drainage basin is 5.16 km ² And in the eastern region of Sichuan. 250m DEM data of Sichuan province are collected on a resource environment science and data platform of a Chinese academy of sciences, and a basin river network is extracted and a basin catchment area is divided by utilizing a hydrological analysis tool of ArcGIS based on a basin DEM image.

(2) Collecting and cleaning flow field characteristic data, and constructing an input data set;

considering various natural and artificial factors influencing the COD concentration of the river, selecting river basin characteristics from the aspects of weather meteorology, soil landform, land utilization, landscape pattern, social economy, hydrology, water quality and the like, and constructing a water quality prediction index system; collecting data from the sources such as 'Sichuan province statistics yearbook' and 'Sichuan province environment statistics communique' by taking catchment areas as units, summarizing the grid data to each catchment area by utilizing the partitioned statistics function of ArcGIS, carrying out spatial discretization on the point element data, and summarizing the point element data to each catchment area through partitioned statistics; deleting the error value, and filling the missing value by using the feature average value or the water quality data average value of the current basin in the same period; and correcting the outlier based on the boxplot, replacing the high outlier and the low outlier with a third quartile and a first quartile respectively, and sorting the cleaned data according to the input data format requirement of the machine learning regression model to generate an input data set.

(3) Selecting features based on the shape value;

calling a corr ('spearman') function in Python, calculating the spearman correlation coefficient between every two features, identifying the features with high autocorrelation and screening out feature pairs with the spearman correlation coefficient being more than 0.8 as shown in figure 3. In the case, a Support Vector Machine (SVM) algorithm is selected to construct a river water quality prediction model based on the measured data condition of COD concentration in the river. The SVM is a classic machine learning algorithm, and can obtain a good prediction effect when the feature number is larger than the sample number. SVR regression in sklern. Svm module in Python is called, and regression model is directly constructed under default parameters. Calling a SHAP model interpretation packet in Python, calculating the shape value of each feature by taking a kernel explainer as an interpreter, and generating the global feature importance ranking of each feature, as shown in FIG. 3. The global feature importance of the two features in the feature pair with the contrast spearman correlation coefficient larger than 0.8 is large, and the higher features are reserved. For example, as can be seen from fig. 3, the correlation coefficient between the precipitation amount and the river flow is 0.82, and the correlation coefficient is an autocorrelation feature pair, while as can be seen from fig. 4, the feature global importance (0.6) of the precipitation amount is higher than the river flow (0.26), so that the precipitation amount is retained, and the river flow index is deleted. And comparing and screening all the features with the spearman correlation coefficient larger than 0.8 to complete feature selection.

It can be seen that the correlation coefficient between the precipitation amount and the river flow is 0.82, the correlation coefficient is an autocorrelation feature pair, and from the feature global importance histogram, fig. 4 shows that the feature global importance (0.6) of the precipitation amount is higher than the river flow (0.26), so that the precipitation amount is retained, and the river flow index is deleted. And comparing and screening all the characteristics with the spearman correlation coefficient larger than 0.8 to complete characteristic selection.

(4) Training, adjusting and verifying a machine learning regression model;

calling a train _ test _ split function in skearn, dividing the whole input data into a training set and a test set according to the proportion of 3:

TABLE 1SVM parameter ranges and optimal parameter values

Training SVM regression model by optimal parameter combination, and evaluating the model on training set and verification set respectively, wherein the evaluation index comprises a decision coefficient r ² Mean square error MSE, mean absolute error MAE, mean absolute percentage error MAPE. The evaluation indexes of the SVM regression model are shown in Table 2, and the SVM regression model has higher accuracy on the prediction of COD concentration and a training set r ² Up to 0.982, test set r2 also reached 0.78.

TABLE 2 COD concentration prediction model evaluation based on SVM

And (4) substituting the data for prediction into the SVM water quality prediction model constructed by the optimal parameter combination in the step (4) to predict the COD concentration. Fig. 5 shows the distribution of the model to the COD concentration and the measured data, and it can be seen that the model analog values are distributed almost near the measured values, and the COD prediction model based on the SVM has a stable simulation effect on the COD concentration of the river.

6) Explaining a water quality model based on a SHAP framework;

calling a SHAP model interpretation packet in Python, selecting a kernel explainer as an interpreter, interpreting the SVM model with the optimal parameter combination constructed in the step (4), calculating the shape value of each feature, and generating a SHAP generalized diagram of the COD prediction model as shown in FIG. 6. As can be seen from the figure, the three watershed characteristics which have the greatest influence on the COD concentration of the river in the watershed range are the number of urban permanent population, GDP and TDE (land occupation ratio for urban and rural construction), and all three core water quality driving factors have positive correlation with the COD concentration; the treatment investment amount of the pollution sources of the months and the old industry and the COD concentration of the river are in a negative correlation relationship. Therefore, pollution generated by urban production and living is the main reason for researching the increase of COD concentration in a region, the stronger the urbanization is, the faster the economic development is, the poorer the water quality of the region is, the COD concentration can be reduced in winter, and the increase of the COD concentration is effectively limited by the investment of industrial pollution treatment. The above conclusions encouraged city managers to start with domestic sewage treatment, enhance sewage treatment efficiency, and reflect seasonal differences in COD management and control.

The invention provides a characteristic selection method based on a shape value, which is characterized in that a Spanish autocorrelation coefficient is utilized to identify a strong autocorrelation characteristic pair, the shape value of each characteristic is taken as a basis, the global characteristic importance of each characteristic is generated to be used as a retention basis of the strong autocorrelation characteristic, and the problem of model operation speed and accuracy reduction caused by multiple collinearity of redundant characteristics in basin modeling is solved. Compared with the feature selection method which is widely applied at present and comprises autocorrelation analysis and subjective screening, the method considers the global contribution degree of the watershed features to the water quality prediction model based on the shape value, and has more scientific basis on the retention and selection of strong autocorrelation features; the characteristic ratio selection omits the artificial subjective judgment process, and the automation degree is higher; after the features are selected, irrelevant features and redundant features are removed, feature dimensions are controlled, training time is shortened, and generalization capability and feature understanding of the model are enhanced.

The invention also provides an interpretable strengthening method of the river water quality prediction model, aiming at the problem of poor interpretability of the black box model, the SHAP framework is used for identifying the core water quality driving factors, and the water quality contribution and action mechanism of the core water quality driving factors are analyzed. The single machine learning model can only output strong and weak correlation between the characteristic and the prediction index through correlation analysis of the characteristic and the prediction index, cannot explain quantitative response relation between the characteristic and the prediction index, cannot judge the contribution strength of the characteristic to the prediction index, and is not beneficial to guidance of a water quality prediction model on actual work. Compared with qualitative post attribution analysis based on correlation analysis of a single machine learning model, the method can quantitatively describe the overall contribution size and action direction of the watershed characteristics to the water quality, explain how the model makes prediction based on the whole characteristic space, the model structure and the parameters, identify the core driving factors of the water quality, and greatly improve the interpretability;

although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that features described in different dependent claims and herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims

1. An interpretable water quality prediction method based on semi-embedded feature selection is characterized by comprising the following steps:

collecting a river course distribution map of a target river, determining a river basin range covered by the river, obtaining a DEM (digital elevation model) map of the river basin range, extracting a river basin and a river network according to the DEM map, and dividing a river basin catchment area;

step four, establishing a front-end processor learning model, calculating the global feature importance values of all the features in the primary input data set by using a SHAP interpretation framework based on the model, deleting the features with lower global feature importance values in strong autocorrelation feature pairs, completing feature selection and acquiring the final input data set of the machine learning model;

2. The interpretable water quality prediction method based on semi-embedded feature selection according to claim 1, further comprising after step seven: and disassembling the water quality predicted value based on a SHAP explanation frame of the optimal machine learning regression model, and acquiring a SHAP value representing the influence of each characteristic on the water quality predicted value.

3. The interpretable water quality prediction method based on semi-embedded feature selection according to claim 1, wherein in the step one, a specific method for dividing a basin catchment area is as follows:

collecting DEM data in a target drainage basin range, and performing spatial interpolation processing on the collected DEM data to obtain a DEM image in the drainage basin range; and extracting a river network of the drainage basin based on the DEM image of the drainage basin range by utilizing a hydrological analysis tool of ArcGIS, and dividing a drainage basin catchment area.

4. The interpretable water quality prediction method based on semi-embedded feature selection according to claim 1, wherein in the second step, the specific method for cleaning the characteristic data of the flow domain is as follows:

5. The interpretable water quality prediction method based on semi-embedded feature selection according to claim 1, wherein in the third step, autocorrelation analysis is performed on historical data of every two features in the preliminary input data set in sequence, and a specific method for acquiring a strong autocorrelation feature pair is as follows:

invoking Scipy in Python software, sequentially calculating the spearman correlation coefficient among the features in the primary input data set, and screening out a feature pair with the spearman correlation coefficient being more than 0.8 as a strong autocorrelation feature pair.

6. The interpretable water quality predicting method based on semi-embedded feature selection as claimed in claim 1, wherein in step four, the SHAP interpretation framework comprises a plurality of interpreters, is applicable to any machine learning model, is used for interpreting a tree-based model and is used for interpreting a deep learning model.

7. The interpretable water quality prediction method based on semi-embedded feature selection according to claim 1, wherein in the sixth step, the specific method for obtaining the SHAP value representing the influence of each feature on the water quality predicted value by disassembling the water quality predicted value based on the SHAP interpretation framework of the optimal machine learning regression model is as follows:

and the SHAP value of the jth water quality predicted value is the ith characteristic.

8. The interpretable water quality predicting method based on semi-embedded feature selection as claimed in claim 7, wherein the SHAP value of the predicted water quality value

The value of (b) represents the value of the contribution degree of the ith feature to the jth sample, and positive and negative represent the direction of the contribution to be positive or negative.