CN115728463A - Interpretable water quality prediction method based on semi-embedded feature selection - Google Patents

Interpretable water quality prediction method based on semi-embedded feature selection Download PDF

Info

Publication number
CN115728463A
CN115728463A CN202211533758.2A CN202211533758A CN115728463A CN 115728463 A CN115728463 A CN 115728463A CN 202211533758 A CN202211533758 A CN 202211533758A CN 115728463 A CN115728463 A CN 115728463A
Authority
CN
China
Prior art keywords
water quality
river
model
machine learning
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211533758.2A
Other languages
Chinese (zh)
Inventor
田禹
孟一鸣
张浩然
王树鹏
孙会航
黎彦良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202211533758.2A priority Critical patent/CN115728463A/en
Publication of CN115728463A publication Critical patent/CN115728463A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A20/00Water conservation; Efficient water supply; Efficient water use
    • Y02A20/152Water filtration

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

An interpretable water quality prediction method based on semi-embedded feature selection belongs to the technical field of environmental engineering. The invention solves the problem that the existing water quality prediction method cannot quantitatively and accurately explain the influence of the characteristics of each river basin on the water quality change of the river. The method selects a corresponding interpreter according to a machine learning regression algorithm, interprets the model based on the SHAP value output by the SHAP packet, compares the contribution of each river basin characteristic to the river water quality, identifies the core water quality driving factors, and analyzes the action mechanism of each core driving factor to the river water quality. The invention is suitable for river water quality prediction.

Description

Interpretable water quality prediction method based on semi-embedded feature selection
Technical Field
The invention belongs to the technical field of environmental engineering, and relates to an environmental system simulation prediction technology.
Background
With the continuous development of human society, a large amount of pollutants enter rivers, so that the problem of river basin water pollution is increasingly serious. Nowadays, the water pollution control is promoted to be accelerated, and the comprehensive treatment of the drainage basin environment is implemented. However, the natural and humanistic characteristics of the river basin are complex and changeable, the influence factors of the river water quality are numerous, the interaction is complicated, the accurate simulation of the change of the river water quality is difficult to carry out by the conventional means, and the influence of each environmental element on the river water quality cannot be quantitatively evaluated. The method has the advantages that an accurate, efficient and interpretable river water quality prediction model is built, the concentration change trend of pollutants after diffusion, migration and conversion is predicted, the contribution and action mechanism of various watershed characteristics to the river water quality change are analyzed, and the method has important guiding significance for coping with water environment crisis and realizing precise watershed pollution prevention and control and scientific decision.
The diversity of water quality influencing factors, the complexity of an influencing mechanism and the spatial heterogeneity of river basin characteristics are the key points for limiting the application of a river water quality model. At present, river water quality prediction models mainly comprise three types: conceptual models, mechanism models, and machine learning models. The conceptual model mainly relates to a production convergence theory, a water quality diffusion conversion theory and a related mathematical theory, has a simple structure, is convenient to construct, but has a single theory and backward functions; the mechanism model is composed of a series of continuous equations and dynamic equations, can dynamically simulate the generation, migration and conversion processes of pollutants in a water body, has high simulation precision, is difficult to calibrate parameters, cannot solve the problem of spatial heterogeneity in large watershed application, and is limited in application scale; the machine learning model is a probability and statistics-based method, does not need to consider the complex migration and transformation process of pollutants in a water body, solves the environmental problem by using computer thinking, is suitable for water quality prediction of large watershed scale, large sample amount and long time scale, and has higher nonlinear data processing capacity. However, because of the inherent black box attribute of the machine learning algorithm, the water quality prediction model based on machine learning has the problem of poor interpretability, and at present, in the related research of river water quality prediction by using the machine learning model at home and abroad, only the simulation prediction function of water quality change can be realized, but the contribution and action mechanism of each river basin characteristic to the river water quality change cannot be quantitatively and accurately explained.
Disclosure of Invention
The invention aims to solve the problem that the influence of characteristics of each river basin on the river water quality change cannot be quantitatively and accurately explained by the existing water quality prediction method, and provides an interpretable water quality prediction method based on semi-embedded characteristic selection.
The invention relates to an interpretable water quality prediction method based on semi-embedded feature selection, which specifically comprises the following steps:
collecting a river course distribution map of a target river, determining a drainage basin range covered by the river, acquiring a DEM (Digital Elevation Model) map of the drainage basin range, extracting a drainage basin river network according to the DEM (Digital Elevation Model, DEM, digital Elevation Model data) map, and dividing a drainage basin catchment area;
step two, constructing a target river water quality prediction characteristic system according to influence factors of the target river water quality, acquiring characteristics influencing the target river water quality, collecting historical data of the characteristics of the target river and corresponding water quality data by taking a river basin catchment area as a unit, cleaning the characteristic data and the corresponding water quality data, and generating a preliminary input data set of a machine learning model;
thirdly, performing autocorrelation analysis on historical data of every two characteristics in the primary input data set in sequence to obtain strong autocorrelation characteristic pairs; the spearman correlation coefficient of the strong autocorrelation feature pairs is greater than 0.8;
step four, establishing a front-end processor learning model, calculating the global feature importance values of all features in the primary input data set by using a SHAP (SHAP) interpretation frame based on the model, deleting feature values which strongly meet the requirement of low global feature importance values in autocorrelation feature pairs, completing feature selection, and acquiring a final input data set of the machine learning model;
step five, dividing the input data set into a training set and a test set, and determining the coefficient r by utilizing the training set 2 Mean square error MSE, mean absolute error MAE, mean absolute percentage error MAPE as model evaluationIndex, optimizing the parameters of the machine learning regression model to obtain the optimal parameter combination;
step six, taking the optimal parameter combination as the machine learning regression model parameters, and utilizing the training set and the test set to respectively train and verify the machine learning regression model to obtain the optimal machine learning regression model;
and seventhly, predicting the target river water quality by using the optimal machine learning regression model in the sixth step and using the characteristic data of the current target river water quality.
Further, in the present invention, after the seventh step, the method further includes: and disassembling the water quality predicted value based on a SHAP explanation frame of the optimal machine learning regression model, and acquiring a SHAP value representing the influence of each characteristic on the water quality predicted value.
Further, in the present invention, in the first step, the specific method for dividing the catchment area of the drainage basin comprises:
collecting DEM (Digital Elevation Model) data in a target drainage basin range, and performing spatial interpolation processing on the collected DEM data to obtain a DEM map in the drainage basin range; and extracting a river network of the drainage basin and dividing a drainage basin catchment area on the basis of a DEM (digital elevation model) diagram of the drainage basin range by utilizing a hydrological analysis tool of ArcGIS (English spelling of Chinese name).
Further, in the second step of the present invention, collecting the historical data of the characteristics of the target river within n years and the corresponding water quality data by taking the river basin catchment area as a unit is as follows:
collecting basin characteristic data by taking a basin catchment area as a unit based on water quality data provided by a yearbook environment bulletin and a public data platform of statistics of a target river;
the collecting of the watershed feature data comprises: collecting raster data and point element data;
for the raster data, summarizing the data to each catchment area by using a partition statistical function in an ArcGIS space analysis tool;
and for the point element data, carrying out spatial discretization processing on the point element data, and summarizing the point element data to each catchment area through partition statistics.
Further, in the second step of the present invention, the specific method for cleaning the flow domain characteristic data is as follows:
searching error values, missing values and outliers in the collected watershed feature data for processing; and uniformly deleting error values, filling missing values in the watershed characteristic data by using a contemporaneous mean value filling method, replacing high outliers in the watershed characteristic data by using a third quartile and replacing low outliers in the watershed characteristic data by using a first quartile on the basis of the quadrant graph, and realizing data cleaning.
Further, in the third step of the present invention, the autocorrelation analysis is sequentially performed on the historical data of every two features in the preliminary input data set, and a specific method for obtaining a strong autocorrelation feature pair is as follows:
and calling Scipy in Python software, sequentially calculating spearman correlation coefficients among the features, and screening out feature pairs with the spearman correlation coefficients larger than 0.8 as strong autocorrelation feature pairs.
Further, in the present invention, in step four, the SHAP interpretation framework includes a plurality of interpreters, is applicable to any machine learning model, is used for interpreting the tree-based model, and is used for interpreting the deep learning model.
Further, in the sixth step of the present invention, the concrete method for disassembling the water quality prediction value of each water quality prediction sample based on the SHAP frame is as follows:
Figure BDA0003975416820000031
wherein, f (x) j ) The water quality predicted value of the jth sample is obtained;
Figure BDA0003975416820000032
predicting the mean value of the water quality of the machine learning regression model; m is a characteristic number;
Figure BDA0003975416820000033
for the SHAP value of the jth characteristic pair jth water quality predicted value,
Figure BDA0003975416820000034
represents the contribution degree of the ith feature to the jth sample, and positive and negative represent the contribution direction to be positive or negative.
The method selects a corresponding interpreter according to a machine learning regression algorithm, interprets the model based on the SHAP value output by the SHAP packet, compares the contribution of each river basin characteristic to the river water quality, identifies the core water quality driving factors, and analyzes the action mechanism of each core driving factor to the river water quality. Compared with a single machine learning prediction model, the interpretability of the water quality prediction model is greatly improved, the water quality driving factor analysis function of the water quality prediction model is enhanced, and the water quality prediction model has more practical guiding significance.
The method constructs a machine learning water quality prediction model through data cleaning, feature selection, super-parameter optimization and model verification, and combines the interpretability of a SHAP framework reinforced machine learning model to realize efficient simulation of river water quality, core water quality driving factor analysis and driving mechanism analysis, thereby being suitable for water quality simulation and pollution factor analysis of watersheds of any scale.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a functional block diagram of a method of the present invention;
FIG. 3 is a characteristic Spireman autocorrelation thermodynamic diagram;
FIG. 4 is a feature global importance histogram;
FIG. 5 is a diagram of measured and fitted data distribution of a SVM-based COD concentration prediction model;
FIG. 6 is a SHAP overview of the COD prediction model.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The first specific implementation way is as follows: the following describes the present embodiment with reference to fig. 1 and fig. 2, and the interpretable water quality prediction method based on semi-embedded feature selection in the present embodiment specifically includes:
collecting a river course distribution map of a target river, determining a drainage basin range covered by the river, acquiring a DEM (Digital Elevation Model) map of the drainage basin range, extracting a drainage basin river network according to the DEM (Digital Elevation Model, DEM, digital Elevation Model data) map, and dividing a drainage basin catchment area;
step two, constructing a target river water quality prediction characteristic system according to the influence factors of the target river water quality, acquiring the characteristics influencing the target river water quality, collecting historical data of the target river characteristics and corresponding water quality data by taking a river basin catchment area as a unit, and cleaning the characteristic data and the corresponding water quality data to generate a preliminary input data set of a machine learning model;
thirdly, performing autocorrelation analysis on the historical data of every two characteristics in the preliminary input data set in sequence to obtain a strong autocorrelation characteristic pair; the spearman correlation coefficient of the strong autocorrelation feature pairs is greater than 0.8;
for example, if four existing features a, B, C, and D are AB, AC, AD, BC, BD, and CD 6 feature pair combinations, then the spearman correlation coefficients between the 6 feature pairs are respectively found to be 0.9, 0.82, 0.3, 0.5, 0.6, and 0.83 by autocorrelation analysis, and the feature pairs AB, AC, and CD with spearman correlation coefficients greater than 0.8 are strong autocorrelation feature pairs;
step four, establishing a front-end processor learning model, calculating the global feature importance values of all features in the primary input data set by using a SHAP (SHAP) interpretation frame based on the model, deleting feature values which strongly meet the requirement of low global feature importance values in autocorrelation feature pairs, completing feature selection, and acquiring a final input data set of the machine learning model;
for example, the global feature importance values of the features in step three are calculated by using the SHAP interpretation framework as follows: a-0.92, B-0.83, C-0.76, D-0.94. For the strong autocorrelation feature pair AB, the global feature importance of A is greater than that of B, so that the feature A is reserved and the feature B is removed; similarly, for the strong autocorrelation feature pair AC, the feature A is reserved, and the feature C is removed; for the CD which is a strong autocorrelation feature pair, the feature D is reserved, and the feature C is removed. At the moment, completing feature selection and keeping the features A and D; if the characteristics which do not belong to the strong autocorrelation characteristic pair exist, directly reserving the characteristics;
step five, dividing the final input data set into a training set and a test set, and determining the coefficient r by utilizing the training set 2 The mean square error MSE, the mean absolute error MAE and the mean absolute percentage error MAPE are used as model evaluation indexes, and the parameters of the machine learning regression model are optimized to obtain an optimal parameter combination;
step six, taking the optimal parameter combination as the machine learning regression model parameters, and utilizing the training set and the test set to respectively train and verify the machine learning regression model to obtain the optimal machine learning regression model;
step seven, adopting the optimal machine learning regression model in the step six, and predicting the target river water quality by utilizing the characteristic data of the current target river water quality;
further, in the present invention, after the seventh step, the method further includes: and disassembling the water quality predicted value based on a SHAP explanation frame of the optimal machine learning regression model, and acquiring a SHAP value representing the influence of each characteristic on the water quality predicted value.
Further, in the present invention, in the first step, the specific method for dividing the catchment area of the drainage basin comprises:
collecting DEM (Digital Elevation Model) data in a target drainage basin range, and performing spatial interpolation processing on the collected DEM data to obtain a DEM map in the drainage basin range; and extracting a river network of the drainage basin and dividing a drainage basin catchment area on the basis of a DEM (digital elevation model) diagram of the drainage basin range by utilizing a hydrological analysis tool of ArcGIS (English spelling of Chinese name).
Further, in the second step of the present invention, collecting the historical data of the characteristics of the target river within n years and the corresponding water quality data by taking the river basin catchment area as a unit is as follows:
collecting basin characteristic data by taking a basin catchment area as a unit based on water quality data provided by a yearbook environment bulletin and a public data platform for statistics of a target river;
the collecting of the watershed feature data comprises: collecting raster data and point element data;
for the raster data, summarizing the data to each catchment area by using a partition statistical function in an ArcGIS space analysis tool;
and for the point element data, carrying out spatial discretization processing on the point element data, and summarizing the point element data to each catchment area through partition statistics.
Further, in the second step of the present invention, the specific method for cleaning the flow domain characteristic data is as follows:
searching error values, missing values and outliers in the collected watershed feature data for processing; and uniformly deleting error values, filling missing values in the watershed characteristic data by using a contemporaneous mean value filling method, replacing high outliers in the watershed characteristic data by using a third quartile and replacing low outliers in the watershed characteristic data by using a first quartile on the basis of the quadrant graph, and realizing data cleaning.
Further, in the third step of the present invention, the autocorrelation analysis is sequentially performed on the historical data of every two features in the preliminary input data set, and a specific method for obtaining a strong autocorrelation feature pair is as follows:
and calling Scipy in Python software, sequentially calculating spearman correlation coefficients among the features, and screening out feature pairs with the spearman correlation coefficients larger than 0.8 as strong autocorrelation feature pairs.
Further, in the present invention, in step four, the SHAP interpretation framework includes a plurality of interpreters, is applicable to any machine learning model, is used for interpreting the tree-based model, and is used for interpreting the deep learning model.
Further, in the sixth step of the present invention, the concrete method for disassembling the water quality prediction value of each water quality prediction sample based on the SHAP frame is as follows:
Figure BDA0003975416820000061
wherein, f (x) j ) The water quality predicted value of the jth sample is obtained;
Figure BDA0003975416820000062
predicting the mean value of the water quality of the machine learning regression model; m is a characteristic number;
Figure BDA0003975416820000063
for the SHAP value of the ith characteristic to the jth water quality predicted value,
Figure BDA0003975416820000064
represents the contribution degree of the ith feature to the jth sample, and positive and negative represent the contribution direction to be positive or negative.
The invention firstly adopts semi-embedded feature selection and utilizes an explanation frame, wherein the former can improve the model precision, the latter can enhance the model interpretability, and the two realize the improvement of the model performance.
The method is a rapid, accurate, interpretable and wide-scale water quality prediction model construction method, and the characteristic selection method based on the shape value realizes high-efficiency automatic simplification on a water quality prediction index system, so that the model operation speed and accuracy are greatly improved; a river water quality prediction model is constructed by utilizing a machine learning regression algorithm, the problem of basin characteristic space heterogeneity in large-scale basin water quality prediction is solved, and the application scale of the water quality prediction model is expanded; the interpretability strengthening method breaks through the problem that the machine learning regression algorithm is poor in interpretability in the river water quality prediction application, and is suitable for any machine learning regression model. The method is beneficial to accurately judging the key control factors of the river water quality, and has important significance on accurate pollution control and scientific decision of a river basin.
The specific embodiment is as follows:
with reference to fig. 2 to fig. 6, a river COD concentration prediction model of the singapore regadenoson basin in sichuan province is constructed as an example, and the specific implementation process is as follows:
(1) Determining a basin range and dividing a catchment area;
the Minjiang river is a first-level branch of the Yangtze river, has the total length of 658.8km and is a key branch of Yangtze river protection attack and solidness fighting. The case is developed by taking the Minjiang Dongyuan as an example, and the area of the Minjiang Dongyuan drainage basin is 5.16 km 2 And in the eastern region of Sichuan. 250m DEM data of Sichuan province are collected on a resource environment science and data platform of a Chinese academy of sciences, and a basin river network is extracted and a basin catchment area is divided by utilizing a hydrological analysis tool of ArcGIS based on a basin DEM image.
(2) Collecting and cleaning flow field characteristic data, and constructing an input data set;
considering various natural and artificial factors influencing the COD concentration of the river, selecting river basin characteristics from the aspects of weather meteorology, soil landform, land utilization, landscape pattern, social economy, hydrology, water quality and the like, and constructing a water quality prediction index system; collecting data from the sources such as 'Sichuan province statistics yearbook' and 'Sichuan province environment statistics communique' by taking catchment areas as units, summarizing the grid data to each catchment area by utilizing the partitioned statistics function of ArcGIS, carrying out spatial discretization on the point element data, and summarizing the point element data to each catchment area through partitioned statistics; deleting the error value, and filling the missing value by using the feature average value or the water quality data average value of the current basin in the same period; and correcting the outlier based on the boxplot, replacing the high outlier and the low outlier with a third quartile and a first quartile respectively, and sorting the cleaned data according to the input data format requirement of the machine learning regression model to generate an input data set.
(3) Selecting features based on the shape value;
calling a corr ('spearman') function in Python, calculating the spearman correlation coefficient between every two features, identifying the features with high autocorrelation and screening out feature pairs with the spearman correlation coefficient being more than 0.8 as shown in figure 3. In the case, a Support Vector Machine (SVM) algorithm is selected to construct a river water quality prediction model based on the measured data condition of COD concentration in the river. The SVM is a classic machine learning algorithm, and can obtain a good prediction effect when the feature number is larger than the sample number. SVR regression in sklern. Svm module in Python is called, and regression model is directly constructed under default parameters. Calling a SHAP model interpretation packet in Python, calculating the shape value of each feature by taking a kernel explainer as an interpreter, and generating the global feature importance ranking of each feature, as shown in FIG. 3. The global feature importance of the two features in the feature pair with the contrast spearman correlation coefficient larger than 0.8 is large, and the higher features are reserved. For example, as can be seen from fig. 3, the correlation coefficient between the precipitation amount and the river flow is 0.82, and the correlation coefficient is an autocorrelation feature pair, while as can be seen from fig. 4, the feature global importance (0.6) of the precipitation amount is higher than the river flow (0.26), so that the precipitation amount is retained, and the river flow index is deleted. And comparing and screening all the features with the spearman correlation coefficient larger than 0.8 to complete feature selection.
It can be seen that the correlation coefficient between the precipitation amount and the river flow is 0.82, the correlation coefficient is an autocorrelation feature pair, and from the feature global importance histogram, fig. 4 shows that the feature global importance (0.6) of the precipitation amount is higher than the river flow (0.26), so that the precipitation amount is retained, and the river flow index is deleted. And comparing and screening all the characteristics with the spearman correlation coefficient larger than 0.8 to complete characteristic selection.
(4) Training, adjusting and verifying a machine learning regression model;
calling a train _ test _ split function in skearn, dividing the whole input data into a training set and a test set according to the proportion of 3:
TABLE 1SVM parameter ranges and optimal parameter values
Figure BDA0003975416820000081
Training SVM regression model by optimal parameter combination, and evaluating the model on training set and verification set respectively, wherein the evaluation index comprises a decision coefficient r 2 Mean square error MSE, mean absolute error MAE, mean absolute percentage error MAPE. The evaluation indexes of the SVM regression model are shown in Table 2, and the SVM regression model has higher accuracy on the prediction of COD concentration and a training set r 2 Up to 0.982, test set r2 also reached 0.78.
TABLE 2 COD concentration prediction model evaluation based on SVM
Figure BDA0003975416820000082
And (4) substituting the data for prediction into the SVM water quality prediction model constructed by the optimal parameter combination in the step (4) to predict the COD concentration. Fig. 5 shows the distribution of the model to the COD concentration and the measured data, and it can be seen that the model analog values are distributed almost near the measured values, and the COD prediction model based on the SVM has a stable simulation effect on the COD concentration of the river.
6) Explaining a water quality model based on a SHAP framework;
calling a SHAP model interpretation packet in Python, selecting a kernel explainer as an interpreter, interpreting the SVM model with the optimal parameter combination constructed in the step (4), calculating the shape value of each feature, and generating a SHAP generalized diagram of the COD prediction model as shown in FIG. 6. As can be seen from the figure, the three watershed characteristics which have the greatest influence on the COD concentration of the river in the watershed range are the number of urban permanent population, GDP and TDE (land occupation ratio for urban and rural construction), and all three core water quality driving factors have positive correlation with the COD concentration; the treatment investment amount of the pollution sources of the months and the old industry and the COD concentration of the river are in a negative correlation relationship. Therefore, pollution generated by urban production and living is the main reason for researching the increase of COD concentration in a region, the stronger the urbanization is, the faster the economic development is, the poorer the water quality of the region is, the COD concentration can be reduced in winter, and the increase of the COD concentration is effectively limited by the investment of industrial pollution treatment. The above conclusions encouraged city managers to start with domestic sewage treatment, enhance sewage treatment efficiency, and reflect seasonal differences in COD management and control.
The invention provides a characteristic selection method based on a shape value, which is characterized in that a Spanish autocorrelation coefficient is utilized to identify a strong autocorrelation characteristic pair, the shape value of each characteristic is taken as a basis, the global characteristic importance of each characteristic is generated to be used as a retention basis of the strong autocorrelation characteristic, and the problem of model operation speed and accuracy reduction caused by multiple collinearity of redundant characteristics in basin modeling is solved. Compared with the feature selection method which is widely applied at present and comprises autocorrelation analysis and subjective screening, the method considers the global contribution degree of the watershed features to the water quality prediction model based on the shape value, and has more scientific basis on the retention and selection of strong autocorrelation features; the characteristic ratio selection omits the artificial subjective judgment process, and the automation degree is higher; after the features are selected, irrelevant features and redundant features are removed, feature dimensions are controlled, training time is shortened, and generalization capability and feature understanding of the model are enhanced.
The invention also provides an interpretable strengthening method of the river water quality prediction model, aiming at the problem of poor interpretability of the black box model, the SHAP framework is used for identifying the core water quality driving factors, and the water quality contribution and action mechanism of the core water quality driving factors are analyzed. The single machine learning model can only output strong and weak correlation between the characteristic and the prediction index through correlation analysis of the characteristic and the prediction index, cannot explain quantitative response relation between the characteristic and the prediction index, cannot judge the contribution strength of the characteristic to the prediction index, and is not beneficial to guidance of a water quality prediction model on actual work. Compared with qualitative post attribution analysis based on correlation analysis of a single machine learning model, the method can quantitatively describe the overall contribution size and action direction of the watershed characteristics to the water quality, explain how the model makes prediction based on the whole characteristic space, the model structure and the parameters, identify the core driving factors of the water quality, and greatly improve the interpretability;
although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that features described in different dependent claims and herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims (8)

1. An interpretable water quality prediction method based on semi-embedded feature selection is characterized by comprising the following steps:
collecting a river course distribution map of a target river, determining a river basin range covered by the river, obtaining a DEM (digital elevation model) map of the river basin range, extracting a river basin and a river network according to the DEM map, and dividing a river basin catchment area;
step two, constructing a target river water quality prediction characteristic system according to the influence factors of the target river water quality, acquiring the characteristics influencing the target river water quality, collecting historical data of the target river characteristics and corresponding water quality data by taking a river basin catchment area as a unit, and cleaning the characteristic data and the corresponding water quality data to generate a preliminary input data set of a machine learning model;
thirdly, performing autocorrelation analysis on the historical data of every two characteristics in the preliminary input data set in sequence to obtain a strong autocorrelation characteristic pair; the spearman correlation coefficient of the strong autocorrelation feature pairs is greater than 0.8;
step four, establishing a front-end processor learning model, calculating the global feature importance values of all the features in the primary input data set by using a SHAP interpretation framework based on the model, deleting the features with lower global feature importance values in strong autocorrelation feature pairs, completing feature selection and acquiring the final input data set of the machine learning model;
step five, dividing the final input data set into a training set and a test set, and determining the coefficient r by utilizing the training set 2 The mean square error MSE, the mean absolute error MAE and the mean absolute percentage error MAPE are used as model evaluation indexes, and the parameters of the machine learning regression model are optimized to obtain an optimal parameter combination;
step six, taking the optimal parameter combination as the machine learning regression model parameters, and utilizing the training set and the test set to respectively train and verify the machine learning regression model to obtain the optimal machine learning regression model;
and seventhly, predicting the target river water quality by using the optimal machine learning regression model in the sixth step and using the characteristic data of the current target river water quality.
2. The interpretable water quality prediction method based on semi-embedded feature selection according to claim 1, further comprising after step seven: and disassembling the water quality predicted value based on a SHAP explanation frame of the optimal machine learning regression model, and acquiring a SHAP value representing the influence of each characteristic on the water quality predicted value.
3. The interpretable water quality prediction method based on semi-embedded feature selection according to claim 1, wherein in the step one, a specific method for dividing a basin catchment area is as follows:
collecting DEM data in a target drainage basin range, and performing spatial interpolation processing on the collected DEM data to obtain a DEM image in the drainage basin range; and extracting a river network of the drainage basin based on the DEM image of the drainage basin range by utilizing a hydrological analysis tool of ArcGIS, and dividing a drainage basin catchment area.
4. The interpretable water quality prediction method based on semi-embedded feature selection according to claim 1, wherein in the second step, the specific method for cleaning the characteristic data of the flow domain is as follows:
searching error values, missing values and outliers in the collected watershed feature data for processing; and uniformly deleting error values, filling missing values in the watershed characteristic data by using a contemporaneous mean value filling method, replacing high outliers in the watershed characteristic data by using a third quartile and replacing low outliers in the watershed characteristic data by using a first quartile on the basis of the quadrant graph, and realizing data cleaning.
5. The interpretable water quality prediction method based on semi-embedded feature selection according to claim 1, wherein in the third step, autocorrelation analysis is performed on historical data of every two features in the preliminary input data set in sequence, and a specific method for acquiring a strong autocorrelation feature pair is as follows:
invoking Scipy in Python software, sequentially calculating the spearman correlation coefficient among the features in the primary input data set, and screening out a feature pair with the spearman correlation coefficient being more than 0.8 as a strong autocorrelation feature pair.
6. The interpretable water quality predicting method based on semi-embedded feature selection as claimed in claim 1, wherein in step four, the SHAP interpretation framework comprises a plurality of interpreters, is applicable to any machine learning model, is used for interpreting a tree-based model and is used for interpreting a deep learning model.
7. The interpretable water quality prediction method based on semi-embedded feature selection according to claim 1, wherein in the sixth step, the specific method for obtaining the SHAP value representing the influence of each feature on the water quality predicted value by disassembling the water quality predicted value based on the SHAP interpretation framework of the optimal machine learning regression model is as follows:
Figure FDA0003975416810000021
wherein, f (x) j ) The water quality predicted value of the jth sample is obtained;
Figure FDA0003975416810000023
predicting the mean value of the water quality of the machine learning regression model; m is a characteristic number;
Figure FDA0003975416810000022
and the SHAP value of the jth water quality predicted value is the ith characteristic.
8. The interpretable water quality predicting method based on semi-embedded feature selection as claimed in claim 7, wherein the SHAP value of the predicted water quality value
Figure FDA0003975416810000024
The value of (b) represents the value of the contribution degree of the ith feature to the jth sample, and positive and negative represent the direction of the contribution to be positive or negative.
CN202211533758.2A 2022-12-01 2022-12-01 Interpretable water quality prediction method based on semi-embedded feature selection Pending CN115728463A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211533758.2A CN115728463A (en) 2022-12-01 2022-12-01 Interpretable water quality prediction method based on semi-embedded feature selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211533758.2A CN115728463A (en) 2022-12-01 2022-12-01 Interpretable water quality prediction method based on semi-embedded feature selection

Publications (1)

Publication Number Publication Date
CN115728463A true CN115728463A (en) 2023-03-03

Family

ID=85301230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211533758.2A Pending CN115728463A (en) 2022-12-01 2022-12-01 Interpretable water quality prediction method based on semi-embedded feature selection

Country Status (1)

Country Link
CN (1) CN115728463A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117094123A (en) * 2023-07-12 2023-11-21 广东省科学院生态环境与土壤研究所 Soil carbon fixation driving force identification method, device and medium based on interpretable model
CN117875769A (en) * 2023-12-29 2024-04-12 广州大学 Analysis method and system for influence intensity of visual element on riding environment evaluation

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120081544A (en) * 2011-01-11 2012-07-19 한국전자통신연구원 Method for measuring total phosphorus using general elements of water quality
CN106525762A (en) * 2016-11-07 2017-03-22 航天恒星科技有限公司 Water quality monitoring method and water quality monitoring device based on adaptive model
CN109242203A (en) * 2018-09-30 2019-01-18 中冶华天南京工程技术有限公司 A kind of water quality prediction of river and water quality impact factors assessment method
CN109614570A (en) * 2018-11-15 2019-04-12 北京英视睿达科技有限公司 Predict the method and device of section water quality parameter data
CN110070144A (en) * 2019-04-30 2019-07-30 云南师范大学 A kind of lake water quality prediction technique and system
AU2020103356A4 (en) * 2020-02-26 2021-01-21 Chinese Research Academy Of Environmental Sciences Method and device for building river diatom bloom warning model
CN112381292A (en) * 2020-11-13 2021-02-19 福州大学 River water quality prediction method considering space-time correlation and meteorological factors
CN112784395A (en) * 2019-11-08 2021-05-11 天津大学 Method for predicting and simulating total phosphorus concentration of river water body
CN113433086A (en) * 2021-06-28 2021-09-24 淮阴工学院 Method for predicting water quality COD (chemical oxygen demand) by combining fuzzy neural network with spectrophotometry
CN114219370A (en) * 2022-01-29 2022-03-22 哈尔滨工业大学 Social network-based multidimensional influence factor weight analysis method for river water quality
CN114220549A (en) * 2021-12-16 2022-03-22 无锡中盾科技有限公司 Effective physiological feature selection and medical causal reasoning method based on interpretable machine learning
CN114267422A (en) * 2021-12-24 2022-04-01 中国电建集团中南勘测设计研究院有限公司 Method and system for predicting surface water quality parameters, computer equipment and storage medium
CN114325454A (en) * 2021-12-30 2022-04-12 东软睿驰汽车技术(沈阳)有限公司 Method, device, equipment and medium for determining influence of multiple characteristics on battery health degree
CN114330929A (en) * 2022-01-21 2022-04-12 广州虎牙科技有限公司 Content contribution degree evaluation method and device, electronic equipment and readable storage medium
CN114330904A (en) * 2021-12-31 2022-04-12 广东长天思源环保科技股份有限公司 River water quality prediction method based on characteristic engineering
WO2022101515A1 (en) * 2020-11-16 2022-05-19 UMNAI Limited Method for an explainable autoencoder and an explainable generative adversarial network
KR102409155B1 (en) * 2021-10-08 2022-06-16 주식회사 다올 System for Predicting Groundwater Level based on LSTM
CN115242458A (en) * 2022-06-28 2022-10-25 南京邮电大学 Interpretable method of 1D-CNN network traffic classification model based on SHAP
CN115390450A (en) * 2022-08-30 2022-11-25 中国矿业大学 Coal flotation intelligent dosing method capable of being interpreted in segmented mode

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120081544A (en) * 2011-01-11 2012-07-19 한국전자통신연구원 Method for measuring total phosphorus using general elements of water quality
CN106525762A (en) * 2016-11-07 2017-03-22 航天恒星科技有限公司 Water quality monitoring method and water quality monitoring device based on adaptive model
CN109242203A (en) * 2018-09-30 2019-01-18 中冶华天南京工程技术有限公司 A kind of water quality prediction of river and water quality impact factors assessment method
CN109614570A (en) * 2018-11-15 2019-04-12 北京英视睿达科技有限公司 Predict the method and device of section water quality parameter data
CN110070144A (en) * 2019-04-30 2019-07-30 云南师范大学 A kind of lake water quality prediction technique and system
CN112784395A (en) * 2019-11-08 2021-05-11 天津大学 Method for predicting and simulating total phosphorus concentration of river water body
AU2020103356A4 (en) * 2020-02-26 2021-01-21 Chinese Research Academy Of Environmental Sciences Method and device for building river diatom bloom warning model
CN112381292A (en) * 2020-11-13 2021-02-19 福州大学 River water quality prediction method considering space-time correlation and meteorological factors
WO2022101515A1 (en) * 2020-11-16 2022-05-19 UMNAI Limited Method for an explainable autoencoder and an explainable generative adversarial network
CN113433086A (en) * 2021-06-28 2021-09-24 淮阴工学院 Method for predicting water quality COD (chemical oxygen demand) by combining fuzzy neural network with spectrophotometry
KR102409155B1 (en) * 2021-10-08 2022-06-16 주식회사 다올 System for Predicting Groundwater Level based on LSTM
CN114220549A (en) * 2021-12-16 2022-03-22 无锡中盾科技有限公司 Effective physiological feature selection and medical causal reasoning method based on interpretable machine learning
CN114267422A (en) * 2021-12-24 2022-04-01 中国电建集团中南勘测设计研究院有限公司 Method and system for predicting surface water quality parameters, computer equipment and storage medium
CN114325454A (en) * 2021-12-30 2022-04-12 东软睿驰汽车技术(沈阳)有限公司 Method, device, equipment and medium for determining influence of multiple characteristics on battery health degree
CN114330904A (en) * 2021-12-31 2022-04-12 广东长天思源环保科技股份有限公司 River water quality prediction method based on characteristic engineering
CN114330929A (en) * 2022-01-21 2022-04-12 广州虎牙科技有限公司 Content contribution degree evaluation method and device, electronic equipment and readable storage medium
CN114219370A (en) * 2022-01-29 2022-03-22 哈尔滨工业大学 Social network-based multidimensional influence factor weight analysis method for river water quality
CN115242458A (en) * 2022-06-28 2022-10-25 南京邮电大学 Interpretable method of 1D-CNN network traffic classification model based on SHAP
CN115390450A (en) * 2022-08-30 2022-11-25 中国矿业大学 Coal flotation intelligent dosing method capable of being interpreted in segmented mode

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
丁彦蕊;孙小妹;王文超;孙培冬;陈蓓;: "基于支持向量机的太湖入湖河流水质影响因素的研究", 水资源与水工程学报, no. 05, 15 October 2011 (2011-10-15) *
何同弟;李见为;黄鸿;: "PSO优选参数的SVR水质评价方法", 计算机工程与应用, no. 24, 21 August 2010 (2010-08-21) *
刘芳芳;黄升谋;陈伟亚;杨永涛;: "GIS在水污染控制规划中的辅助应用――以湖北省某市河网为例", 绿色科技, no. 07, 25 July 2013 (2013-07-25) *
宓云;王晓萍;金鑫;: "基于机器学习的水质COD预测方法", 浙江大学学报(工学版), no. 05, 15 May 2008 (2008-05-15) *
贾永利: "基于集成学习与可解释性方法的空气污染物浓度预测模型研究", 《中国优秀硕士学位论文全文数据库 (工程科技Ⅰ辑)》, 15 September 2021 (2021-09-15) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117094123A (en) * 2023-07-12 2023-11-21 广东省科学院生态环境与土壤研究所 Soil carbon fixation driving force identification method, device and medium based on interpretable model
CN117094123B (en) * 2023-07-12 2024-06-11 广东省科学院生态环境与土壤研究所 Soil carbon fixation driving force identification method, device and medium based on interpretable model
CN117875769A (en) * 2023-12-29 2024-04-12 广州大学 Analysis method and system for influence intensity of visual element on riding environment evaluation

Similar Documents

Publication Publication Date Title
CN109978249B (en) Population data spatialization method, system and medium based on partition modeling
Ozkan et al. A novel wind power forecast model: Statistical hybrid wind power forecast technique (SHWIP)
CN110517482B (en) Short-term traffic flow prediction method based on 3D convolutional neural network
CN115728463A (en) Interpretable water quality prediction method based on semi-embedded feature selection
CN113283095A (en) Evolutionary digital twin watershed construction method
CN113468803A (en) Improved WOA-GRU-based flood flow prediction method and system
CN113139760B (en) Typhoon risk comprehensive evaluation method and system based on wind and rain big data
CN115374995A (en) Distributed photovoltaic and small wind power station power prediction method
CN107133686A (en) City-level PM2.5 concentration prediction methods based on Spatio-Temporal Data Model for Spatial
CN116108984A (en) Urban flow prediction method based on flow-POI causal relationship reasoning
CN113642699A (en) Intelligent river flood forecasting system
CN114882373A (en) Multi-feature fusion sandstorm prediction method based on deep neural network
CN113240219A (en) Land utilization simulation and prediction method
Zhang et al. Short-term Traffic Flow Prediction With Residual Graph Attention Network.
CN113297805A (en) Wind power climbing event indirect prediction method
CN109190800A (en) A kind of sea surface temperature prediction technique based on spark frame
CN111310974A (en) Short-term water demand prediction method based on GA-ELM
CN112507549A (en) Modular hydrological simulation system
CN115691140B (en) Analysis and prediction method for space-time distribution of automobile charging demand
CN111488974A (en) Deep learning neural network-based ocean wind energy downscaling method
CN112101608A (en) Offshore wind farm site selection method and device
CN115271221A (en) City expansion prediction method, device, medium and equipment
CN115296298A (en) Wind power plant power prediction method
CN113962357A (en) GWO-WNN-based distributed photovoltaic power data virtual acquisition method
CN117541928B (en) Urban building material stock estimation method and system based on convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination