CN109214592B

CN109214592B - Multi-model-fused deep learning air quality prediction method

Info

Publication number: CN109214592B
Application number: CN201811210072.3A
Authority: CN
Inventors: 陈红倩; 陈晚林
Original assignee: Beijing Technology and Business University
Current assignee: Dragon Totem Technology Hefei Co ltd
Priority date: 2018-10-17
Filing date: 2018-10-17
Publication date: 2022-03-08
Anticipated expiration: 2038-10-17
Also published as: CN109214592A

Abstract

The invention discloses an air quality prediction method for deep learning of multi-model fusion, which comprises the following steps of firstly, acquiring historical air quality data and meteorological data; secondly, missing value interpolation and normalization processing are carried out on historical air quality data; thirdly, a deep learning model based on seq2seq is constructed by using historical air quality data and serves as a single-factor prediction model; fourthly, constructing a seq2seq deep learning model based on a double-attention machine system by utilizing historical air quality data and meteorological data to serve as a multi-factor prediction model; and fifthly, fusing the single-factor prediction model prediction result, the multi-factor prediction model prediction result and the current meteorological data of the air quality data into an xgboost tree for regression calculation to obtain a final predicted value of the air quality data.

Description

Multi-model-fused deep learning air quality prediction method

Technical Field

The invention belongs to the crossing field of computer disciplines and environmental disciplines, and particularly relates to an air quality prediction method for deep learning with multi-model fusion.

Background

Due to the fact that social progress and rapid industrial development in recent years lead to a large number of serious environmental pollution problems, especially air pollution is always intensified, the quantity of inhalable particles in the air is increased, and the quality of the air which people live increasingly is gradually reduced. Among the inhalable particles in the air, the particles PM2.5 and PM10 are the most serious, which not only have great influence on the air quality, but also cause great harm to human bodies. In order to reduce the harm of air pollutants to human bodies, the method has important practical significance for analyzing and predicting the air quality.

In response to this problem of air quality prediction, a number of researchers have made a great deal of contribution in recent years. They have combined the latest ideas and new technologies to carry out quantitative research, enabling to find out the causes of air pollution and the fundamental trends of air quality. Where air quality is affected by many factors and has a lot of uncertainty and complexity. Conventional artificial neural networks, Back Propagation (BP) neural networks and regression prediction methods are not accurate enough to predict, and it is difficult to learn internal rules.

Wangxin et al uses LSTM (long short term memory network) recurrent neural network to predict the failure time sequence; machine learning is used by dalikey and the like to dynamically forecast the short-term concentration of PM 2.5; nepheline et al use a genetic algorithm and a Back Propagation (BP) neural network for air quality prediction; establishing an air pollution space-time forecasting model by using a Recurrent Neural Network (RNN) in Vancouxiang and the like; zhengyi and the like establish a PM2.5 prediction model of a deep belief network; zhangtian et al used a BP neural network for air quality prediction.

When the monitoring station acquires the air quality data, the air quality data generally has more missing values due to the problems of equipment failure or network delay blocking and the like. In the conventional data preprocessing, especially missing value processing, most methods for filling missing values are methods such as deletion, averaging, proximity and the like, the filling precision is poor, and the air quality data has time sequence information, so that the precision of sequence data during training is poor. The prediction model uses a traditional single model in selection, problems of overfitting or poor precision and the like can occur, and compared with machine learning models such as an artificial neural network and a traditional support vector machine, the prediction model based on deep learning can be better in the aspects of data capturing characteristics and prediction precision. The conventional machine learning method is difficult to extract the time sequence information from the sequence data aiming at the time sequence information in the sequence data, and the technology of deep learning can be used for extracting the time sequence characteristics, so that the prediction accuracy is improved.

Disclosure of Invention

In view of this, the invention provides a multi-model fusion deep learning air quality prediction method, which can improve the prediction accuracy of air quality data.

The technical scheme for realizing the invention is as follows:

a multi-model fusion deep learning air quality prediction method comprises the following steps:

acquiring historical air quality data and meteorological data;

step two, carrying out missing value interpolation and normalization processing on historical air quality data;

thirdly, constructing a deep learning model based on seq2seq (Sequence to Sequence) as a single-factor prediction model by using historical air quality data;

step four, constructing a seq2seq deep learning model based on a double attention (attention) mechanism by utilizing historical air quality data and meteorological data to serve as a multi-factor prediction model;

inputting the historical air quality data into a single-factor prediction model to obtain a prediction result of the single-factor prediction model, inputting the historical air quality data and the meteorological data into a multi-factor prediction model to obtain a prediction result of the multi-factor prediction model, and fusing the prediction result of the single-factor prediction model, the prediction result of the multi-factor prediction model and the current meteorological data into an xgboost tree to perform regression calculation to obtain a final prediction value of the air quality data.

Further, the air quality data includes PM2.5, PM10, NO₂、CO、O₃And SO₂Concentration data of one or more than two types.

Further, when the air quality data is more than two types of concentration data, in the fifth step, one type of the air quality data is predicted in sequence, and the prediction result of the single-factor prediction model and the prediction result of the multi-factor prediction model of the air quality data and the current meteorological data are merged into the xgboost tree to perform regression calculation to obtain the prediction value of the air quality data.

Further, meteorological data includes temperature, barometric pressure, humidity, wind direction, wind speed, and weather indicators.

Further, the missing value interpolation is performed by using an expectation maximization method.

Furthermore, an input layer of the single-factor prediction model is historical air quality data, a hidden layer is of a seq2seq model, an encoding and decoding structure is arranged inside the hidden layer, an encoding and decoding unit is an LSTM long-time memory neural network, and an output layer is an air quality prediction value.

Furthermore, the input layer of the multi-factor prediction model is historical air quality data and meteorological data, the hidden layer structure is a seq2seq model of a double-attention-machine system, an encoding-decoding (encoder-decoder) structure is arranged inside the hidden layer, a layer of attention machine system is added before encoding (encoder) and decoding (decoder), and the output layer is an air quality data prediction value.

Has the advantages that:

1. according to the method, on the construction of a single-factor prediction model and a multi-factor prediction model, relevant factors influencing air quality are analyzed, and then important features are extracted and used as the input of the models.

2. In the invention, a seq2seq model of a double-attention mechanism is used for selecting the model, the cyclic neural network (RNN) is expanded, the sequence data can be well processed, and the future trend can be predicted by the self-learning rule of the neural network.

3. Aiming at the complexity of air quality prediction, the method integrates the prediction results of multiple models, and cooperates with each other to solve the optimal solution, so that the optimal predicted value is obtained, and the accurate prediction of the air quality is completed.

4. The method combines the special self attribute of the air quality, uses the EM maximum expectation algorithm to fill the missing value of the air quality data, and can well fit the distribution of the data.

Drawings

FIG. 1 is a schematic diagram of the model of the present invention.

FIG. 2 is a diagram of a single-factor prediction model according to the present invention.

FIG. 3 is a diagram of a multi-factor predictive model of the present invention.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

The invention needs to model the air data, and the selection of the characteristic data before modeling is very important, especially the air quality data, which relates to the influence on the space and the time, such as the daily working and working hours, the weekend of each week, the days before and after each month, and each season (the air pollution degree is different in summer and winter). Meanwhile, the spatial influence, such as the influence of weather factors, the influence of wind direction and the influence of humidity of surrounding areas, can be received by each monitoring station. Therefore, models are respectively established in the time dimension and the space dimension, firstly, a single-factor model is established in the time dimension, a single factor (such as PM2.5) is selected as the input of the single-factor model, the trend in time is found out, and a local rule is obtained; then establishing a multi-factor model in the spatial dimension, and selecting the multi-factor (surrounding area weather conditions and the like) as the input of the multi-factor model to obtain a global rule; and finally, using a fusion model to take the output of the single-factor model and the multi-factor model as input, and calculating the weight by combining the weather condition of the current time to obtain the optimal predicted value. As shown in fig. 1, the specific steps are as follows:

step one, acquiring historical weather data comprising PM2.5, PM10 and NO per hour₂，CO，O₃，SO₂Temperature, air pressure, humidity, wind direction, wind speed, weather index, wherein, the temperature unit: DEG C, air pressure unit: hectopa (hPa), humidity unit: percent (%), wind direction unit: north begins clockwise angle definition, wind speed unit: m/s, weather index: sunny days, rains, cloudy, etc.

Step two, preprocessing historical data

2.1, because the air quality data are obtained through outdoor sensors, in special or artificial situations, the sensors cannot acquire data accurately or the sensors stop collecting data, and the air quality data have missing values. Therefore, when air quality prediction is performed, missing value interpolation or completion needs to be performed on historical data.

The missing value completion is performed using the Expectation Maximization (EM) method. The EM algorithm is used for estimating the maximum likelihood and the maximum posterior probability of a probability parameter model containing hidden variables, wherein air quality data are used as the hidden variables, maximization is performed according to a likelihood formula to obtain a parameter expression of the model, then iteration is performed on real data to solve parameters, and finally a missing value is calculated.

2.2, normalizing the historical data

Wherein X is the original data, X_minRepresenting the minimum value, X, in the raw data_maxThe maximum value in the raw data is indicated.

And 2.3, dividing the processed historical data into 80% training sets and 20% testing sets according to the proportion.

The invention uses the maximum expectation EM algorithm to carry out interpolation processing on the missing value, and the EM algorithm has better fitting capability on time sequence data and is very close to the distribution of original data.

Step three, constructing a single-factor prediction model: the model structure is divided into an input layer, a hidden layer and an output layer, wherein the data of the input layer is PM2.5, PM10 and NO₂、CO、O₃，SO₂Six concentration values are used as training data, and the dimension of an input layer is K multiplied by 6. The output layers are PM2.5, PM10 and NO₂、CO、O₃，SO₂Predicted value of concentration, output layer dimension is 6 × T.

Wherein K represents the length of the selected time series, i.e. K time frames of air quality data (PM2.5, PM10, NO)₂、CO、O₃，SO₂) For example, K is 72h (hours). T denotes air quality data at a future time, that is, air quality data at a predicted future time T, for example, T48 h (hours).

Hidden layer structure of single factor model: the hidden layer adopts a Sequence to Sequence (seq) model and uses a coding-decoding (encoder-decoder) structure, wherein the coding encoder structure is a multi-layer Long and Short-Term Memory (LSTM) neural unit, and after coding (encoder), a context vector is calculated to express important characteristics of input data. The decoding (decoder) part is mainly used for decoding and is structured as a multi-layered LSTM unit.

Step four, constructing a multi-factor prediction model: the model structure is divided into an input layer, a hidden layer and an output layer, wherein the data of the input layer is the whole historical weather data, including PM2.5, PM10 and NO₂、CO、O₃、SO₂Temperature, pressure, humidity, windThe dimension of the input layer is K × 12 for 12 data such as direction, wind speed, weather index, etc., where K represents the length of the selected time series, for example, K is 72h (hours). The output layer is an air quality data prediction value with the dimension of 6 multiplied by T, wherein T represents the air quality data at the future moment and comprises PM2.5, PM10 and NO₂、CO、O₃、SO₂Six concentration data, e.g., T ═ 48h (hours).

Hidden layer structure of multi-factor model: the model is a Sequence-to-Sequence (seq 2 seq) model of a double-attention mechanism, an encoding-decoding (encoder-decoder) structure is arranged in the model, a layer of attention (attention) mechanism is added before encoding to acquire context information before encoding, the encoding encoder part is a multi-layer LSTM (Long Short-Term Memory) neural unit, and after encoding (encoder), a context vector is calculated to express important characteristics of input data. A layer of attention (attention) mechanism is also added at this point prior to decoding. The decoding (decoder) part is mainly used to output the final result state, and is structured as multiple layers of LSTM units.

Step five, fusing the prediction model: sequentially training PM2.5, PM10 and NO by using xgboost (eXtreme Gradient boosting) lifting tree as a training model₂、CO、O₃And SO₂One of the six air mass concentration data is input into the prediction result of the single-factor prediction model and the multi-factor model prediction of a certain air mass concentration data during each training, the weather conditions at the latest moment including temperature, air pressure, humidity, wind direction, wind speed, weather indexes and the like are also input, and the corresponding labeled data during the training are the current air mass data (PM2.5, PM10, NO)₂、CO、O₃、SO₂) For example, when the input data during training is eight data features such as PM2.5 (single-factor model result), PM2.5 (multi-factor model result), temperature, air pressure, humidity, wind direction, wind speed, weather index (six current weather conditions), etc., the labeled data is the current PM2.5 concentration data, and so on, the remaining five types (PM10, NO) can be obtained₂、CO、O₃、SO₂) Input data and label number ofAnd inputting the input data and the marking data of each type into the xgboost tree. The xgboost model establishes n regression subtrees in the training process, namely n tree nodes, each tree node has a plurality of leaf nodes, wherein the leaf nodes are the predicted values of the single-factor and multi-factor prediction models and eight input characteristic values of meteorological data at the latest moment, the corresponding leaf nodes of each tree node are determined according to the weights of the eight input characteristic values, the node splitting principle is to select the characteristic with the maximum information gain as a splitting point, so that the leaf nodes of each subtree have importance (weight value), each round of training adjusts the weight values of the leaf nodes according to a target function to form a new subtree, the optimal subtree is reached after multiple iterations, each subtree node is a predicted value, and all the predicted values of the subtree nodes are accumulated to obtain the air quality predicted value. After input data and marking data are input into the xgboost training, six air quality predicted values can be obtained respectively, wherein the six air quality predicted values comprise PM2.5, PM10 and NO₂、CO、O₃And SO₂Six kinds of concentration data.

Prediction result evaluation

In order to evaluate and analyze the final prediction result, the obtained prediction model is detected and evaluated by using test set data, a Root Mean Square Error (RMSE) is used as an evaluation index, and a formula of the root mean square error is shown as a formula:

wherein,

expressed is the predicted value, y_iThe original values are represented and n represents the total number of original values.

Meanwhile, in order to prevent the singularity of the evaluation index, R-square is used as the evaluation index of regression analysis, wherein the formula of the R-square is shown as the formula:

wherein,

expressed is the predicted value, y_iIt is shown that the original value is,

the average values are shown and n represents the total number of original values.

Example 1

In order to better explain the technical scheme of the invention, the invention selects the data of 36 air quality monitoring stations and the data of 18 meteorological stations in Beijing City and obtains sufficient sample data, and the specific implementation steps of the invention are as follows:

step 1: obtaining data

The air quality data used by the method is the hourly data collected from 2017, 1 month to 2018, 1 month in Beijing, wherein the air quality data mainly comprises the following important air pollutants: PM2.5(μ g/m3), PM10(μ g/m3), NO₂(μg/m3)，CO(mg/m3)，O₃(. mu.g/m 3) and SO₂(μ g/m 3). And weather and meteorological data which mainly comprise weather, temperature, air pressure, humidity, wind speed and wind direction. Wherein the weather mainly comprises sunny days, snowy days, cloudy days, light rain, heavy rain, sand raising and the like; the temperature is a temperature value monitored by a meteorological station, and the unit is centigrade; barometric pressure refers to atmospheric pressure in units of hectopascal (hPa); humidity refers to the amount of water vapor in the air, in percent (%); wind speed refers to the speed of wind monitored by a weather station, and is measured in meters per second (m/s); wind direction refers to the origin of the wind, for example, north wind refers to wind blowing from north to south, and wind direction is defined by an angle clockwise from north. For example, the wind direction is 180 degrees from the south, and 90 degrees from the east.

Step 2: data pre-processing

For historical air quality data, problems of monitoring instruments are inevitable, andand finally, the data is lost, interpolation processing of the lost value is carried out before the data is input into the prediction model, and the interpolation method adopted by the invention is a maximum Expectation (EM) method, which comprises the steps of E, obtaining maximum likelihood estimation on the sample and M, obtaining the maximum likelihood result and maximizing the maximum likelihood result. The given training sample is { x }¹，…，x^mIndependent samples, the goal is to find the implicit class z of each sample, maximizing the maximum likelihood function p (x, z).

And E, calculating the log-likelihood function expectation, and performing maximum likelihood estimation on the hidden variable:

Q_i(z⁽ⁱ⁾):＝p(z⁽ⁱ⁾|x⁽ⁱ⁾；θ)

wherein Q is_iGiven x⁽ⁱ⁾With respect to z by the parameter theta⁽ⁱ⁾Solves the problem of how to select Q (z) to make the likelihood function equal to its lower bound, namely, let Q_i(z⁽ⁱ⁾) Is z⁽ⁱ⁾The posterior probability of (2).

And for the M step, the expected result of the likelihood function of the E step is maximized, so that a new expected value is obtained, and the latest expected value can be applied to the missing value to fill the missing value. In step E, Q is selected_iMaximum lower bound of log-likelihood function:

and E and M are continuously and circularly repeated until convergence.

The data after interpolation is normalized by a min-max normalization method, and the normalization can well reduce the data within a certain range, and can carry out non-dimensionalization and accelerate the convergence rate of model solution. The invention reduces the data between [0, 1], and the normalized formula is as follows:

wherein X is originalWeather data, X_minRepresenting the minimum value, X, in the raw weather data_maxThe maximum value in the raw weather data is indicated.

Dividing the preprocessed data into a training data set and a testing data set according to the proportion of 8: 2.

and step 3: constructing a single factor prediction model

The input data of the single-factor prediction model are PM2.5, PM10 and NO₂、CO、O₃And SO₂Six concentrations are used as training data, the dimension of an input layer is K multiplied by 6, and K represents the length of the selected time sequence. .

The single-factor model adopts a seq2seq model, the internal structure of the model is encoding-decoding (encoder-decoder), and a given sequence x is { x ═ x₁,x₂,…,x_TThe formula of the coding part is specifically as follows:

h_t＝f(h_t-1,x_t)

wherein t is the current time, h_tFor the hidden state at time t, f is the LSTM encoder.

After the encoder finishes encoding, all h_tThe vector c is combined and used as the input of a decoder, and the formula of the decoding part is specifically as follows:

h_(t)＝f(h_(t-1),y_(t-1),c)

p(y_t|y_(t-1),y_(t-2),...y₍₁₎,c)＝g(h_(t),y_(t-1),c)

wherein h is_(t)And c is the output of the encoder, c is the last state in the encoder, and f and g are nonlinear activation functions and are set as LSTM neural network units.

The model diagram is shown in fig. 2, in which the encoding (encoder) part is composed of 2 layers of Recurrent Neural Networks (RNN), and each layer is provided with 64 LSTM neural units. And a decoding (decoder) part is composed of 2 layers of Recurrent Neural Networks (RNN), each layer also provided with 64 LSTM neural units.

The training parameter settings for the model are shown in table 1:

TABLE 1 model training parameters

And 4, step 4: constructing a multi-factor prediction model

The input to the multi-factor predictive model is the entire historical weather data, including PM2.5, PM10, NO₂、CO、O₃、SO₂Temperature, air pressure, humidity, wind direction, wind speed, weather index and other 12 kinds of data, and the dimension of the input layer is Kx 12, wherein K represents the length of the selected time sequence.

The multi-factor model adopts a seq2seq model with a double attention machine mechanism, is an improvement on the seq2seq model, can better learn the influence of other meteorological factors on the air quality after improvement, and can capture key information, thereby better learning the time sequence relation among data and having stronger robustness.

Its encoder part, given K sequence data

Constructing a feedforward neural network and adding an attention mechanism, wherein the specific formula is as follows:

wherein h is_t-1And s_t-1For the previous hidden state and neuron state, w and u are learning parameters.

Use of softmax function to ensure all

The sum of the weights of (a) is equal to 1,

wherein,

attention weights of k sequences at t moment, and with the attention weights, we can adaptively extract

Followed by encoding using a multi-layer LSTM neural network

Where f is the LSTM neural network element,

is a new sequence with attention weight, h_tIs a hidden state at time t.

For the state h obtained after the encoder has performed coding_tAn attention mechanism is added as follows:

use of softmax function to ensure all

The sum of the weights of (a) is equal to 1, as follows:

wherein

Is the weight of the encoder output state at time i.

Attention weight and encoder hidden state h₁,h₂,…h_TThe weighted sum of } as the context vector c_t：

Its decoder part, using the multi-layered LSTM as decoder, can match the weights and context vectors c to a given target sequence (y) once they are obtained₁,y₂,…,y_T-1) Combining:

wherein

And

new in relation to the size of the decoder input

I.e., the new decoder state, the state value is input into the decoder as follows:

wherein g is an LSTM neural network element, d_tThe final output state, as part of the decoder.

The model diagram is shown in fig. 3, in which an attention mechanism (attention) is added before the coding (encoder), and the coding part is composed of 2 layers of Recurrent Neural Networks (RNN), and each layer is provided with 64 LSTM neural units. An attention mechanism (attention) is added before decoding, and a decoding (decoder) part consists of 2 layers of Recurrent Neural Networks (RNN), and 64 layers are also arranged in each layer

LSTM neural units.

And 5: integrated predictive model

Using an xgboost (extreme Gradient boosting) lifting tree to carry out regression prediction, wherein the input of the lifting tree is the result of a single-factor model, the result of a multi-factor model and the meteorological condition at the current moment, and specifically, the result of the single-factor model prediction is p ═ { p { (p) }₁,p₂…p₆The result of the multifactor model is z ═ z { (z)₁,z₂…z₆P and z represent PM2.5, PM10, NO₂，CO，O₃And SO₂Six kinds of concentration data, the meteorological condition g ═ { g at the current moment₁，g₂，…，g₆G represents six data of temperature, air pressure, humidity, wind direction, wind speed and weather index at the latest moment, and during training or prediction, the data input in the xgboost tree is x_i＝{p_i,z_iG, i is any one of 1 to 6, y is during training_iIs labeled current air quality data including PM2.5, PM10, NO₂，CO，O₃，SO₂And y is measured or predicted_iI.e. predicted air quality data, e.g. x₁(ii) pm2.5 (single factor model result), pm2.5 (multi-factor model result), temperature, barometric pressure, humidity, wind direction, wind speed, weather indicator (weather conditions at six most recent times) }, y₁Pm2.5 (current value in training set) }, x_iAnd y_iAnd so on. The xgboost tree is trained by adding the current meteorological condition into the input data, and the weight of the air quality data is flexibly adjusted, so that an accurate predicted value is obtained.

The method for regression calculation of Xgboost is as follows:

if there are k trees in xgboost,

in the above formula F represents all the functional space in the regression forest,

as model predicted value, f_kThe predicted value of the ith sample in the kth tree is shown.

The parameters now required are the structure of each tree and the weight of each leaf, and simply the calculation of the sub-trees f_kWherein, it is provided with:

Θ＝{f₁,f₂,...f_k}

however, the objective function obj () is to find a good parameter Θ, and the formula is as follows:

obj(Θ)＝L(Θ)+Ω(Θ)

wherein, L (Θ) is a fitting condition of an error function used for expressing data, and Ω (Θ) is a regularization term used for penalizing a complex model.

The step of training the model is that each tree is accumulated until k trees stop, the process is as follows:

wherein,

representsAfter the i-th cycle, x_iT represents the model training of the t-th round.

The error in training is as follows:

wherein y is_iI.e. the corresponding annotation data,

the time-predicted data, l, represents a loss function, and loss calculation is carried out, wherein square loss and Logistic loss are common.

The penalty term during training is as follows:

finally, the formula for simplifying the target function by using the Taylor formula is obtained as follows:

therein is provided with

C is a constant value, and C is a constant value,

f_t(x)＝w_q(x),w∈R^T,q:R^d→ {1,2, …, T }, f (x) denotes the node prediction value of each subtree, w denotes the weight of leaves, q denotes the structure of the tree, T denotes the number of leaves of the tree, R^dRepresenting a data set with a number of features d.

Further, the method is converted into the minimum value of the quadratic function of the above formula, namely:

wherein the formula for calculating the information Gain (Gain) is as follows:

wherein L represents the left sub-tree, R represents the right sub-tree, H_LRepresents the left sub-tree weight, H_RRepresenting the right subtree weight and gamma representing the complexity cost introduced by adding a new leaf node.

Thus trained, the xgboost tree is split by information Gain (Gain) to create each sub-tree f_t(x)＝w_q(x)Each data feature, i.e. x_i＝{p_i，z_i，g₁G2, g3, g4, g5, g6} includes air quality data p derived from one-factor and multi-factor models_iAnd z_iMapping current meteorological data g to leaf nodes of each tree, adjusting the leaf nodes of each subtree according to weight w according to multi-round training, updating all subtrees at the moment, wherein each subtree node is a predicted value, adjusting the weight according to an objective function obj to obtain an optimal value w, and finally mapping all subtrees f_kAnd accumulating the predicted values to obtain the finally predicted air quality data.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-model fusion deep learning air quality prediction method is characterized by comprising the following steps:

acquiring historical air quality data and meteorological data;

thirdly, establishing a seq2 seq-based deep learning model as a single-factor prediction model by using historical air quality data;

step four, establishing a seq2seq deep learning model based on a double-attention machine system as a multi-factor prediction model by utilizing historical air quality data and meteorological data;

inputting historical air quality data into a single-factor prediction model to obtain a prediction result of the single-factor prediction model, inputting the historical air quality data and meteorological data into a multi-factor prediction model to obtain a prediction result of the multi-factor prediction model, and fusing the prediction result of the single-factor prediction model, the prediction result of the multi-factor prediction model and the current meteorological data into an xgboost tree to perform regression calculation to obtain a final prediction value of the air quality data;

when the air quality data is more than two types of concentration data, one type of the air quality data is predicted in sequence, and the prediction result of the single-factor prediction model and the prediction result of the multi-factor prediction model of the air quality data and the current meteorological data are merged into an xgboost tree to perform regression calculation to obtain the prediction value of the air quality data;

the method comprises the following specific steps:

s1, training PM2.5, PM10 and NO in sequence by using the xgboost tree as a training model₂、CO、O₃And SO₂One of the six air mass concentration data is input into the prediction result of the single-factor prediction model and the multi-factor model prediction of a certain air mass concentration data during each training, the weather conditions at the latest moment including temperature, air pressure, humidity, wind direction, wind speed and weather indexes are also input, and the corresponding labeled data during the training are the current air mass data, namely PM2.5, PM10 and NO₂、CO、O₃、SO₂Inputting the input data and the labeled data of each type into the xgboost tree;

s2, establishing n regression subtrees inside the xgboost model during training, that is, there are n tree nodes, and each tree node has multiple leavesThe method comprises the following steps that nodes are provided, wherein leaf nodes are predicted values of a single-factor and multi-factor prediction model and eight characteristic values input by meteorological data at the latest moment, which leaf nodes correspond to each tree node are determined according to weights of the eight input characteristic values, the principle of node splitting is to select a characteristic with the largest information gain as a splitting point, so that the leaf nodes of each sub-tree have weight values, the weight values of the leaf nodes are adjusted according to a target function in each training round to form a new sub-tree, the optimal sub-tree is achieved after multiple iterations, each sub-tree node is a predicted value, and all the predicted values of the sub-tree nodes are accumulated to obtain an air quality predicted value; after input data and labeled data are input into the xgboost training, six air quality predicted values including PM2.5, PM10 and NO can be obtained respectively₂、CO、O₃And SO₂Six kinds of concentration data.

2. The method of claim 1, wherein the air quality data comprises PM2.5, PM10, NO₂、CO、O₃And SO₂Concentration data of one or more than two types.

3. The method of claim 1, wherein the meteorological data includes temperature, barometric pressure, humidity, wind direction, wind speed, and weather indicators.

4. The method of claim 1, wherein the missing value interpolation is performed by using an expectation maximization method.

5. The method as claimed in claim 1, wherein the input layer of the single-factor prediction model is historical air quality data, the structure of the hidden layer is seq2seq model, the inside is coding and decoding structure, the coding and decoding unit is LSTM long-term memory neural network, and the output layer is air quality prediction value.

6. The method as claimed in claim 1, wherein the input layers of the multi-model fusion deep learning air quality prediction method are historical air quality data and meteorological data, the hidden layer structure is a seq2seq model with a double attention mechanism, the interior of the model is an encoding-decoding structure, a layer of attention mechanism is added before encoding and decoding, and the output layer is an air quality data prediction value.