CN114997506A

CN114997506A - Atmospheric pollution propagation path prediction method based on link prediction

Info

Publication number: CN114997506A
Application number: CN202210690966.7A
Authority: CN
Inventors: 李勇; 吴京鹏
Original assignee: Lanzhou Qidu Digital Polymer Technology Co ltd; Northwest Normal University
Current assignee: Lanzhou Qidu Digital Polymer Technology Co ltd; Northwest Normal University
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-09-02
Anticipated expiration: 2042-06-17
Also published as: CN114997506B

Abstract

The invention discloses an atmospheric pollution propagation path prediction method based on link prediction. Calculating the propagation quantity of the atmospheric pollutants PM2.5 among different monitoring stations based on the transfer entropy; converting complex time sequence data into network data which is easy to model and analyze, and strengthening and paying attention to pollutant propagation relations among different monitoring stations; vectorizing and representing network nodes by a network representation learning technology, and obtaining node vector representation with low data noise based on an attention mechanism and a network neighbor aggregation technology; and obtaining vector representation corresponding to the connected edges in the network through a Hadamard product, and converting the atmospheric pollutant propagation path prediction problem into a two-classification problem. The method effectively solves the problems of high difficulty in collecting data of the atmospheric pollution propagation path, lag in prediction result, difficulty in transferring the model and the like, and has high adaptability and stability in the aspect of predicting the atmospheric pollution propagation path.

Description

Atmospheric pollution propagation path prediction method based on link prediction

Technical Field

The invention relates to a time series data network construction method based on transfer entropy and a link prediction method based on network representation learning. The method has important application and popularization values in the field of prediction of the PM2.5 propagation path of the atmospheric pollutants.

Background

With the progress of human industry and science and technology, atmospheric pollution becomes an important factor influencing human health and daily life, and the accurate prediction of the transmission path of pollutants is an important means for preventing the atmospheric pollution. The existing atmospheric pollution propagation path prediction model mainly comprises a mathematical model based on probability theory and an image model based on deep learning. The establishment of mathematical models often relies on a large amount of a priori data, such as: the method has the advantages that the building density, the population density, the pedestrian volume, the road width and the like are high, a large amount of manpower and material resources are consumed in the data acquisition process, the model solving difficulty is high, and in addition, the model precision is easily influenced greatly under the condition of urban infrastructure reconstruction. The establishment of the image model usually depends on an instant photographic image in a certain area of a city or a high-altitude image shot by a satellite, under the condition of severe atmospheric change, the perception range of the image model is limited to a certain extent, and different images lack correlation, so that the prediction result of the image model has obvious hysteresis. Such as: the pollution is found to be serious at a position 3km away from the current position, and the image shot at the current position cannot be timely sensed.

Link prediction is a common data analysis method in the field of network science, and aims to infer the probability of generating a connection edge relationship between any two nodes. The atmospheric pollutant PM2.5 propagation path prediction model is established from the perspective of network science, the transfer relationship of the atmospheric pollutant at different positions in a city can be better described, and the universality of the model is improved while the dependence on prior data is reduced.

Disclosure of Invention

In order to overcome the defects of the prior art, the collected PM2.5 time sequence data matrix is assumed to be X in the technical scheme of the invention, the matrix comprises N rows of monitoring stations, and each monitoring station comprises a PM2.5 concentration value data M column. The atmospheric pollutant PM2.5 propagation network constructed on the basis of the transfer entropy and the PM2.5 time sequence matrix X is G (V, E), wherein V represents a node set of the network, and nodes represent monitoring stations; and E represents a connected edge set of the network, and the connected edges represent the propagation relation of PM2.5 among different monitoring stations. By giving random walk of the initial node in the network, a vector representation matrix H representing the local topology of the node is obtained. And optimizing the H by using a node neighbor aggregation technology of the graph neural network to obtain a final node vector representation matrix H'. And randomly selecting a continuous edge and a non-existent continuous edge in the network to construct a sample space, wherein a continuous edge vector is represented by a Hadamard product of the characterization vectors of two nodes forming the continuous edge. Through the logistic regression classifier, the whole prediction output of the model, namely the probability value matrix Y of the propagation of the atmospheric pollutants PM2.5 among the monitoring stations can be obtained.

The invention mainly comprises five parts: (1) for the PM2.5 time sequence data matrix X, K time windows are selected according to time sequence intervals, N (N-1) groups of data are calculated according to the row-to-row interval, and each group of data is K transfer entropy values. (2) And (3) taking the monitoring station as a node, simultaneously solving an average value and a standard deviation of each group of K transfer entropy values, determining a connecting edge according to the average value and the standard deviation, and constructing an atmospheric pollutant propagation network G (V, E). (3) And sequentially giving initial nodes, and randomly walking in the network to obtain a node local topological structure vectorization expression matrix H. (4) M important nodes are found out from the network node set V, a similarity matrix Sim of the important nodes and all the nodes is calculated, an attention coefficient matrix A can be obtained through the Sim matrix, and a final node vector representation matrix H' is obtained through the attention coefficient and the neighbor aggregation process of the graph neural network. (5) Constructing a training set and a test set of connected edges, calculating a vector Hadamard product representing two nodes to obtain connected edge vectors corresponding to the two nodes, and solving the two classification problems through a logistic regression classifier. The following respectively describes the concrete contents of the above five parts:

1. for the PM2.5 time sequence data matrix X, K time windows are selected according to time sequence intervals, N (N-1) groups of data are calculated according to the row-to-row interval, and each group of data is K transfer entropy values. The magnitude of each transfer entropy represents the amount of contaminant propagation from one monitoring station to another monitoring station PM2.5 over a prescribed time window.

2. And taking the monitoring station as a node, and simultaneously solving the average value and the standard deviation of each group of K transfer entropy values. Measuring the pollutant propagation amount through an average value, wherein the larger the average value is, the larger the pollutant propagation amount between two monitoring stations is; and measuring the stability of the propagation relationship through the standard deviation, wherein the smaller the standard deviation is, the more stable the propagation relationship of the pollutants between the two monitoring stations is. And selecting a monitoring station as a node, selecting node pairs with large average value and small standard deviation to form a connecting edge, and constructing an atmospheric pollutant PM2.5 propagation network G (V, E).

3. And sequentially giving initial nodes, randomly walking in an atmospheric pollution PM2.5 propagation network G to obtain N random walking paths, and expressing the paths by a word vectorization method to obtain a feature vector matrix H representing a local network structure.

4. Selecting m important nodes from a node set V of a network G, carrying out similarity calculation with all nodes once to obtain a similarity matrix Sim, obtaining an attention coefficient matrix A by calculating the product of the Sim and an amplification factor matrix L, mapping the value range of the elements in the A to an interval [0,1] by row normalization, carrying out optimization adjustment on the elements in the H by combining a node neighbor aggregation technology of a graph neural network, and finally obtaining a node vector representation matrix H' for link prediction.

5. And adding all elements in the edge connecting set of the network G into a sample space as positive samples, and selecting reverse samples which are equal to the positive samples and are added into the sample space by a negative sampling technology. Randomly extracting a training set and a test set from a sample space, calculating a vector corresponding to a Hadamard product expression connecting edge according to two node vector expressions forming the connecting edge, and solving the two classification problems through a logistic regression classifier.

The method for predicting the atmospheric pollution propagation path based on the link prediction (xx) comprises the following steps:

step 1: respectively calculating transfer entropy through steps 1.1, 1.2 and 1.3, describing the transfer amount of PM2.5 among different monitoring stations through the transfer entropy, and then turning to step 2; the matrix X represents a PM2.5 time sequence data matrix, N rows are provided, each row represents a monitoring station, each row has M columns, each column represents an acquisition time, and each element in the matrix represents a PM2.5 concentration value acquired by the monitoring station at the acquisition time; let X have the subscript i _i Represents the PM2.5 concentration acquired by the monitoring station i at M acquisition timesThe values constitute a row vector. Steps 1.1, 1.2 and 1.3 are described in detail below:

step 1.1: data acquisition and cleaning: recording PM2.5 concentration values of a current area at intervals of 1 hour at urban densely-distributed air quality monitoring stations, filling missing values which cannot be recorded due to equipment faults by using the average value of the PM2.5 concentrations of the previous hour and the next hour to obtain a PM2.5 time sequence data matrix X, and then, turning to the step 1.2.

Step 1.2: for the PM2.5 time sequence data matrix X, taking the row vector X of the 1 st row of data ₁ Row vector X of data of i-1 row with the rest ₂ ,X ₃ …,X _i Is mixing X ₁ Is marked by Y, any X _i Is marked as X by the formula

Then X can be calculated ₁ To X _i The index n of x and y represents the dimension of the row vector is n dimensions, the superscripts k and l of x and y represent the time window size specified by calculating the transition entropy, and the practical requirement can be met by analyzing the PM2.5 propagation amount within 4 hours, so that in four cases, i.e. 1,2,3 and 4, any two monitoring stations can calculate 4 transition entropies, and the calculation is carried out as a group, and then the process is shifted to step 1.3.

Step 1.3: and (3) repeating the step 1.2, calculating transfer entropies of all the row vectors and all the row vectors except the PM2.5 time sequence data matrix X to calculate N (N-1) groups of transfer entropies, wherein each group of transfer entropies represents the PM2.5 transfer amount of the two monitoring stations within 1,2,3 and 4 hours respectively, and then switching to the step 2.

Step 2: constructing an atmospheric pollutant propagation network through the steps 2.1, 2.2 and 2.3, and then turning to the step 3; let G ═ V, E denote the atmospheric pollutant propagation network, where V represents the set of nodes of the network and E represents the set of connected edges in the network. Order to

And (4) representing the transfer entropy calculated by the monitoring station i to the monitoring station j by taking 1 hour as a time window. The detailed description of steps 2.1 and 2.2 is as follows:

step 2.1: for any two monitoring stations i, j, 4 transfer entropies can be obtained

Using formulas

The average value of the data can be calculated by formula

The standard deviation of the group of data can be calculated, N (N-1) mean values and standard deviations are obtained by calculating N (N-1) groups of transfer entropies through the step, and then the step 2.2 is carried out.

Step 2.2: for the mean value as x-axis and the standard deviation as y-axis, a planar rectangular coordinate system xOy can be constructed, and the N (N-1) mean values and standard deviations calculated in step 2.1 are represented in xOy. By using

Calculating the average value of the whole sample mean value by using

The average of the standard deviations of the entire sample is calculated, and in xOy, the parallel line y parallel to the x axis is made STDEV, the parallel line x parallel to the y axis is made AVG, the xOy is divided into four regions, and then the procedure goes to step 2.3.

Step 2.3: for the points in the lower right corner area of the xOy, the average value is smaller in the sample space, the standard deviation is relatively smaller in the sample space, the larger the mean value of the transition entropy is, the larger the transmission amount of PM2.5 is, and the smaller the standard deviation of the transition entropy is, the more stable the transmission relationship existing between the two sites is. Therefore, all monitoring stations related to the point in the lower right corner area of the xOy are added into the node set V of the network G, the two monitoring stations forming the point construct a connecting edge between the monitoring stations, the connecting edge set E of the network G is added, and then the step 3 is carried out.

And step 3: the atmospheric pollutant PM2.5 propagation network G calculated in step 2 is (V, E), a feature vector matrix H representing a node local network structure can be obtained by a random walk and vectorization technique, H has N rows, N is the number of monitoring stations, i.e., the number of elements in V, has dim columns, and dim is an output vector dimension of the vectorization technique. The detailed description of step 3 is as follows:

step 3.1: for an atmospheric pollutant propagation network G ═ (V, E), a node V is given _i E.g. V, at node V _i Randomly selecting a node v from the first-order neighbor nodes _j Random walk is carried out, a random walk step length k is given, and a node access sequence obtained through the random walk is obtained through the process

Then, the step 3.2 is carried out.

Step 3.2: for the process described in 3.1, the Node2vec technology calculates the probability of the Node which is possibly accessed by random walk each time by introducing the depth random walk parameter d and the breadth random walk parameter b, walks according to the probability value, so that the obtained Node access sequence has certain controllability, and can obtain the Node access sequence

Expressed as a value in the range of [0,1]]Is represented by a vector of floating-point numbers of,

the vector expresses the node v _i In the local topology of the network G,

dimension of (c) can be specified by a dim parameter, which is usually 128, and the matrix formed by all node vectors is recorded as H, and then step 4 is performed.

And 4, step 4: considering adverse effects caused by data noise and an unsupervised training mode generated in a Node2vec technology random walk process, the Node vector matrix H obtained in the step 3 is optimized and adjusted by introducing a figure attention mechanism and a Node neighbor aggregation technology, the Node vector matrix H finally used for link prediction is obtained through the steps 4.1, 4.2 and 4.3, and the steps 4.1, 4.2 and 4.3 are described in detail as follows:

step 4.1: selecting the first m important nodes with the largest degree from the atmospheric pollution propagation network graph G, and calculating a similarity matrix Sim _n×m ＝Similarity(WH _n×dim ,WH _m×dim ) Simiarity is a cosine Similarity function, W is a deep learning parameter matrix to be solved, H _n×dim Eigenvector matrix, H, representing nodes _m×dim The eigenvector matrix representing the m important nodes is then transferred to step 4.2.

And 4.2: calculating an attention coefficient matrix A _n×n ＝Sim _n×m L _m×n ，Sim _n×m Representing a similarity matrix, L _m×n For deep learning of parameter matrix to be solved, n is number of nodes in network graph G, m is number of selected important nodes, and attention matrix A _n×n Each element a in _ij The normalization is performed according to rows, and the normalization mode is as follows: a is _ij ∈A _n×n ，

And then the step 4.3 is carried out.

Step 4.3: computing

Wherein K represents the number of stacked layers of the hidden layer, N _i A set of neighbor nodes representing a node i,

represents the firstAttention Strength of node i to node j, W, in k hidden layers ^k The parameter matrix to be learned for the k-th hidden layer,

representing the characteristic vector of the node j to obtain a node vector representation matrix H' for link prediction, and then turning to the step 5.

And 5: and 4, adding an output layer containing two neuron nodes, namely a logistic regression classifier after the last hidden layer of the node vector representation matrix H' for link prediction calculated in the step 4, wherein the logistic regression classifier is used for outputting the probability Y of existence of the connecting edge. The detailed description of steps 5.1, 5.2 and 5.3 is as follows:

step 5.1: and (V, E) calculating the atmospheric pollutant PM2.5 propagation network G which is obtained by the step 2, adding all elements of the continuous edge set E of the atmospheric pollutant as positive example samples into the sample space, randomly selecting the elements which are not in the continuous edge set E and are equal to the positive example samples by using a negative sampling technology, adding the elements as negative example samples into the sample space, carrying out disorder processing on the sample space, and carrying out disorder processing on the continuous edge E consisting of any node i and j _ij The corresponding feature vector can be represented as

Wherein the hadamard product operation of the |, representing a vector, proceeds to step 5.2.

Step 5.2: and adding an output layer containing two neuron nodes after the hidden layer described in the step 4.3, wherein the output layer is used for outputting the probability of existence of a connecting edge, an activation function adopts Softmax, a loss function adopts a binary cross entropy loss function, an optimizer selects Adam, an activation function selects LeakyReLU with a parameter of 0.2, the number of the neuron nodes of the input layer is 128, the neuron nodes of the neighbor aggregation layer is 64, the neighbor aggregation layer is stacked into 32 layers, the learning rate is 0.001, the dropout parameter is 0.4, and the model training iteration cycle epoch is 100. Then, the step 5.3 is carried out.

Step 5.2: after the training is completed in the step 5.3, selecting the PM2.5 time sequence data X collected by two air quality monitoring stations i, j optionally _i And X _j As model input, can obtainProbability Y of PM2.5 occurrence of propagation relation between air quality monitoring stations i, j _ij 。

Compared with the prior art, the invention has the following advantages:

the invention provides an atmospheric pollution propagation path prediction method based on link prediction. The atmospheric pollution propagation path prediction method has the following characteristics that firstly, the method quantifies the propagation amount of pollutants among different monitoring stations through transfer entropy, and can better realize accurate prevention and control of pollutant diffusion and propagation; secondly, modeling is carried out by using a network science view angle, and the method focuses on the pollutant propagation relation existing among different monitoring stations; and thirdly, the method does not depend on manual data acquisition, and can automatically acquire data for model input under the existing conditions of the air quality monitoring station. Fourthly, the method does not depend on image data, is more sensitive to atmospheric changes and has wider perception visual field; the method does not depend on parameters such as the geographic structure, the building density and the like of a specific city, the model can be conveniently migrated and reconstructed in other cities, and the method has higher universality;

drawings

FIG. 1 is a diagram of an example data fragment used in the present invention

FIG. 2 is a flow chart of the construction of the propagation relationship of atmospheric pollutants among air quality monitoring stations according to the present invention

FIG. 3 is a two-dimensional spatial representation of transfer entropy and standard deviation of any two-site 4-group constructed based on Lanzhou air quality monitoring station data

FIG. 4 is a flow chart of an atmospheric pollution propagation path prediction model based on link prediction according to the present invention

FIG. 5 is a graph comparing AUC index (unit:%) during training of the present invention and the prior art link prediction method

FIG. 6 is a graph comparing AUC values and average relative errors (unit:%) of 10 times of random experiments repeated with the prior link prediction method

FIG. 7 is a comparison graph of Precision index under different L values in the present invention and the existing link prediction method

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The original data relied on by the present invention is selected from data collected by 111 air quality monitoring stations densely distributed in Lanzhou, wherein the data segment of one monitoring station is shown in FIG. 1. Each collection time corresponds to a PM2.5 concentration value, and the collection time interval is 1 hour. And replacing the missing values which are not collected due to mechanical faults by the mean value of the PM2.5 concentration values of the previous hour and the next hour.

The transfer entropy calculation process of the present invention is shown in FIG. 2, where S ₁ ,S ₂ ,…,S _n Representing n air quality monitoring micro-stations, t ₁ ,t ₂ ,…,t _m Representing m acquisition times, each element in the raw data matrix represents a PM2.5 concentration value. Under 4 conditions with time windows k-l-1, 2,3,4, any two monitoring stations can calculate 4 sets of transition entropies as 1. This operation yields (111 × (111-1)) group transfer entropy through step 1.

The network construction process of the invention is as shown in fig. 2, after (111 × (111-1)) groups of transfer entropies are obtained, 1 standard deviation and 1 average value can be calculated for each group of transfer entropies, a plane rectangular coordinate system xoy is established by taking the average value as a horizontal axis and the standard deviation as a vertical axis, wherein each point in the xoy is associated with 2 monitoring stations. The mean value of the transfer entropy standard deviations of all groups is calculated as STDEV, the mean value of the transfer entropy of all groups is calculated as AVG, a parallel line y parallel to an x axis is taken as STDEV, a parallel line x parallel to a y axis is taken as AVG, the xOy is divided into 4 areas, wherein the points in the lower right area represent the transfer entropy of two monitoring stations forming the point, the mean value is larger and the standard deviation is smaller in the whole sample space, namely the mean value of the transfer entropy is larger, the pollutant propagation quantity is larger, the standard deviation of the transfer entropy is smaller, and the pollutant propagation relation is more stable. And selecting the points in the lower right corner area, taking the monitoring stations as nodes, and constructing connecting edges between the corresponding monitoring stations to obtain the atmospheric pollutant PM2.5 transmission network. Based on 111 monitoring station data in Lanzhou city, a two-dimensional spatial representation of the mean and standard deviation of the transfer entropy was constructed as shown in FIG. 3. The operation is carried out through step 2 to obtain an atmospheric pollutant PM2.5 propagation network G ═ V, E, and then the operation proceeds to step 3 to obtain a vector matrix H representing the local topology of the node.

The atmospheric pollution propagation path prediction model flow chart based on link prediction is shown in fig. 4, after a node eigenvector matrix H is obtained in step 3, a vector representation of m nodes with the maximum degree is selected, an attention coefficient matrix A is obtained by combining parameters W and L to be solved in deep learning, Softmax normalization is carried out on the attention coefficient matrix A according to node neighbors, and finally an updated node eigenvector representation matrix H 'is obtained through H' ═ sigma (AWH). The operation is to optimize and adjust the node eigenvector matrix H through step 4 to obtain H'. And (5) turning to a step 5, constructing a sample set and sample characteristics, combining and training the classification structure and the neighbor aggregation structure in the step 4, and obtaining the probability of generating a connecting edge between any two nodes, namely the probability of the atmospheric pollutant PM2.5 propagation relation between any two monitoring stations.

FIGS. 5, 6 and 7 are comparative graphs of the present invention with other similar models under different evaluation indexes. Wherein the closer the AUC value is to 1, the better the prediction performance of the model is represented, and the slower the descending speed of the Precision value curve indicates the better performance of the model. It can be seen from fig. 5 that the model (FALP) not only has higher prediction performance, but also has faster convergence rate; as can be seen from a comparison graph of AUC values and relative errors of 10 repeated random experiments in fig. 6, the average AUC value of the model is the highest, and the average relative error is the lowest, which indicates that the model not only has high prediction performance, but also has better stability; as can be seen from FIG. 7, the Precision value curve of the model has the slowest descending speed, which indicates that the fault tolerance of the model is higher.

Claims

1. An atmospheric pollution propagation path prediction method based on link prediction is characterized by comprising the following steps: when an atmospheric pollution propagation path prediction model is constructed, a processing method under a network science view angle is adopted.

2. The atmospheric pollution propagation path prediction method based on link prediction as claimed in claim 1, characterized in that: and according to PM2.5 time sequence data vectors acquired by different monitoring stations, sliding time windows to calculate transfer entropy values under different window conditions, and quantizing the transmission quantity of PM2.5 among different monitoring stations by using the transfer entropy values.

3. The atmospheric pollution propagation path prediction method based on link prediction as claimed in claim 1, characterized in that: calculating a transfer entropy mean value and a standard deviation according to the solved transfer entropy values under different time window conditions, constructing a two-dimensional space representation of PM2.5 propagation relations among monitoring stations by taking the transfer entropy mean value as a horizontal axis and the transfer entropy standard deviation as a vertical axis, determining sample points with larger transfer entropy mean value and smaller standard deviation mean value in the whole sample space, selecting the monitoring stations related to the sample points to construct connecting edges, and constructing an atmospheric pollutant propagation network.

4. The atmospheric pollution propagation path prediction method based on link prediction as claimed in claim 1, characterized in that: in an atmospheric pollutant propagation network, an initial node is given to carry out random walk through a network representation learning technology to obtain a node access sequence, and finally a vector representation matrix of the nodes in the network is obtained through a vectorization technology.

5. The atmospheric pollution propagation path prediction method based on link prediction as claimed in claim 1, characterized in that: and selecting important nodes in the atmospheric pollutant propagation network, calculating similarity with all other nodes, and combining deep learning to-be-solved parameters to obtain an attention coefficient matrix.

6. The atmospheric pollution propagation path prediction method based on link prediction as claimed in claim 1, characterized in that: and carrying out normalized representation on the attention coefficient according to the node neighbors of the network, and obtaining an optimized node vector representation matrix through the attention coefficient matrix and combining a network node neighbor aggregation structure of deep learning of the graph.

7. The atmospheric pollution propagation path prediction method based on link prediction as claimed in claim 1, characterized in that: all the connecting edges are used as positive example samples, and negative example samples are constructed through a negative sampling technology to obtain a sample space; representing continuous edge vectors in a sample space through a node vector representation matrix and a Hadamard product; and adding a logistic regression structure after the last deep learning hidden layer to convert the atmospheric pollution propagation path prediction problem into a link prediction problem under a network science view angle, solving the link prediction problem through two classifications, and effectively improving the performance of the atmospheric pollution propagation path prediction model.