CN110245692B - Hierarchical clustering method for collecting numerical weather forecast members - Google Patents

Hierarchical clustering method for collecting numerical weather forecast members Download PDF

Info

Publication number
CN110245692B
CN110245692B CN201910444986.4A CN201910444986A CN110245692B CN 110245692 B CN110245692 B CN 110245692B CN 201910444986 A CN201910444986 A CN 201910444986A CN 110245692 B CN110245692 B CN 110245692B
Authority
CN
China
Prior art keywords
minimum distance
matrix
connected graph
distance connected
vertex
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910444986.4A
Other languages
Chinese (zh)
Other versions
CN110245692A (en
Inventor
樊仲欣
王兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN201910444986.4A priority Critical patent/CN110245692B/en
Publication of CN110245692A publication Critical patent/CN110245692A/en
Application granted granted Critical
Publication of CN110245692B publication Critical patent/CN110245692B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention discloses a hierarchical clustering method for members of an ensemble numerical weather forecast, which comprises the steps of establishing a minimum distance communicating graph according to the data characteristics of the members of the ensemble numerical weather forecast, dividing data into clusters layer by utilizing the maximum difference value of the minimum distance communicating graph, eliminating noise points, finding out representative members of the ensemble forecast, and finishing clustering. Compared with the Ward clustering method, the time complexity is smaller, and the method has the function of denoising points which is not possessed by Ward. Compared with a tube method and a distance correlation coefficient clustering method, the method has the function of generating multi-level clustering results, can preferentially select the most appropriate cluster number on each level, and does not need to set core parameters.

Description

Hierarchical clustering method for collecting numerical weather forecast members
Technical Field
The invention relates to a data clustering analysis method in the technical field of information, in particular to a hierarchical clustering method for collecting numerical weather forecast members.
Background
Aggregate numerical weather forecasts not only give a single best possible forecast, but also quantitatively estimate the uncertainty of the weather forecast. The deterministic forecast is only carried out once numerical integration, and the ensemble forecasting system carries out multiple times of numerical integration by using different initial fields, so that the uncertainty of the weather forecast can be estimated by a plurality of numerical forecasting results obtained by ensemble forecasting members, and meanwhile, the certainty forecast can be more confident. Since the uncertainty of weather forecasts varies from day to day with weather conditions, ensemble forecasting provides an estimate of this day-to-day uncertainty, ensemble forecasting systems can be used to sample the probability distribution function of survey weather forecast results, and typically to generate probability forecasts-to assess the likelihood of a result occurring. The following table shows three global ensemble forecasting modes of ECMWF, NCEP, and T639 issued by the national weather information center:
Figure BDA0002073327490000011
some indicators of ensemble forecasting systems can be used to optimize deterministic forecasting, the best way to apply the uncertainty of ensemble dispersion display sufficiently to make sure how much certainty the deterministic forecasting is when it is issued. Ensemble dispersion can be analyzed directly using ensemble forecasting products, and clustering is a common approach. The clustering analysis method is to combine similar members in ensemble prediction into a class, and simultaneously give the relative frequency of the class, and particularly for atmospheric states of multiple equilibrium states, the clustering method can provide clear prediction guidance with several typical equilibrium states, so that the clustering method is more suitable for predictors with less experience. Therefore, by utilizing cluster analysis, representative forecast members in the ensemble forecast can be found and the credibility of the forecast members can be given.
At present, the collective forecasting member clustering analysis methods commonly used at home and abroad mainly comprise a Ward clustering method, a tube clustering method (Tubingclustering), a pitch correlation coefficient clustering method, a dynamic fuzzy method, a neuron clustering method, a Central clustering method (Central clustering) and the like. The cluster analysis methods are characterized in that similar elements or similar weather forms in ensemble prediction are found and classified into one category, wherein the Ward clustering method, the tube method and the pitch correlation coefficient clustering method are applied more frequently, the tube method is used for ECMWF in the global ensemble prediction mode of the table, the pitch correlation coefficient clustering method is used for NCEP, and the Ward method is used for the precursor T213 of T639. Although the methods have various characteristics, certain common problems still remain to be solved.
1. Are not optimized for application requirements and computational efficiency
In order to improve the referency and credibility of the representative ensemble prediction members screened by the clustering algorithm, a clustering result which has a multilayer structure and can eliminate non-representative members is often required to be provided, so that the clustering simultaneously has the function of merging similar clusters to generate a multilayer clustering result and the function of eliminating noise points, but the Ward method cannot remove noise, and the tube method and the flat correlation coefficient clustering method cannot generate a multilayer clustering result. In addition, due to the real-time performance (12 hours time interval) of the numerical weather forecast and considering the time delay of data transmission, it is often required to generate and output forecast products within several hours, so the calculation efficiency of ensemble forecast is also very high, and the time complexity of the existing hierarchical clustering (Ward method) is generally high, namely, O (n)2) To O (n)3) (n represents the amount of data, the total number of vertices represented by the minimum distance connectivity graph, the total number of ensemble numerical forecast members represented by the ensemble numerical forecast product, etc.).
2. The setting of the core parameters requires experience and is difficult
The pipe method needs to set the radius, the distance-level correlation coefficient method needs to set the correlation coefficient threshold value and the like, the parameters are core parameters, the setting is directly related to the classification mode and the clustering efficiency of the clusters, but no clear objective reference basis exists, only subjective experience can be relied on, and therefore the setting difficulty is high.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a hierarchical clustering method for members of the ensemble numerical weather forecast aiming at the defects involved in the background technology, the hierarchical clustering method can remove noise points, and the time complexity of clustering is optimized according to the data characteristics of the members of the ensemble numerical weather forecast, so that the classification and screening of the members of the ensemble forecast are more efficient and accurate.
The invention adopts the following technical scheme for solving the technical problems:
a hierarchical clustering method for aggregating members of a numerical weather forecast, comprising the steps of:
step 1), establishing a minimum distance connected graph according to the data characteristics of members of the set numerical weather forecast; the minimum distance connected graph is free of direction and loop, and comprises unique identifiers of n vertexes and n-1 edges for connecting the vertexes, wherein the n-1 edges are formed by connecting the n vertexes according to the nearest distance, and as shown in FIG. 1;
step 1.1), let
Figure BDA0002073327490000021
xi={xi1,xi2,…ximThe data of the ith vertex is represented, i is more than or equal to 1 and less than or equal to n, n is the total number of the vertices, namely the total number of the members of the aggregation numerical value weather forecast, m is the data dimension of the members of the aggregation numerical value weather forecast, id is the dimension of the members of the aggregation numerical value weather forecastiUnique identification of data, vertex id, for the ith aggregate numerical weather forecast MemberiI.e. the ith vertex; and let the nearest edge matrix E11Initially, an empty matrix;
randomly selecting ith vertex in X, namely vertex idiRespectively calculate the vertex idiEuclidean distances to other vertexes to generate a distance matrix
Figure BDA0002073327490000031
In the formula (d)ijIs a vertex idiTo vertex idjJ is more than or equal to 1 and less than or equal to n, and j is not equal to i; and make the set EX ═ idi};
Step 1.2), the minimum value d of the distance is searched from the distance matrix XDi_minAnd the row [ id ] of the distance matrix XDi,idj,di_min]Adding to the nearest edge matrix E11Then deleted from the distance matrix XD;
step 1.3), vertex idjAdding into the collection EX, and calculating idjThe distances to the vertices other than the vertex in the set EX are used to generate a distance matrix
Figure BDA0002073327490000032
p is the number of vertices in the set EX except the vertex, djpIs a vertex idjTo vertex idpThe Euclidean distance of;
step 1.4), combining the distance matrix XD and the distance matrix XD2 to form a new distance matrix XD;
step 1.5), repeating steps 1.2) to 1.4) until the number of vertices in the set EX is equal to n;
step 1.6), based on the set ID11[id1,…,idn]And the nearest edge matrix E11Generating minimum distance connectivity graph MDG [ ID ]11,E11];
Step 2), segmenting data into clusters layer by utilizing the maximum difference value of the minimum distance connectivity graph and eliminating noise points:
step 2.1), connecting graphs MDG [ ID ] with minimum distance11,E11]As a first layer minimum distance connectivity graph;
step 2.2), the minimum distance connected graph MDG [ ID ]11,E11]Dividing the graph into a plurality of second layer minimum distance connected graphs;
step 2.2.1), calculating the nearest edge matrix E11The difference between the third column of each row except the first row and the third column of the previous row is taken as the maximum value dd1At the nearest edge matrix E11Calculating the average value of the third column in the corresponding row and the value of the third column in the next row to obtain the average value ddt1
Step 2.2.2), according to the mean value ddt1Split minimum distance connectivity graph MDG [ ID ]11,E11]Most adjacent edge matrix E of11Forming a minimum distance connected graph set, and taking the minimum distance connected graph set as a second layer minimum distance connected graph set, wherein the specific steps of dividing the most adjacent edge matrix of the minimum distance connected graph according to the mean value and forming the minimum distance connected graph set are as follows;
step 2.2.2.1), setting the average value as T and the nearest edge matrix of the minimum distance connected graph as EE, and dividing the nearest edge matrix EE according to the average value T to obtain a plurality of divided nearest edge matrices;
step 2.2.2.1.1), deleting all rows of which the values of the third column in the nearest edge matrix EE are larger than the average value T, and newly building a matrix EA;
step 2.2.2.1.2), setting the matrix EA as a null matrix, and deleting the first row in the most adjacent edge matrix EE from the most adjacent edge matrix EE after the first row in the most adjacent edge matrix EE is placed at the tail of the matrix EA;
step 2.2.2.1.3), for each row in the matrix EA, searching whether a value identical to the value of the first column or the second column exists in the first column and the second column of the nearest edge matrix EE, and if so, deleting the value from the nearest edge matrix EE after the row in the nearest edge matrix EE is placed at the end of the matrix EA;
step 2.2.2.1.4), repeating step 2.2.2.1.3) until the first and second columns of the nearest edge matrix EE and the first and second columns of the matrix EA do not have the same value;
step 2.2.2.1.5), creating a null matrix, and assigning the value in the matrix EA to the null matrix to obtain a partitioned most adjacent edge matrix;
step 2.2.2.1.6), repeating steps 2.2.2.1.2) to 2.2.2.1.5) until EE is a blank matrix, obtaining a plurality of segmented nearest edge matrixes, and forming a matrix set BB;
step 2.2.2.2), for each most adjacent edge matrix in the matrix set BB, extracting unique identifiers of vertexes corresponding to each edge of the matrix set BB to obtain a vertex set corresponding to the matrix set BB, and generating a minimum distance connected graph corresponding to the vertex set;
step 2.2.2.3), generating a second layer minimum distance connected graph set according to the minimum distance connected graph corresponding to each nearest edge matrix in the matrix set BB;
step 2.2.3), noise points in the second layer minimum distance connected graph set are marked, wherein the method for marking the noise points in the minimum distance connected graph set is as follows: for vertex sets corresponding to each minimum distance connected graph in the minimum distance connected graph set, sequentially judging whether the number of vertexes contained in the vertex sets is less than or equal to a preset proportional threshold value multiplied by n, if so, the minimum distance connected graph is a sparse cluster, and marking the minimum distance connected graph as a noise point;
step 2.2.4), marking natural clusters in the second layer minimum distance connected graph set, wherein the method for marking the natural clusters in the minimum distance connected graph set comprises the following steps: respectively judging whether the most adjacent edge matrixes corresponding to the minimum distance connected graphs in the minimum distance connected graph set accord with normal and exponential distribution tests, and if so, marking the minimum distance connected graph as a natural cluster;
step 2.3), taking the second layer minimum distance connected graph set as the current layer minimum distance connected graph set;
step 2.4), segmenting the minimum distance connected graph set of the current layer;
step 2.4.1), calculating the difference value between the third column of each row except the first row and the third column of the previous row in each most adjacent edge matrix corresponding to the minimum distance connected graph except the noise point and the natural cluster in the minimum distance connected graph set of the current layer, and acquiring the maximum value dd in the most adjacent edge matrix2
Step 2.4.2), for dd2Corresponding nearest edge matrix, take dd2Corresponding row three in the most adjacent edge matrixCalculating the average value of the column values and the values of the third column in the next row to obtain the average value ddt2
Step 2.4.3), according to the mean value ddt2Splitting dd2The corresponding nearest edge matrix is divided to form a next layer minimum distance connected graph set;
step 2.4.4), marking noise points and natural clusters in the next layer of minimum distance connected graph set;
step 2.4.5), dividing dd in the minimum distance connected graph set of the current layer2Adding the minimum distance connected graph except the corresponding minimum distance connected graph into the next layer of minimum distance connected graph set;
step 2.5), the next layer of minimum distance connected graph set is used as the current layer of minimum distance connected graph set;
step 2.6), repeating the steps 2.4) to 2.5) until the minimum distance connected graph set of the current layer has no noise and the minimum distance connected graph except the natural cluster;
step 3), finding out representative ensemble prediction members, and finishing clustering:
step 3.1), the number of layers of the current layer is set to be L, and for each layer of minimum distance connected graphs, the minimum distance connected graph of non-noise points in the minimum distance connected graphs is screened out to be used as a cluster set to be screened of the layer;
step 3.2), sequentially comparing the number of clusters to be screened from the L layer to the first layer with a preset number range threshold value until the number of clusters to be screened of a certain layer is within the preset number threshold value range, and combining the clusters to be screened of the layer into a final cluster set to be screened;
step 3.3), for any minimum distance connected graph in the final cluster set to be screened, screening out a vertex closest to a cluster center in the minimum distance connected graph as a representative vertex of the minimum distance connected graph, and obtaining the representative vertex of each minimum distance connected graph in the final cluster set to be screened;
and 3.4) taking the set numerical weather forecast member corresponding to the representative vertex of each minimum distance connected graph in the cluster set to be screened as a representative member.
As a further optimization scheme of the hierarchical clustering method for collecting numerical weather forecast members, the preset proportion threshold is preferably set to be 10%.
As a further optimization scheme of the hierarchical clustering method for collecting numerical weather forecast members, the preset number threshold ranges from 3 to 5.
Compared with the prior art, the invention adopting the technical scheme has the following technical effects:
1. compared with Ward clustering method, the time complexity of the method is
Figure BDA0002073327490000051
Less than O (n) of Ward clustering3) And the invention has the function of denoising points which is not possessed by Ward. Compared with a tube method and a distance correlation coefficient clustering method, the method has the function of generating multi-level clustering results, can preferentially select the most appropriate cluster number on each level, and does not need to set core parameters.
2. Compared with the existing hierarchical clustering algorithm, such as the time complexity O (n) of the optimized Neighbor Chain method (Nearest Neighbor Chain) of the agglomerative hierarchical clustering method2) And the time complexity O (n) of Chameleon (Chameleon) cannot be denoised2) And the k value (core parameter) of the k nearest neighbor graph needs to be set, a cluster diameter threshold T (core parameter) needs to be set by utilizing a balanced iterative reduction and clustering method (BIRCH) of a hierarchical structure, and a clustering result has randomness, so the method still has advantages in the aspects of time complexity, core parameter setting, denoising point function and the like.
Drawings
FIG. 1 is a schematic diagram of a minimum distance connectivity graph MDG [ X, E ];
fig. 2 is a schematic diagram of the clustering process of ECMWF global ensemble forecasting products.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
the invention discloses a hierarchical clustering method for collecting numerical weather forecast members, which comprises the following specific steps:
step 1), establishing a minimum distance connected graph according to the data characteristics of members of the set numerical weather forecast; the minimum distance connected graph is free of direction and loop, and comprises unique identifiers of n vertexes and n-1 edges for connecting the vertexes, wherein the n-1 edges are formed by connecting the n vertexes according to the nearest distance, and as shown in FIG. 1;
step 1.1), let
Figure BDA0002073327490000061
xi={xi1,xi2,…ximThe data of the ith vertex is represented, i is more than or equal to 1 and less than or equal to n, n is the total number of the vertices, namely the total number of the members of the aggregation numerical value weather forecast, m is the data dimension of the members of the aggregation numerical value weather forecast, id is the dimension of the members of the aggregation numerical value weather forecastiUnique identification of data, vertex id, for the ith aggregate numerical weather forecast MemberiI.e. the ith vertex; and let the nearest edge matrix E11Initially, an empty matrix;
randomly selecting ith vertex in X, namely vertex idiRespectively calculate the vertex idiEuclidean distances to other vertexes to generate a distance matrix
Figure BDA0002073327490000062
In the formula (d)ijIs a vertex idiTo vertex idjJ is more than or equal to 1 and less than or equal to n, and j is not equal to i; and make the set EX ═ idi};
Step 1.2), the minimum value d of the distance is searched from the distance matrix XDi_minAnd the row [ id ] of the distance matrix XDi,idj,di_min]Adding to the nearest edge matrix E11Then deleted from the distance matrix XD;
step 1.3), vertex idjAdding into the collection EX, and calculating idjThe distances to the vertices other than the vertex in the set EX are used to generate a distance matrix
Figure BDA0002073327490000063
p is the number of vertices in the set EX except the vertex, djpIs a vertex idjTo vertex idpEuclidean distance ofSeparating;
step 1.4), combining the distance matrix XD and the distance matrix XD2 to form a new distance matrix XD;
step 1.5), repeating steps 1.2) to 1.4) until the number of vertices in the set EX is equal to n;
step 1.6), based on the set ID11[id1,…,idn]And the nearest edge matrix E11Generating minimum distance connectivity graph MDG [ ID ]11,E11]。
The space complexity S (n (m +1) +3n (n-1)/2) and the time complexity of the method
Figure BDA0002073327490000071
Where n represents the amount of data and m represents the data dimension.
Figure BDA0002073327490000072
In
Figure BDA0002073327490000073
This term represents the total time complexity of generating the distance matrices XD and XD2, which is positively correlated with the data dimension,
Figure BDA0002073327490000074
this term represents the total time complexity of traversing the XD to find the minimum distance value, which is positively correlated with the amount of data. Then, since the data dimension of the cluster for the ECMWF global ensemble prediction member is 33600 (ground grid point number) or 12600 (high grid point number), and the data volume thereof is 50, the visible data dimension is much larger than the data volume, and as for the NCEP global ensemble prediction and the T639 global ensemble prediction, the data dimension is also much larger than the data volume, the time complexity can be simplified to be that
Figure BDA0002073327490000075
And 2) segmenting data into clusters layer by utilizing the maximum difference value of the minimum distance connected graph and eliminating noise points, wherein a path formed by connecting edges formed by vertexes in series is stored in an edge matrix E of the minimum distance connected graph MDG [ X, E ], and the path is generated by continuously extending and expanding the minimum distance from a certain random vertex, so that the order of the edges in the path is inevitably that the local cluster which has the nearest distance expanded firstly enters other clusters. Therefore, the paths of the edge matrix E can be divided into clusters by using the maximum difference value, and the steps are as follows:
step 2.1), connecting graphs MDG [ ID ] with minimum distance11,E11]As a first layer minimum distance connectivity graph;
step 2.2), the minimum distance connected graph MDG [ ID ]11,E11]Dividing the graph into a plurality of second layer minimum distance connected graphs;
step 2.2.1), calculating the nearest edge matrix E11The difference between the third column of each row except the first row and the third column of the previous row is taken as the maximum value dd1At the nearest edge matrix E11Calculating the average value of the third column in the corresponding row and the value of the third column in the next row to obtain the average value ddt1
Step 2.2.2), according to the mean value ddt1Split minimum distance connectivity graph MDG [ ID ]11,E11]Most adjacent edge matrix E of11Forming a minimum distance connected graph set, and taking the minimum distance connected graph set as a second layer minimum distance connected graph set, wherein the specific steps of dividing the most adjacent edge matrix of the minimum distance connected graph according to the mean value and forming the minimum distance connected graph set are as follows;
step 2.2.2.1), setting the average value as T and the nearest edge matrix of the minimum distance connected graph as EE, and dividing the nearest edge matrix EE according to the average value T to obtain a plurality of divided nearest edge matrices;
step 2.2.2.1.1), deleting all rows of which the values of the third column in the nearest edge matrix EE are larger than the average value T, and newly building a matrix EA;
step 2.2.2.1.2), setting the matrix EA as a null matrix, and deleting the first row in the most adjacent edge matrix EE from the most adjacent edge matrix EE after the first row in the most adjacent edge matrix EE is placed at the tail of the matrix EA;
step 2.2.2.1.3), for each row in the matrix EA, searching whether a value identical to the value of the first column or the second column exists in the first column and the second column of the nearest edge matrix EE, and if so, deleting the value from the nearest edge matrix EE after the row in the nearest edge matrix EE is placed at the end of the matrix EA;
step 2.2.2.1.4), repeating step 2.2.2.1.3) until the first and second columns of the nearest edge matrix EE and the first and second columns of the matrix EA do not have the same value;
step 2.2.2.1.5), creating a null matrix, and assigning the value in the matrix EA to the null matrix to obtain a partitioned most adjacent edge matrix;
step 2.2.2.1.6), repeating steps 2.2.2.1.2) to 2.2.2.1.5) until EE is a blank matrix, obtaining a plurality of segmented nearest edge matrixes, and forming a matrix set BB;
step 2.2.2.2), for each most adjacent edge matrix in the matrix set BB, extracting unique identifiers of vertexes corresponding to each edge of the matrix set BB to obtain a vertex set corresponding to the matrix set BB, and generating a minimum distance connected graph corresponding to the vertex set;
step 2.2.2.3), generating a second layer minimum distance connected graph set according to the minimum distance connected graph corresponding to each nearest edge matrix in the matrix set BB;
step 2.2.3), noise points in the second layer minimum distance connected graph set are marked, wherein the method for marking the noise points in the minimum distance connected graph set is as follows: for vertex sets corresponding to each minimum distance connected graph in the minimum distance connected graph set, sequentially judging whether the number of vertexes contained in the vertex sets is less than or equal to a preset proportional threshold value multiplied by n, if so, the minimum distance connected graph is a sparse cluster, and marking the minimum distance connected graph as a noise point;
step 2.2.4), marking natural clusters in the second layer minimum distance connected graph set, wherein the method for marking the natural clusters in the minimum distance connected graph set comprises the following steps: respectively judging whether the most adjacent edge matrixes corresponding to the minimum distance connected graphs in the minimum distance connected graph set accord with normal and exponential distribution tests, and if so, marking the minimum distance connected graph as a natural cluster;
step 2.3), taking the second layer minimum distance connected graph set as the current layer minimum distance connected graph set;
step 2.4), segmenting the minimum distance connected graph set of the current layer;
step 2.4.1), calculating the difference value between the third column of each row except the first row and the third column of the previous row in each most adjacent edge matrix corresponding to the minimum distance connected graph except the noise point and the natural cluster in the minimum distance connected graph set of the current layer, and acquiring the maximum value dd in the most adjacent edge matrix2
Step 2.4.2), for dd2Corresponding nearest edge matrix, take dd2Calculating the average value of the values in the third column of the corresponding row and the third column of the next row in the nearest edge matrix to obtain the average value ddt2
Step 2.4.3), according to the mean value ddt2Splitting dd2The corresponding nearest edge matrix is divided to form a next layer minimum distance connected graph set;
step 2.4.4), marking noise points and natural clusters in the next layer of minimum distance connected graph set;
step 2.4.5), dividing dd in the minimum distance connected graph set of the current layer2Adding the minimum distance connected graph except the corresponding minimum distance connected graph into the next layer of minimum distance connected graph set;
step 2.5), the next layer of minimum distance connected graph set is used as the current layer of minimum distance connected graph set;
step 2.6), repeating the steps 2.4) to 2.5) until no noise point exists in the current layer minimum distance connected graph set and the minimum distance connected graph except the natural cluster exists.
In the process of segmenting the minimum distance connected graph layer by layer, apart from removing sparse clusters (noise), the segmentation operation of the step 2.4) can be reduced and natural clusters can be found by performing the normality distribution test and the exponential distribution test on the edge matrix of each cluster at the same layer. In the process of layer-by-layer segmentation, the distances of large difference values are continuously removed, so that the distance value Distribution of each cluster edge matrix continuously tends to Normal Distribution (the X axis of the Distribution is the distance value, and the Y axis is the occurrence frequency of the distance value), and a Lilliefors Normal Distribution test method is adopted to judge whether the distance value Distribution conforms to the Normal Distribution. The purpose of the exponential distribution test is to determine whether the distribution of data within a cluster exhibits a hyper-sphere and whether there is data that is truly close to the cluster center (the mean of data within a cluster in each dimension), since the edges of the minimum distance connectivity graph are connected with each other at the nearest distance, the frequency of occurrence of each vertex in a side indicates whether it is located inside the cluster at an edge or at the center, therefore, the vertex number of the cluster is classified according to the occurrence times and sorted according to the ascending order of the occurrence times to obtain the vector xv of the vertex number, after the transformation of the exponential distribution (where the transformation formula yv is 1-xv/xv _ max, where xv _ max represents the maximum value of the vector xv), if yv tends to the exponential distribution (the X axis of the distribution is the number of occurrences of the vertex, and the Y axis is the yv value), the cluster is substantially a hyper-sphere and has a cluster center, and the Lilliefors Exponential Distributions test method is adopted to judge whether the distribution of the vertex occurrence times conforms to the Exponential distribution.
And 3) finding out representative ensemble prediction members to finish clustering, wherein the method has good applicability to ensemble prediction products which are updated continuously in a rolling mode, and natural clusters and noise points can be identified while multi-level clustering results are generated by splitting a minimum distance connected graph according to the characteristics of ultrahigh dimensionality and small data size of ensemble prediction member data. The specific steps of selecting the ensemble forecasting member representatives are as follows:
step 3.1), the number of layers of the current layer is set to be L, and for each layer of minimum distance connected graphs, the minimum distance connected graph of non-noise points in the minimum distance connected graphs is screened out to be used as a cluster set to be screened of the layer;
and 3.2) sequentially comparing the number of the clusters to be screened from the L layer to the first layer with a preset number threshold range until the number of the clusters to be screened of a certain layer is within the preset number threshold range, and combining the clusters to be screened of the layer into a final cluster set to be screened. The predetermined threshold number is preferably set to 3-5 (generally, the number of clusters is preferably 3-5, otherwise, too many clusters will result in a low confidence level of the representative member, and too few clusters will result in a poor referential property of the representative member). If the hierarchy which does not meet the condition does not exist, the noise is considered to be too much, and the aggregation member does not have the clustering performance.
Step 3.3), for any minimum distance connected graph in the final cluster set to be screened, screening out a vertex closest to a cluster center in the minimum distance connected graph as a representative vertex of the minimum distance connected graph, and obtaining the representative vertex of each minimum distance connected graph in the final cluster set to be screened;
and 3.4) taking the set numerical weather forecast member corresponding to the representative vertex of each minimum distance connected graph in the cluster set to be screened as a representative member.
And finally, outputting the percentage of the number of the forecast members of the cluster in which the representative member is located to the total number of the forecast members (namely the percentage of the number of the vertexes of the minimum distance link graph in which the representative vertex is located to the total number of the vertexes) as the credibility of the representative member.
The preset proportional threshold is preferably set to 10%.
The preset number threshold ranges from 3 to 5.
To illustrate the practical implementation method of the present invention (MDG hierarchical clustering method), taking ECMWF global ensemble forecasting product as an example, the system flow chart is shown in fig. 2.
As can be seen from fig. 2, after the forecast product generated by the ECMWF global ensemble forecasting system is clustered by the MDG hierarchical clustering method, the number of grid points is used as a dimension, the total number of members is used as a data volume, and a clustering result representing that the number of members is 3 to 5 is generated at each time and each time resolution, but if too many noise points are not clustered, information that the members are not clustered is prompted, all 50 set members are given as clustering results, and finally the results are subjected to graphic visualization processing and presented to the forecaster for use.
Taking a prediction of 6 hours of 2016, 8, 31, and 00 hours (world time), as a clustering time, when the target time of the ground layer air temperature is 6 hours, then 50 aggregation members perform one-time MDG hierarchical clustering, and then the ground layer lattice data of each representative member is bilinearly interpolated to the Nanjing station to examine the aggregation prediction credibility of the 6-hour ground air temperature of the Nanjing station, and the result of the MDG hierarchical clustering method is as follows:
Figure BDA0002073327490000101
confidence is the percentage of the forecast membership representing the cluster in which the member is located to the total number of forecast members (50).
According to the TS scoring method, the absolute errors of the air temperature forecast are all considered to be hit within the range of 2 ℃ (see the article of Queen orange, namely 'forecast of the highest air temperature in summer in Jiangsu region based on Kalman filtering and MOS method'). It can be seen from the above table that the ensemble forecasting members represent that there are hits in numbers 6, 15, and 39, so that the clustering result of the ensemble forecasting members of this time is set members 6, 15, and 39, and their respective confidence levels are (36%, 28%, and 16%), and their accuracy is hit confidence level/total confidence level (80%/92% ═ 87%).
And finally, expanding the clustering result of the table based on the time point to the time period. Selecting data time: 6 hour forecast for 6-8 months 00Z and 12Z in 2016; a place: nanjing, Xuzhou, and she yang san di; clustering analysis element: the 3 month MDG hierarchical clustering results of the surface air temperature forecasts of 50 collection members are compared with Ward, tube method as follows:
location of a site Accuracy of tube method Cmean Ward accuracy Cmean MDG hierarchical clustering accuracy Cmean
Nanjing 67% 70% 80%
Xuzhou 73% 75% 85%
All-grass of Begonia 76% 78% 86%
CiForecasting the accuracy of member clustering results in a certain set excluding non-clustering conditions
Figure BDA0002073327490000111
(n 92 days. times. 2 times/day-number of uncleavable times)
The result of MDG hierarchical clustering in the table is superior to Ward clustering and a tube method, and the accuracy can reach more than 80 percent, because the MDG hierarchical clustering method has the denoising function, the interference of sparse clusters can be shielded, the clustering accuracy of natural clusters is only considered, and the problem of inapplicability of dynamic time sequence change data caused by fixed core parameter setting is avoided. However, the clustering accuracy of Nanjing is generally lower from a geographical point of view, because the Nanjing area is influenced by the surrounding mountains and rivers, and the weather situation is more complicated and more variable than that of other two places, the overall accuracy of the ensemble numerical prediction is reduced, and the MDG hierarchical clustering result based on the ensemble numerical prediction product is naturally worse.
Attached: all the operation effects and conclusions of the algorithm are realized according to the following computer software and hardware:
Figure BDA0002073327490000112
it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only illustrative of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (3)

1. A hierarchical clustering method for aggregating members of numerical weather forecast, comprising the steps of:
step 1), establishing a minimum distance connected graph according to the data characteristics of members of the set numerical weather forecast; the minimum distance connected graph is free of direction and loop, and comprises unique identifiers of n vertexes and n-1 edges for connecting the vertexes, wherein the n-1 edges are formed by connecting the n vertexes according to the nearest distance;
step 1.1), let
Figure FDA0002073327480000011
xi={xi1,xi2,…ximThe data of the ith vertex is represented, i is more than or equal to 1 and less than or equal to n, n is the total number of the vertices, namely the total number of the members of the aggregation numerical value weather forecast, m is the data dimension of the members of the aggregation numerical value weather forecast, id is the dimension of the members of the aggregation numerical value weather forecastiFor the ith set of numerical weather forecastUnique identification of data of newspaper members, vertex idiI.e. the ith vertex; and let the nearest edge matrix E11Initially, an empty matrix;
randomly selecting ith vertex in X, namely vertex idiRespectively calculate the vertex idiEuclidean distances to other vertexes to generate a distance matrix
Figure FDA0002073327480000012
In the formula (d)ijIs a vertex idiTo vertex idjJ is more than or equal to 1 and less than or equal to n, and j is not equal to i; and make the set EX ═ idi};
Step 1.2), the minimum value d of the distance is searched from the distance matrix XDi_minAnd the row [ id ] of the distance matrix XDi,idj,di_min]Adding to the nearest edge matrix E11Then deleted from the distance matrix XD;
step 1.3), vertex idjAdding into the collection EX, and calculating idjThe distances to the vertices other than the vertex in the set EX are used to generate a distance matrix
Figure FDA0002073327480000013
p is the number of vertices in the set EX except the vertex, djpIs a vertex idjTo vertex idpThe Euclidean distance of;
step 1.4), combining the distance matrix XD and the distance matrix XD2 to form a new distance matrix XD;
step 1.5), repeating steps 1.2) to 1.4) until the number of vertices in the set EX is equal to n;
step 1.6), based on the set ID11[id1,…,idn]And the nearest edge matrix E11Generating minimum distance connectivity graph MDG [ ID ]11,E11];
Step 2), segmenting data into clusters layer by utilizing the maximum difference value of the minimum distance connectivity graph and eliminating noise points:
step 2.1), connecting graphs MDG [ ID ] with minimum distance11,E11]As a first layer minimum distance connectivity graph;
step 2.2), the minimum distance connected graph MDG [ ID ]11,E11]Dividing the graph into a plurality of second layer minimum distance connected graphs;
step 2.2.1), calculating the nearest edge matrix E11The difference between the third column of each row except the first row and the third column of the previous row is taken as the maximum value dd1At the nearest edge matrix E11Calculating the average value of the third column in the corresponding row and the value of the third column in the next row to obtain the average value ddt1
Step 2.2.2), according to the mean value ddt1Split minimum distance connectivity graph MDG [ ID ]11,E11]Most adjacent edge matrix E of11Forming a minimum distance connected graph set, and taking the minimum distance connected graph set as a second layer minimum distance connected graph set, wherein the specific steps of dividing the most adjacent edge matrix of the minimum distance connected graph according to the mean value and forming the minimum distance connected graph set are as follows;
step 2.2.2.1), setting the average value as T and the nearest edge matrix of the minimum distance connected graph as EE, and dividing the nearest edge matrix EE according to the average value T to obtain a plurality of divided nearest edge matrices;
step 2.2.2.1.1), deleting all rows of which the values of the third column in the nearest edge matrix EE are larger than the average value T, and newly building a matrix EA;
step 2.2.2.1.2), setting the matrix EA as a null matrix, and deleting the first row in the most adjacent edge matrix EE from the most adjacent edge matrix EE after the first row in the most adjacent edge matrix EE is placed at the tail of the matrix EA;
step 2.2.2.1.3), for each row in the matrix EA, searching whether a value identical to the value of the first column or the second column exists in the first column and the second column of the nearest edge matrix EE, and if so, deleting the value from the nearest edge matrix EE after the row in the nearest edge matrix EE is placed at the end of the matrix EA;
step 2.2.2.1.4), repeating step 2.2.2.1.3) until the first and second columns of the nearest edge matrix EE and the first and second columns of the matrix EA do not have the same value;
step 2.2.2.1.5), creating a null matrix, and assigning the value in the matrix EA to the null matrix to obtain a partitioned most adjacent edge matrix;
step 2.2.2.1.6), repeating steps 2.2.2.1.2) to 2.2.2.1.5) until EE is a blank matrix, obtaining a plurality of segmented nearest edge matrixes, and forming a matrix set BB;
step 2.2.2.2), for each most adjacent edge matrix in the matrix set BB, extracting unique identifiers of vertexes corresponding to each edge of the matrix set BB to obtain a vertex set corresponding to the matrix set BB, and generating a minimum distance connected graph corresponding to the vertex set;
step 2.2.2.3), generating a second layer minimum distance connected graph set according to the minimum distance connected graph corresponding to each nearest edge matrix in the matrix set BB;
step 2.2.3), noise points in the second layer minimum distance connected graph set are marked, wherein the method for marking the noise points in the minimum distance connected graph set is as follows: for vertex sets corresponding to each minimum distance connected graph in the minimum distance connected graph set, sequentially judging whether the number of vertexes contained in the vertex sets is less than or equal to a preset proportional threshold value multiplied by n, if so, the minimum distance connected graph is a sparse cluster, and marking the minimum distance connected graph as a noise point;
step 2.2.4), marking natural clusters in the second layer minimum distance connected graph set, wherein the method for marking the natural clusters in the minimum distance connected graph set comprises the following steps: respectively judging whether the most adjacent edge matrixes corresponding to the minimum distance connected graphs in the minimum distance connected graph set accord with normal and exponential distribution tests, and if so, marking the minimum distance connected graph as a natural cluster;
step 2.3), taking the second layer minimum distance connected graph set as the current layer minimum distance connected graph set;
step 2.4), segmenting the minimum distance connected graph set of the current layer;
step 2.4.1), calculating the difference value between the third column of each row except the first row and the third column of the previous row in each most adjacent edge matrix corresponding to the minimum distance connected graph except the noise point and the natural cluster in the minimum distance connected graph set of the current layer respectively to obtain the difference valueTaking the maximum value dd therein2
Step 2.4.2), for dd2Corresponding nearest edge matrix, take dd2Calculating the average value of the values in the third column of the corresponding row and the third column of the next row in the nearest edge matrix to obtain the average value ddt2
Step 2.4.3), according to the mean value ddt2Splitting dd2The corresponding nearest edge matrix is divided to form a next layer minimum distance connected graph set;
step 2.4.4), marking noise points and natural clusters in the next layer of minimum distance connected graph set;
step 2.4.5), dividing dd in the minimum distance connected graph set of the current layer2Adding the minimum distance connected graph except the corresponding minimum distance connected graph into the next layer of minimum distance connected graph set;
step 2.5), the next layer of minimum distance connected graph set is used as the current layer of minimum distance connected graph set;
step 2.6), repeating the steps 2.4) to 2.5) until the minimum distance connected graph set of the current layer has no noise and the minimum distance connected graph except the natural cluster;
step 3), finding out representative ensemble prediction members, and finishing clustering:
step 3.1), the number of layers of the current layer is set to be L, and for each layer of minimum distance connected graphs, the minimum distance connected graph of non-noise points in the minimum distance connected graphs is screened out to be used as a cluster set to be screened of the layer;
step 3.2), sequentially comparing the number of clusters to be screened from the L layer to the first layer with a preset number range threshold value until the number of clusters to be screened of a certain layer is within the preset number threshold value range, and combining the clusters to be screened of the layer into a final cluster set to be screened;
step 3.3), for any minimum distance connected graph in the final cluster set to be screened, screening out a vertex closest to a cluster center in the minimum distance connected graph as a representative vertex of the minimum distance connected graph, and obtaining the representative vertex of each minimum distance connected graph in the final cluster set to be screened;
and 3.4) taking the set numerical weather forecast member corresponding to the representative vertex of each minimum distance connected graph in the cluster set to be screened as a representative member.
2. The hierarchical clustering method for aggregating numerical weather forecast members according to claim 1, characterized in that the preset proportional threshold is preferentially set to 10%.
3. The hierarchical clustering method for aggregating numerical weather forecast members according to claim 1, characterized in that said preset number threshold ranges from 3 to 5.
CN201910444986.4A 2019-05-27 2019-05-27 Hierarchical clustering method for collecting numerical weather forecast members Active CN110245692B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910444986.4A CN110245692B (en) 2019-05-27 2019-05-27 Hierarchical clustering method for collecting numerical weather forecast members

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910444986.4A CN110245692B (en) 2019-05-27 2019-05-27 Hierarchical clustering method for collecting numerical weather forecast members

Publications (2)

Publication Number Publication Date
CN110245692A CN110245692A (en) 2019-09-17
CN110245692B true CN110245692B (en) 2022-03-18

Family

ID=67885159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910444986.4A Active CN110245692B (en) 2019-05-27 2019-05-27 Hierarchical clustering method for collecting numerical weather forecast members

Country Status (1)

Country Link
CN (1) CN110245692B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507415B (en) * 2020-04-21 2023-07-25 南京信息工程大学 Multi-source atmosphere data clustering method based on distribution density
CN113159098B (en) * 2021-02-08 2024-03-29 北京工商大学 Nutritional food clustering method based on density consistency and correlation
CN113158817B (en) * 2021-03-29 2023-07-18 南京信息工程大学 Objective weather typing method based on rapid density peak clustering

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034858A (en) * 2012-11-30 2013-04-10 宁波大学 Secondary clustering segmentation method for satellite cloud picture
CN107784165A (en) * 2017-09-29 2018-03-09 国网青海省电力公司 Surface temperature field multi-scale data assimilation method based on photovoltaic plant

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808948B (en) * 2016-03-08 2017-02-15 中国水利水电科学研究院 Automatic correctional multi-mode value rainfall ensemble forecast method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034858A (en) * 2012-11-30 2013-04-10 宁波大学 Secondary clustering segmentation method for satellite cloud picture
CN107784165A (en) * 2017-09-29 2018-03-09 国网青海省电力公司 Surface temperature field multi-scale data assimilation method based on photovoltaic plant

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向智能终端的短临天气主观分析***设计与实现;王兴 等;《软件工程》;20190531;第43-45、31页 *

Also Published As

Publication number Publication date
CN110245692A (en) 2019-09-17

Similar Documents

Publication Publication Date Title
CN110245692B (en) Hierarchical clustering method for collecting numerical weather forecast members
CN104462184B (en) A kind of large-scale data abnormality recognition method based on two-way sampling combination
CN108595414B (en) Soil heavy metal enterprise pollution source identification method based on source-sink space variable reasoning
CN113642849B (en) Geological disaster risk comprehensive evaluation method and device considering spatial distribution characteristics
Biard et al. Automated detection of weather fronts using a deep learning neural network
CN111079999A (en) Flood disaster susceptibility prediction method based on CNN and SVM
CN111539444A (en) Gaussian mixture model method for modified mode recognition and statistical modeling
CN111611293B (en) Outlier data mining method based on feature weighting and MapReduce
CN107909062B (en) Information entropy-based remote sensing image feature discretization method and system
CN112131731A (en) Urban growth cellular simulation method based on spatial feature vector filtering
CN112907113B (en) Vegetation change cause identification method considering spatial correlation
CN112749905A (en) Flood disaster assessment method based on big data mining
CN116258279A (en) Landslide vulnerability evaluation method and device based on comprehensive weighting
CN107423319B (en) Junk web page detection method
CN114265954B (en) Graph representation learning method based on position and structure information
CN115934699A (en) Abnormal data screening method and device, electronic equipment and storage medium
Ariff et al. Clustering of rainfall distribution patterns in peninsular Malaysia using time series clustering method
CN108320512B (en) Macroscopic road safety analysis unit selection method based on Laplace spectral analysis
CN113191089A (en) Tailing sand liquefaction data clustering method based on sliding window
Fang et al. Zonation and scaling of tropical cyclone hazards based on spatial clustering for coastal China
Zhi et al. A Self-Adaptive OPTICS Clustering Algorithm Based on the Lightning Distribution
CN111309782A (en) Subspace-based outlier detection algorithm
Chen et al. Combining random forest and graph wavenet for spatial-temporal data prediction
CN117114105B (en) Target object recommendation method and system based on scientific research big data information
CN113743464B (en) Continuous characteristic discretization loss information compensation method and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant