CN109117861A

CN109117861A - A kind of multi-level cluster analysis method of point set for taking spatial position into account

Info

Publication number: CN109117861A
Application number: CN201810696862.0A
Authority: CN
Inventors: 虞昌彬; 郭仁忠; 庞超逸; 杨建刚; 赵志刚; 贺彪
Original assignee: Ningbo Institute of Technology of ZJU
Current assignee: Ningbo Institute of Technology of ZJU
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2019-01-01
Anticipated expiration: 2038-06-29
Also published as: CN109117861B

Abstract

The present invention relates to a kind of multi-level cluster analysis methods of point set for taking spatial position into account, including step following six: (1) is tentatively judged based on the space clustering existence of Statistic map of grades；(2) space clustering existence accurate judgement of the based on spatial autocorrelation；(3) space clustering type accurate judgement of the based on spatial autocorrelation；(4) is accurately divided based on the space clustering region of spatial autocorrelation；(5) is accurately divided based on the aggregation spatial abnormal feature of spatial autocorrelation；(6) space clustering region of the based on clustering algorithm includes the accurate positioning of point.The above method uses progressive multi-level judgement structure, so that without the judgement into next level if previous level does not meet, it is closely related between at all levels, and it is at all levels between it is progressive, meet the cognitive need and habit of people, it is fast and calculation the good method to the standard with calculation, but also calculation not only calculated.

Description

A kind of multi-level cluster analysis method of point set for taking spatial position into account

Technical field

The present invention relates to computer science and Geographical Information Sciences technical field, take spatial position into account more particularly, to one kind The multi-level cluster analysis method of point set.

Background technique

In real world, the objective law of " Things of a kind come together " is often abided by between things.So being born The model of volume of data cluster, method, algorithm, this is especially apparent in Computer Science and Technology field.Machine learning (Machine Learning) ten big algorithm include C4.5 algorithm, K average value (K-means) algorithm, support vector machine (SVM, Support Vector Machine) algorithm, Apriori association algorithm, greatest hope (EM, Expectation Maximum) Algorithm, paging PageRank algorithm, AdaBoost iterative algorithm, K closest (KNN, K-Nearest Neighbor) algorithm, Piao Plain Bayes's (NB, Naive Bayes) algorithm and Taxonomy and distribution (CART, Classification And Regress Trees) algorithm.Wherein, K average value (K-means) algorithm, greatest hope (EM, Expectation Maximum) algorithm are all Clustering algorithm.Cluster, as non-supervisory (unsupervised) machine learning method typically without label (unlabed), packet Include the clustering algorithm (typical as K-means algorithm) based on division, the clustering method based on level (it is typical as DIANA algorithm, AGNES algorithm), density-based algorithms (typical as DBSCAN algorithm), density clustering method it is (typical as maximum It is expected that EM algorithm) etc..The new clustering method of the above all kinds of traditional datas (often multidimensional data or high dimensional data) is continuous It is designed, and is gradually widely studied and applies.

It is worth noting that, 80% data and spatial domain (or spatial position or geographical position in the objective reality world Set) it is closely related.In other words, real-life most of data are closely connected with information and spatial position.In face of per hour/ Per minute all in ten hundreds of data or information of generation, and these data or information all carry spatial position spy These data or information are given cluster analysis (or abbreviation space clustering) from space angle, are one very valuable by sign Work.

The data clustering method and algorithm of the traditional data (in Computer Science and Technology field) are followed, it is existing (on ground Manage in information science field) spatial clustering method with algorithm is also that rough classification is as follows: spatial clustering method, base based on division In the spatial clustering method of level, density-based spatial clustering method, spatial clustering method based on grid etc..Typically, Two-dimensional space cluster can be regarded as only there are two dimension (i.e. only have X-coordinate and Y coordinate or longitude longitude and latitude Latitude the two attribute columns) data cluster analysis, and such cluster analysis result can be two-dimentional empty Between intuitively show on domain (typical as plane map)；Meanwhile three-dimensional space cluster can be regarded as only (only having there are three dimension X-coordinate and Y coordinate and Z coordinate or longitude longitude and latitude latitude and height height these three attribute columns) The cluster analysis of data, and such cluster analysis result can be intuitive on three-dimensional space domain (typical such as three-dimensional sphere) Display.

In addition to space clustering model and method, there are also spatial autocorrelation judgements for the method analyzed for spatial aggregation.Needle Generation to spatial auto-correlation (spatial auto-correlation), it then follows First Law of Geography (Tobler ' s First Law or Tobler ' s First Law of Geography), i.e., " anything is all related to other things , only more similar things is often associated with that close (original text is Everything is related on spatial position to everything else,but near things are more related to each other)".More than being based on First Law of Geography, this results in the correlation analysis result of things or attribute in spatial distribution, and there are following several possibility (as shown in Fig. 1): (1) space is positively correlated: referring to that adjacent domain has the same or similar attribute value, as shown in attached drawing 1 (a)；It changes Yan Zhi, if to show place also high around high place, low in spatial distribution also low around for certain variable's attribute value, referred to as Space is positively correlated, and shows that this variable's attribute value has space diffusion characteristic；(2) space is negatively correlated: it is different to refer to that adjacent domain has Attribute value, as shown in attached drawing 1 (b)；In other words, if showing place week low around high place, low in spatial distribution Height is enclosed, then referred to as space is negatively correlated, shows that this variable's attribute value has spatial polarizations feature；(3) it is spatially uncorrelated: referring to variable category The phenomenon that property shows randomness in spatial distribution, shows that spatial autocorrelation is unobvious, be a kind of random distribution, such as attached drawing 1 (c) shown in.

Although the above spatial clustering method, spatial autocorrelation model etc. can auxiliary space aggregation analysis work, But wherein spatial clustering method is still most frequently used, so illustrated before illustrating the content of present invention in this emphasis Space clustering.

For spatial clustering method, it can be given and be expressed using following common version, as shown in formula (1):

SDCA=(S, m, d, Dz, Ag, q) (1)

Wherein, SDCA is the contracting of Spatial Data Clustering Analysis (spatial data cluster analysis) It writes；

S (acronym of Spatial), representation space data set, S={ O1, O2 ..., On }；

The total amount of data object in m (abbreviation of number) representation space data set；

D (acronym of dimension), the dimension of representation space data set；

Dz indicates the similarity measures for being used for specific clustering；In space clustering, often using visual Space length distance carrys out measured similarity；In tradition distance, the distance between hundreds and thousands of dimensions are often non-visual；

Ag (abbreviation of Algorithm) indicates the specific implementation algorithm for being used for clustering, is described in detail later；

Q (abbreviation of Convergence), representation space clustering algorithm termination condition (or complete condition, restrain item Part)；If any clustering algorithm only pass through limited times operation and directly obtain cluster result, and some algorithms by continuous iteration until Convergence is to obtain final cluster result.

So far thousands of about the research paper of cluster, traditional clustering algorithm system is substantially about ginseng above Number Ag expansion.In general, Ag can be divided into following five class: specifically including: (1) based on the method for division: data set is drawn at random Be divided into k subset, then by iteration re-positioning technology attempt by data object from a cluster be moved to another cluster to The quality for continuously improving cluster, such as K-means algorithm；(2) based on the method for level: carrying out layer to given set of data objects Secondary decomposition according to the forming method of level, and can be divided into cohesion and division two major classes method, such as solidifying based on minimum distance Poly- algorithm；(3) based on the method for density, cluster is generated according to the density of domain object or certain density function, so that often A class must include at least the point of certain amount in given range region, such as DBSCAN algorithm；(4) based on the method for grid: Object space is quantified as a limited number of unit, forms a network structure, so that all cluster operations are all in network Upper progress, so that cluster speed greatly promotes, such as STING algorithm；(5) based on the method for model: assuming one for each class A model finds data to the best fit of setting models, such as COBWeb algorithm；(6) method based on probability: estimated based on probability The clustering method of meter, such as greatest hope EM algorithm, Density Estimator method.Additional, there is also the feelings that many methods are intersected Condition, such as based on grid and the clustering algorithm combined based on density.It is as shown in Fig. 2 that above is referred to clustering algorithms.

For above various types of clustering algorithms, each time complexity, space complexity, scalability, Cluster shape, whether unrelated with input sequence, noise processed ability and be capable of handling data type etc. has respective method The characteristics of.Specific manifestation is as follows:

(1) requirement of efficiency of algorithm: many clustering algorithms can be very for the relatively small data set less than 200 data objects It is clustered well, still, large data is concentrated may be comprising millions of, several ten million or even more objects and record. Although can reduce data volume to be processed by sampling, sampling can affect to the result of cluster or even can generate mistake Result accidentally.Therefore, the clustering algorithm of telescopic in height is ideally needed；

(2) handle the ability of different type attribute: many clustering methods can only cluster the input data of value type；So And in data mining practical application, input data type it is diversified, nonideal, so need to consider different clusters Processing capacity of the algorithm for different data scale feature；

(3) handle the ability of noise data: most of database in the real world all contains isolated point, sky Scarce, unknown data or wrong data；Some clustering methods are more sensitive for such data, may cause low-quality cluster As a result；

(4) handle the ability of high dimensional data: a database or data warehousing may include several dimensions or attribute；Perhaps Multi-cluster method is good at the data of processing low-dimensional, may pertain only to 2-3 dimension；However, the cluster data in higher dimensional space Object be it is very challenging, especially such input data is possible to very sparse, and high deflection；

(5) interpretation and availability: user wish cluster result be can explain, be understood that, can be used 's；In other words, cluster may need to connect each other and combine with specific semantic interpretation and application；

(6) for determines input parameter domain knowledge number: many clustering methods require user in clustering Certain parameter is inputted, such as wishes to generate the number of class, and cluster result is very sensitive for input parameter.At practical place In reason, parameter is generally difficult to determine, even more so for data set when especially for comprising high dimensional object；

(7) to the sensitivity of input sequence: some clustering algorithms are very sensitive for the input sequence of data, for example, It for the same data set, inputs or is scanned into some algorithm in differing order, it is possible to completely different gather can be generated Class is as a result, what this was often not desirable to；

(8) can find the cluster of arbitrary shape: many clustering methods are based on distance to determine cluster result, and are based on The algorithm of distance metric tends to the spherical class for being found to have similar scale and density；However, cluster may be various shapes Shape, such as linear, cyclic annular, spill and various other irregular shapes of complexity.

Different from single use Spatial Clustering give spatial aggregation judgement, also different from use single space from Correlation model is to judging spatial aggregation, the invention proposes a kind of multi-level space clustering method for taking spatial position into account, This method is multi-level, and between level be it is progressive, the spatial aggregation suitable for point set object judges, judgement Process is intuitive and progressive, meets the cognitive need of people, explained later.

Summary of the invention

Technical problem to be solved by the invention is to provide a kind of multi-level cluster analysis of the point set for taking spatial position into account Method, this method use progressive multi-level judgement structure, so that next without entering if previous level does not meet The judgement of a level, meet " calculation fast ", meanwhile, the above multi-level judgment model meet people from the superficial to the deep cognition habit with Demand embodies " calculation good ", it is at all levels between be it is progressive, meet the cognitive need and habit of people, this method is One not only " calculation to " and " standard of calculation ", but also the method for " calculation fast " and " calculation good ", it is proposed for spatial aggregation judgement A kind of new thinking.

The technical scheme adopted by the invention is that a kind of multi-level cluster analysis method of point set for taking spatial position into account, Include the following steps:

Step 1: the space clustering existence based on Statistic map of grades tentatively judges: by hierarchical statistics drafting method come Judge the whether doubtful presence of spatial aggregation；

Step 2: the space clustering existence accurate judgement based on spatial autocorrelation: by spatial autocorrelation coefficient Global Moran I coefficient is to determine whether be implicitly present in spatial aggregation；

Step 3: the space clustering type accurate judgement based on spatial autocorrelation: if being implicitly present in spatial aggregation, Judge that space clustering type is that high level cluster or low value are poly- by the global Getis-Ord coefficient in spatial autocorrelation coefficient Class；

Step 4: the space clustering region based on spatial autocorrelation accurately divides: being clustered if it is high level, then pass through space Local Getis-Ord coefficient in auto-correlation coefficient accurately delimits the specific region of high level cluster；It is clustered if it is low value, then The specific region of low value cluster accurately delimited by the local Getis-Ord coefficient in spatial autocorrelation coefficient；

Step 5: the aggregation spatial abnormal feature based on spatial autocorrelation accurately divides: by the office in spatial autocorrelation coefficient Portion's MoranI coefficient accurately to mark off the abnormal area other than high level aggregation or low value aggregation；

Step 6: the space clustering region based on clustering algorithm include point accurate positioning: by Spatial Clustering come It is accurately positioned the point that space clustering region is included.

The beneficial effects of the present invention are: the above-mentioned multi-level cluster analysis method of point set for taking spatial position into account, is different from Single use Spatial Clustering gives spatial aggregation judgement, and the spatial autocorrelation model single also different from calling is to sentence Disconnected spatial aggregation, in this method, it is at all levels between be closely related, and it is at all levels between it is progressive, meet people's Cognitive need and habit, each progressive level successively gives " with the presence or absence of cluster (answer Yes or No) ", " if there is Cluster then judgement be it is which type of cluster (answer high level cluster or low value cluster) ", " how the specific region of cluster delimited (providing specific aggregation zone) " and " clustering the point that specific region is included above has which (to give specific accumulation regions to be wrapped The point contained) " answer so that if previous level does not meet without enter next level judgement, meet " calculation Fastly "；Meanwhile the above multi-level judgment model meets the cognition habit and demand from the superficial to the deep of people, embodies " calculation good ".Always It is both " calculation in a kind of multi-level cluster analysis method of point set for taking spatial position into account proposed by the present invention for knot It is right " and " standard of calculation ", the method for " calculation fast " and " calculation good " again.

As preferential, in step 1, the hierarchical statistics drafting method are as follows: according to the statistics of each region dividing unit, According to the density, intensity or development level of phenomenon come divided rank, then according to rank height, distinguish on map by zoning It fills out and draws the different color of the depth or the different warp of density, to show the difference between each region dividing unit.

As preferential, in step 1, hierarchical statistics drafting method is used to three indexs, i.e. index PN, index PD with And index HR, wherein PN indicates the number of specific crowd, and PD indicates the density of specific crowd, and HR indicates specific crowd all Ratio in population, the specific calculating of three above index is as shown in following formula (1) and formula (2):

Wherein, aa_(i)Indicate the face domain size of i-th of administrative division, cn_(i)Indicate the population base of i-th of administrative division.

As preferential, in step 2, shown in the following formula of calculating (3) and formula (4) of global Moran I coefficient:

Wherein, z_iIt is the difference of feature i attribute value Yu its intermediate valuew_i,jIt is the space weight of feature i Yu feature j, n It is the total number of feature, S₀It is the summation of all space weights.

As preferential, in step 3, shown in the following formula of calculating (5) of global Getis-Ord coefficient:

Wherein, x_iAnd x_jIt is the attribute value of feature i and feature j, w_ijIt is the space weight of feature i and feature j, n is data The total number of feature is concentrated,Indicate that feature i and feature j cannot be the same feature.

As preferential, in step 4, the following formula of calculating (6) of local Getis-Ord coefficient is to shown in formula (8):

Wherein, x_jIt is the attribute value of feature j, w_i,jIt is the space weight of feature i and feature j, n is feature total number.

As preferential, in step 5, shown in the following formula of calculating (9) of local Moran I coefficient:

Wherein, x_iIt is the attribute value of feature i,It is the intermediate value of corresponding attribute, w_i,jIt is the space right of feature i and feature j Weight, n is the total number of feature.

As global Moran I coefficient, overall situation Getis-Ord coefficient, part preferential, be related in above-mentioned steps Getis-Ord coefficient and part Moran I coefficient are required to using to Spatial weight matrix, the meter of the Spatial weight matrix The adoptable strategy of calculation mode include: anti-distance strategy, anti-square distance strategy, fixed range strategy, indifference region strategy, The closest strategy of K, the adjacent strategy in side, the adjacent strategy of edge point and Delaunay triangulation network strategy.

As preferential, in step 6, the Spatial Clustering is bottom-up blending algorithm, described bottom-up The specific method of blending algorithm include the following steps:

(1), regard n spatial point as n subgroup, i.e., each subgroup only has 1 spatial point, then according to selected Cluster merging criterion calculates the relationship between this n subgroup；

(2), subgroup two-by-two is classified as by a new subgroup according to cluster subgroup merging criterion, has thus obtained n-1 Subgroup；

(3), class statistic amount between any two in n-1 subgroup is recalculated, continues to be continued according to the above same criterion Subgroup merging is carried out, then obtains n-2 subgroup；

(4), above step is repeated, and so on, until all subgroups complete to merge, ultimately form 1 big subgroup.

As preferential, the similitude judgement being related between spatial point in bottom-up blending algorithm, by using away from Come the similitude between metric space point, the distance metric criterion from measurement criterion are as follows: setting vector x has j different dimensional Degree, then the various distances between two different vector individual xs and xj are calculated as follows:

(1) Minkowski distance, shown in the following formula of the calculating of Minkowski distance (11):

Wherein p indicates Minkowski index；

(2) city block distance, when p value is 1 in Minkowski distance, special case turns to city block distance, city block distance The following formula of calculating (12) shown in:

City block distance in two-dimensional space can further following formula (13) calculate:

d_st=| x_s-x_t|+|y_s-y_t| (13)

City block distance in three-dimensional space can further following formula (14) calculate:

d_st=| x_s-x_t|+|y_s-y_t|+|z_s-z_t| (14)

(3) Euclidean distance, when p value is 2 in Minkowski distance, special case turns to common Euclidean distance, Europe Shown in the following formula of the calculating of formula distance (15):

Euclidean distance in two-dimensional space can further following formula (16) calculate:

Euclidean distance in three-dimensional space can further following formula (17) calculate:

(4) Chebyshev's distance, when p value is infinitely great in Minkowski distance, special case turns to Chebyshev Distance, shown in the following formula of the calculating of Chebyshev's distance (18):

Chebyshev's distance in two-dimensional space can further following formula (19) calculate (such as shown in attached drawing 6 (c)):

d_st=max (| x_s-x_t|,|y_s-y_t|) (19)

Chebyshev's distance in three-dimensional space can further following formula (20) calculate:

d_st=max (| x_s-x_t|,|y_s-y_t|,|z_s-z_t|) (20)

As preferential, above-mentioned cluster subgroup merging criterion refers to: two sons are judged according to the distance between two subgroups Whether group should merge, if can merge, choose the two subgroups and merge, can set two subgroups is subgroup respectively R and subgroup s, object number is respectively nr and ns in two subgroups.So, Cluster merging criterion can be specifically arranged as follows:

Chain for list, using the similarity matrix or distance matrix of data, defining between class distance is between two classes The minimum range of data, list is chain formula (21) to express as follows:

D (r, s)=min (dist (x_ri,x_sj)),i∈(i,...,n_r),j∈(1,...,n_s) (21)

For complete chain, the similarity matrix or distance matrix of data are used, it is several for two classes between to define between class distance According to maximum distance, it is complete it is chain can also following formula (22) expression:

D (r, s)=max (dist (x_ri,x_sj)),i∈(1,...,n_r),j∈(1,...,n_s) (22)

For a group average linkage, using the similarity matrix or distance matrix of data, definition between class distance is class spacing From data two-by-two with a distance from average value, group average linkage can following formula (23) expression:

For centroid distance, from distance matrix and initial data, definition distance is two-dimentional Euclidean distance, this distance is Individual and the quality distance of group or the centroid distance of group and group, the following formula of centroid distance (24) expression:

It is chain for Ward, it is intended to make the increment of the sum of squares of deviations in group minimum in each step, Ward is chain Following formula (25) simplifies expression:

It is chain for intermediate value, in the mass center of calculating group, by two parts of synthesis group according to identical weight calculation, that is, count The mass center of calculating is actually the average value for forming the two-part mass center of the group, and intermediate value is chain can following formula (26) table It reaches:

It is chain for weighted average, when calculating class spacing to distance plus the power for being equivalent to membership's inverse in class Weight, weighted average is chain formula (27) to express as follows:

Detailed description of the invention

Attached drawing 1 is the intuitive schematic diagram of spatial coherence, wherein (a) is to be positively correlated, is (b) negative correlation, is (c) not phase It closes；

Attached drawing 2 is the classification system figure of various Spatial Clusterings；

The comparison diagram of the characteristics such as the advantages of attached drawing 3 is various Spatial Clusterings and disadvantage；

Attached drawing 4 is general technical route map；

Attached drawing 5 is the various weight downward trends signal calculated in the Spatial weight matrix that spatial autocorrelation coefficient is related to Figure；

Attached drawing 6 is all kinds of distance metric criterion schematic diagrames for calculating hierarchical clustering and being related to；

Attached drawing 7 is each seed group polymerization criterion schematic diagram for calculating hierarchical clustering and being related to；

Attached drawing 8 is Ningbo City and the spatial position distribution map it includes administrative division；

Attached drawing 9 is the hierarchical chart of China's administrative division setting；

Attached drawing 10 be Ningbo City's administrative division detailed composition scheme (area Gong11Ge/county/county-level city, 153 street/towns/ Township)；

Attached drawing 11 is China's infectious disease type map (2 kinds of Class A, 26 kinds of Class B, 11 kinds of Class C)；

Attached drawing 12 is Statistic map of grades result figure of all kinds of indexs in area/county/county-level city (totally 11) rank of Ningbo City；

Attached drawing 13 be all kinds of indexs Ningbo City street/town/township (totally 153) rank Statistic map of grades result Figure；

Attached drawing 14 is cold and hot regional analysis result of all kinds of indexs in street/town/township (totally 153) rank of Ningbo City Figure；

Attached drawing 15 is that all kinds of indexs are analyzed totally in the aggregation of street/town/township (153) rank of Ningbo City and abnormal area Result figure；

Attached drawing 16 is the hierarchical space cluster result figure of infectious diseases in Ningbo patient；

Attached drawing 17 is that the spatial aggregation based on Density Estimator analyzes result figure.

Specific embodiment

It is invented referring to the drawings and in conjunction with specific embodiment to further describe, to enable those skilled in the art's reference Specification word can be implemented accordingly, and the scope of the present invention is not limited to the specific embodiment.

The present invention relates to a kind of multi-level cluster analysis method of point set for taking spatial position into account, this method is different from single Use space clustering algorithm gives spatial aggregation judgement, and the spatial autocorrelation model single also different from calling is to judge sky Between aggregation, it is a kind of multi-level cluster analysis method, it is at all levels between be closely related, and it is at all levels it Between be it is progressive, meet the cognitive need and habit of people.This method is the spatial aggregation judgment method towards point set.

Specifically, in a kind of multi-level cluster analysis method of point set for taking spatial position into account proposed by the present invention, respectively A progressive level successively gives the answer of following key problem: " with the presence or absence of cluster (answering Yes or No) ", " if there is Cluster then judgement be it is which type of cluster (answer high level cluster or low value cluster) ", " how the specific region of cluster delimited (providing specific aggregation zone) ", " clustering the point that specific region is included above has which (giving specific accumulation regions is included Point) ".

It is worth noting that, for any one algorithm, generally require to consider " calculation to or not (effectiveness) ", " (efficiency) is not allowed in the standard of calculation ", " fast unhappy (quick) of calculation ", " good or not of calculation (satisfaction)".For a kind of multi-level cluster analysis method of point set for taking spatial position into account proposed herein, tool " calculation to " and " standard of calculation " is completely secured in standby stronger Fundamentals of Mathematics and mathematical proof；Above method uses progressive multilayer Secondary judgement structure, so that meeting " calculation fast " if previous level does not meet without the judgement into next level； Meanwhile the above multi-level judgment model meets the cognition habit and demand from the superficial to the deep of people, embodies " calculation good ".Summarize and Speech, it is proposed here a kind of multi-level cluster analysis method of point set for taking spatial position into account, be one both " calculation to " and " standard of calculation ", the method for " calculation fast " and " calculation good " again, it proposes a kind of new thinking for spatial aggregation judgement.

In order to realize a kind of multi-level cluster analysis method of point set for taking spatial position into account of the present invention, need through Cross following six big steps, comprising:

It is introduced in detail below for each step.

The preliminary judgement of space clustering existence of the step 1 based on Statistic map of grades

The content of this step is " judging the whether doubtful presence of spatial aggregation ", and the target of this step is that " space clustering is No existing preliminary judgement ", this step is " hierarchical statistics drafting method " by means.

In hierarchical statistics drafting method, need using following three indexs: index PN, index PD, index HR.Specifically , PN represents the number of specific crowd, it is the abbreviation of Number of Particular people；PD represents specific crowd Density, it is the abbreviation of Density of Particular people；HR represents ratio of the specific crowd in all populations Example, it is the abbreviation of Rate of ad-Hoc people.

In the calculating process of three above index PN, PD, HR, it is also necessary to by following parameter: setting i-th of administrative division Face domain size be aa_(i)(it is the abbreviation of Administrative Area), the population base of i-th of administrative division are cn_(i) (it is the abbreviation of Census Number).The specific calculating of three above index is as shown in following formula (1) and formula (2):

Particularly, it is believed that this absolute index of PD and HR the two relative indicatrixes ratio PN more has reliably, because being directed to Identical specific crowd number, administrative division unit be to be located in center and economy is relatively flourishing so cause " although area Less, but populous, specific crowd is concentrated " feature, and some administrative divisions are that address is remote and economy falls behind relatively so Cause " although vast in territory, sparse population, specific crowd is very few " feature.For three above index, using ground Hierarchical statistics drawing method in figure drawing, which is given, to be showed, and typically can set color method using classification.Classification based on the above index Statistical chart observation can be realized the preliminary judgement with the presence or absence of spatial aggregation.

The accurate judgement of spatial aggregation existence of the step 2 based on spatial autocorrelation

The content of this step is " accurate judgement whether there is spatial aggregation ", and the purpose of this step is " spatial aggregation Existing accurate judgement ", this step is " the global Moran I coefficient in spatial autocorrelation coefficient " by means.

In other words, for global Moran I coefficient, it for judging whether there is spatial autocorrelation, i.e., answer be Yes or No (Yes indicates that, there are spatial autocorrelation, No indicates that spatial autocorrelation is not present).

Shown in the following formula of calculating (3) and formula (4) of global Moran I coefficient:

The accurate judgement of space clustering type of the step 3 based on spatial autocorrelation

The content of this step is " if there is space clustering, then accurate judgement is that high level cluster or low value cluster ", this step Rapid purpose is " judgement is high level cluster or low value cluster ", this step is " in spatial autocorrelation coefficient by means Global Getis-Ord coefficient ".

In other words, for global Getis-Ord coefficient, what it was answered is high level spatial autocorrelation or low value space from phase It closes, i.e. answer is high level cluster (high-value cluster) or low value cluster (low-value cluster).

Shown in the following formula of calculating (5) of global Getis-Ord coefficient:

Wherein, x_iAnd x_jIt is the attribute value of feature i and feature j, w_ijIt is the space weight of feature i and feature j, n is data The total number of feature is concentrated,Indicate that feature i and feature j cannot be the same feature.If space weighted value is binary number It is worth (i.e. 0 and 1) or numerical value less than 1, then overall situation Getis-Ord factor v is always between 0 and 1.

The accurate division in space clustering region of the step 4 based on spatial autocorrelation

The content of this step is " tool of high level (or low value) cluster then accurately to be delimited if there is high level (or low value) cluster Body region ", the purpose of this step are " accurately dividing the specific region of high level (or low value) aggregation ", this step is by means " the local Getis-Ord coefficient in spatial autocorrelation coefficient ".

In other words, for local Getis-Ord coefficient, it is used to detect the specific aggregation space region of high level (low value) cluster Domain be where, can specifically mark which region specific region is.

The following formula of calculating (6) of local Getis-Ord coefficient is to shown in formula (8):

The accurate division of aggregation spatial abnormal feature of the step 5 based on spatial autocorrelation

The content of this step is " accurately having divided the abnormal area other than high level (or low value) aggregation ", the purpose of this step It is " the accurate abnormal area divided other than conventional aggregation zone ", this step is " in spatial autocorrelation coefficient by means Local Moran I coefficient ".

In other words, for local Moran I coefficient (i.e. LISA, Local Indicator for Spatial Auto- Correlation, local space auto-correlation coefficient), it provide the above routine clustering (English referred to herein as cluster, i.e., it is high- High cluster, low-low cluster) specific region except, give abnormal conditions (English referred to herein as outlier, i.e., high-oligomeric class, Low-high cluster) specific range.

Shown in the following formula of calculating (9) of local Moran I coefficient:

In summary, for overall situation Moran I coefficient as above, it is for judging whether there is spatial autocorrelation, i.e. answer It is Yes or No (Yes indicates that, there are spatial autocorrelation, No expression is not present)；For the above overall situation Getis-Ord coefficient, it is returned What is answered is high level spatial autocorrelation or low value spatial autocorrelation, i.e., answer be high level cluster (high-value cluster) or Person's low value clusters (low-value cluster)；For the above part Getis-Ord coefficient, it is for detecting high level (low value) Cluster specific area of space be where, can specifically mark which region specific region is；For the above part Moran I coefficient (i.e. LISA), it is providing the specific of the above cluster (English is referred to herein as cluster, i.e. Gao-high cluster, low-low cluster) Except region, the specific range of exception (English is referred to herein as outlier, i.e., high-oligomeric class, low-high cluster) is given；Above four Person's coefficient is progressive.

It is worth noting that, above either overall situation Moran I coefficient, overall situation Getis-Ord coefficient, or part Getis-Ord coefficient, local Moran I coefficient (i.e. LISA), requires use space weight matrix (spatial weight matrix).The calculating of Spatial weight matrix can be using following strategy:

(1) anti-distance strategy: anti-distance tactful (Inverse Distance, be abbreviated as ID) refers to an element to another The influence of an outer element is reduced with the increase of distance.In other words, with the element of distant place ratio, neighbouring neighbouring element is to mesh The influence for marking the calculating of element is bigger (such as shown in attached drawing 5 (a))；

(2) anti-square distance strategy: anti-square distance strategy (Inverse Distance Squared, be abbreviated as IDS) Similar with anti-distance strategy, but its gradient becomes apparent from, therefore impacts and decline faster, and only target component Nearest field can generate significant impact to the calculating of element；

(3) fixed range strategy: fixed range strategy (Fixed Distance Band, be abbreviated as FDB) refers to will be right Each element in neighbouring element environment is analyzed；The weight that apportioning cost is 1 by the neighbouring element in specified critical distance, And significant impact is generated to the calculating of target component；Neighbouring element outside specified critical distance by be assigned as 0 weight, and Any influence will not be generated to the calculating of target component (such as shown in attached drawing 5 (b))；

(4) indifference region strategy: indifference strategy (Zone of Indifference, be abbreviated as ZI) can be regarded as The combination of anti-distance strategy and fixed range strategy；Apportioning cost is by it by the element in the specified critical distance to target component 1 weight, and the calculating that target component will be will affect；Once more than the critical distance, weight (and neighbouring element wants target The influence that element calculates) it will be with the increase of distance and reduce (such as shown in attached drawing 5 (c))；

(5) the closest strategy of K: K closest tactful (K Nearest Neighborhood, write a Chinese character in simplified form KNN) is referred to will most K close element is included in analysis, and wherein K is specified numerical parameter；

(6) in adjacent strategy: while abut tactful (Contiguity Edges Only, be abbreviated as CEO) and refers to only public affairs It just will affect the calculating of target component with the adjacent surface element on boundary or overlapping；

(7) the adjacent strategy of edge point: edge point adjacent tactful (Contiguity Edges Corners, be abbreviated as CEC) refers to It is the calculating that Border, node or the face of overlapping element will affect target component；

(8) Delaunay triangulation network strategy: (Delaunay Triangulation, writes a Chinese character in simplified form Delaunay triangulation network strategy DT it) refers to: being primarily based on element mass center creation not superimposed triangular grid, closed later using same edge and with triangle node The case where element of connection is adjacent element；

The accurate positioning in space clustering region of the step 6 based on clustering algorithm

The content of this step is " being accurately positioned the point that space clustering region is included based on clustering algorithm ", the mesh of this step Be " be accurately positioned space clustering region included point set ", this step is " Spatial Clustering " by means.

Herein, the hierarchical clustering algorithm in use space clustering algorithm.

Hierarchical clustering algorithm include two kinds: hierarchical clustering algorithm include bottom-up blending algorithm (AGNES algorithm, Agglomerative Nesting) and top-down blending algorithm (DIANA algorithm, Divisive Analysis), here What is used is bottom-up blending algorithm, and the specific method of the blending algorithm is following steps:

(1), regard n spatial point as n subgroup, i.e., each subgroup only has 1 spatial point, then according to selected Cluster merging criterion calculates the relationship between this n subgroup；Here " cluster subgroup merging criterion " refers to why choose The reason of two specified subgroups merge (or be merging standard), including it is shortest distance criterion, maximum distance criterion, average (can be described in detail below) such as distance criterion, center of gravity distance criterion, sum of squares of deviations increment criterion；

(2), for two subgroups (point), according to the above criterion, (such as distance is nearest, sum of squares of deviations is minimum, sum of squares of deviations Increment is minimum) it is classified as a new subgroup, thus obtain n-1 subgroup；

(3), recalculate the class statistic amount of n-1 subgroup between any two, continue to be continued according to the above same criterion into Row subgroup merges, then obtains n-2 subgroup；

(4), above step is repeated, and so on, until all subgroups complete to merge, ultimately form 1 big subgroup；

The process that the above subgroup gradually merges can also be using tree-like graph expression be clustered, to clearly reflect parent between subgroup It dredges.

For any hierarchical clustering algorithm (or even any clustering algorithm), the similitude that can be all related between spatial point Judgement.In Spatial Clustering, using space length distance come measured similarity, i.e., " distance metric criterion ", distance metric Criterion specifically: set vector x with j different dimensions, then the various distances between two different vector individual xs and xj are as follows Calculate (vector individual xs and xj representation space point xs and spatial point xj here):

(1) Minkowski distance (Minkowski distance)

Fujian Koffsky distance is a kind of distance of summary (general), street distance (city block Distance), Euclidean distance (euclidean distance), Chebyshev's distance (chebyshev distance) are all Mins The specific special case of Koffsky distance.Shown in the following formula of the calculating of Minkowski (11):

Wherein, p indicates Fujian Koffsky index.

(2) city block distance (city block distance)

When p value is 1 in Minkowski distance, special case turns to city block distance (city block distance), Also referred to as manhatton distance (Manhattan distance) or taxi distance (taxi distance), the calculating of city block distance Shown in following formula (12):

City block distance in two-dimensional space can further following formula (13) calculate (such as shown in attached drawing 6 (a)):

d_st=| x_s-x_t|+|y_s-y_t| (13)

d_st=| x_s-x_t|+|y_s-y_t|+|z_s-z_t| (14)

(3) Euclidean distance (euclidean distance)

When p value is 2 in Minkowski distance, special case turns to common Euclidean distance (euclidean Distance), shown in the following formula of the calculating of Euclidean distance (15):

Euclidean distance in two-dimensional space can further following formula (16) calculate (such as shown in attached drawing 6 (b)):

(4) Chebyshev's distance (shebyshev distance)

When p value is infinitely great in Minkowski distance, special case turns to Chebyshev's distance (shebyshev Distance), also referred to as chessboard distance (chess distance), the following formula of the calculating of Chebyshev's distance (18) are shown:

d_st=max (| x_s-x_t|,|y_s-y_t|) (19)

d_st=max (| x_s-x_t|,|y_s-y_t|,|z_s-z_t|) (20)

It is different from distance metric described above, the effect of " Cluster merging criterion " in algorithm is: according to two sons The distance between group judges whether the two subgroups should merge, and can set two subgroups is subgroup r and subgroup s respectively, two Object number is respectively nr and ns in subgroup.So, " Cluster merging criterion " can be specifically arranged as follows:

For single chain (single linkage), also known as arest neighbors (nearest neighbor) method, such as Fig. 7 (a) It is shown.This method uses the similarity matrix or distance matrix of data, and defining between class distance is data between two classes Minimum range (as shown in two points of line in 7 (a) in figure).This method does not consider class formation, and there may be at random for it Classification, especially in the case where big data, it is possible to create reel chain (long chaining) phenomenon, list is chain can be as follows Formula (21) expression:

D (r, s)=min (dist (x_ri,x_sj)),i∈(i,...,n_r),j∈(1,...,n_s) (21)

For complete chain (complete linkage), also known as farthest neighbour (furthest neighbor) method is such as schemed Shown in 7 (b).This method equally uses the similarity matrix or distance matrix of data, but define between class distance be two classes it Between data maximum distance (as shown in two points of line in Fig. 7 (b)).This method does not consider class formation, this method equally Tend to find some compact classification.It is entirely chain formula (22) to express as follows:

D (r, s)=max (dist (x_ri,x_sj)),i∈(1,...,n_r),j∈(1,...,n_s) (22)

For a group average linkage (group average linkage), also known as UPGMA (Unweighted Pair-Group Method using the Average approach), as shown in Fig. 7 (c).This method equally uses the similarity of data Matrix or distance matrix, but it is (more in such as attached drawing 7 (c) to define the average value that between class distance is between class distance data distance two-by-two To shown in line).So the classification of generation has preferable robustness, tendency as it can be seen that this method considers the structure of class Calculating Jie Yu Unit in two small classes of combination variance, distance it is chain and it is complete it is chain between.Group average linkage can also be following public Formula (23) expression:

For centroid distance (centroid linkage), also known as UPGMC (Unweighted Pair-Group Method Using Centroid approach), as shown in Fig. 7 (d).Different from previous methods, this method is from distance matrix and original Data are set out, and general definition distance is two-dimentional Euclidean distance, this distance is individual and the quality distance of group or the matter of group and group Heart distance (as shown in dotted line is given directions jointly in Fig. 7 (d)).It is of course also possible to using other distance measuring methods, but may The concept elaboration for initial data " mass center " can be lacked, this method considers the structure of class.The generally following formula of centroid distance (24) it expresses:

It is chain for Ward, also known as sum of squares of deviations method of addition (error sum of square criterion).This Method tends to make in each step the increment of the sum of squares of deviations in group minimum, as shown in Fig. 7 (f).It is worth noting that, Ward method has stronger Fundamentals of Mathematics (can be described in detail later).In contrast, it is known as sum of squares approach there are one method (sum of square), it is chain to be similar to Ward, but its sum of squares of deviations based on each class rather than sum of squares of deviations Increment, as shown in Fig. 7 (e).The formula expression of Ward method is complex, but has distinct feature and (have balance The feature of Number of Subgroups, explained later), expression can be simplified by formula (25) as follows:

It is chain (medium linkage) for intermediate value, also known as WPGMC (Weighted Pair-Group Method Using Centroid approach), before root unlike UPGMC, in the mass center of calculating group, by two of synthesis group Divide according to identical weight calculation, that is to say, that calculated mass center actually forms being averaged for the two-part mass center of the group Value.Intermediate value distance formula (26) can be expressed as follows:

Chain (weighted average linkage), the also known as WPGMA (Weighted for weighted average Pair-Group Method using Average approach), it is chain to be similar to intermediate value, but when calculating class spacing to away from From plus the weight for being equivalent to membership's inverse in class.Weighted average is chain formula (27) to express as follows:

Based on the above, special, using two-dimentional Euclidean distance as " distance (similitude) measurement criterion ", call based on from Poor quadratic sum increment minimum criteria (i.e. Ward is chain) is used as " Cluster merging criterion ", the hierarchical clustering algorithm (one of the coagulation type As referred to as Ward clustering method) calculating process it is specific as follows:

Wherein, the following formula of the center of gravity of subgroup S and subgroup T (28) to formula (31) calculates:

For any one subgroup T, the following formula of calculating (32) of sum of squares of deviations is shown inside subgroup:

When two subgroups S and T merge into new subgroup U, the following formula of distance (33) between existing subgroup R and new subgroup U To shown in formula (37):

n_U=n_S+n_T (33)

When two subgroups S and T merge into new subgroup U, the caused following formula of sum of squares of deviations increment (38) is calculated:

When S and T merge into new subgroup U when subgroup, sum of squares of deviations increment such as formula (39) recurrence is caused between subgroup R and U:

When subset adjusts (when adjusting point j to subgroup T from subgroup S), generates the following formula of increment (40) and calculates:

At this point, if setting the subgroup that subgroup R removes point j as subgroup S, the above Adjusted Option can be summarized as follows: needle It is that belong to subgroup R or belong to subgroup T to point j, specifically calculates following formula (41) to shown in formula (43):

n_R=n_S-1 (41)

Formula (43) as above is as it can be seen that at a distance from the point j and subgroup R and in the case where being equidistant of point j and subgroup T, such as Point j is belonged to subgroup R rather than subgroup T by fruit, then the number at the subgroup midpoint R have to it is fewer than the number at the subgroup midpoint T.It changes Yan Zhi, " hierarchy clustering method (i.e. Ward method) based on sum of squares of deviations increment " has " balances subgroup in the same circumstances The distinct characteristic of the number of point ".

Herein, specifically by taking the patients with infectious diseases spatial distribution cluster analysis of Ningbo City, Zhejiang Province as an example, to give The present invention proposes that the specific example of method illustrates.

Ningbo, abbreviation river in Zhejiang Province are Vico-provincial Cities, cities specifically designated in the state plan, the fourth-largest port city of the world.Ningbo is located in southeast edge Sea is located at China's Mainland coastline middle section, the Yangtze River Delta south wing, and it is natural barrier that, which there are Zhoushan Islands in east, on the north of Hangzhou Wan, west The Shengzhou, Xinchang, Shangyu for meeting Shaoxin City border on Sanmen Wan in the south, and are connected with three of Taizhou, balcony, and attached drawing 8 gives Ningbo City It is divided in the spatial position in China and Zhejiang Province and the space of administrative division.

The administrative division in China can be divided into multistage, in general, include national (i.e. Chinese), provincial (such as Zhejiang Province), Prefecture-level (such as Ningbo City), area/county/county-level city's rank (being labeled as Level 1 herein, be abbreviated as Lv1), street/town/township level are other (being labeled as Level 2 herein, be abbreviated as Lv2), community/residential block/administrative village rank (herein labeled as Level3, are abbreviated as Lv3), spatial network rank (being labeled as Level 4 herein, be abbreviated as Lv4), the plane X-coordinate of individual and Y coordinate (are marked herein It is denoted as Level 5, is abbreviated as Lv5), as shown in Fig. 9.

Based on the above, giving Ningbo City includes elaborating for administrative division, as shown in Fig. 10.It is specific as follows:

For the division of complete Ningbo City's administrative division, in total comprising 6 areas (i.e. Haishu District, Jiangdong District, Jiangbei District, Zhenhai District, Yinzhou District, Beilun District), 3 county-level cities (i.e. Cixi City, Yuyao City, Fenghua City), 2 counties (i.e. Ninghai County, Xiangshan County), Specifically the case where, is as follows:

(1) it is directed to Haishu District, contains 8 streets, respectively south gate street, the street Jiang Sha, west gate street, lunar lacus street Road, drum tower street, white clouds street, the street Duan Tang, the street Wang Chun；

(2) it is directed to Jiangdong District, contains 8 streets, respectively hundred zhang of streets, the street Ming Lou, white crane street, Dong Liujie Road, the street Dong Sheng, eastern suburb street, the street Fu Ming, the street Xin Ming；

(3) be directed to Jiangbei District, contain 7 streets and 1 town, respectively the street Zhong Ma, white sand street, the street Kong Pu, Culture and education street, the street Hong Tang, the street Zhuan Qiao, the street Yong Jiang, kind cities and towns；

(4) be directed to Zhenhai District, contain 4 streets and 2 towns, respectively camel street, the street Zhuan Shi, the street Jiao Chuan, Recruit Golconda street, the town Jiu Longhu, the Pu Creek town；

(5) it is directed to Yinzhou District, 7 streets, 17 towns, 1 township is contained, is the street Xia Ying, the street Zhong Gongmiao, stone respectively The street ?, the street Mei Xu, the street Zhong He, first South Street road, the street Pan Huo, the town Zhan Qi, the town Xian Xiang, the town Tang Xi, Lake Dongqian town, Wu Town, five small towns, the town Qiu Ai, Yunlong town, the town Heng Xi, the town Jiang Shan, high bridge town, rank street town, the town Ji Shigang, the town Gu Lin, the town Dong Qiao, Yin The town Jiang Zhen, Zhang Shui, the township Long Guan；

(6) it is directed to Beilun District, contains 7 streets, 2 towns, 1 township, is brace street respectively, the street Qi Jiashan, new The street ?, the big street ?, Xiapu street, the street Chai Qiao, Daxie street, the town Chun Xiao, white peak town, plum mountain area；

(7) be directed to Cixi City, contain 5 streets and 15 towns, be respectively the street Zong Han, the street Kan Dun, the street Hu Shan, The street Gu Tang, white sand road street, the town Zhou Xiang, long river town, the town An Dong, the town Chong Shou, Yokogawa town, the town Xin Pu, the town Sheng Shan, the town Yao Lin, The town Kuang Yan, attached Hai Zhen, end of the bridge town, Baywatch Wei Town, the town Zhang Qi, Longshan Town, Tianyuan town；

(8) it is directed to Yuyao City, includes 6 streets, 14 towns, 1 township, is Ditang street, the street Lang Xia, Yang Mingjie respectively Road, Lanjiang River street, Fengshan street, the street Li Zhou, the town Huang Jiabu, the town Lin Shan, the town Mu Shan, the town Si Men, the town Ma Zhu, little Cao pretty young woman town, The town Liang Nong, the town Zhang Ting, the town Lu Bu, the town great Lan, Siming Shan town, Radix Notoginseng town, the town He Mudu, the town great Yin, the township Lu Ting；

(9) it is directed to Ninghai County, 4 streets, 11 towns, 3 townshiies is contained, is allosaurus street, Land of Peach Blossoms street, plum forests respectively It is street, end of the bridge Hu street, the town Chang Jie, the town Li Yang, a town, branch road town, the town Qian Tong, the town Sang Zhou, the town Huang Tan, the town great Jia He, strong Flood dragon town, the town Xi Dian, the depth town Quan, the township Hu Chen, the township Cha Yuan, the township Yue Xi；

(10) be directed to Fenghua City, contain 5 streets and 6 towns, be respectively silk screen street, the street Yue Lin, Jiangkou street, The street Xi Wu, the street Xiao Wangmiao, Xikou Zhen, still Tian Town, the town Chun Hu, fur coat villages and small towns, the town great Yan, the town Song Ao；

(11) it is directed to Xiangshan County, 3 streets, 10 towns, 5 townshiies is contained, is Dandong street, Dan Xijiedao, the rank of nobility respectively Small stream street, the town Shi Pu, Western Zhou Dynasty town, the town He Pu, the town Xian Xiang, the top of a wall town, the Sizhou town Tou Zhen, Ding Tang, the town Tu Ci, the town great Xu, new bridge Town, the township Dong Chen, the township Xiao Tang, the township Huang Bi Ao, the township Mao Yang, the township Gao Tangdao；

In other words, as shown in dotted portion in attached drawing 9, Ningbo City includes 11 areas/county/county-level city (Lv1) and 153 in total Street/town/township (Lv2).In fact, being directed to " street/town/township ", more careful administrative division can also be given and divided, i.e., into One step is divided into community/residential block/administrative village (Lv3), typically such as: (1) being directed to the south gate street of Haishu District, contain 11 Community (such as community Cheng Lang, the community Liu Jin, Wan An community, the community Hong Qi, the community Zhou Jiangan, southern exposure community, station community)； (2) the kind cities and towns of Jiangbei District are directed to, 6 communities (such as community Gu Xiang, the community Jing Ming), 5 residential block (such as Ci Dongju are contained Residential block Min Qu, Miao Shan etc.), 37 administrative villages (such as village Ci Hu, Dongshan village).More precisely, the details of administrative division is retouched Community space grid (Lv4) can also further be arrived in detail by stating degree (LOD, Levels of Details), until final accurate Navigate to the plane X-coordinate and Y coordinate (Lv5) of each individual.It is worth noting that for " community/residential block/administrative village (Lv3) " and the division mode of " community space grid (Lv4) " in specific subsequent processing with more than " street/town/township (Lv2) " It is substantially similar, then divide content on excessively in detail, think herein researching value less so omit.

In general, for the LOD of administrative division, that is, include Lv1 (area/county/county-level city), Lv2 (street/town/township), Lv3 (community/residential block/administrative village), Lv4 (community space grid), Lv5 (individual space coordinate) this 5 grades.Institute as above It states, in actual treatment, relates generally to Lv1 (area/county/county-level city, Ningbo City totally 11), Lv2 (street/town/township, Ningbo City Totally 153), Lv5 (individual space coordinate, about 5000 diseased individuals of 2011 annual data of Ningbo City record) this 3 grades.At this In, Lv1 referred to as " area grade (district level) ", Lv2 referred to as " street-level (street convenient for subsequent expression Level) ", Lv5 abbreviation " individual grade (individual level) ".

Meanwhile the case where providing China's infectious disease disease herein explanation.According to " People's Republic of China's Prevention of Infectious Diseases Method " and " law on the prevention and control of infectious diseases implementing regulations " for the infectious disease type in China be divided into category A infectious disease, Category B notifiable disease, third This three categories of class infectious disease.More careful disease classification is specific as follows (as shown in Fig. 11):

(1) it is directed to category A infectious disease, can be subdivided into the plague, cholera, totally 2 kinds；

(2) it is directed to Category B notifiable disease, can be subdivided into virus hepatitis, morbilli, pertussis, stranguria syndrome, malaria, syphilis steps on Leather heat, anthrax, infectiousness atypia hepatitis, diphtheria, neo-nataltetenus(NNT), AIDS, brucellosis, scarlet fever, rabies, Typhoid and paratyphoid, meningococal meningitis, Hemorrhagic fever, human hepatic stellate cell, popular B-mode brain Inflammation, bacillary and amebic dysentery, pulmonary tuberculosis, leptospirosis, snail fever, polio, totally 25 kinds；

(3) it is directed to Class C infectious disease, can be subdivided into influenza, mumps, acute hemorrhagic conjunctivitis, Rubeola, leprosy, popular and matlazahuatl, kala-azar, echinococcosis, filariasis remove cholera, bacillary and amoeba Property dysentery, the infectious diarrhea disease other than Typhoid and paratyphoid, hand-foot-and-mouth disease, totally 11 kinds；

Particularly, in patients with infectious diseases data herein, be related to data be Ningbo entirely within the scope of big city (as sky Between scale), the patient informations of 2011 annual (as time scale) all kinds of infectious diseases, about 5000 records in total, wherein The infectious disease type being actually related to is as shown in ticking in attached drawing 11 and reference numerals.As shown in Fig. 11, why herein Infectious disease type only have in the legal 38 kinds of infectious diseases in China 11 kinds (as in Figure 11 tick and reference numerals shown in, wherein Class A Middle reality is without reference to being actually related to 7 kinds, be actually related to 4 kinds in Class C in Class B；According to patient's number descending sort such as attached drawing In 11 shown in numeral mark), this is mainly attributed to China, Zhejiang Province, Ningbo City and many years in terms of Prevention of Infectious Diseases makes great efforts work Make, so that many of the above infectious disease has been eliminated completely or Eradication or maintained extremely a small number of in Ningbo City (as shown in strikethrough in attached drawing 11), concrete condition is as follows:

(1) 1940 year Anti-Japanese War, Ningbo City, Zhejiang Province and surrounding cities are unanimously by the plague, cholera, anthrax Invasion, these viruses and the bacterium overwhelming majority are launched by Japanese army, until the above infectious disease of founding of New in 1949 is Effective control is obtained；

(2) be directed to polio, in Ningbo City until ability Eradication in 1991, while herein in case not yet Occur；

(3) it is directed to highly pathogenic bird flu, 2 cases of most initial in Ningbo City until just detect in January, 2014, simultaneously Also do not occur in case data herein；

(4) it is directed to Japanese Type-B encephalitis, in Ningbo City until being just effectively controlled for 2010, annual patient is controlled System is coincide in units, the situation with case data situation here；

(5) the case where dengue fever, is similar with Japanese Type-B encephalitis；

(6) be directed to diphtheria, be not just found in Ningbo City from after 1989, at the same herein in case data not yet Once occurred；

(7) it is directed to snail fever and malaria, is destroyed respectively in Ningbo City in 1972 and 1989, while case herein Number of cases does not also occur in；

(8) be directed to leprosy, obtained Eradication in 1997 in Ningbo City, at the same herein in case data not yet Occur；

(9) it is directed to typhus, in Ningbo City recent years also without correlation report, the situation and case number of cases herein According to identical；

(10) it is directed to kala-azar and echinococcosis, situation is similar with typhus, and especially the former is in China in 1958 years By Eradication, the latter does not have case to be found in Zhejiang Province, and above situation is coincide with case data situation herein；

(11) it is directed to filariasis, was destroyed in Ningbo City in 1997, while not occurring in this paper case data.

Meanwhile the Eradication of above section infectious disease or elimination completely will also be attributed to the fact that China " planned immunization " works. Specifically, China since founding of New, has reinforced work in terms of disease prevention and cure, rigid and absolute enforcement immunization work, each province Actively implement.Wherein, Zhejiang Province is specifically included for the immunization procedures of children: for preventing the second of virus B hepatitis Liver vaccine (HBV), for preventing BCG vaccine lungy (BCG), the spinal cord ash vaccine (PV) for guarding against poliomyelities, use Vaccine (DPT) is broken, for preventing morbilli, rubeola, the popular parotid gland in one hundred days of prevention pertussis, diphtheria, neo-nataltetenus(NNT) Scorching numb cheek wind vaccine (MMR), the Vaccinum Encephalitidis Epidemicae (JEV) for preventing Japanese Type-B encephalitis, meningococal meningitis epidemic meningitis Vaccine (MCV), the Aimmugen (HAV) for preventing viral hepatitis type A.

Give above (1) China's administrative division structure setting Ningbo City, Zhejiang Province reality implement and (2) China Infectious disease type is really related to illustrating for type in present case.

It is worth noting that, the spatial aggregation analysis for giving infectious diseases in Ningbo patient in 2011 specific works it Before, it is more accurate to calculate and describing, it provides following pretreatment work: herein, using index PNID in following formula (44) Instead of the PN in above formula (1), the PD in above formula (1) is replaced using the index PDID in following formula (44), is used Index HRID in following formula (45) replaces the index HR in above formula (2).Compared with three original indexs, the above substitution Three indexs more for practical application meaning.

Specifically, PNID represents the number of patients with infectious diseases, it is Patient Number of Infectious The abbreviation of Disease；PDID represents the density of patients with infectious diseases, it is Patient Density of Infectious The abbreviation of Disease；HRID represents ratio of the patients with infectious diseases in all populations, it is Hospitalization Rate The abbreviation of of Infectious Disease.Following formula (44) and formula are distinguished in the calculating of index PDID and index HRID (45) shown in:

In addition to the above, remainder formula (i.e. formula (3) to formula (43)) remains unchanged.

Using step 1 (the preliminary judgement of the space clustering existence based on Statistic map of grades), three above index is obtained In the Preliminary visualization result (shown in attached drawing 12 (a)-(c)) and three above index of area/county/county-level city's rank (totally 11) In the Preliminary visualization result (shown in attached drawing 13 (a)-(c)) of street/town/township level not (totally 153) and its central area part Enlarged drawing (shown in such as attached drawing 13 (d)-(f)), corresponding statistical result can tentatively judge as listed in table 1, from above, and 2011 years peaceful The citywide Bo Shi patients with infectious diseases spatial distribution is doubtful there are spatial aggregation, and doubtful is gathered in central city.

Using step 2 (accurate judgement of the space clustering existence based on spatial autocorrelation), three above index is obtained (overall situation Moran I coefficient is used, such as table 2 in the accurate judgement of street/town/township level not space clustering existence of (totally 153) In the 2nd, the 4th, the 6th column shown in), obtain accurate judgement the result is that " there is cluster (confidence level 99%) ".Herein, it walks Rapid two accurate judgement the result is that the preliminary judging result of above step one confirmation.

Using step 3 (accurate judgement of the space clustering type based on spatial autocorrelation), further, obtain above Three indexs (use overall situation Getis-Ord system in the accurate judgement of street/town/township level not space clustering type of (totally 153) Number, as shown in the 3rd, the 5th, the 7th column in table 2), obtain accurate judgement the result is that " high level clusters (confidence level 99%) ", Herein, the accurate judgement of step 3 is the refinement that above step two judges.

Using step 4 (the accurate division in the space clustering region based on spatial autocorrelation), obtains three above index and exist The standard of high level aggregation zone (Gao-high cluster) and low value aggregation zone (low-low cluster) of street/town/township level not (totally 153) It really divides (shown in such as attached drawing 14 (a)-(c)).Herein, the specific regional assignment of step 4 is the specific of the judgement of above step three Change.

Using step 5 (the accurate division of the space clustering spatial abnormal feature based on spatial autocorrelation), obtain for PNID word Section calculate cluster and the statistical conditions of abnormal area (such as attached drawing 15 (a) and (d) shown in, according to LMI ZScore descending row in table 3 Column)；It is similar, can also obtain calculating for HRID field cluster and abnormal area statistical conditions (such as attached drawing 15 (c) and (f) shown in, arranged in table 4 also according to LMI ZScore descending) and for PDID field calculate cluster and the system of abnormal area Count situation (shown in such as attached drawing 15 (b) and (e), table omits as space is limited).For above 3 tables, Gao-high cluster can be found Street/town/the township the most apparent (high-high cluster), the street Dong Liu, the street Ming Lou including Jiangdong District, Jiangbei District Culture and education street, the street stone ?, the town Jiang Shan, the town Gu Lin of Yinzhou District, the street Jiao Chuan of Zhenhai District, the street Zhuan Shi, camel street, Recruit Golconda street.Simultaneously, it was found that there are this low value (low-high outlier) abnormal conditions of the street Jiang Sha of Haishu District. Herein, step 5 is the further supplement of above step four.

Using step 6 (precise positioning in the space clustering region based on clustering algorithm), made using " two-dimentional Euclidean distance " For " distance metric criterion ", while different " Cluster merging criterion " is taken, obtained shown in result such as attached drawing 16 (a)-(g).Its In, choose the Ward hierarchy clustering method (such as shown in attached drawing 16 (g)) for having " balance Number of Subgroups " distinct characteristic, obtain as Lower 4 level-one density centers (being recorded as C1, C2, C3, C4 respectively) and 6 second level density centers (be recorded as respectively c1, c2, c3, C4, c5, c6), as shown in table 5.It is specific as follows:

For C1, be located at the station area of Reuter in the trick Golconda street of Zhenhai District, Shengli road community, Zong Pu bridge community, after Street community, suitable grand community, the community Xi Menshequ, Bai Long, the prison community ?.For C2, it is located at the street Ming Lou of Jiangdong District With the community Jing Jia, the community Xu Jia, the community Xu Rong, the community Jin Yuan, the mill the Dong Liu community, Tai Koo Shing society of the intersection in the street Dong Liu Area, community of living in peace, center community.For C3, be located at Yinzhou District Shi ?street Shi ?community, east community, new district society Area, the village stone ?, the village Tang Xi, the village Hou Cang, the village Yue She, the village Che Hedu, the village Lian Feng.For C4, it is located at the town Jiang Shan of Yinzhou District The age of the small town community, the residential block Jiang Shan, the residential block Shi Shan, the village Qiang Nong, Dongguang village, the village Yu Jia, the upper village Zhang Cun, Hou Mao, Chen Jiatuan Village, the village Fan Shidu.

For c1, be located at the street Jiao Chuan of Zhenhai District the community Yu Fan, near a river community, five communities Li Pai, clear water Pu village, The village Zhong Guanlu.For c2, it is located at the community Zhuan Shi, the area of emerging village Reuter, connection Xing Cun in the street Zhuan Shi of Zhenhai District.For c3, Positioned at the community Sheng Jia in the camel street of Zhenhai District, the village Luo Xing, village of respecting virtue, Jinhua village, excess-three village.For c4, it is located at the north of the Changjiang River The mill the Shuan Dong community in the culture and education street in area, community of cultivating people of ability, the community Cui Dong, the community great Zha, the community Bei Anqinsen.For c5, position In the West Lake community in the town Gu Lin of Yinzhou District, the residential block Gu Lin, the village Gu Lin, the village Shi Jia, the village Guo Xia, the village Dai Jia, the village Feng Li, total Village, the village Bu Zheng, the village Bao Jia, the village Zhe Jiao, the village Zhong Yi, the West Gang Cun, Song Yan Wang Cun.For c6, it is located at the Dan Dongjie of Xiangshan County The park community of road and red West Street road intersection, East Street community, Tashan Mountain community, enterprising village, village, the village Qi Chun outside east gate, and North Road community, newly built community, north gate village, square well head village, the village Wu Feng.

Herein, keep almost the same based on the calculated result of step 6 and five result of above step, and after the former is Person's refinement.

Table 1. (is limited to only to list Hai Shu, the east of a river, the north of the Changjiang River, town for patients with infectious diseases in the rough estimates of each administrative division Sea)

Table 2. uses the calculating of Global Moran I and Global Getis-Ord index for Infectious Diseases Data

Table 3. calculates the statistical data of cluster with abnormal area for PNID field (according to LMI ZScore descending)

Table 4. calculates the statistical data of cluster with abnormal area for HRID field (according to LMI ZScore descending)

5. Density Estimator result of table is consistent with space clustering result

The invention patent is by subsidy in " digital mapping and territory Information application engineering country mapping geography information office emphasis are real Test open research foundation Funded Projects (project number GCWD201801) .Funded by Open Research Fund in room Program of Key Laboratory of Digital Mapping and Land Information Application Engineering,NASG(National Administration of Surveying,Mapping and Geoinformation) (Grant No.GCWD201801) " and " national natural science fund subsidy project (project approval Number: 41601428) .Project Supported by National Natural Science Foundation of China (Grant No.41601428) " and " Ningbo Institute of Technology, Zhejiang University scientific research starting project (project name: utilization LADM realizes the design and modeling _ by taking Shenzhen City, Guangdong Province and Ningbo City, Zhejiang Province as an example of the unified registration of China's real estate) " and " monitoring of Ministry of Land and Resources's urban land resource subsidizes project (KF-2016-02-001) with emulation key lab's open fund .The Project Supported by the Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation,Ministry of Land and Resources(KF-2016- 02-001) " and " 2016 annual Zhejiang Province's post-doctor's scientific research projects subsidize on a selective basis Task application (Project items title: about The unified registration modeling of China's real estate of LADM _ by taking Zhejiang Province as an example) " and " Mapping remote sensing technology information engineering state key experiment Room Funded Projects and number (15I03) .Open Research Fund of Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (15I03) " and " Ningbo City is towards life Order intelligent big data engineer application innovation team (project number: 2016C11024) the .This patent is of health supported by Ningbo Innovative Team:The intelligent big data engineering application for life and health(Grant No.2016C11024)。

The invention patent is also by subsidy in " Ministry of Education's humanity social sciences research general data-youth fund (entry name Claim: the method for safety monitoring research project number of the involuntary behavior crowd of fusion indoor location service and video analysis: 16YJCZH112) " and " Ningbo City's Natural Science Fund In The Light (project name, the huge traffic data based on NoSQL cloud database Acquisition is studied with method for digging, project number: 2017A610118) ".

Claims

1. a kind of multi-level cluster analysis method of point set for taking spatial position into account, it is characterised in that: include the following steps:

Step 1: the space clustering existence based on Statistic map of grades tentatively judges: being judged by hierarchical statistics drafting method The whether doubtful presence of spatial aggregation；

Step 2: the space clustering existence accurate judgement based on spatial autocorrelation: passing through the overall situation in spatial autocorrelation coefficient Moran I coefficient is to determine whether be implicitly present in spatial aggregation；

Step 3: the space clustering type accurate judgement based on spatial autocorrelation: if being implicitly present in spatial aggregation, passing through Global Getis-Ord coefficient in spatial autocorrelation coefficient judges that space clustering type is high level cluster or low value cluster；

Step 4: the space clustering region based on spatial autocorrelation accurately divides: being clustered if it is high level, then pass through space from phase Local Getis-Ord coefficient in relationship number accurately delimits the specific region of high level cluster；It clusters, then passes through if it is low value Local Getis-Ord coefficient in spatial autocorrelation coefficient accurately delimits the specific region of low value cluster；

Step 5: the aggregation spatial abnormal feature based on spatial autocorrelation accurately divides: by the part in spatial autocorrelation coefficient Moran I coefficient accurately to mark off the abnormal area other than high level aggregation or low value aggregation；

Step 6: the space clustering region based on clustering algorithm includes the accurate positioning of point: by Spatial Clustering come accurate The point that located space aggregation zone is included.

2. a kind of multi-level cluster analysis method of point set for taking spatial position into account according to claim 1, feature exist In: in step 1, the hierarchical statistics drafting method are as follows: according to the statistics of each region dividing unit, according to the density of phenomenon, Intensity or development level carry out divided rank, then according to rank height, fill out by zoning that draw the depth different respectively on map Color or the different warp of density, to show the difference between each region dividing unit.

3. according to claim 1 or a kind of multi-level cluster analysis side of point set for taking spatial position into account as claimed in claim 2 Method, it is characterised in that: hierarchical statistics drafting method is used to three indexs, i.e. index PN, index PD and index HR, wherein PN indicates the number of specific crowd, and PD indicates the density of specific crowd, and HR indicates ratio of the specific crowd in all populations, with The specific calculating of upper three indexs is as shown in following formula (1) and formula (2):

4. a kind of multi-level cluster analysis method of point set for taking spatial position into account according to claim 1, feature exist In: in step 2, shown in the following formula of calculating (3) and formula (4) of global Moran I coefficient:

Wherein, z_iIt is the difference of feature i attribute value Yu its intermediate valuew_i,jIt is the space weight of feature i Yu feature j, n is special The total number of sign, S₀It is the summation of all space weights.

5. a kind of multi-level cluster analysis method of point set for taking spatial position into account according to claim 1, feature exist In: in step 3, shown in the following formula of calculating (5) of global Getis-Ord coefficient:

Wherein, x_iAnd x_jIt is the attribute value of feature i and feature j, w_ijIt is the space weight of feature i and feature j, n is special in data set The total number of sign,Indicate that feature i and feature j cannot be the same feature.

6. a kind of multi-level cluster analysis method of point set for taking spatial position into account according to claim 1, feature exist In: in step 4, the following formula of calculating (6) of local Getis-Ord coefficient is to shown in formula (8):

7. a kind of multi-level cluster analysis method of point set for taking spatial position into account according to claim 1, feature exist In: in step 5, shown in the following formula of calculating (9) of local Moran I coefficient:

Wherein, x_iIt is the attribute value of feature i,It is the intermediate value of corresponding attribute, w_i,jIt is the space weight of feature i and feature j, n is The total number of feature.

8. a kind of multi-level cluster analysis method of point set for taking spatial position into account according to claim 1, feature exist In: global Moran I coefficient that step 1 is related into step 6, overall situation Getis-Ord coefficient, part Getis-Ord system Several and part Moran I coefficient is required to use using to the calculation of Spatial weight matrix, the Spatial weight matrix To strategy include: anti-distance strategy, anti-square distance strategy, fixed range strategy, indifference region strategy, the closest plan of K Slightly, the adjacent strategy in side, the adjacent strategy of edge point and Delaunay triangulation network strategy.

9. a kind of multi-level cluster analysis method of point set for taking spatial position into account according to claim 1, feature exist In: in step 6, the Spatial Clustering is bottom-up blending algorithm, the tool of the bottom-up blending algorithm Body method includes the following steps:

(3), class statistic amount between any two in n-1 subgroup is recalculated, continuation continues according to the above same criterion Subgroup merges, then obtains n-2 subgroup；

10. a kind of multi-level cluster analysis method of point set for taking spatial position into account according to claim 9, feature exist In: include the similitude judgement between spatial point in bottom-up blending algorithm, is measured by using distance metric criterion Similitude between spatial point, the distance metric criterion are as follows: setting vector x has j different dimensions, then two different vectors Various distances between individual xs and xj are calculated as follows:

Wherein p indicates Minkowski index；

(2) city block distance, when p value is 1 in Minkowski distance, special case turns to city block distance, the meter of city block distance It calculates shown in following formula (12):

d_st=| x_s-x_t|+|y_s-y_t| (13)

d_st=| x_s-x_t|+|y_s-y_t|+|z_s-z_t| (14)

(3) Euclidean distance, when p value is 2 in Minkowski distance, special case turns to common Euclidean distance, it is European away from From the following formula of calculating (15) shown in:

(4) Chebyshev's distance, when p value is infinitely great in Minkowski distance, special case turns to Chebyshev's distance, Shown in the following formula of the calculating of Chebyshev's distance (18):

d_st=max (| x_s-x_t|,|y_s-y_t|) (19)

d_st=max (| x_s-x_t|,|y_s-y_t|,|z_s-z_t|) (20)

11. a kind of multi-level cluster analysis method of point set for taking spatial position into account according to claim 9, feature exist In: above-mentioned cluster subgroup merging criterion refers to: judge whether two subgroups should merge according to the distance between two subgroups, If can merge, choose the two subgroups and merge, setting two subgroups is subgroup r and subgroup s respectively, in two subgroups Object number is respectively nr and ns, then, Cluster merging criterion can be specifically arranged as follows:

It is chain for list, the similarity matrix or distance matrix of data are used, defining between class distance is data between two classes Minimum range, it is single it is chain can following formula (21) expression:

D (r, s)=min (dist (x_ri,x_sj)),i∈(i,...,n_r),j∈(1,...,n_s) (21)

For complete chain, the similarity matrix or distance matrix of data are used, between class distance data between two classes are defined Maximum distance, entirely chain formula (22) to express as follows:

D (r, s)=max (dist (x_ri,x_sj)),i∈(1,...,n_r),j∈(1,...,n_s) (22)

For a group average linkage, using the similarity matrix or distance matrix of data, definition between class distance is between class distance number According to the average value of distance two-by-two, organize average linkage formula (23) can express as follows:

For centroid distance, from distance matrix and initial data, definition distance is two-dimentional Euclidean distance, this distance is individual With the quality distance of group or the centroid distance of group and group, the following formula of centroid distance (24) is expressed:

It is chain for Ward, it is intended to make the increment of the sum of squares of deviations in group minimum in each step, Ward is chain as follows Formula (25) simplifies expression:

It is chain for intermediate value, in the mass center of calculating group, by two parts of synthesis group according to identical weight calculation, that is, calculate Mass center be actually the average value for forming the two-part mass center of the group, intermediate value is chain can following formula (26) expression:

It is chain for weighted average, the weight for being equivalent to membership's inverse in class is added to distance when calculating class spacing, Weighted average is chain formula (27) to express as follows: