CN110728293A - Region growth and competition-based visitor destination data hierarchical clustering method for variable-scale data density space - Google Patents

Region growth and competition-based visitor destination data hierarchical clustering method for variable-scale data density space Download PDF

Info

Publication number
CN110728293A
CN110728293A CN201910812062.5A CN201910812062A CN110728293A CN 110728293 A CN110728293 A CN 110728293A CN 201910812062 A CN201910812062 A CN 201910812062A CN 110728293 A CN110728293 A CN 110728293A
Authority
CN
China
Prior art keywords
cluster
data
clustering
weight
points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910812062.5A
Other languages
Chinese (zh)
Other versions
CN110728293B (en
Inventor
何熊熊
袁志琴
庄华亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201910812062.5A priority Critical patent/CN110728293B/en
Publication of CN110728293A publication Critical patent/CN110728293A/en
Application granted granted Critical
Publication of CN110728293B publication Critical patent/CN110728293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a region growth and competition-based tourist destination data hierarchical clustering method for a variable-scale data density space, which is different from the conventional method in that the hierarchical clustering idea is adopted, and the clustering process is divided into three levels. The first-level clustering is used for dividing the objects into a certain number of subclasses based on Euclidean distance by using a distance threshold R1, so that the algorithm is simplified and the complexity is reduced. And then, the second-level method for growing the spatial data area uses the obtained cluster center as a growth seed, and the obtained cluster center grows under a growth criterion until a stop condition is reached, so that the problem of variable-scale data density clustering is solved. And finally, calculating the weight between the cluster centers based on the competitive idea and density similarity principle, and adopting a proper rule to merge the clusters to solve the problem of non-convex data clustering. Compared with other clustering algorithms, the method disclosed by the invention can maximally improve the clustering accuracy on the basis of reducing the complexity, has obvious advantages in processing mass data, and can better meet the requirements of practical engineering application.

Description

Region growth and competition-based visitor destination data hierarchical clustering method for variable-scale data density space
Technical Field
The invention relates to the field of hierarchical clustering, in particular to a clustering method for improving variable-scale density data by using a region growing and competition-based method.
Background
Data mining is a hot problem of research in the fields of artificial intelligence and databases, clustering analysis is an important branch of data mining, and the clustering is widely applied in various fields as a tool for data analysis. Clustering is the process of dividing a physical or abstract collection into classes composed of similar objects. Clustering originates in taxonomy, but differs from classification. Clustering differs from classification in that the class to which clustering requires partitioning is unknown and unsupervised. Clustering algorithms are broadly classified into (1) partition-based methods, such as K-means algorithm, and the like; (2) hierarchy-based methods such as the BIRCH algorithm, the CURE algorithm; (3) density-based methods, such as DBSCAN algorithm, density, and the like; (4) a grid-based approach; (5) neural networks, and other various clustering methods. Among them, the K-means algorithm is one of the most classical clustering algorithms. As the clustering algorithm based on division which is most widely applied at present, the K-means algorithm is simpler to realize, but has the following three defects: (1) the user must specify the clustering number k in advance; (2) the K-means algorithm is not suitable for finding non-convex clusters; (3) the K-means algorithm is very sensitive to noise and outlier data. The DBSCAN determines whether to establish a new cluster taking an object as a core object by checking whether the density of an object epsilon neighborhood is high enough, namely whether the number of data points in a certain distance epsilon exceeds a set threshold value, and then combines the clusters with reachable density to realize that the cluster class with any shape can be found in a spatial database with noise, but the DBSCAN algorithm is sensitive to two parameters which are difficult to determine, namely epsilon and the set threshold value. In addition, DBSCAN is relatively high in computational complexity.
Disclosure of Invention
Traditional clustering algorithms mostly assume the same scale of spatial density, but real data are often non-convex data with density multi-scale changes. Various defects often occur when the traditional clustering algorithm is adopted for data with density multi-scale change. Especially distance-based clustering algorithms such as the Kmeans algorithm, increase the sensitivity of the parameters and decrease the accuracy. Aiming at the defect that multi-scale data is mostly limited in space, the invention provides a novel multi-level clustering algorithm based on distance according to actual needs by means of multi-level analysis, and solves the clustering problem of multi-scale density data by means of multi-level rapid non-convex clustering. The algorithm can correspondingly simplify the algorithm complexity based on the distance, and the calculation of the density is avoided; and (4) performing reasonable fusion by utilizing seed region growth in a multi-stage aggregation manner to complete the clustering of the data. The invention can reduce the complexity on the basis of simplifying the algorithm, is beneficial to the clustering of mass data and is suitable for the analysis of the data of the tourist destination.
In order to solve the technical problems, the invention adopts the following technical scheme:
a variable-scale data density space oriented region growing and competition based visitor destination data hierarchical clustering method comprises the following steps:
a first stage: the cluster center is updated by drawing a circle from the distance threshold R1 as follows:
step 1.1: inputting a set of unlabeled data sets X ═ X1,x2,...xi,...xN}∈RPRandomly fetching the ith data object X from XiStoring the first cluster center point in a set C { }; then randomly taking the j-th data object X in XjCalculating x by equation (1)ixjEuclidean distance between them
Figure BDA0002185348140000021
If it is
Figure BDA0002185348140000022
Less than R1(R1 is 10% of the spatial size of the data set), point xixjFor the same class, calculate a new cluster center point S to replace point x in C according to equation (2)iIf, if
Figure BDA0002185348140000023
Greater than R1, indicating xixjNot of the same class, xjAlso as a cluster core, the cluster core set C ═ { x ═ is storedi};
Figure BDA0002185348140000024
Figure BDA0002185348140000025
Wherein, S in the formula (2) is the updated cluster center, and β is the weight coefficient;
step 1.2: from dataset X (excluding X)i、xj) In random fetching of the m-th data object xmCalculating the Euclidean distance set
Figure BDA0002185348140000026
n is the number of points in the C set, and x is determinedmTo the closest point C in the cluster center setiUsing point x in combinationm、CiUpdating the cluster center according to the method of formula (1):
step 1.3: repeating the steps 1.1 and 1.2 to traverse all the points in the data X, and obtaining the updated cluster center set C ═ C1,...,Ci,...CwW is the cluster number, and the corresponding cluster set M ═ C1{...},...,Ci{...},...Cw{...}};
And a second stage: the region growing is carried out as follows:
step 1: determining the seed sequence: firstly, traversing all the points in the cluster center set C, and calculating the number n of the points corresponding to the ith cluster i1,2.. m. If n isiIf min C is less than min C, the corresponding cluster center point C is deletediDeleting the corresponding cluster center point set C in Mi{., and storing the cluster center points in the set D as the seed sequence B ═ C1,...,Ci,...,Cd},d<=w;
Step 2.2: defining growth criteria, determining growth stop conditions: taking the first cluster center C in the seed sequence B1And a circle is drawn with R1 as a radius. Calculating the number of points n in a circle1If n is1If min is greater than C, continue with C1Drawing a circle Q with the circle center R being R1+ △ R as the radiusB1And is judged to enter the circle QB1If the point (i) belongs to D, i +1 continues to grow;
△R=e(sm(x))/10*i^2*0.03 (3)
wherein sm (x) is the average value of the distance between data in the x-th cluster in the M set, and the points entering the circle are stored in the corresponding cluster set to obtain updated M;
step 2.3, for the points obtained after each cluster center area grows, the next secondary time is not taken as a growing object to be processed, and then other cluster center points in the C are traversed by the method in the step 2.2 to obtain the data of each cluster center point and the corresponding cluster thereof;
and a third stage: and calculating the relation weight among all cluster centers of the clusters by a competition-based idea, and adopting a proper rule to merge the clusters.
After the data set X is subjected to the second-level clustering, if all cluster centers carry out the second-level clustering on the data XiIn the competition process, the winner is the heart of the cluster respectively
Figure BDA0002185348140000031
And
Figure BDA0002185348140000032
get
Figure BDA0002185348140000033
When d has a value in a certain range, we consider the clusterHezhou cluster
Figure BDA0002185348140000042
There is a relational weight, the increasing criterion of which is: by using
Figure BDA0002185348140000043
Expressing the relationship weight between two small clusters, and the calculation method is as the following formula (4):
Figure BDA0002185348140000044
wherein, in the formula (4)
Figure BDA0002185348140000045
Figure BDA0002185348140000046
y=max(x,y);
Step 3.1: first, for a data set X ═ X1,...,Xi,...,XNFrom the first data X1Starting to traverse in sequence, and finding out two winners of all cluster centers in the process of competing for data for each specific data
Figure BDA0002185348140000047
And
Figure BDA0002185348140000048
then, the clusters corresponding to the two winners are judged according to the relation weight existence criterion
Figure BDA0002185348140000049
And
Figure BDA00021853481400000410
if the cluster with the weight exists, the relationship weight is increased according to a formula (4), and then the next data is traversed; if the relation weight does not exist, directly traversing the next data until all the data are traversed once in sequence;
after the calculation of the relationship weight is completed, the relationship weight is formed asWherein the subscript x takes values from 1 up to M, and the subscript y takes values from x up to M;
step 3.2: calculating density similarity between each cluster, firstly calculating the intra-cluster density rho of each cluster for the cluster set M clustered at the second stagei
ρi=ni/Si(5)
niIs the number of points included in the ith cluster, SiIs the area size of the ith cluster. ρ ═ ρ1,...,ρi,...,ρdAnd calculating a density difference between the x-th cluster and the y-th clusterNamely:
Figure BDA00021853481400000413
subscript x takes values from 1 up to d, and subscript y takes values from x up to d;
and step 3: when in use
Figure BDA00021853481400000414
And isIn the middle of the time, cluster
Figure BDA00021853481400000416
Hezhou cluster
Figure BDA00021853481400000417
May be combined;
assume that the final cluster set formed is MkWherein each value of the subscript k corresponds to an independent cluster, and a finally formed cluster set M is subjected tokThe subscript of (a) is initialized to k 1,
Figure BDA00021853481400000418
relationship weight
Figure BDA00021853481400000419
Subscript x is initialized to x ═ 1;
relationship weight
Figure BDA0002185348140000051
Starting from 1 up to M, the subscript x of (1) weights the relationship
Figure BDA0002185348140000052
The superscript y takes values from x up to M when
Figure BDA0002185348140000053
When x is equal to y, letSatisfy the requirement of
Relationship weight
Figure BDA0002185348140000056
Not satisfying the condition
Figure BDA0002185348140000057
And is
Figure BDA0002185348140000058
In time, the small clusters are not processed; relationship weight
Figure BDA0002185348140000059
Satisfy the requirement of
Figure BDA00021853481400000510
And is
Figure BDA00021853481400000511
When it is in condition, if
Figure BDA00021853481400000512
Or
Figure BDA00021853481400000513
Then
Figure BDA00021853481400000514
And
Figure BDA00021853481400000515
are simultaneously merged into MkIn (1),
Figure BDA00021853481400000516
otherwise k is k +1, simultaneously
Figure BDA00021853481400000517
And
Figure BDA00021853481400000518
merge into a new cluster MkIn (1),
Figure BDA00021853481400000519
whereinAnd
Figure BDA00021853481400000521
the same elements present in (a) are combined into the same item;
step 3.4: the cluster center set finally formed is MkK clustering ends.
The region growing of the invention is a process of gradually aggregating a data or sub data set region into a complete independent connected region according to a predefined growing rule. For the interested target region R, z in the spatial data as the seed points found in advance on the region R, gradually merging the data meeting the similarity criterion in a certain neighborhood with the seed points z into a seed group according to the specified growth criterion for the growth of the next stage, and continuously carrying out cyclic growth until the growth stopping condition is met, thereby completing the process of growing the interested region from one seed point into an independent connected region. The similarity criterion can be the distance between data, density and other related attributes. The region growing algorithm is therefore generally implemented in three steps: (1) determining growing seed points (2) stipulates a growing criterion (3) determines a growth stop condition.
The invention adopts the idea of hierarchical clustering and divides the clustering process into three-level clustering. The first-level clustering divides the objects into a certain number of subclasses based on a distance threshold R1; the second stage is grown by region growing. And performing second-level clustering on the non-clustered data, and finally calculating the weights among all clustering centers of the clusters on the basis of a competitive idea and density similarity principle, and combining the clusters by adopting a proper rule.
The beneficial effects of the invention are as follows:
(1) the first-level distance-based clustering can simplify the algorithm and reduce the complexity of the algorithm.
(2) The second stage can solve the problem of variable scale density data by using a seed region growing method.
(3) The third-level merging part provides a relation weight threshold and a density similarity threshold, so that the merging of the small clusters is more reasonable and double-guarantee. The problem of non-convex clustering is effectively solved, and the merging accuracy is improved.
(4) By utilizing the reasonable design and fusion of the three-level algorithm, the overall algorithm avoids multi-layer iteration and greatly reduces the complexity of the algorithm.
Drawings
FIG. 1 is an overall flow diagram of the method of the present invention;
FIG. 2 is a flow chart of the first level clustering of the algorithm of the present invention;
FIG. 3 is a flow chart of the second level clustering of the algorithm of the present invention;
FIG. 4 is a flow chart of the third level clustering of the algorithm of the present invention
FIG. 5 is the final clustering result of the algorithm of the present invention applied to a occlusion data set.
Fig. 6 shows the final clustering result of the algorithm of the present invention run on the non-uniform density data set new.
Detailed Description
For the purpose of illustrating the objects, technical solutions and advantages of the present invention, the present invention will be described in further detail below with reference to specific embodiments and accompanying drawings.
Referring to fig. 1 to 6, a hierarchical clustering method based on region growing and competition for a variable-scale data density space includes the following steps:
a first stage: the cluster center is updated by drawing a circle from the distance threshold R1 as follows:
step 1.1: inputting a set of unlabeled data sets X ═ X1,x2,...xi,...xN}∈RPWhere x represents a sample point in the data set, P represents a sample dimension, and N represents the number of samplesNumber, randomly taking the ith data object X from XiStoring the first cluster center point in a set C { }; then randomly taking the j-th data object X in XjCalculating x by equation (1)ixjEuclidean distance between them
Figure BDA0002185348140000061
If it is
Figure BDA0002185348140000062
Less than R1(R1 is 10% of the spatial size of the data set), point xixjFor the same class, calculate a new cluster center point S to replace point x in C according to equation (2)i. If it is
Figure BDA0002185348140000063
Greater than R1, indicating xixjNot of the same class, xjAlso as a cluster core, the cluster core set C ═ { x ═ is storedi}。
Figure BDA0002185348140000064
In equation (2), S is the updated cluster center, and β is the weighting factor (β is 1/16).
Step 1.2: from dataset X (excluding X)i、xj) In random fetching of the m-th data object xmCalculating the Euclidean distance set
Figure BDA0002185348140000072
n is the number of points in the C set, and x is determinedmTo the closest point C in the cluster center setiUsing point x in combinationm、CiAnd updating the cluster center according to the method of the formula (1).
Step 1.3: repeating the steps 1.1 and 1.2 to traverse all the points in the data X, and obtaining the updated cluster center set C ═ C1,...,Ci,...CwW is a cluster classAnd (4) counting. Corresponding cluster set M ═ C1{...},...,Ci{...},...Cw{...}}。
And a second stage: the region growing is carried out as follows:
step 2.1: determining the seed sequence: firstly, traversing all the points in the cluster center set C, and calculating the number n of the points corresponding to the ith cluster i1,2.. m. If n isiIf min C (min C is 5% of all samples), no cluster is formed, and the corresponding cluster center point C is deleted from CiDeleting the corresponding cluster center point set C in Mi{., and storing the cluster center points in the set D as the seed sequence B ═ C1,...,Ci,...,Cd},d<=w。
Step 2.2: defining growth criteria, determining growth stop conditions: taking the first cluster center C in the seed sequence B1And a circle is drawn with R1 as a radius. Calculating the number of points n in a circle1If n is1If min is greater than C, continue with C1Drawing a circle Q with the circle center R being R1+ △ R as the radiusB1And is judged to enter the circle QB1If the point (D) belongs to D, i +1 continues to grow.
△R=e(sm(x))/10*i^2*0.03 (3)
And sm (x) is the average value of the distance between data in the x-th cluster in the M set, and the points entering the circle are stored in the corresponding cluster set to obtain the updated M.
And 2.3, for the points obtained after each cluster center area grows, the next secondary time is not taken as a growing object to be processed, and then other cluster center points in the C are traversed by the method in the step 2.2 to obtain the data of each cluster center point and the corresponding cluster thereof.
And a third stage: and calculating the relation weight among all cluster centers of the clusters by a competition-based idea, and adopting a proper rule to merge the clusters.
After the data set X is subjected to the second-level clustering, if all cluster centers carry out the second-level clustering on the data XiIn the competition process, the winner is the heart of the cluster respectively
Figure BDA0002185348140000081
And
Figure BDA0002185348140000082
getWhen d has a value in a certain range, we consider the cluster
Figure BDA0002185348140000084
Hezhou cluster
Figure BDA0002185348140000085
There is a relational weight. When d < ═ 2.5, the algorithm has better clustering quality. And taking d < 2.5 as existence criterion of the relation weight. Increase criterion of the relational weight: by using
Figure BDA00021853481400000816
Expressing the weight of the relationship between the two small clusters, the calculation method is as follows (4)
Figure BDA0002185348140000086
Wherein, in the formula (4)
Figure BDA0002185348140000087
Figure BDA0002185348140000088
Where x is min (x, y) and y is max (x, y).
Step 3.1: first, for a data set X ═ X1,...,Xi,...,XNFrom the first data X1Starting to traverse in sequence, and finding out two winners of all cluster centers in the process of competing for data for each specific data
Figure BDA0002185348140000089
And
Figure BDA00021853481400000810
then according to the above-mentioned relation weight existence criterionCluster corresponding to two broken winners
Figure BDA00021853481400000811
And
Figure BDA00021853481400000812
if the cluster with the weight exists, the relationship weight is increased according to a formula (4), and then the next data is traversed; if no relationship weight exists, the next data is directly traversed. Until all data has been traversed once in turn.
After the calculation of the relationship weight is completed, the relationship weight is formed as
Figure BDA00021853481400000813
Where subscript x takes on values from 1 up to M and superscript y takes on values from x up to M.
Step 3.2: calculating density similarity between each cluster, firstly calculating the intra-cluster density rho of each cluster for the cluster set M clustered at the second stagei
ρi=ni/Si(5)
niIs the number of points included in the ith cluster, SiIs the area size of the ith cluster. ρ ═ ρ1,...,ρi,...,ρdAnd calculating a density difference between the x-th cluster and the y-th clusterNamely:
Figure BDA00021853481400000815
subscript x takes on values from 1 up to d, and superscript y takes on values from x up to d.
Step 3.3: when in use
Figure BDA0002185348140000091
And is
Figure BDA0002185348140000092
In the middle of the time, cluster
Figure BDA0002185348140000093
Hezhou cluster
Figure BDA0002185348140000094
May be combined. (experiments have found that a reasonable sum of the number of all data in two small clusters with link thresholds of about 40% to 50% of the weight of the relevant system is good.Sim represents the difference between the two densities, i.e., smaller is better, and a number less than 1.5 is used.)
Assume that the final cluster set formed is MkWherein each value of the subscript k corresponds to an independent cluster, and a finally formed cluster set M is subjected tokThe subscript of (a) is initialized to k 1,
Figure BDA0002185348140000095
relationship weight
Figure BDA0002185348140000096
The subscript x is initialized to x ═ 1.
Relationship weight
Figure BDA0002185348140000097
Starting from 1 up to M, the subscript x of (1) weights the relationship
Figure BDA0002185348140000098
The superscript y takes values from x up to M when
Figure BDA0002185348140000099
When x is equal to y, let
Figure BDA00021853481400000910
Satisfy the requirement of
Figure BDA00021853481400000911
Relationship weight
Figure BDA00021853481400000912
Not satisfying the condition
Figure BDA00021853481400000913
And is
Figure BDA00021853481400000914
In time, the small clusters are not processed; relationship weight
Figure BDA00021853481400000915
Satisfy the requirement of
Figure BDA00021853481400000916
And is
Figure BDA00021853481400000917
When it is in condition, ifOr
Figure BDA00021853481400000919
ThenAnd
Figure BDA00021853481400000921
are simultaneously merged into MkIn (1),
Figure BDA00021853481400000922
otherwise k is k +1, simultaneously
Figure BDA00021853481400000923
Andmerge into a new cluster MkIn (1),
Figure BDA00021853481400000925
wherein
Figure BDA00021853481400000926
Andthe same elements present in (a) are combined into the same item;
step 3.4: the cluster center set finally formed is Mk,k=1,2...K。
The effects of the present invention can be further illustrated by the following simulation experiments.
1) Simulation conditions
The operating system used for the experiment is Windows10, simulation software Matlab (R2018b) (64 bits), the processor is Inter (R) core (TM) i7, and the installation memory is 8.00GB
Table 1 is partial UCI real data:
Figure BDA00021853481400000928
Figure BDA0002185348140000101
TABLE 1
2) Simulation result
The algorithm of the invention, the DBSCAN algorithm and the Kmean algorithm are used for the comparison experiment on a UCI data set with scale transformation and a group of artificial data sets with scale transformation new. In order to further verify the performance of the algorithm on a real data set, 4 data sets in the table 1 are used for carrying out experiments, and common ACC and F-measure indexes are adopted to evaluate clustering results, wherein the value ranges of the ACC and the F-measure indexes are [0,1], and the larger the value is, the better the clustering effect is.
Figure BDA0002185348140000102
TABLE 2
As can be seen from Table 2, the method of the present invention has better results than the conventional DBSCAN algorithm and the Kmeans algorithm. Compared with the running time of the DBSCAN algorithm, the complexity of the algorithm is lower. Especially when the amount of data is large. Has better practical engineering application value.
Details not described in this specification are within the skill of the art that are well known to those skilled in the art.

Claims (1)

1. A variable-scale data density space oriented region growing and competition based visitor destination data hierarchical clustering method is characterized by comprising the following steps:
a first stage: the cluster center is updated by drawing a circle from the distance threshold R1 as follows:
step 1.1: inputting a set of unlabeled data sets X ═ X1,x2,...xi,...xN}∈RPRandomly fetching the ith data object X from XiStoring the first cluster center point in a set C { }; then randomly taking the j-th data object X in XjCalculating x by equation (1)ixjEuclidean distance between them
Figure FDA0002185348130000011
If it is
Figure FDA0002185348130000012
Less than R1(R1 is 10% of the spatial size of the data set), point xixjFor the same class, calculate a new cluster center point S to replace point x in C according to equation (2)iIf, if
Figure FDA0002185348130000013
Greater than R1, indicating xixjNot of the same class, xjAlso as a cluster core, the cluster core set C ═ { x ═ is storedi};
Figure FDA0002185348130000014
Figure FDA0002185348130000015
Wherein, S in the formula (2) is the updated cluster center, and β is the weight coefficient;
step 1.2: from dataset X (excluding X)i、xj) In random fetching of the m-th data object xmCalculating the Euclidean distance set
Figure FDA0002185348130000016
n is the number of points in the C set, and x is determinedmTo the closest point C in the cluster center setiUsing point x in combinationm、CiUpdating the cluster center according to the method of the formula (1);
step 1.3: repeating the steps 1.1 and 1.2 to traverse all the points in the data X, and obtaining the updated cluster center set C ═ C1,...,Ci,...CwW is the cluster number, and the corresponding cluster set M ═ C1{...},...,Ci{...},...Cw{...}};
And a second stage: the region growing is carried out as follows:
step 2.1: determining the seed sequence: firstly, traversing all the points in the cluster center set C, and calculating the number n of the points corresponding to the ith clusteri1,2, m, if niIf min C is less than min C, the cluster is not formed, and the corresponding cluster center point C is deleted from CiDeleting the corresponding cluster center point set C in Mi{., and storing the cluster center points in the set D as the seed sequence B ═ C1,...,Ci,...,Cd},d<=w;
Step 2.2: defining growth criteria, determining growth stop conditions: taking the first cluster center C in the seed sequence B1Drawing a circle with R1 as the radius, and calculating the number n of points in the circle1If n is1If min is greater than C, continue with C1Drawing a circle Q with the circle center R being R1+ △ R as the radiusB1And is judged to enter the circle QB1If the point (i) belongs to D, i +1 continues to grow;
△R=e(sm(x))/10*i^2*0.03 (3)
wherein sm (x) is the average value of the distance between data in the x-th cluster in the M set, and the points entering the circle are stored in the corresponding cluster set to obtain updated M;
step 2.3, for the points obtained after each cluster center area grows, the next secondary time is not taken as a growing object to be processed, and then other cluster center points in the C are traversed by the method in the step 2.2 to obtain the data of each cluster center point and the corresponding cluster thereof;
and a third stage: calculating the relation weight and density similarity among all clustering centers by a competition-based idea, and adopting a proper rule to merge the clusters;
after the data set X is subjected to the second-level clustering, if all cluster centers carry out the second-level clustering on the data XiIn the competition process, the winner is the heart of the cluster respectively
Figure FDA0002185348130000021
And
Figure FDA0002185348130000022
get
Figure FDA0002185348130000023
When d has a value in a certain range, we consider the clusterHezhou cluster
Figure FDA0002185348130000025
There is a relational weight; increase criterion of the relational weight: by usingExpressing the weight of the relationship between the two small clusters, the calculation method is as follows (4)
Wherein, in the formula (4)
Figure FDA0002185348130000028
Figure FDA0002185348130000029
Where x is min (x, y), y is max (x, y);
step 3.1: first, for a data set X ═ X1,…,Xi,…,XNFrom the first data X1Starting to traverse in sequence, and finding out two winners of all cluster centers in the process of competing for data for each specific data
Figure FDA00021853481300000210
Andthen, the clusters corresponding to the two winners are judged according to the relation weight existence criterionAndif the cluster with the weight exists, the relationship weight is increased according to a formula (4), and then the next data is traversed; if the relation weight does not exist, directly traversing the next data until all the data are traversed once in sequence;
after the calculation of the relationship weight is completed, the relationship weight is formed as
Figure FDA00021853481300000214
Wherein the subscript x takes values from 1 up to M, and the subscript y takes values from x up to M;
step 3.2: calculating density similarity between each cluster, firstly calculating the intra-cluster density rho of each cluster for the cluster set M clustered at the second stagei
ρi=ni/Si(5)
niIs the number of points included in the ith cluster, SiIs the area size of the ith cluster, ρ ═ ρ { [ ρ ]1,...,ρi,...,ρdAnd calculateDensity difference between the x-th cluster and the y-th cluster
Figure FDA00021853481300000215
Namely:
Figure FDA00021853481300000216
subscript x takes on values from 1 up to d, and superscript y takes on values from x up to d.
Step 3.3: when in use
Figure FDA00021853481300000217
And isIn the middle of the time, cluster
Figure FDA00021853481300000219
Hezhou cluster
Figure FDA00021853481300000220
May be combined.
Assume that the final cluster set formed is MkWherein each value of the subscript k corresponds to an independent cluster, and a finally formed cluster set M is subjected tokThe subscript of (a) is initialized to k 1,
Figure FDA00021853481300000221
relationship weight
Figure FDA00021853481300000222
Subscript x is initialized to x ═ 1;
relationship weight
Figure FDA00021853481300000223
Starting from 1 up to M, the subscript x of (1) weights the relationship
Figure FDA00021853481300000224
The superscript y takes values from x up to M when
Figure FDA00021853481300000225
When x is equal to y, let
Figure FDA00021853481300000226
Satisfy the requirement of
Figure FDA00021853481300000227
Relationship weight
Figure FDA00021853481300000228
Not satisfying the conditionAnd is
Figure FDA00021853481300000230
In time, the small clusters are not processed; relationship weight
Figure FDA00021853481300000231
Satisfy the requirement of
Figure FDA00021853481300000232
And is
Figure FDA00021853481300000233
When it is in condition, if
Figure FDA00021853481300000234
Or
Figure FDA00021853481300000235
Then
Figure FDA00021853481300000236
And
Figure FDA00021853481300000237
are simultaneously merged into MkIn (1),
Figure FDA00021853481300000238
otherwise k is k +1, simultaneously
Figure FDA00021853481300000239
And
Figure FDA00021853481300000240
merge into a new cluster MkIn (1),wherein
Figure FDA00021853481300000242
And
Figure FDA00021853481300000243
the same elements present in (a) are combined into the same item;
step 3.4: the cluster center set finally formed is Mk,k=1,2…K。
CN201910812062.5A 2019-08-30 2019-08-30 Hierarchical clustering method for tourist heading data Active CN110728293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910812062.5A CN110728293B (en) 2019-08-30 2019-08-30 Hierarchical clustering method for tourist heading data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910812062.5A CN110728293B (en) 2019-08-30 2019-08-30 Hierarchical clustering method for tourist heading data

Publications (2)

Publication Number Publication Date
CN110728293A true CN110728293A (en) 2020-01-24
CN110728293B CN110728293B (en) 2021-10-29

Family

ID=69218832

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910812062.5A Active CN110728293B (en) 2019-08-30 2019-08-30 Hierarchical clustering method for tourist heading data

Country Status (1)

Country Link
CN (1) CN110728293B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071140A1 (en) * 2001-05-18 2005-03-31 Asa Ben-Hur Model selection for cluster data analysis
CN101523412A (en) * 2006-10-11 2009-09-02 惠普开发有限公司 Face-based image clustering
CN106776849A (en) * 2016-11-28 2017-05-31 西安交通大学 A kind of method and guide system to scheme quick-searching sight spot
US20170161606A1 (en) * 2015-12-06 2017-06-08 Beijing University Of Technology Clustering method based on iterations of neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071140A1 (en) * 2001-05-18 2005-03-31 Asa Ben-Hur Model selection for cluster data analysis
CN101523412A (en) * 2006-10-11 2009-09-02 惠普开发有限公司 Face-based image clustering
US20170161606A1 (en) * 2015-12-06 2017-06-08 Beijing University Of Technology Clustering method based on iterations of neural networks
CN106776849A (en) * 2016-11-28 2017-05-31 西安交通大学 A kind of method and guide system to scheme quick-searching sight spot

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MADAN S 等: "Modified balanced iterative reducing and clustering using hierarchies(m-BIRCH)for visual clustering", 《PATTERN ANALYSIS AND APPLICATIONS》 *
李春忠 等: "基于多尺度信息融合的层次聚类算法", 《工程数学学报》 *

Also Published As

Publication number Publication date
CN110728293B (en) 2021-10-29

Similar Documents

Publication Publication Date Title
WO2018086433A1 (en) Medical image segmenting method
WO2018166270A2 (en) Index and direction vector combination-based multi-objective optimisation method and system
CN104217015B (en) Based on the hierarchy clustering method for sharing arest neighbors each other
CN108280236A (en) A kind of random forest visualization data analysing method based on LargeVis
CN109002858B (en) Evidence reasoning-based integrated clustering method for user behavior analysis
CN109271427A (en) A kind of clustering method based on neighbour&#39;s density and manifold distance
CN115641177B (en) Second-prevention killing pre-judging system based on machine learning
CN108416381B (en) Multi-density clustering method for three-dimensional point set
CN113435108A (en) Battlefield target grouping method based on improved whale optimization algorithm
CN113128617B (en) Spark and ASPSO based parallelization K-means optimization method
CN110580252B (en) Space object indexing and query method under multi-objective optimization
Xing et al. Fuzzy c-means algorithm automatically determining optimal number of clusters
CN110781943A (en) Clustering method based on adjacent grid search
CN110728293B (en) Hierarchical clustering method for tourist heading data
CN108897820B (en) Parallelization method of DENCLUE algorithm
CN108446740B (en) A kind of consistent Synergistic method of multilayer for brain image case history feature extraction
CN117093885A (en) Federal learning multi-objective optimization method integrating hierarchical clustering and particle swarm
Mir et al. Improving data clustering using fuzzy logic and PSO algorithm
Patel et al. Study and analysis of particle swarm optimization for improving partition clustering
CN113469107B (en) Bearing fault diagnosis method integrating space density distribution
CN115690476A (en) Automatic data clustering method based on improved harmony search algorithm
Cui et al. Weighted particle swarm clustering algorithm for self-organizing maps
CN112308160A (en) K-means clustering artificial intelligence optimization algorithm
CN115577235A (en) Bearing fault diagnosis method for optimizing fuzzy C-means by improving firework algorithm
CN115510959A (en) Density peak value clustering method based on natural nearest neighbor and multi-cluster combination

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant