CN104143009B

CN104143009B - Competition and cooperation clustering method based on the maximal clearance cutting of dynamic encompassing box

Info

Publication number: CN104143009B
Application number: CN201410419179.4A
Authority: CN
Inventors: 陈仁喜; 周绍光
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2014-08-22
Filing date: 2014-08-22
Publication date: 2017-03-08
Anticipated expiration: 2034-08-22
Also published as: CN104143009A

Abstract

The invention discloses a kind of competition and cooperation clustering method based on the maximal clearance cutting of dynamic encompassing box, the method for proposing the acquisition initial seed point using the maximal clearance cutting of dynamic encompassing box, first fall into a trap the bounding box of evidence of counting in multidimensional feature space, and the data point in the bounding box is projected to most major axis, find out adjacent projections point maximum spacing position to be divided into two the bounding box, such recurrence, until whole space is cut into enough subspaces, the center of subspace is finally calculated as initial seed point；Present invention is alternatively directed to same cluster is broken into the phenomenon of multiple classes, propose to merge operation using distance radius analytic approach to cluster, can adaptive each class group by broken point build up a complete cluster.The present invention can avoid the omission phenomenon that randomization seed point is caused, and can avoid clustering fragmentation phenomenon, be conducive to quickly obtaining real cluster result.

Description

Competitive cooperation clustering method based on maximum gap segmentation of dynamic bounding box

Technical Field

The invention relates to a competitive cooperation clustering method based on maximum gap segmentation of a dynamic bounding box, and belongs to the technical field of data mining.

Background

Clustering is a process of grouping a collection of real or abstract data objects into multiple classes or clusters, and is an effective means for people to recognize and explore the internal relationships between things. Commonly used clustering methods are K-means, ISODATA, fuzzy clustering, and the like. K-means is a clustering method based on the Mean Square Error (MSE) minimization criterion, but such algorithms have two major drawbacks: 1) k-means needs to determine the exact number of categories in advance, but in practical application, it is difficult to determine the parameter; 2) a so-called "dead unit" phenomenon is easily generated. If an initial cluster center is given improperly, it will result in no input data being attributed to the initial center, which becomes a "bad cell". In order to overcome these drawbacks, researchers have proposed Competitive Learning (CL) clustering algorithms, such as Frequency Sensitive Competitive Learning (FSCL) that solves the bad cell problem by using a mechanism that reduces the winning rate of the frequent winning seeds; the sub-winner penalized competitive Learning (RPCL) algorithm adopts a rejection mechanism for sub-optimal seed points to push redundant seed points away from an input sample space, thereby realizing automatic determination of the category number; the penalty constrained competitive learning (RPCCL) of the secondary winner is the improvement of the RPCL, the automatic determination of the inverse learning rate is realized, and the defect that the RPCL is sensitive to the inverse learning rate is avoided; distance Sensitive (DSRPL) algorithm based on cost function minimization criterion. Although these improved competitive learning algorithms improve some performance, there still remain convergence problems, and in addition cluster center positioning bias is caused by the exclusion mechanism in the algorithm. A Competitive and cooperative learning algorithm (CCL) introduces a cooperation mechanism, so that redundant seed points are prevented from being rejected out of an input sample space, and meanwhile, the accurate positioning of a clustering center is ensured; meanwhile, the CCL algorithm also avoids the problem of non-convergence of the RPCCL clustering algorithm. However, the CCL algorithm still has some inevitable problems: 1) there is an initial seed point sensitivity problem. The common clustering algorithm adopts a randomization method to obtain initial seed points, which causes the iteration times of the algorithm and the instability of a clustering result; 2) the method is not suitable for heterogeneous data with unbalanced distribution, and clusters with few data points cannot be correctly identified; 3) and (5) clustering result fragmentation. The CCL algorithm sometimes results in data that originally belongs to the same cluster being broken down into multiple subclasses. Intuitively, these data should belong to the same category.

The existence of these problems affects the use effect and the practical value of the CCL clustering algorithm, and it is necessary to improve these defects of the CCL algorithm.

Disclosure of Invention

The invention aims to provide a competition cooperative clustering method based on maximum gap segmentation of a dynamic bounding box, which is used for carrying out targeted improvement on an original CCL clustering algorithm and more quickly obtaining a real clustering result.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

the competition cooperative clustering method based on the maximum gap segmentation of the dynamic bounding box comprises the following steps of:

1) setting an initial clustering category number K;

2) analyzing the N input data, and initializing K seed points by adopting a dynamic bounding box maximum gap segmentation algorithm, wherein the method comprises the following specific steps of:

2-1) taking the input data as a point of a multidimensional space, and calculating a minimum outsourcing rectangle which can contain all the input data;

2-2) comparing the lengths of all dimensions of the minimum outsourcing rectangle, and selecting the dimension corresponding to the maximum length as a segmentation axis;

2-3) projecting all input data points to the segmentation axis, and then arranging the projection points according to the sequence from small to large;

2-4) calculating the distance between the front and back two adjacent projection points, selecting the two adjacent projection points with the largest distance as segmentation positions, and dividing the input data into two subsets along the segmentation axis;

2-5) selecting the subset of all subsets which surrounds the largest box volume to perform the steps 2-1) -2-4) again, and dividing the subset into two parts;

2-6) repeating steps 2-5) until K subsets are obtained;

2-7) calculating the geometric centers of the obtained K subsets to serve as initial seed points;

3) let the number of wins n for each initial seed point_k＝1,k＝1,...,K；

4) For current input data x_iCalculating an index function I (j | x)_i)：

Wherein, c_pDenotes the p-th seed point, r_pRepresenting the relative win ratio of the p-th seed point,

n_pthe number of wins for the p-th seed point,

find out that satisfies the index function I (j | x)_i) 1 seed point, which is marked as the winning seed point c_w；

5) Find to win seed point c_wCentered at c_w-x_iAll seed points in a circle with a radius I;

6) all seed points within the crop population are updated as follows:

wherein,which represents the seed point before the update,indicating the updated seed point, η is a learning rate parameter;

7) update winning seed point c as follows_wThe number of wins of the game is,

wherein,for winning seed point c before update_wThe number of wins of the game is,to updated winning seed points c_wThe number of wins;

8) repeating the steps 4) -7) until the seed point is not changed;

9) removing repeated seed points;

10) performing clustering and merging operation to form a final clustering result:

after the iteration and the repeated seed point deletion are assumed to be finished, M seed points are finally obtained, which are called as clustering centers and are marked as d_mM is 1 … M, M is less than or equal to K, then each input data is marked as the belonged cluster center, and the clusters are combinedThe specific operation is as follows:

10-1) according to the label information Lab (x) of the cluster center to which the input data belongs_i) Calculating the radius R that each cluster center can cover_m,m＝1…M；

10-2) extracting two clustering centers d_qAnd d_t,q∈[1,M],t∈[1,M]And q < t, calculating Euclidean distance D between them_qtIf the following conditions are satisfied:

D_qt≤R_qor D_qt≤R_t

Then the label information Lab (x) in the input data is added_i) The input data of t is marked as q again, namely t class is merged into q class;

10-3) performing the operation of the step 10-2) on all two cluster centers until no combinable clusters exist;

10-4) recalculating the cluster centers of the merged classes to obtain the final H (H is less than or equal to M) cluster centers.

The number of initial clustering categories K in the step 1) is far greater than the actual number of categories K^*。

The value of the learning rate parameter η in the foregoing step 6) is 0.001.

The foregoing step 9) of removing the duplicate seed points refers to deleting the plurality of seed points converging to the same position, and only one of the seed points is retained.

In the foregoing step 10), the step of marking each input data as the cluster center to which each input data belongs refers to all the input data x_iCalculate which cluster center it is closest to, assuming x_iNearest to the s-th cluster center, then x_iLabs (x) of (c)_i) Setting s to indicate that the input data belongs to the s-th clustering center:

Lab(x_i)＝s。

the foregoing steps10-1), radius R_mThe calculation method comprises the following steps: finding the distance between the m-th cluster center and all the input data belonging to the cluster center, and taking the maximum value as the radius R_m。

The invention has the advantages that:

the invention adopts the maximum gap segmentation method of the dynamic bounding box, can automatically select and obtain initial seed points according to the distribution rule of the input data, accelerates the clustering speed, improves the stability of the algorithm, and can be suitable for heterogeneous data with non-uniform class distribution. In addition, the initial seed point obtained by the method is closer to a real clustering center, so that the convergence speed of the algorithm can be increased.

The invention adopts a maximum gap segmentation method of the dynamic bounding box to obtain the seed points, and can also obtain the initial seed points for certain clustering with rare data, thereby avoiding the omission phenomenon caused by randomizing the seed points.

The invention adopts a distance radius analysis method to carry out merging operation on the clusters, can avoid cluster fragmentation and is beneficial to obtaining a real clustering result.

Drawings

FIG. 1 is a schematic diagram of a method of the present invention for dividing all input data into 2 subsets;

FIG. 2 is a schematic diagram of the division into 3 subsets based on FIG. 1;

FIG. 3 is a schematic diagram of initialization seed points obtained using a randomization method;

FIG. 4 is a schematic diagram of a clustering result obtained by using an original CCL clustering method based on FIG. 3;

FIG. 5 is a schematic diagram of initialization seed points obtained again using the randomization method;

FIG. 6 is a diagram illustrating the clustering result obtained by re-using the original CCL clustering method;

FIG. 7 is a schematic diagram of initialized seed points obtained by the dynamic segmentation method of the present invention;

FIG. 8 is a schematic diagram of the clustering results obtained by the method of the present invention;

FIG. 9 is a diagram illustrating a clustering result obtained by using the original CCL algorithm;

FIG. 10 is a diagram showing a clustering result obtained by the method of the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings and detailed description.

The invention discovers the defects through the experiment and the analysis of the original CCL clustering algorithm and improves the defects in a targeted manner. The concepts and symbols involved in the present invention are defined as follows:

inputting a sample: i.e. input data, each input data being a multi-dimensional vector denoted x_iI represents the ith input data;

seed point: also referred to as cluster center in the present invention, a vector of the same dimension as the input data, denoted as c_iI represents the ith seed point;

the winner: for current input data x_iDistance x_iThe nearest seed point is called the winner and is marked as c_wThe distance measurement here uses Euclidean distance;

cooperative group: for current input data x_iTo the winner c_wAs a center, | | c_w-x_iI is other seed points and c in the radius circle_wTogether, referred to as a cooperative cluster;

the invention discloses a competition cooperative clustering method based on maximum gap segmentation of a dynamic bounding box, which comprises the following steps of:

for N input data x₁,x₂,...,x_N，

1) And setting the initial clustering class number K to be far larger than the actual class number K.

2-1) taking the input data as a point of a multi-dimensional space, calculating a minimum outsourcing rectangle capable of containing all the input data, wherein the minimum outsourcing rectangle is a plane rectangle if the input data is 2-dimensional data, and a 3-dimensional cuboid if the input data is 3-dimensional data, and the like.

2-2) comparing the lengths of all dimensions of the minimum outsourcing rectangle, and selecting the dimension corresponding to the maximum length as a cutting axis.

2-3) project all input data points to the segmentation axis, at which point the projected points actually become 1-dimensional data. Then arranging the projection points in the order from small to large.

2-4) calculating the distance between two adjacent projection points before and after, selecting two adjacent projection points with the largest distance as the slicing position, taking the smaller input data of the two projection points and the input data of the projection point smaller than the projection point as a subset, and taking the input data of the other projection point and the input data of the projection point larger than the projection point as another subset, thus dividing the input data into two subsets along the slicing axis. Fig. 1 shows the result of dividing the input data along the x-axis into two subsets, one within a box.

2-5) then, the subset of the subsets that encloses the largest volume of the cartridge is selected to perform steps 2-1) -2-4) again, and the subset is divided into two as shown in fig. 2, in which one subset is within one box. Then, from these 3 subsets, the division into two is continued with the largest bounding volume. This process continues until K subsets are obtained.

2-6) calculating the geometric centers of the obtained K subsets as initial seed points. This obtains the K initial seed points needed for clustering.

3) Let the number of wins n for each initial seed point_k＝1,k＝1,...,K。

4) For current input data x_iCalculating an index function I (j | x)_i)：

n_pthe winning number of the p seed point.

Find out that satisfies the index function I (j | x)_i) The seed point 1 is taken as a winning seed point and is marked as a winning seed point c_w。

5) Find to win seed point c_wCentered at c_w-x_iAll seed points within a circle with a radius are formed into a cooperative population.

6) All seed points within the crop population are updated as follows:

wherein,which represents the seed point before the update,indicating the updated seed point, η is a learning rate parameter, which is generally 0.001.

7) Update winning seed points c_wKeeping the winning times of other seed points unchanged.

Wherein,for winning seed point c before update_wThe number of wins of the game is,to updated winning seed points c_wThe number of wins.

8) Repeating steps 4) -7) until the seed point is no longer changed.

9) And (5) removing repeated seed points. Since the cooperative population mechanism in the iterative process may make different multiple seed points converge to the same position, only one of them needs to be reserved.

10) And performing clustering combination operation based on the analysis of the distance and the radius between clusters to form a final clustering result.

The effect of performing inter-cluster distance analysis and cluster merging is to eliminate the cluster fragmentation problem in order to obtain a true clustering result. After the previous iteration and the deletion of the repeated seed points are assumed to be completed, M seed points are finally obtained, which are called clustering centers and are marked as d_mM is 1 … M, and M is less than or equal to K. Then for all input data x_iCalculate which cluster center it is closest to. Let x be_iNearest to the s-th cluster center, then x_iLabs (x) of (c)_i) Setting s to indicate that the input data belongs to the s-th clustering center:

Lab(x_i)＝s。

the specific operation of clustering and merging is as follows:

10-1) according to the label information Lab (x) of the cluster center to which the input data belongs_i) Calculating the radius R that each cluster center can cover_mM1 … M, radius R_mIs calculated as follows: finding the distance between the m-th cluster center and all the input data belonging to the cluster center, and taking the maximum value as the radius R_m。

D_qt≤R_qor D_qt≤R_t

Then the label information Lab (x) in the input data is added_i) The input data for t is relabeled as q, i.e. t classes are merged into q classes.

10-3) the operation of step 10-2) is performed for all two cluster centers until there are no clusters that can be merged.

Fig. 3 is a schematic diagram of initialization seed points obtained by the 1 st randomization using the randomization method, in which a small circle is the initial seed point, and fig. 4 is a schematic diagram of a clustering result obtained by the original CCL clustering method based on fig. 3, where the clustering number is 6, the iteration number is 553, and the time is 1.2 seconds. FIG. 5 is a schematic diagram of an initialization seed point obtained again by using the randomization method, in which a small circle is the initial seed point, unlike the seed point in FIG. 3; fig. 6 is a schematic diagram of a clustering result obtained by re-using the original CCL clustering method based on fig. 5, where the clustering number is 6, the iteration number is 568, and the time is 1.24 seconds, which is different from the clustering result of fig. 4. FIG. 7 is a schematic diagram of initialized seed points obtained by the dynamic segmentation method of the present invention, and the seed points obtained each time are the same; FIG. 8 is a schematic diagram of the clustering result obtained by the method of the present invention, with the clustering number of 6 and the iteration number 323 consuming 0.72 seconds; as can be seen from fig. 7, the initial seed point obtained by the maximum gap segmentation method of the dynamic bounding box of the present invention is closer to the real cluster center, so that the convergence rate of the algorithm can be increased; and the initial seed points can be obtained for some clusters with rare data, so that the omission phenomenon caused by the randomized seed points is avoided. Fig. 9 is a schematic diagram of a clustering result using an original CCL algorithm, fig. 10 is a schematic diagram of a clustering result using the method of the present invention, and it can be seen from the circled portion in fig. 9 that a fragmentation phenomenon exists, while fig. 10 avoids the fragmentation phenomenon, which is beneficial to obtaining a real clustering result.

Claims

1. The competition cooperative clustering method based on the maximum gap segmentation of the dynamic bounding box is characterized by comprising the following steps of:

1) setting an initial clustering category number K;

2-6) repeating steps 2-5) until K subsets are obtained;

3) let the number of wins n for each initial seed point_k＝1,k＝1,...,K；

4) For current input data x_iCalculating an index function I (j | x)_i)：

r_{p} = n_{p} / Σ_{j = 1}^{K} n_{j}

n_pthe number of wins for the p-th seed point,

6) all seed points within the crop population are updated as follows:

c_{u}^{n e w} = c_{u}^{o l d} + η (x_{i} - c_{u}^{o l d})

7) update winning seed point c as follows_wThe number of wins of the game is,

n_{w}^{n e w} = n_{w}^{o l d} + 1

8) repeating the steps 4) -7) until the seed point is not changed;

9) removing repeated seed points;

after the iteration and the repeated seed point deletion are assumed to be finished, M seed points are finally obtained, which are called as clustering centers and are marked as d_mAnd M is equal to 1 … M and is less than or equal to K, then each input data is marked as a cluster center to which the input data belongs, and the specific operation of cluster merging is as follows:

D_qt≤R_qor D_qt≤R_t

2. The method for competitive collaborative clustering based on the maximum gap segmentation of dynamic bounding boxes according to claim 1, wherein the initial clustering class number K in the step 1) is much larger than the actual class number K^*。

3. The competitive collaborative clustering method based on the maximum gap segmentation of the dynamic bounding box as claimed in claim 1, wherein the learning rate parameter η in the step 6) takes a value of 0.001.

4. The competitive collaborative clustering method based on the maximum gap segmentation of the dynamic bounding box according to claim 1, wherein the step 9) of removing the repeated seed points refers to deleting a plurality of seed points converging to the same position and only keeping one of the seed points.

5. The method for competitive collaborative clustering based on the maximum gap segmentation of dynamic bounding boxes according to claim 1, wherein the step 10) of marking each input data as the cluster center to which each input data belongs refers to all the input data x_iCalculate which cluster center it is closest to, assuming x_iNearest to the s-th cluster center, then x_iLabs (x) of (c)_i) Setting s to indicate that the input data belongs to the s-th clustering center:

Lab(x_i)＝s。

6. the method for competitive collaborative clustering based on the maximum gap segmentation of dynamic bounding boxes according to claim 1, wherein in the step 10-1), the radius R_mThe calculation method comprises the following steps: finding the distance between the m-th cluster center and all the input data belonging to the cluster center, and taking the maximum value as the radius R_m。