WO2015001416A1

WO2015001416A1 - Multi-dimensional data clustering

Info

Publication number: WO2015001416A1
Application number: PCT/IB2014/001262
Authority: WO
Inventors: Diptesh DAS; Aniruddha Sinha; Kingshuk CHAKRAVARTY; Amit Konar
Original assignee: Tata Consultancy Services Limited
Priority date: 2013-07-05
Filing date: 2014-07-03
Publication date: 2015-01-08

Abstract

A method for clustering multi-dimensional data comprises obtaining multi-dimensional data comprising a plurality of data points, each data point having multiple dimensions. Initial memberships are assigned to each dimension, for a plurality of clusters, and one of the initial memberships and modified memberships assigned to the dimensions of each data point is aggregated and induced by a fuzziness control parameter. Based on the aggregation, a cluster center of each cluster is computed, and square of distance between each dimension of the cluster center and each dimension is calculated. Based on the calculation, one of the initial memberships and the modified memberships, assigned to the plurality of dimensions of each data point, is modified, and the fuzziness control parameter is updated. A goodness measurement metric indicative of significance of the each dimension is determined for each dimension based on comparison of a point cluster index and a dimension cluster index.

Description

MULTI-DIMENSIONAL DATA CLUSTERING

TECHNICAL FIELD

[0001] The present subject matter relates, in general, to data processing and, in particular, to a system and a method for clustering multi-dimensional data.

BACKGROUND

[0002] In recent years, dramatic growth in applications such as Internet search, digital imaging, and video surveillance have created many high-volume, high dimensional data sets. Most of such data sets are unstructured, adding to the difficulty in managing such data sets. Further, increase in both the volume and the variety of data requires advances in methodology for clustering of the data.

[0003] Data clustering is a method of grouping data points or objects of a given data that are substantially similar in characteristics into clusters. Generally, each cluster is represented by a geometric centroid of the data points lying in the cluster. Clustering techniques can be applied to data that are quantitative (numerical), qualitative (categorical), or a combination of both. Clustering techniques are mostly unsupervised methods that can be used to organize data into clusters based on similarities among the individual data items. The potential of clustering techniques to reveal the underlying structures in data can be exploited in a wide variety of applications including classification, image processing, data mining, pattern recognition, modelling and identification.

BRIEF DESCRIPTION OF THE DRAWINGS

[0004] The detailed description is described with reference to the accompanying figure(s). In the figure(s), the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the figure(s) to reference like features and components. Some embodiments of systems and/or methods in accordance with embodiments of the present subject matter are now described, by way of example only, and with reference to the accompanying figure(s), in which: [0005] Figure la illustrates a network environment implementing a clustering system, according to an embodiment of the present subject matter.

[0006] Figure lb illustrates comparison of an exemplary image segmented by the present clustering system and a conventional clustering system.

[0007] Figures 2a and 2b illustrate a method for determining significant dimensions for clustering multi-dimensional data, according to an embodiment of the present subject matter.

DETAILED DESCRIPTION

[0008] Various techniques of clustering data have been developed in past few years. Such conventional techniques partition data comprising a set of data points or objects into two or more clusters based on an iterative two-steps process. In the first step each data point is allocated to nearest cluster center, and in the second step cluster centers are determined based on identifying a centroid for each of the two or more clusters. The centroid is identified for each partition of the data points allocated to each cluster. Such conventional clustering techniques, however, fail to determine right clusters for data points that reside marginally at boundaries of the two or more clusters.

[0009] Few attempts have been made in the past to overcome the limitation of the conventional techniques to associate data points residing marginally at the boundaries of the two or more clusters by incorporating partial membership of belongingness of each data point to a particular cluster. However, such attempts have been unsuccessful for a very high dimensional dataset. For very high dimensional dataset, it may be possible that odd parametric value of few dimensions forces the data point away from the actual cluster and thus it becomes difficult to identify significance of the dimensions for the data point. Therefore, clustering accuracy significantly reduces.

[0010] In accordance with the present subject matter, a system and a method for determining significant dimensions for clustering multi-dimensional data are described. The clustering may be understood as partitioning a set of data points of the data into a plurality of clusters, such that the data points that belong to the same cluster are as similar as possible and the data points that belong to different clusters are as dissimilar as possible. The system as described herein is a clustering system.

[0011] Initially, a database for storing multi-dimensional data is maintained according to one implementation. The multi-dimensional data may be representative of multimedia data, financial transactions and the like. According to an implementation, the multi-dimensional data is represented by a plurality of data points in a multi-dimensional space, say n-dimensional space. Further, each of the plurality of data points may include a plurality of dimensions or components. In one example, the multi-dimensional data may be an image and pixels of the image may be the plurality of data points. In said example, the components of the pixels, i.e., RGB (red, green and blue) or HSV (hue, saturation and value) can be the dimensions. The database can be an external repository associated with the clustering system, or an internal repository within the clustering system.

[0012] The data stored in the database may be retrieved whenever clustering is to be performed. Further, the data contained within such database may be updated, whenever required. For example, new data may be added into the database, existing data can be modified, or non-useful data may be deleted from the database. Although it has been described that a database is maintained to store the multi-dimensional data, however, it is well appreciated that the multidimensional data may be received by the clustering system in real-time to identify significant dimensions and then perform clustering of the multidimensional data.

[0013] In one implementation, a membership is assigned to each dimension of each of the plurality of data points to a plurality of clusters. The membership assigned to each dimension initially may be interchangeably referred to as initial membership. In said implementation, the plurality of clusters may be pre-defined. A membership assigned to a dimension of a data point may be understood as strength of association between the dimension of the data point and a particular cluster. In other words, the membership assigned to a dimension of a data point may be understood as degree of belonging of the dimension to a particular cluster. Each dimension may belong to several clusters simultaneously, with different degrees of membership. Further, the dimensions can be assigned a membership between 0 and 1 , indicating their partial memberships. According to an implementation, the memberships can be initialized in a random fashion using a random number between 0 and 1.

[0014] The memberships assigned to the dimensions of the plurality of data points are then aggregated. In one implementation, the memberships may be induced by a fuzziness control parameter (m). The fuzziness control parameter (m) determines the level of cluster fuzziness. A large value of fuzziness control parameter (m) results in smaller memberships and hence, fuzzier clusters. In an example, the value of fuzziness control parameter (m) may be 2. Thereafter, a cluster center of each of the plurality of clusters is computed based on the aggregated memberships. A cluster center of a cluster is average of all data points in the cluster. The computation of the cluster center has been explained later in detail (using equation 4), in the forthcoming description.

[0015] Further, square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points is calculated. In a scenario, where the square of distance is greater than zero then the membership assigned to each dimension of each of the plurality of data points is modified. The calculation of the square of distance has been explained later in detail (using equation 6), in the description. In another scenario where the square of distance is equal to zero, then the membership assigned to each dimension of each of the plurality of data points is set to 1 for the corresponding cluster and set to zero for the remaining clusters.

[0016] In an implementation, once the cluster centers are computed and the memberships are modified, the fuzziness control parameter (m) is updated to stabilize the cluster centers. The stability of the cluster centers is obtained by using the value of the fuzziness control parameter (m) which has the minimum effect in the change of cluster centers due to the change in memberships or membership values. In one implementation, partial derivative of each dimension of the cluster centers may be taken with respect to membership degree of each dimension of each of the plurality of data points and then it may be set to zero. In said implementation, the fuzziness control parameter (m) may be updated using an update factor weight (a). For instance, the value of the update factor weight (a) may be 0.95. The fuzziness control parameter (m) may be updated based on weighted sum of initial or previous fuzziness control parameter (m) and modified fuzziness control parameter (m). The computation of the initial or previous fuzziness control parameter (m) and the modified fuzziness control parameter (m) cluster center has been explained later in detail, in the forthcoming description.

[0017] According to an implementation, the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) are updated in a plurality of iterations. In one implementation, the plurality of iterations terminate when a sum of absolute difference between the modified memberships and the initial memberships or previously updated memberships is less than a pre-defined limit (ε). As an example, the value of the pre-defined limit (ε) may be 0.01. In another implementation, the plurality of iterations is predefined, and the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) are updated till the predefined number of iterations is exhausted.

[0018] Thereafter, a hard assignment of the plurality of data points to the cluster centers of the plurality of clusters is performed. To perform the hard assignment, a point cluster index is identified for each of the plurality of data points. The point cluster index for a data point may be understood as the index which has a maximum number of Is in a binary rank matrix for each dimension of the data point. The binary rank matrix is indicative of membership representation of each dimension of a data point. The membership may be represented in terms of binary notation i.e. either as Is or as 0s. In an implementation, if more than one cluster index have equal number of ' Is', then the hard assignment is done based on a membership rank matrix for each of the plurality of data points. A membership rank matrix is indicative of average membership of dimensions of a data point for which binary rank matrix entry is equal to 1.

[0019] According to an implementation, the hard assignment to assign each of the dimensions of plurality of data points to cluster centers of the plurality of clusters may also be performed based on identifying a dimension cluster index for each dimension of each of the plurality of data points. The dimension cluster index for a dimension of a data point may be understood as the index for which binary rank matrix for the dimension of the data point is 1.

[0020] Thereafter, a measurement metric is determined for each dimension of each of the plurality of data points. The measurement metric may be understood as a goodness measure for each dimension of a data point. In an implementation, the measurement metric can be used for performing dimensionality reduction. Dimensionality reduction can be performed by tracking the dimensions which follow data points well as compared to other dimensions. If the value of goodness measure is high then a dimension follows the data points well indicating the significance of the dimension in the process of clustering. For instance, if value of goodness measure is high then the significance of the dimension for the process of clustering is also very high. Thus, dimensionality reduction can be performed by selecting a set of dimensions that have higher values of goodness measure. The set of dimensions that have higher goodness measure can be used for clustering the n-dimensional data.

[0021] According to an implementation, the measurement metric for each dimension of each of the plurality of data points may be determined based on comparison of the point cluster index for each of the plurality of data points and the dimension cluster index for each dimension of each of the plurality of data points. The measurement metric may be interchangeably referred to as a goodness measurement metric. The measurement metric of a dimension of a data point is equal to 1 if the point cluster index is same as the dimension cluster index. Otherwise, the measurement metric is equal to 0.

[0022] According to the present subject matter, each dimension of the data points independently contribute in the process of determining the membership of the data points to the clusters. Since membership degree, i.e., degree of belongingness of each dimension to each cluster is taken into consideration, therefore the distance between each dimension of a data point with the cluster center of that dimension of the data point is significantly minimized and as a result accuracy of clustering of input data improves significantly. Further, since cluster assignment of a data point by considering the highest aggregated membership for that cluster can also be ascertained by a set of dimensions having higher membership to belong to that cluster. Although, independent contribution of the dimensions to determine the cluster center results in a fluctuation in cluster centers, however the fluctuation in the cluster centers can be stabilized by membership aggregation and by adapting fuzziness control parameter (m) towards its convergence, thereby stabilizing the cluster centers. Therefore, the clustering of the data is performed reliably and accurately.

[0023] Figure la illustrates a network environment 100 implementing a clustering system 102, in accordance with an embodiment of the present subject matter.

[0024] In one implementation, the network environment 100 can be a public network environment, including thousands of personal computers, laptops, various servers, such as blade servers, and other computing devices. In another implementation, the network environment 100 can be a private network environment with a limited number of computing devices, such as personal computers, servers, and laptops.

[0025] The clustering system 102 may be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, and the like. Further, it will be understood that the clustering system 102 is connected to a plurality of user devices 104-1, 104-2, 104-3..., and 104-N, collectively referred to as user devices 104 and individually referred to as a user device 104. As shown in figure 1, the user devices 104 are communicatively coupled to the clustering system 102 over a network 106 through one or more communication links for facilitating one or more end users to access and operate the clustering system 102. The user device 104 may include, but is not limited to, a desktop computer, a portable computer, a handheld computing device, and a workstation.

[0026] In one implementation, the network 106 may be a wireless network, a wired network, or a combination thereof. The network 106 may also be an individual network or a collection of many such individual networks, interconnected with each other and functioning as a single large network, e.g., the Internet or an intranet. The network 106 may be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and such. The network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), etc., to communicate with each other. Further, the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.

[0027] The network environment 100 further comprises a database 108 communicatively coupled to the clustering system 102. The database 108 may store multi-dimensional data. The data may be representative of multimedia data, financial transactions and the like. According to an implementation, the data is represented as a plurality of data points in a multi-dimensional space, say n- dimensional space. Although the database 108 is shown external to the clustering system 102, it will be appreciated by a person skilled in the art that the database 108 can also be implemented internal to the clustering system 102, where the multi-dimensional data may be stored within a memory component of the clustering system 102.

[0028] According to an implementation, the clustering system 102 includes processor(s) 1 10, interface(s) 1 12, and memory 1 14 coupled to the processor(s) 1 10. The processor(s) 1 10 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) 110 may be configured to fetch and execute computer-readable instructions stored in the memory 114.

[0029] The memory 114 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM), and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.

[0030] Further, the interface(s) 112 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a product board, a mouse, an external memory, and a printer. Additionally, the interface(s) 112 may enable the clustering system 102 to communicate with other devices, such as web servers and external repositories. The interface(s) 112 may also facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. For the purpose, the interface(s) 112 may include one or more ports.

[0031] The clustering system 102 also includes module(s) 116 and data 118. The module(s) 116 include, for example, an assignment module 120, a modification module 122, an identification module 124, and a determination module 126, and other module(s) 128. The other module(s) 128 may include programs or coded instructions that supplement applications or functions performed by the clustering system 102. The data 118 may be membership data 130, index data 132, and other data 134. The other data 134, amongst other things, may serve as a repository for storing data that is processed, received, or generated as a result of the execution of one or more modules in the module(s) 116.

[0032] According to an implementation, the assignment module 120 of the clustering system 102 may retrieve the multi-dimensional data from the database 108. For instance, the multi-dimensional data may be an n-dimensional data. As indicated earlier, the multi-dimensional data may be represented by a plurality of data points in a multi-dimensional space, say n-dimensional space. Further, each of the plurality of data points may include a plurality of dimensions or components. The n-dimensional data is mathematically represented by the expression provided below:

.... (1)

[0033] In the above expression, ( X ) represents the n-dimensional data of size N. In one example, the multi-dimensional data may be an image and pixels of the image may be the plurality of data points. In said example, the components of the pixels, i.e., RGB (red, green and blue) or HSV (hue, saturation and value) may be the dimensions.

[0034] In an example, the clustering system 102 may partition a two- dimensional image or a three-dimensional (3D) image into two or more clusters. In said example, an image of 481 by 321 pixel dimension is taken and is transformed from RGB plane into HSV plane. Each data point includes three components, i.e., Hue (H), Saturation (S) and Value (V). These components are clustered into three clusters based on the HSV value of background, subject skin and dress color of the subject in the image. Therefore, in this case, the total number of data points (N) are 155401 (481 x321), the total number of dimensions (n) are 3, and the total number of clusters (c) are 3. The performance of this segmentation is justified in terms of number of points originally belonging to the subjects are misclassified as the background. It is experimentally established that the misclassification error produced by traditional clustering system is 16.82 %, whereas the same for the present clustering system 102 is 10.34 %. Therefore the present clustering system 102 outperforms the traditional clustering system by minimizing the misclassification error. Though the experimental verification of the present clustering system 102 is validated using 2D image data, same can be applied in a more generic image segmentation problems including segmentation of 3D image cloud points. Figure lb depicts the original image (140), the clustered outputs generated by the traditional clustering system and for the present clustering system 102.

[0035] Thereafter, the assignment module 120 may assign a membership to each dimension of each of the plurality of data points to a plurality of clusters. The membership assigned to each dimension may be interchangeably referred to as initial membership. In an implementation, the number of plurality of clusters may be pre-defined depending upon the application A membership assigned to a dimension of a data point may be understood as strength of association between the dimension of the data point and a particular cluster. In other words, the membership assigned to a dimension of a data point may be understood as degree of belonging of the dimension to a particular cluster. Each dimension may belong to several clusters simultaneously, with different degrees of membership.

[0036] According to an implementation, the assignment module 120 may initialize membership to each dimension using a random number ranging between 0 and 1, indicating their partial memberships. In said implementation, the memberships can be assigned in a random fashion using a random number ranging between 0 and 1. The membership assigned to the dimensions is mathematically represented by the expression provided below: μ_Αί(^χΙ) ' l < j < n, 1≤ k < N, l≤i≤c

.... (2)

[0037] In the above expression, (x^ ) denotes j^th dimension of the k^th data point and [μ_Αί( ^ )] denotes membership of x^to belong to the i^th cluster. Further, (N) is the size of the n-dimensional data and (c) is the numbers of clusters for the n-dimensional data.

[0038] In one implementation, the memberships assigned to the dimensions by the assignment module 120 may be stored as the membership data 130 within the clustering system 102. [0039] Further, the assignment module 120 aggregates memberships assigned to each dimension of each of the plurality of data points. In one implementation, the assignment module 120 may aggregate the memberships using a fuzziness control parameter (m). The fuzziness control parameter (m) determines the level of fuzziness in a cluster. A large value of fuzziness control parameter (m) results in smaller memberships and hence, fuzzier clusters. In an example, the value of fuzziness control parameter (m) may be 2. According to one implementation, the assignment module 120 aggregates the memberships using equation (3) provided below: μ ?^Γ ( x _k) = ±∑f₌₁[ v_M (*&]^m , 1≤ k < N, 1 < i < c

.... (3) where,

represents the aggregated membership of ( x^* _k) to belong to the i^th

cluster,

μ_Αί (xj^ ) represents membership of ( ^) to belong to the i^th cluster; and m represents the fuzziness control parameter.

[0040] Based on the aggregated memberships, the modification module

122 computes a cluster center of each of the plurality of clusters. A cluster center of a cluster may be understood as average of all data points in the cluster. According to an implementation, the modification module 122 computes a cluster center using equation (4) provided below:

^V' - Σ^ ΚΓ ( ) ] ' l≤;≤n, l < _1≤ c

.... (4) where, vj represents j^th dimension of the i^th cluster center,

x^ represents j^th dimension of the k^th data point, and μ^ ( x^* _k) represents the aggregated membership of ( x^* _k) to belong to the i^th

cluster.

[0041] In an implementation, the modification module 122 may initially compute the cluster center of each of the plurality of clusters using equation (5) provided below and then compute new cluster center of each of the plurality of clusters using equation (5) provided above:

.... (5) where, μ_Α; (x^J _k ) represents membership of x_k to belong to the i^th cluster; and m represents the fuzziness control parameter.

[0042] The modification module 122 may then calculate square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points. The square of distance calculated between the each dimension of cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points is mathematically represented by the expression provided below:

(x_k - vj) ², 1 < j < n, 1 < k < N, 1 < i < c

.... (6) [0043] In the above expression, (x^J _k— vj) ² denotes square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points, (x_k ) denotes j^th dimension of the k^th data point and (vj ) denotes j^th dimension of the i^th cluster center.

[0044] Further, the modification module 122 determines a modified membership for each dimension of each of the plurality of data points. In an implementation, the modified membership is determined based on modifying the initial membership assigned to each dimension based on the cluster center of each of the plurality of clusters. In said implementation, the modification module 122 determines the modified membership based on equation 7 (provided below).

[0045] According to an implementation, the modification module 122 modifies the membership based on the calculation of square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points. For instance, if [(x^— vj) ²] > 0 then the membership assigned to each dimension is modified. In another instance, if [ x^— ν·) ²] = 0 then the membership assigned to each dimension of each of the plurality of data points is set to 1 for the corresponding cluster and set to zero for the remaining clusters.

[0046] According to an implementation, the modification module 122 modifies the me ed below:

.... (7) where, μ_Α; x^ ) represents membership of x^J _k to belong to the i^th cluster,

(x_k— vj) ² represents square of distance between each dimensions of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points, and

m represents the fuzziness control parameter.

[0047] In an implementation, the modification module 122 computes the cluster center of each of the plurality of clusters and modifies the memberships using equation (8) provided below:

=

.... (8) where, _Αί0¾ ) represents membership of x^ to belong to the i^th cluster;

(x^— ν·) ² represents square of distance between each dimensions of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points,

represents Lagrange's multiplier, and

m represents the fuzziness control parameter.

[0048] In said implementation, partial derivative of equation (8) is taken with respect to memberships, cluster centers, and Lagrange's multiplier to obtain equation (5) and equation (7).

[0049] Since dimensions of each data point independently contribute to determination of cluster centers of the plurality of clusters it may cause fluctuation in the cluster centers. To stabilize the fluctuation in the cluster centers, the modification module 122 aggregates the memberships or membership values and adapts the fuzziness control parameter (m) towards its convergence, i.e., the fuzziness control parameter (m) is placed in the less sensitive region of the cluster centers.

[0050] In one implementation, the modification module 122 updates the fuzziness control parameter (m) to stabilize the cluster centers. The stability of the cluster centers is obtained by using the value of the fuzziness control parameter (m) which has the minimum effect in the change of cluster centers due to the change in membership values.

[0051] According to an implementation, the modification module 122 updates the fuzziness control parameter (m) based on weighted sum of initial or previous fuzziness control parameter (m) and modified fuzziness control parameter (m). In said implementation, the modification module 122 takes partial derivative of each dimension of the cluster centers with respect to membership or membership degree of each dimension of each of the plurality of data points and then it may be set to zero.

[0052] The modification module 122 selects the value of the fuzziness control parameter (m) such that it gives minimum absolute value of the partial derivative accumulated over all the dimensions of all the clusters for all data points. In an implementation, the modification module 122 may update the fuzziness control parameter (m) using an update factor weight (a). For instance, the value of the update factor weight (a) may be 0.95.

[0053] In an implementation, the modification module 122 updates the fuzziness control parameter (m) using equation (9) and (10) provided below:

[0054] m_modified = arg min (abs (∑f₌₁∑ ₌₁∑₌₁— -A- )) , l .l<m<2.5 (9) m_new = a x m + (l— a) x m_modified

.... (10) where, μ_Αί( {₍ ) represents membership of x^ to belong to the i^th cluster;

m represents the initial or previous fuzziness control parameter (m);. m modified represents the modified fuzziness control parameter (m), where m modified is calculated using equation (9),

a is the weight factor and

m new represents the updated fuzziness control parameter (m).

[0055] According to an implementation, the modification module 122 may update the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) in a plurality of iterations until a sum of absolute difference between the modified memberships and the initial memberships or previously updated memberships is less than a pre-defined limit (ε) For example, the value of the pre-defined limit (ε) may be 0.01. In an implementation, the plurality of iterations is predefined, and in said implementation, the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) are updated till the predefined number of iterations is exhausted. [0056] Thereafter, the identification module 124 of the clustering system

102 calculates a binary rank matrix for each dimension of each of the plurality of data points. The binary rank matrix is indicative of membership representation of each dimension of a data point. The membership may be represented in terms of binary notation i.e. either as Is or as 0s. In an implementation, the matrix dimension of a binary rank matrix is equal to ratio of total number of clusters to total number of dimensions of a data point. The identification module 124 may assign a value of 1 to that cluster which corresponds to maximum value of membership and all other clusters are assigned a value of 0. Further, the identification module 124 computes a membership rank matrix for each of the plurality of data points. The membership rank matrix may be indicative of average membership of dimensions of a data point for which binary rank matrix entry is equal to 1.

[0057] Based on the binary rank matrix and the membership rank matrix, the identification module 124 performs a hard assignment of each of the dimensions of each of the plurality of data points to the cluster centers of the plurality of clusters is performed. To perform the hard assignment, the identification module 124 identifies a point cluster matrix for each of the plurality of data points and a dimension cluster index for each dimension of each of the plurality of data points to assign each data points to cluster centers of the clusters. The point cluster index for a data point may be understood as the index which has a maximum number of Is in a binary rank matrix for each dimension of the data point. The point cluster index for the data point is mathematically represented by the expression provided below:

C£^ata = arg max (∑f₌₁ M_tj ( x _k)), l≤k≤N, l≤i≤c

.... (1 1)

[0058] In the above expression, (C ^ata) denotes a point cluster index of k^th data point which has maximum number of Is in the binary rank matrix and Mi_j(x^*k ) denotes the binary rank matrix. [0059] In an implementation, if more than one cluster index have equal number of ' I s' or have same value of binary rank matrix, then the hard assignment is done based on the membership rank matrix for each of the plurality of data points. The point cluster index for this case is mathematically represented by the expression provided below:

C£^ata = arg max ( U_Ai(x _k )), 1 < k < N, 1 < i < c

.... (12)

[0060] In the above expression, U_Ai(x ) denotes the membership rank matrix.

[0061] According to an implementation, the identification module 124 also performs hard assignment to assign each of the plurality of data points to cluster centers of the plurality of clusters based on identifying a dimension cluster index for each dimension of each of the plurality of data points. The dimension cluster index for a dimension of a data point may be understood as the index for which binary rank matrix for the dimension of the data point is 1.

[0062] In an implementation, the dimension cluster index for a dimension of a data point is mathematically represented by the expression provided below:

= arg (Mj_jC _k) = 1), 1 < k < N, 1 < i < c, 1 < j < n

.... (13) [0063] In the above expression, (C_k) denotes a dimension cluster index of j^th dimension of k^th data point and Mj_j(5f_k ) denotes the binary rank matrix. In one implementation, the point cluster index and the dimension cluster index identified by the identification module 124 may be stored as the index data 132 within the clustering system 102.

[0064] Thereafter, the determination module 126 determines a measurement metric for each dimension of each of the plurality of data points. The measurement metric may be interchangeably referred to as a goodness measurement metric. The measurement metric may be understood as a goodness measure (G_k) for each dimension of a data point. If the value of goodness measure (G'_k) is high then a dimension follows the data points well indicating the significance of the dimension in the process of clustering. The value of the measurement metric for each dimension accumulated over all the data points provides the measure of significance of the dimension. For instance, if the value of the measurement metric is high, then measure of significance of the dimension may also be high. In one implementation, a set of dimensions that have higher goodness measure G^) may be selected to be used for clustering the n- dimensional data.

[0065] According to an implementation, the determination module 126 may determine the measurement metric for each dimension of each of the plurality of data points based on comparison of the point cluster index and the dimension cluster index. The measurement metric of a dimension of a data point is equal to 1 if the point cluster index is same as the dimension cluster index. If the point cluster index and the dimension cluster index are not equal, then the measurement metric is equal to 0.

[0066] According to the present subject matter, each dimension of the data points independently contribute in the process of determining the membership of the data points to the clusters. Since membership degree, i.e., degree of belongingness of each dimension to each cluster is taken into consideration, the distance between each dimension of a data point with the cluster center of that dimension of the data point is significantly minimized and as a result accuracy of clustering of input data improves significantly. Further, since cluster assignment of a data point by considering the highest aggregated membership for that cluster can also be ascertained by a set of dimensions having higher membership to belong to that cluster. Although, independent contribution of the dimensions to determine the cluster center results in a fluctuation in cluster centers, however the fluctuation in the cluster centers can be stabilized by membership aggregation and adapting fuzziness control parameter (m) towards its convergence, thereby stabilizing the cluster centers. Therefore, the clustering of the data is performed reliably and accurately. [0067] Figure lb illustrates comparison of an exemplary image segmented by the present clustering system and a conventional clustering system.

[0068] As shown in figure lb, image 140 is an original image that is to be segmented. Further, image 142 is the segmented image that is obtained as a result of image segmentation performed by the conventional clustering system, and image 144 is the segmented image that is obtained as a result of the image segmentation performed by the present clustering system 102, i.e., clustering system described in accordance with the present subject matter. The performance of the segmentation process is justified in terms of number of data points originally belonging to the subjects are misclassified as the background. As can be seen in figure lb, the present clustering system 102 outperforms the conventional clustering system by minimizing the misclassification error.

[0069] It is to be noted that the original image 140 as depicted in figure lb for the image segmentation experiment has been taken from the source "P. Arbelaez, M. Maire, C. Fowlkes and J. Malik., "Contour Detection and Hierarchical Image Segmentation", IEEE TP AMI, Vol. 33, No. 5, pp. 898-916, May 201 1."

[0070] Figures 2a and 2b illustrate a method 200 for determining significant dimensions for clustering multi-dimensional data, according to an embodiment of the present subject matter. The method 200 is implemented in computing device, such as a clustering system 102. The method 200 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 200 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network.

[0071] The order in which the method 200 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 200, or an alternative method. Furthermore, the method 200 can be implemented in any suitable hardware, software, firmware or combination thereof.

[0072] At block 202, the method 200 includes obtaining multidimensional data, where the multi-dimensional data includes a plurality of data points. In an example, the multi-dimensional data may be an n-dimensional data and the multi-dimensional data may be represented by a plurality of data points. Further, each of the plurality of data-points may include a plurality of dimensions. In one example, the multi-dimensional data may be multimedia data, say an image. Thus, the data points of the image can be the pixels and components of the pixels, i.e., RGB (red, green and blue) or HSV (hue, saturation and value) may be the dimensions. In an implementation, the assignment module 120 of the clustering system 102 may obtain the multi-dimensional data from the database 108.

[0073] At block 204, the method 200 includes assigning initial memberships to each dimension of each of the plurality of data points for a plurality of clusters. In an implementation, a membership to each dimension of each of the plurality of data points to a plurality of clusters. In an implementation, the plurality of clusters may be pre-defined. A membership assigned to a dimension of a data point may be understood as strength of association between the dimension of the data point and a particular cluster. In one implementation, the memberships can be assigned in a random fashion using a random number between 0 and 1. The assignment module 120 of the clustering system 102 assigns a membership to each dimension of each of the plurality of data points to a plurality of clusters.

[0074] At block 206, the method 200 includes aggregating the initial memberships assigned to the dimensions of the plurality of data points. The memberships may be aggregated induced by fuzziness control parameter (m). The fuzziness control parameter (m) determines the level of fuzziness in a cluster. In an example, the value of fuzziness control parameter (m) may be 2. According to one implementation, the assignment module 120 aggregates the memberships based on the equation (3) described in the previous section.

[0075] At block 208, the method 200 includes computing a cluster center of each of the plurality of clusters based on the aggregated memberships. A cluster center of a cluster may be understood as average of all data points in the cluster. According to an implementation, the modification module 122 computes a cluster center based on the equation (4) described in the previous section.

[0076] At block 210, the method 200 includes calculating square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points. According to an implementation, the modification module 122 calculates a square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points.

[0077] At block 212, the method 200 includes modifying the initial memberships assigned to the dimensions of each of the plurality of data points. For instance, if square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points is greater than 0 then the membership assigned to each dimension is modified. In another instance, if square of distance between each dimensions of t!ie cluster center and each dimension of each of the plurality of data points is equal to 0 then the membership is set to 1 for the corresponding cluster and set to 0 for the rest of the clusters. In an implementation, the modification module 122 modifying the membership assigned to each dimension of each of the plurality of data points based on the equation (6) described in the previous section.

[0078] At block 214, the method 200 includes updating a fuzziness control parameter (m). In one implementation, the fuzziness control parameter (m) may be updated based on weighted sum of initial or previous fuzziness control parameter (m) and modified fuzziness control parameter (m). The modified fuzziness control parameter (m) is computed by taking the partial derivative of each dimension of the cluster centers with respect to membership of membership degree of each dimension of each of the plurality of data points and then it may be set to zero. The modification module 122 selects the value of the fuzziness control parameter (m) such that it gives minimum absolute value of the partial derivative accumulated over all the dimensions of all the clusters for all data points. The value of cluster centers and hence the membership values are updated until sum of absolute difference between the modified membership and the initial membership is less than a pre-defined limit (ε) or a predefined limit of iterations have been exhausted. For instance, the pre-defined limit (ε) may be 0.01. According to an implementation, the modification module 122 updates the fuzziness control parameter (m) to stabilize the cluster centers.

[0079] At block 216, the method 200 includes identifying a point cluster index for each data point and a dimension cluster index for dimensions of each data point. The point cluster index for a data point may be understood as the index which has a maximum number of Is in a binary rank matrix for each dimension of the data point. The dimension cluster index for a dimension of a data point may be understood as the index for which binary rank matrix for the dimension of the data point is 1. According to an implementation, the identification module 124 identifies a point cluster index for each data point and a dimension cluster index for dimensions of each data point.

[0080] At block 218, the method 200 includes assigning each of the plurality of data points to cluster centers of the plurality of clusters based on the point cluster index. According to an implementation, the identification module 124 performs a hard assignment of the plurality of data points to the cluster centers of the plurality of clusters using the point cluster index

[0081] At block 220, the method 200 includes determining a measurement metric for each dimension of each of the plurality of data points. In one implementation, the measurement metric may be determined based on comparison of the point cluster index for each of the plurality of data points and the dimension cluster index for each dimension of each of the plurality of data points. The measurement metric may be understood as a goodness measure (G^) for each dimension of a data point. If the value of goodness measure (G^) is high then a dimension follows the data points well indicating that the dimension and the corresponding data point index belong to the same cluster. In one implementation, a set of dimensions that have higher goodness measure G^) may be selected to be used for clustering the n-dimensional data. The determination module 126 determines the measurement metric for each dimension of each of the plurality of data points.

[0082] The method blocks 206, 208, 210, 212, and 214 described above are repeated in a plurality of iterations. In one implementation, the plurality of iterations terminate when a sum of absolute difference between the modified memberships and the initial memberships or previously updated memberships is less than a pre-defined limit (ε). In another implementation, the plurality of iterations is predefined, and the method blocks 206, 208, 210, 212, and 214 are repeated till the predefined number of iterations is exhausted.

[0083] Although embodiments for methods and systems for clustering multi-dimensional data have been described in a language specific to structural features and/or methods, it is to be understood that the invention is not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as exemplary embodiments for clustering multi-dimensional data.

Claims

I/We claim:

1. A clustering system (102) to determine significant dimensions for clustering multi-dimensional data, the clustering system (102) comprising:

a processor (1 10);

an assignment module (120), coupled to the processor (1 10), the assignment module (120) to:

obtain the multi-dimensional data comprising a plurality of data points, each of the plurality of data points having a plurality of dimensions;

assign initial memberships to each of the plurality of dimensions of each of the plurality of data points of the multidimensional data, for a plurality of clusters; and

aggregate initial memberships assigned to the plurality of dimensions of each of the plurality of data points induced by a fuzziness control parameter (m);

a modification module (122), coupled to the processor (1 10), the modification module (122) to:

modify the initial memberships assigned to the plurality of dimensions based on a cluster center of each of the plurality of clusters, wherein the cluster center of each of the plurality of clusters is computed based on the aggregated initial memberships; and

update the fuzziness control parameter (m) based on weighted sum of a modified fuzziness control parameter (m), and at least one of an initial fuzziness control parameter (m) and a previous fuzziness control parameter (m); and

a determination module (126), coupled to the processor (1 10), the determination module (126) to determine a goodness measurement metric for each dimension based on comparison of a point cluster index for each data point, and a dimension cluster index for each dimension of each of the plurality of data points, wherein value of goodness measurement metric is indicative of significance of the each dimension.

The clustering system (102) as claimed in claim 1 , wherein the initial memberships, the aggregated initial memberships, the cluster centers, and the fuzziness control parameter (m) are updated in a plurality of iterations till a sum of absolute difference between the modified memberships and the initial memberships is less than a pre-defined limit (ε).

The clustering system (102) as claimed in claim 2, wherein the plurality of iterations are predefined, and the initial memberships, the aggregated initial memberships, the cluster centers, and the fuzziness control parameter (m) are updated till the predefined number of iterations is exhausted.

The clustering system (102) as claimed in claim 1, wherein the modified fuzziness control parameter (m) is determined based on minimizing sum of derivative of cluster centers with respect to memberships of dimensions of each of the plurality of clusters for the plurality of data points.

The clustering system (102) as claimed in claim 1 , wherein the modification module (122) further:

calculates square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points followed by a power of a negative of a ratio of 1 and fuzziness control parameter minus 1 (m-1); and

updates the membership values for each dimension using the calculated square of distance and normalization with respect to the square of distance from all the cluster centers.

The clustering system (102) as claimed in claim 1, wherein the identification module (124) further: calculates a binary rank matrix for each dimension of each of the plurality of data points, wherein the binary rank matrix is indicative of membership representation of each dimension of a data point;

computes a membership rank matrix for each of the plurality of data points, wherein the membership rank matrix is indicative of average membership of dimensions of a data point for which binary rank matrix entry is equal to 1 ; and

identify a point cluster index for each of the plurality of data points and a dimension cluster index for each dimension of each of the plurality of data points to assign each of the plurality of data points to cluster centers of the plurality of clusters, based on the binary rank matrix and the membership rank matrix.

The clustering system (102) as claimed in claim 1, wherein the initial membership is initialized to each dimension of each of the plurality of data points using a random number ranging between 0 and 1.

The clustering system (102) as claimed in claim 1, wherein value of the goodness measurement metric is 1 if the point cluster index is equal to the dimension cluster index.

The clustering system (102) as claimed in claim 1, wherein value of the goodness measurement metric is 0 if the point cluster index is not equal to the dimension cluster index.

A method for determining significant dimensions for clustering multidimensional data, the method comprising:

obtaining, by the clustering system 102, the multi-dimensional data comprising a plurality of data points, each of the plurality of data points having a plurality of dimensions; assigning, by the clustering system 102, initial memberships to each of the plurality of dimensions of each of the plurality of data points of the multi-dimensional data, for a plurality of clusters;

aggregating, by the clustering system 102, one of the initial memberships and modified memberships assigned to the plurality of dimensions of each of the plurality of data points induced by a fuzziness control parameter (m);

computing, by the clustering system 102, a cluster center of each of the plurality of clusters based on the aggregated one of the initial memberships and the modified memberships;

calculating, by the clustering system 102, square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points;

modifying, by the clustering system 102, one of the initial memberships and the modified memberships, assigned to the plurality of dimensions of each of the plurality of data points, based the calculation; updating, by the clustering system 102, the fuzziness control parameter (m) based on weighted sum of a modified fuzziness control parameter (m), and at least one of an initial fuzziness control parameter (m) and a previous fuzziness control parameter (m); and

determining, by the clustering system 102, a goodness measurement metric for each dimension based on comparison of a point cluster index for each data point, and a dimension cluster index for each dimension of each of the plurality of data points, wherein value of measurement metric is indicative of significance of the each dimension.

The method as claimed in claim 10 further comprising: updating the fuzziness control parameter (m) till a sum of absolute difference between the modified memberships and the initial memberships is less than a pre-defined limit (ε);

calculating a binary rank matrix for each dimension of each of the plurality of data points, wherein the binary rank matrix is indicative of membership representation of each dimension of a data point;

computing a membership rank matrix for each of the plurality of data points, wherein the membership rank matrix is indicative of average membership of all dimensions of a data point for which binary rank matrix entry is equal to 1 ; and

identifying a point cluster index for each of the plurality of data points and a dimension cluster index for each dimension of each of the plurality of data points to assign each of the plurality of data points to cluster centers of the plurality of clusters, based on the binary rank matrix and the membership rank matrix.

The method as claimed in claim 10, wherein the initial memberships, the aggregated initial memberships, and the cluster centers are updated in a plurality of iterations till a sum of absolute difference between the modified memberships and the initial memberships is less than a predefined limit (ε), and wherein the plurality of iterations are predefined, and the initial memberships, the aggregated initial memberships, and the cluster centers are updated till the predefined number of iterations is exhausted.

The method as claimed in claim 10, wherein the initial memberships are modified based on square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points followed by a power of a negative of a ratio of 1 and fuzziness control parameter minus 1 (m-1) and normalization with respect to the square of distance from all the cluster centers. The method as claimed in claim 10, wherein value of the goodness measurement metric is 1 if the point cluster index is equal to the dimension cluster index and the value of goodness measurement metric is 0 if the point cluster index is not equal to the dimension cluster index.

A non-transitory computer-readable medium having embodied thereon a computer program for executing a method comprising:

obtaining, by the clustering system 102, the multi-dimensional data comprising a plurality of data points, each of the plurality of data points having a plurality of dimensions;

assigning, by the clustering system 102, initial memberships to each of the plurality of dimensions of each of the plurality of data points of the multi-dimensional data, for a plurality of clusters;