WO2015001416A1 - Multi-dimensional data clustering - Google Patents

Multi-dimensional data clustering Download PDF

Info

Publication number
WO2015001416A1
WO2015001416A1 PCT/IB2014/001262 IB2014001262W WO2015001416A1 WO 2015001416 A1 WO2015001416 A1 WO 2015001416A1 IB 2014001262 W IB2014001262 W IB 2014001262W WO 2015001416 A1 WO2015001416 A1 WO 2015001416A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimension
memberships
data points
cluster
initial
Prior art date
Application number
PCT/IB2014/001262
Other languages
French (fr)
Inventor
Diptesh DAS
Aniruddha Sinha
Kingshuk CHAKRAVARTY
Amit Konar
Original Assignee
Tata Consultancy Services Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tata Consultancy Services Limited filed Critical Tata Consultancy Services Limited
Publication of WO2015001416A1 publication Critical patent/WO2015001416A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image

Definitions

  • the present subject matter relates, in general, to data processing and, in particular, to a system and a method for clustering multi-dimensional data.
  • Data clustering is a method of grouping data points or objects of a given data that are substantially similar in characteristics into clusters. Generally, each cluster is represented by a geometric centroid of the data points lying in the cluster. Clustering techniques can be applied to data that are quantitative (numerical), qualitative (categorical), or a combination of both. Clustering techniques are mostly unsupervised methods that can be used to organize data into clusters based on similarities among the individual data items. The potential of clustering techniques to reveal the underlying structures in data can be exploited in a wide variety of applications including classification, image processing, data mining, pattern recognition, modelling and identification.
  • Figure lb illustrates comparison of an exemplary image segmented by the present clustering system and a conventional clustering system.
  • Figures 2a and 2b illustrate a method for determining significant dimensions for clustering multi-dimensional data, according to an embodiment of the present subject matter.
  • Such conventional techniques partition data comprising a set of data points or objects into two or more clusters based on an iterative two-steps process. In the first step each data point is allocated to nearest cluster center, and in the second step cluster centers are determined based on identifying a centroid for each of the two or more clusters. The centroid is identified for each partition of the data points allocated to each cluster.
  • Such conventional clustering techniques fail to determine right clusters for data points that reside marginally at boundaries of the two or more clusters.
  • the clustering may be understood as partitioning a set of data points of the data into a plurality of clusters, such that the data points that belong to the same cluster are as similar as possible and the data points that belong to different clusters are as dissimilar as possible.
  • the system as described herein is a clustering system.
  • a database for storing multi-dimensional data is maintained according to one implementation.
  • the multi-dimensional data may be representative of multimedia data, financial transactions and the like.
  • the multi-dimensional data is represented by a plurality of data points in a multi-dimensional space, say n-dimensional space.
  • each of the plurality of data points may include a plurality of dimensions or components.
  • the multi-dimensional data may be an image and pixels of the image may be the plurality of data points.
  • the components of the pixels i.e., RGB (red, green and blue) or HSV (hue, saturation and value) can be the dimensions.
  • the database can be an external repository associated with the clustering system, or an internal repository within the clustering system.
  • the data stored in the database may be retrieved whenever clustering is to be performed. Further, the data contained within such database may be updated, whenever required. For example, new data may be added into the database, existing data can be modified, or non-useful data may be deleted from the database.
  • a database is maintained to store the multi-dimensional data, however, it is well appreciated that the multidimensional data may be received by the clustering system in real-time to identify significant dimensions and then perform clustering of the multidimensional data.
  • a membership is assigned to each dimension of each of the plurality of data points to a plurality of clusters.
  • the membership assigned to each dimension initially may be interchangeably referred to as initial membership.
  • the plurality of clusters may be pre-defined.
  • a membership assigned to a dimension of a data point may be understood as strength of association between the dimension of the data point and a particular cluster.
  • the membership assigned to a dimension of a data point may be understood as degree of belonging of the dimension to a particular cluster.
  • Each dimension may belong to several clusters simultaneously, with different degrees of membership.
  • the dimensions can be assigned a membership between 0 and 1 , indicating their partial memberships.
  • the memberships can be initialized in a random fashion using a random number between 0 and 1.
  • the memberships assigned to the dimensions of the plurality of data points are then aggregated.
  • the memberships may be induced by a fuzziness control parameter (m).
  • the fuzziness control parameter (m) determines the level of cluster fuzziness. A large value of fuzziness control parameter (m) results in smaller memberships and hence, fuzzier clusters.
  • the value of fuzziness control parameter (m) may be 2.
  • a cluster center of each of the plurality of clusters is computed based on the aggregated memberships. A cluster center of a cluster is average of all data points in the cluster. The computation of the cluster center has been explained later in detail (using equation 4), in the forthcoming description.
  • the fuzziness control parameter (m) is updated to stabilize the cluster centers.
  • the stability of the cluster centers is obtained by using the value of the fuzziness control parameter (m) which has the minimum effect in the change of cluster centers due to the change in memberships or membership values.
  • partial derivative of each dimension of the cluster centers may be taken with respect to membership degree of each dimension of each of the plurality of data points and then it may be set to zero.
  • the fuzziness control parameter (m) may be updated using an update factor weight (a). For instance, the value of the update factor weight (a) may be 0.95.
  • the fuzziness control parameter (m) may be updated based on weighted sum of initial or previous fuzziness control parameter (m) and modified fuzziness control parameter (m).
  • the computation of the initial or previous fuzziness control parameter (m) and the modified fuzziness control parameter (m) cluster center has been explained later in detail, in the forthcoming description.
  • the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) are updated in a plurality of iterations.
  • the plurality of iterations terminate when a sum of absolute difference between the modified memberships and the initial memberships or previously updated memberships is less than a pre-defined limit ( ⁇ ).
  • a pre-defined limit
  • the value of the pre-defined limit ( ⁇ ) may be 0.01.
  • the plurality of iterations is predefined, and the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) are updated till the predefined number of iterations is exhausted.
  • a hard assignment of the plurality of data points to the cluster centers of the plurality of clusters is performed.
  • a point cluster index is identified for each of the plurality of data points.
  • the point cluster index for a data point may be understood as the index which has a maximum number of Is in a binary rank matrix for each dimension of the data point.
  • the binary rank matrix is indicative of membership representation of each dimension of a data point.
  • the membership may be represented in terms of binary notation i.e. either as Is or as 0s.
  • the hard assignment is done based on a membership rank matrix for each of the plurality of data points.
  • a membership rank matrix is indicative of average membership of dimensions of a data point for which binary rank matrix entry is equal to 1.
  • the hard assignment to assign each of the dimensions of plurality of data points to cluster centers of the plurality of clusters may also be performed based on identifying a dimension cluster index for each dimension of each of the plurality of data points.
  • the dimension cluster index for a dimension of a data point may be understood as the index for which binary rank matrix for the dimension of the data point is 1.
  • a measurement metric is determined for each dimension of each of the plurality of data points.
  • the measurement metric may be understood as a goodness measure for each dimension of a data point.
  • the measurement metric can be used for performing dimensionality reduction. Dimensionality reduction can be performed by tracking the dimensions which follow data points well as compared to other dimensions. If the value of goodness measure is high then a dimension follows the data points well indicating the significance of the dimension in the process of clustering. For instance, if value of goodness measure is high then the significance of the dimension for the process of clustering is also very high. Thus, dimensionality reduction can be performed by selecting a set of dimensions that have higher values of goodness measure. The set of dimensions that have higher goodness measure can be used for clustering the n-dimensional data.
  • the measurement metric for each dimension of each of the plurality of data points may be determined based on comparison of the point cluster index for each of the plurality of data points and the dimension cluster index for each dimension of each of the plurality of data points.
  • the measurement metric may be interchangeably referred to as a goodness measurement metric.
  • the measurement metric of a dimension of a data point is equal to 1 if the point cluster index is same as the dimension cluster index. Otherwise, the measurement metric is equal to 0.
  • each dimension of the data points independently contribute in the process of determining the membership of the data points to the clusters. Since membership degree, i.e., degree of belongingness of each dimension to each cluster is taken into consideration, therefore the distance between each dimension of a data point with the cluster center of that dimension of the data point is significantly minimized and as a result accuracy of clustering of input data improves significantly. Further, since cluster assignment of a data point by considering the highest aggregated membership for that cluster can also be ascertained by a set of dimensions having higher membership to belong to that cluster.
  • Figure la illustrates a network environment 100 implementing a clustering system 102, in accordance with an embodiment of the present subject matter.
  • the network environment 100 can be a public network environment, including thousands of personal computers, laptops, various servers, such as blade servers, and other computing devices.
  • the network environment 100 can be a private network environment with a limited number of computing devices, such as personal computers, servers, and laptops.
  • the clustering system 102 may be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, and the like. Further, it will be understood that the clustering system 102 is connected to a plurality of user devices 104-1, 104-2, 104-3..., and 104-N, collectively referred to as user devices 104 and individually referred to as a user device 104. As shown in figure 1, the user devices 104 are communicatively coupled to the clustering system 102 over a network 106 through one or more communication links for facilitating one or more end users to access and operate the clustering system 102.
  • the user device 104 may include, but is not limited to, a desktop computer, a portable computer, a handheld computing device, and a workstation.
  • the network 106 may be a wireless network, a wired network, or a combination thereof.
  • the network 106 may also be an individual network or a collection of many such individual networks, interconnected with each other and functioning as a single large network, e.g., the Internet or an intranet.
  • the network 106 may be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and such.
  • the network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), etc., to communicate with each other.
  • the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
  • the network environment 100 further comprises a database 108 communicatively coupled to the clustering system 102.
  • the database 108 may store multi-dimensional data.
  • the data may be representative of multimedia data, financial transactions and the like. According to an implementation, the data is represented as a plurality of data points in a multi-dimensional space, say n- dimensional space.
  • the database 108 is shown external to the clustering system 102, it will be appreciated by a person skilled in the art that the database 108 can also be implemented internal to the clustering system 102, where the multi-dimensional data may be stored within a memory component of the clustering system 102.
  • the clustering system 102 includes processor(s) 1 10, interface(s) 1 12, and memory 1 14 coupled to the processor(s) 1 10.
  • the processor(s) 1 10 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions.
  • the processor(s) 110 may be configured to fetch and execute computer-readable instructions stored in the memory 114.
  • the memory 114 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM), and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • volatile memory such as static random access memory (SRAM), and dynamic random access memory (DRAM)
  • DRAM dynamic random access memory
  • non-volatile memory such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • the interface(s) 112 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a product board, a mouse, an external memory, and a printer. Additionally, the interface(s) 112 may enable the clustering system 102 to communicate with other devices, such as web servers and external repositories. The interface(s) 112 may also facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. For the purpose, the interface(s) 112 may include one or more ports.
  • the clustering system 102 also includes module(s) 116 and data 118.
  • the module(s) 116 include, for example, an assignment module 120, a modification module 122, an identification module 124, and a determination module 126, and other module(s) 128.
  • the other module(s) 128 may include programs or coded instructions that supplement applications or functions performed by the clustering system 102.
  • the data 118 may be membership data 130, index data 132, and other data 134.
  • the other data 134 may serve as a repository for storing data that is processed, received, or generated as a result of the execution of one or more modules in the module(s) 116.
  • the assignment module 120 of the clustering system 102 may retrieve the multi-dimensional data from the database 108.
  • the multi-dimensional data may be an n-dimensional data.
  • the multi-dimensional data may be represented by a plurality of data points in a multi-dimensional space, say n-dimensional space. Further, each of the plurality of data points may include a plurality of dimensions or components.
  • the n-dimensional data is mathematically represented by the expression provided below:
  • ( X ) represents the n-dimensional data of size N.
  • the multi-dimensional data may be an image and pixels of the image may be the plurality of data points.
  • the components of the pixels i.e., RGB (red, green and blue) or HSV (hue, saturation and value) may be the dimensions.
  • the clustering system 102 may partition a two- dimensional image or a three-dimensional (3D) image into two or more clusters.
  • an image of 481 by 321 pixel dimension is taken and is transformed from RGB plane into HSV plane.
  • Each data point includes three components, i.e., Hue (H), Saturation (S) and Value (V). These components are clustered into three clusters based on the HSV value of background, subject skin and dress color of the subject in the image. Therefore, in this case, the total number of data points (N) are 155401 (481 x321), the total number of dimensions (n) are 3, and the total number of clusters (c) are 3.
  • the assignment module 120 may assign a membership to each dimension of each of the plurality of data points to a plurality of clusters.
  • the membership assigned to each dimension may be interchangeably referred to as initial membership.
  • the number of plurality of clusters may be pre-defined depending upon the application
  • a membership assigned to a dimension of a data point may be understood as strength of association between the dimension of the data point and a particular cluster.
  • the membership assigned to a dimension of a data point may be understood as degree of belonging of the dimension to a particular cluster.
  • Each dimension may belong to several clusters simultaneously, with different degrees of membership.
  • the assignment module 120 may initialize membership to each dimension using a random number ranging between 0 and 1, indicating their partial memberships.
  • the memberships can be assigned in a random fashion using a random number ranging between 0 and 1.
  • the membership assigned to the dimensions is mathematically represented by the expression provided below: ⁇ ⁇ ⁇ ( ⁇ ⁇ ) ' l ⁇ j ⁇ n, 1 ⁇ k ⁇ N, l ⁇ i ⁇ c
  • (x ⁇ ) denotes j th dimension of the k th data point and [ ⁇ ⁇ ( ⁇ )] denotes membership of x ⁇ to belong to the i th cluster.
  • (N) is the size of the n-dimensional data and (c) is the numbers of clusters for the n-dimensional data.
  • ⁇ ⁇ (xj ⁇ ) represents membership of ( ⁇ ) to belong to the i th cluster; and m represents the fuzziness control parameter.
  • the modification module Based on the aggregated memberships, the modification module
  • a cluster center of a cluster may be understood as average of all data points in the cluster.
  • the modification module 122 computes a cluster center using equation (4) provided below:
  • x ⁇ represents j th dimension of the k th data point
  • ⁇ ( x * k ) represents the aggregated membership of ( x * k ) to belong to the i th
  • the modification module 122 may initially compute the cluster center of each of the plurality of clusters using equation (5) provided below and then compute new cluster center of each of the plurality of clusters using equation (5) provided above:
  • the modification module 122 may then calculate square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points.
  • the square of distance calculated between the each dimension of cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points is mathematically represented by the expression provided below:
  • (x J k — vj) 2 denotes square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points
  • (x k ) denotes j th dimension of the k th data point
  • (vj ) denotes j th dimension of the i th cluster center.
  • the modification module 122 determines a modified membership for each dimension of each of the plurality of data points.
  • the modified membership is determined based on modifying the initial membership assigned to each dimension based on the cluster center of each of the plurality of clusters.
  • the modification module 122 determines the modified membership based on equation 7 (provided below).
  • the modification module 122 modifies the me ed below:
  • (x k — vj) 2 represents square of distance between each dimensions of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points
  • m represents the fuzziness control parameter.
  • the modification module 122 computes the cluster center of each of the plurality of clusters and modifies the memberships using equation (8) provided below:
  • ⁇ 03 ⁇ 4 represents membership of x ⁇ to belong to the i th cluster
  • (x ⁇ — ⁇ ) 2 represents square of distance between each dimensions of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points
  • m represents the fuzziness control parameter.
  • Equation (8) is taken with respect to memberships, cluster centers, and Lagrange's multiplier to obtain equation (5) and equation (7).
  • the modification module 122 aggregates the memberships or membership values and adapts the fuzziness control parameter (m) towards its convergence, i.e., the fuzziness control parameter (m) is placed in the less sensitive region of the cluster centers.
  • the modification module 122 updates the fuzziness control parameter (m) to stabilize the cluster centers.
  • the stability of the cluster centers is obtained by using the value of the fuzziness control parameter (m) which has the minimum effect in the change of cluster centers due to the change in membership values.
  • the modification module 122 updates the fuzziness control parameter (m) based on weighted sum of initial or previous fuzziness control parameter (m) and modified fuzziness control parameter (m).
  • the modification module 122 takes partial derivative of each dimension of the cluster centers with respect to membership or membership degree of each dimension of each of the plurality of data points and then it may be set to zero.
  • the modification module 122 selects the value of the fuzziness control parameter (m) such that it gives minimum absolute value of the partial derivative accumulated over all the dimensions of all the clusters for all data points.
  • the modification module 122 may update the fuzziness control parameter (m) using an update factor weight (a). For instance, the value of the update factor weight (a) may be 0.95.
  • the modification module 122 updates the fuzziness control parameter (m) using equation (9) and (10) provided below:
  • m represents the initial or previous fuzziness control parameter (m);.
  • m modified represents the modified fuzziness control parameter (m), where m modified is calculated using equation (9),
  • a is the weight factor
  • m new represents the updated fuzziness control parameter (m).
  • the modification module 122 may update the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) in a plurality of iterations until a sum of absolute difference between the modified memberships and the initial memberships or previously updated memberships is less than a pre-defined limit ( ⁇ )
  • the value of the pre-defined limit ( ⁇ ) may be 0.01.
  • the plurality of iterations is predefined, and in said implementation, the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) are updated till the predefined number of iterations is exhausted.
  • the 102 calculates a binary rank matrix for each dimension of each of the plurality of data points.
  • the binary rank matrix is indicative of membership representation of each dimension of a data point.
  • the membership may be represented in terms of binary notation i.e. either as Is or as 0s.
  • the matrix dimension of a binary rank matrix is equal to ratio of total number of clusters to total number of dimensions of a data point.
  • the identification module 124 may assign a value of 1 to that cluster which corresponds to maximum value of membership and all other clusters are assigned a value of 0. Further, the identification module 124 computes a membership rank matrix for each of the plurality of data points.
  • the membership rank matrix may be indicative of average membership of dimensions of a data point for which binary rank matrix entry is equal to 1.
  • the identification module 124 Based on the binary rank matrix and the membership rank matrix, the identification module 124 performs a hard assignment of each of the dimensions of each of the plurality of data points to the cluster centers of the plurality of clusters is performed. To perform the hard assignment, the identification module 124 identifies a point cluster matrix for each of the plurality of data points and a dimension cluster index for each dimension of each of the plurality of data points to assign each data points to cluster centers of the clusters.
  • the point cluster index for a data point may be understood as the index which has a maximum number of Is in a binary rank matrix for each dimension of the data point.
  • the point cluster index for the data point is mathematically represented by the expression provided below:
  • (C ata ) denotes a point cluster index of k th data point which has maximum number of Is in the binary rank matrix and Mi j (x * k ) denotes the binary rank matrix.
  • the hard assignment is done based on the membership rank matrix for each of the plurality of data points.
  • the point cluster index for this case is mathematically represented by the expression provided below:
  • U Ai (x ) denotes the membership rank matrix
  • the identification module 124 also performs hard assignment to assign each of the plurality of data points to cluster centers of the plurality of clusters based on identifying a dimension cluster index for each dimension of each of the plurality of data points.
  • the dimension cluster index for a dimension of a data point may be understood as the index for which binary rank matrix for the dimension of the data point is 1.
  • the dimension cluster index for a dimension of a data point is mathematically represented by the expression provided below:
  • (C k ) denotes a dimension cluster index of j th dimension of k th data point and Mj j (5f k ) denotes the binary rank matrix.
  • the point cluster index and the dimension cluster index identified by the identification module 124 may be stored as the index data 132 within the clustering system 102.
  • the determination module 126 determines a measurement metric for each dimension of each of the plurality of data points.
  • the measurement metric may be interchangeably referred to as a goodness measurement metric.
  • the measurement metric may be understood as a goodness measure (G k ) for each dimension of a data point. If the value of goodness measure (G' k ) is high then a dimension follows the data points well indicating the significance of the dimension in the process of clustering. The value of the measurement metric for each dimension accumulated over all the data points provides the measure of significance of the dimension. For instance, if the value of the measurement metric is high, then measure of significance of the dimension may also be high. In one implementation, a set of dimensions that have higher goodness measure G ⁇ ) may be selected to be used for clustering the n- dimensional data.
  • the determination module 126 may determine the measurement metric for each dimension of each of the plurality of data points based on comparison of the point cluster index and the dimension cluster index.
  • the measurement metric of a dimension of a data point is equal to 1 if the point cluster index is same as the dimension cluster index. If the point cluster index and the dimension cluster index are not equal, then the measurement metric is equal to 0.
  • each dimension of the data points independently contribute in the process of determining the membership of the data points to the clusters. Since membership degree, i.e., degree of belongingness of each dimension to each cluster is taken into consideration, the distance between each dimension of a data point with the cluster center of that dimension of the data point is significantly minimized and as a result accuracy of clustering of input data improves significantly. Further, since cluster assignment of a data point by considering the highest aggregated membership for that cluster can also be ascertained by a set of dimensions having higher membership to belong to that cluster.
  • FIG. lb illustrates comparison of an exemplary image segmented by the present clustering system and a conventional clustering system.
  • image 140 is an original image that is to be segmented.
  • image 142 is the segmented image that is obtained as a result of image segmentation performed by the conventional clustering system
  • image 144 is the segmented image that is obtained as a result of the image segmentation performed by the present clustering system 102, i.e., clustering system described in accordance with the present subject matter.
  • the performance of the segmentation process is justified in terms of number of data points originally belonging to the subjects are misclassified as the background.
  • the present clustering system 102 outperforms the conventional clustering system by minimizing the misclassification error.
  • Figures 2a and 2b illustrate a method 200 for determining significant dimensions for clustering multi-dimensional data, according to an embodiment of the present subject matter.
  • the method 200 is implemented in computing device, such as a clustering system 102.
  • the method 200 may be described in the general context of computer executable instructions.
  • computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types.
  • the method 200 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network.
  • the order in which the method 200 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 200, or an alternative method.
  • the method 200 can be implemented in any suitable hardware, software, firmware or combination thereof.
  • the method 200 includes obtaining multidimensional data, where the multi-dimensional data includes a plurality of data points.
  • the multi-dimensional data may be an n-dimensional data and the multi-dimensional data may be represented by a plurality of data points.
  • each of the plurality of data-points may include a plurality of dimensions.
  • the multi-dimensional data may be multimedia data, say an image.
  • the data points of the image can be the pixels and components of the pixels, i.e., RGB (red, green and blue) or HSV (hue, saturation and value) may be the dimensions.
  • the assignment module 120 of the clustering system 102 may obtain the multi-dimensional data from the database 108.
  • the method 200 includes assigning initial memberships to each dimension of each of the plurality of data points for a plurality of clusters.
  • the plurality of clusters may be pre-defined.
  • a membership assigned to a dimension of a data point may be understood as strength of association between the dimension of the data point and a particular cluster.
  • the memberships can be assigned in a random fashion using a random number between 0 and 1.
  • the assignment module 120 of the clustering system 102 assigns a membership to each dimension of each of the plurality of data points to a plurality of clusters.
  • the method 200 includes aggregating the initial memberships assigned to the dimensions of the plurality of data points.
  • the memberships may be aggregated induced by fuzziness control parameter (m).
  • the fuzziness control parameter (m) determines the level of fuzziness in a cluster.
  • the value of fuzziness control parameter (m) may be 2.
  • the assignment module 120 aggregates the memberships based on the equation (3) described in the previous section.
  • the method 200 includes computing a cluster center of each of the plurality of clusters based on the aggregated memberships.
  • a cluster center of a cluster may be understood as average of all data points in the cluster.
  • the modification module 122 computes a cluster center based on the equation (4) described in the previous section.
  • the method 200 includes calculating square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points.
  • the modification module 122 calculates a square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points.
  • the method 200 includes modifying the initial memberships assigned to the dimensions of each of the plurality of data points. For instance, if square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points is greater than 0 then the membership assigned to each dimension is modified. In another instance, if square of distance between each dimensions of t!ie cluster center and each dimension of each of the plurality of data points is equal to 0 then the membership is set to 1 for the corresponding cluster and set to 0 for the rest of the clusters.
  • the modification module 122 modifying the membership assigned to each dimension of each of the plurality of data points based on the equation (6) described in the previous section.
  • the method 200 includes updating a fuzziness control parameter (m).
  • the fuzziness control parameter (m) may be updated based on weighted sum of initial or previous fuzziness control parameter (m) and modified fuzziness control parameter (m).
  • the modified fuzziness control parameter (m) is computed by taking the partial derivative of each dimension of the cluster centers with respect to membership of membership degree of each dimension of each of the plurality of data points and then it may be set to zero.
  • the modification module 122 selects the value of the fuzziness control parameter (m) such that it gives minimum absolute value of the partial derivative accumulated over all the dimensions of all the clusters for all data points.
  • the value of cluster centers and hence the membership values are updated until sum of absolute difference between the modified membership and the initial membership is less than a pre-defined limit ( ⁇ ) or a predefined limit of iterations have been exhausted.
  • the pre-defined limit ( ⁇ ) may be 0.01.
  • the modification module 122 updates the fuzziness control parameter (m) to stabilize the cluster centers.
  • the method 200 includes identifying a point cluster index for each data point and a dimension cluster index for dimensions of each data point.
  • the point cluster index for a data point may be understood as the index which has a maximum number of Is in a binary rank matrix for each dimension of the data point.
  • the dimension cluster index for a dimension of a data point may be understood as the index for which binary rank matrix for the dimension of the data point is 1.
  • the identification module 124 identifies a point cluster index for each data point and a dimension cluster index for dimensions of each data point.
  • the method 200 includes assigning each of the plurality of data points to cluster centers of the plurality of clusters based on the point cluster index.
  • the identification module 124 performs a hard assignment of the plurality of data points to the cluster centers of the plurality of clusters using the point cluster index
  • the method 200 includes determining a measurement metric for each dimension of each of the plurality of data points.
  • the measurement metric may be determined based on comparison of the point cluster index for each of the plurality of data points and the dimension cluster index for each dimension of each of the plurality of data points.
  • the measurement metric may be understood as a goodness measure (G ⁇ ) for each dimension of a data point. If the value of goodness measure (G ⁇ ) is high then a dimension follows the data points well indicating that the dimension and the corresponding data point index belong to the same cluster. In one implementation, a set of dimensions that have higher goodness measure G ⁇ ) may be selected to be used for clustering the n-dimensional data.
  • the determination module 126 determines the measurement metric for each dimension of each of the plurality of data points.
  • the method blocks 206, 208, 210, 212, and 214 described above are repeated in a plurality of iterations.
  • the plurality of iterations terminate when a sum of absolute difference between the modified memberships and the initial memberships or previously updated memberships is less than a pre-defined limit ( ⁇ ).
  • a pre-defined limit
  • the plurality of iterations is predefined, and the method blocks 206, 208, 210, 212, and 214 are repeated till the predefined number of iterations is exhausted.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

A method for clustering multi-dimensional data comprises obtaining multi-dimensional data comprising a plurality of data points, each data point having multiple dimensions. Initial memberships are assigned to each dimension, for a plurality of clusters, and one of the initial memberships and modified memberships assigned to the dimensions of each data point is aggregated and induced by a fuzziness control parameter. Based on the aggregation, a cluster center of each cluster is computed, and square of distance between each dimension of the cluster center and each dimension is calculated. Based on the calculation, one of the initial memberships and the modified memberships, assigned to the plurality of dimensions of each data point, is modified, and the fuzziness control parameter is updated. A goodness measurement metric indicative of significance of the each dimension is determined for each dimension based on comparison of a point cluster index and a dimension cluster index.

Description

MULTI-DIMENSIONAL DATA CLUSTERING
TECHNICAL FIELD
[0001] The present subject matter relates, in general, to data processing and, in particular, to a system and a method for clustering multi-dimensional data.
BACKGROUND
[0002] In recent years, dramatic growth in applications such as Internet search, digital imaging, and video surveillance have created many high-volume, high dimensional data sets. Most of such data sets are unstructured, adding to the difficulty in managing such data sets. Further, increase in both the volume and the variety of data requires advances in methodology for clustering of the data.
[0003] Data clustering is a method of grouping data points or objects of a given data that are substantially similar in characteristics into clusters. Generally, each cluster is represented by a geometric centroid of the data points lying in the cluster. Clustering techniques can be applied to data that are quantitative (numerical), qualitative (categorical), or a combination of both. Clustering techniques are mostly unsupervised methods that can be used to organize data into clusters based on similarities among the individual data items. The potential of clustering techniques to reveal the underlying structures in data can be exploited in a wide variety of applications including classification, image processing, data mining, pattern recognition, modelling and identification.
BRIEF DESCRIPTION OF THE DRAWINGS
[0004] The detailed description is described with reference to the accompanying figure(s). In the figure(s), the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the figure(s) to reference like features and components. Some embodiments of systems and/or methods in accordance with embodiments of the present subject matter are now described, by way of example only, and with reference to the accompanying figure(s), in which: [0005] Figure la illustrates a network environment implementing a clustering system, according to an embodiment of the present subject matter.
[0006] Figure lb illustrates comparison of an exemplary image segmented by the present clustering system and a conventional clustering system.
[0007] Figures 2a and 2b illustrate a method for determining significant dimensions for clustering multi-dimensional data, according to an embodiment of the present subject matter.
DETAILED DESCRIPTION
[0008] Various techniques of clustering data have been developed in past few years. Such conventional techniques partition data comprising a set of data points or objects into two or more clusters based on an iterative two-steps process. In the first step each data point is allocated to nearest cluster center, and in the second step cluster centers are determined based on identifying a centroid for each of the two or more clusters. The centroid is identified for each partition of the data points allocated to each cluster. Such conventional clustering techniques, however, fail to determine right clusters for data points that reside marginally at boundaries of the two or more clusters.
[0009] Few attempts have been made in the past to overcome the limitation of the conventional techniques to associate data points residing marginally at the boundaries of the two or more clusters by incorporating partial membership of belongingness of each data point to a particular cluster. However, such attempts have been unsuccessful for a very high dimensional dataset. For very high dimensional dataset, it may be possible that odd parametric value of few dimensions forces the data point away from the actual cluster and thus it becomes difficult to identify significance of the dimensions for the data point. Therefore, clustering accuracy significantly reduces.
[0010] In accordance with the present subject matter, a system and a method for determining significant dimensions for clustering multi-dimensional data are described. The clustering may be understood as partitioning a set of data points of the data into a plurality of clusters, such that the data points that belong to the same cluster are as similar as possible and the data points that belong to different clusters are as dissimilar as possible. The system as described herein is a clustering system.
[0011] Initially, a database for storing multi-dimensional data is maintained according to one implementation. The multi-dimensional data may be representative of multimedia data, financial transactions and the like. According to an implementation, the multi-dimensional data is represented by a plurality of data points in a multi-dimensional space, say n-dimensional space. Further, each of the plurality of data points may include a plurality of dimensions or components. In one example, the multi-dimensional data may be an image and pixels of the image may be the plurality of data points. In said example, the components of the pixels, i.e., RGB (red, green and blue) or HSV (hue, saturation and value) can be the dimensions. The database can be an external repository associated with the clustering system, or an internal repository within the clustering system.
[0012] The data stored in the database may be retrieved whenever clustering is to be performed. Further, the data contained within such database may be updated, whenever required. For example, new data may be added into the database, existing data can be modified, or non-useful data may be deleted from the database. Although it has been described that a database is maintained to store the multi-dimensional data, however, it is well appreciated that the multidimensional data may be received by the clustering system in real-time to identify significant dimensions and then perform clustering of the multidimensional data.
[0013] In one implementation, a membership is assigned to each dimension of each of the plurality of data points to a plurality of clusters. The membership assigned to each dimension initially may be interchangeably referred to as initial membership. In said implementation, the plurality of clusters may be pre-defined. A membership assigned to a dimension of a data point may be understood as strength of association between the dimension of the data point and a particular cluster. In other words, the membership assigned to a dimension of a data point may be understood as degree of belonging of the dimension to a particular cluster. Each dimension may belong to several clusters simultaneously, with different degrees of membership. Further, the dimensions can be assigned a membership between 0 and 1 , indicating their partial memberships. According to an implementation, the memberships can be initialized in a random fashion using a random number between 0 and 1.
[0014] The memberships assigned to the dimensions of the plurality of data points are then aggregated. In one implementation, the memberships may be induced by a fuzziness control parameter (m). The fuzziness control parameter (m) determines the level of cluster fuzziness. A large value of fuzziness control parameter (m) results in smaller memberships and hence, fuzzier clusters. In an example, the value of fuzziness control parameter (m) may be 2. Thereafter, a cluster center of each of the plurality of clusters is computed based on the aggregated memberships. A cluster center of a cluster is average of all data points in the cluster. The computation of the cluster center has been explained later in detail (using equation 4), in the forthcoming description.
[0015] Further, square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points is calculated. In a scenario, where the square of distance is greater than zero then the membership assigned to each dimension of each of the plurality of data points is modified. The calculation of the square of distance has been explained later in detail (using equation 6), in the description. In another scenario where the square of distance is equal to zero, then the membership assigned to each dimension of each of the plurality of data points is set to 1 for the corresponding cluster and set to zero for the remaining clusters.
[0016] In an implementation, once the cluster centers are computed and the memberships are modified, the fuzziness control parameter (m) is updated to stabilize the cluster centers. The stability of the cluster centers is obtained by using the value of the fuzziness control parameter (m) which has the minimum effect in the change of cluster centers due to the change in memberships or membership values. In one implementation, partial derivative of each dimension of the cluster centers may be taken with respect to membership degree of each dimension of each of the plurality of data points and then it may be set to zero. In said implementation, the fuzziness control parameter (m) may be updated using an update factor weight (a). For instance, the value of the update factor weight (a) may be 0.95. The fuzziness control parameter (m) may be updated based on weighted sum of initial or previous fuzziness control parameter (m) and modified fuzziness control parameter (m). The computation of the initial or previous fuzziness control parameter (m) and the modified fuzziness control parameter (m) cluster center has been explained later in detail, in the forthcoming description.
[0017] According to an implementation, the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) are updated in a plurality of iterations. In one implementation, the plurality of iterations terminate when a sum of absolute difference between the modified memberships and the initial memberships or previously updated memberships is less than a pre-defined limit (ε). As an example, the value of the pre-defined limit (ε) may be 0.01. In another implementation, the plurality of iterations is predefined, and the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) are updated till the predefined number of iterations is exhausted.
[0018] Thereafter, a hard assignment of the plurality of data points to the cluster centers of the plurality of clusters is performed. To perform the hard assignment, a point cluster index is identified for each of the plurality of data points. The point cluster index for a data point may be understood as the index which has a maximum number of Is in a binary rank matrix for each dimension of the data point. The binary rank matrix is indicative of membership representation of each dimension of a data point. The membership may be represented in terms of binary notation i.e. either as Is or as 0s. In an implementation, if more than one cluster index have equal number of ' Is', then the hard assignment is done based on a membership rank matrix for each of the plurality of data points. A membership rank matrix is indicative of average membership of dimensions of a data point for which binary rank matrix entry is equal to 1.
[0019] According to an implementation, the hard assignment to assign each of the dimensions of plurality of data points to cluster centers of the plurality of clusters may also be performed based on identifying a dimension cluster index for each dimension of each of the plurality of data points. The dimension cluster index for a dimension of a data point may be understood as the index for which binary rank matrix for the dimension of the data point is 1.
[0020] Thereafter, a measurement metric is determined for each dimension of each of the plurality of data points. The measurement metric may be understood as a goodness measure for each dimension of a data point. In an implementation, the measurement metric can be used for performing dimensionality reduction. Dimensionality reduction can be performed by tracking the dimensions which follow data points well as compared to other dimensions. If the value of goodness measure is high then a dimension follows the data points well indicating the significance of the dimension in the process of clustering. For instance, if value of goodness measure is high then the significance of the dimension for the process of clustering is also very high. Thus, dimensionality reduction can be performed by selecting a set of dimensions that have higher values of goodness measure. The set of dimensions that have higher goodness measure can be used for clustering the n-dimensional data.
[0021] According to an implementation, the measurement metric for each dimension of each of the plurality of data points may be determined based on comparison of the point cluster index for each of the plurality of data points and the dimension cluster index for each dimension of each of the plurality of data points. The measurement metric may be interchangeably referred to as a goodness measurement metric. The measurement metric of a dimension of a data point is equal to 1 if the point cluster index is same as the dimension cluster index. Otherwise, the measurement metric is equal to 0.
[0022] According to the present subject matter, each dimension of the data points independently contribute in the process of determining the membership of the data points to the clusters. Since membership degree, i.e., degree of belongingness of each dimension to each cluster is taken into consideration, therefore the distance between each dimension of a data point with the cluster center of that dimension of the data point is significantly minimized and as a result accuracy of clustering of input data improves significantly. Further, since cluster assignment of a data point by considering the highest aggregated membership for that cluster can also be ascertained by a set of dimensions having higher membership to belong to that cluster. Although, independent contribution of the dimensions to determine the cluster center results in a fluctuation in cluster centers, however the fluctuation in the cluster centers can be stabilized by membership aggregation and by adapting fuzziness control parameter (m) towards its convergence, thereby stabilizing the cluster centers. Therefore, the clustering of the data is performed reliably and accurately.
[0023] Figure la illustrates a network environment 100 implementing a clustering system 102, in accordance with an embodiment of the present subject matter.
[0024] In one implementation, the network environment 100 can be a public network environment, including thousands of personal computers, laptops, various servers, such as blade servers, and other computing devices. In another implementation, the network environment 100 can be a private network environment with a limited number of computing devices, such as personal computers, servers, and laptops.
[0025] The clustering system 102 may be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, and the like. Further, it will be understood that the clustering system 102 is connected to a plurality of user devices 104-1, 104-2, 104-3..., and 104-N, collectively referred to as user devices 104 and individually referred to as a user device 104. As shown in figure 1, the user devices 104 are communicatively coupled to the clustering system 102 over a network 106 through one or more communication links for facilitating one or more end users to access and operate the clustering system 102. The user device 104 may include, but is not limited to, a desktop computer, a portable computer, a handheld computing device, and a workstation.
[0026] In one implementation, the network 106 may be a wireless network, a wired network, or a combination thereof. The network 106 may also be an individual network or a collection of many such individual networks, interconnected with each other and functioning as a single large network, e.g., the Internet or an intranet. The network 106 may be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and such. The network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), etc., to communicate with each other. Further, the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
[0027] The network environment 100 further comprises a database 108 communicatively coupled to the clustering system 102. The database 108 may store multi-dimensional data. The data may be representative of multimedia data, financial transactions and the like. According to an implementation, the data is represented as a plurality of data points in a multi-dimensional space, say n- dimensional space. Although the database 108 is shown external to the clustering system 102, it will be appreciated by a person skilled in the art that the database 108 can also be implemented internal to the clustering system 102, where the multi-dimensional data may be stored within a memory component of the clustering system 102.
[0028] According to an implementation, the clustering system 102 includes processor(s) 1 10, interface(s) 1 12, and memory 1 14 coupled to the processor(s) 1 10. The processor(s) 1 10 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor(s) 110 may be configured to fetch and execute computer-readable instructions stored in the memory 114.
[0029] The memory 114 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM), and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
[0030] Further, the interface(s) 112 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a product board, a mouse, an external memory, and a printer. Additionally, the interface(s) 112 may enable the clustering system 102 to communicate with other devices, such as web servers and external repositories. The interface(s) 112 may also facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. For the purpose, the interface(s) 112 may include one or more ports.
[0031] The clustering system 102 also includes module(s) 116 and data 118. The module(s) 116 include, for example, an assignment module 120, a modification module 122, an identification module 124, and a determination module 126, and other module(s) 128. The other module(s) 128 may include programs or coded instructions that supplement applications or functions performed by the clustering system 102. The data 118 may be membership data 130, index data 132, and other data 134. The other data 134, amongst other things, may serve as a repository for storing data that is processed, received, or generated as a result of the execution of one or more modules in the module(s) 116.
[0032] According to an implementation, the assignment module 120 of the clustering system 102 may retrieve the multi-dimensional data from the database 108. For instance, the multi-dimensional data may be an n-dimensional data. As indicated earlier, the multi-dimensional data may be represented by a plurality of data points in a multi-dimensional space, say n-dimensional space. Further, each of the plurality of data points may include a plurality of dimensions or components. The n-dimensional data is mathematically represented by the expression provided below:
Figure imgf000012_0001
.... (1)
[0033] In the above expression, ( X ) represents the n-dimensional data of size N. In one example, the multi-dimensional data may be an image and pixels of the image may be the plurality of data points. In said example, the components of the pixels, i.e., RGB (red, green and blue) or HSV (hue, saturation and value) may be the dimensions.
[0034] In an example, the clustering system 102 may partition a two- dimensional image or a three-dimensional (3D) image into two or more clusters. In said example, an image of 481 by 321 pixel dimension is taken and is transformed from RGB plane into HSV plane. Each data point includes three components, i.e., Hue (H), Saturation (S) and Value (V). These components are clustered into three clusters based on the HSV value of background, subject skin and dress color of the subject in the image. Therefore, in this case, the total number of data points (N) are 155401 (481 x321), the total number of dimensions (n) are 3, and the total number of clusters (c) are 3. The performance of this segmentation is justified in terms of number of points originally belonging to the subjects are misclassified as the background. It is experimentally established that the misclassification error produced by traditional clustering system is 16.82 %, whereas the same for the present clustering system 102 is 10.34 %. Therefore the present clustering system 102 outperforms the traditional clustering system by minimizing the misclassification error. Though the experimental verification of the present clustering system 102 is validated using 2D image data, same can be applied in a more generic image segmentation problems including segmentation of 3D image cloud points. Figure lb depicts the original image (140), the clustered outputs generated by the traditional clustering system and for the present clustering system 102.
[0035] Thereafter, the assignment module 120 may assign a membership to each dimension of each of the plurality of data points to a plurality of clusters. The membership assigned to each dimension may be interchangeably referred to as initial membership. In an implementation, the number of plurality of clusters may be pre-defined depending upon the application A membership assigned to a dimension of a data point may be understood as strength of association between the dimension of the data point and a particular cluster. In other words, the membership assigned to a dimension of a data point may be understood as degree of belonging of the dimension to a particular cluster. Each dimension may belong to several clusters simultaneously, with different degrees of membership.
[0036] According to an implementation, the assignment module 120 may initialize membership to each dimension using a random number ranging between 0 and 1, indicating their partial memberships. In said implementation, the memberships can be assigned in a random fashion using a random number ranging between 0 and 1. The membership assigned to the dimensions is mathematically represented by the expression provided below: μΑί(χΙ) ' l < j < n, 1≤ k < N, l≤i≤c
.... (2)
[0037] In the above expression, (x^ ) denotes jth dimension of the kth data point and [μΑί( ^ )] denotes membership of x^to belong to the ith cluster. Further, (N) is the size of the n-dimensional data and (c) is the numbers of clusters for the n-dimensional data.
[0038] In one implementation, the memberships assigned to the dimensions by the assignment module 120 may be stored as the membership data 130 within the clustering system 102. [0039] Further, the assignment module 120 aggregates memberships assigned to each dimension of each of the plurality of data points. In one implementation, the assignment module 120 may aggregate the memberships using a fuzziness control parameter (m). The fuzziness control parameter (m) determines the level of fuzziness in a cluster. A large value of fuzziness control parameter (m) results in smaller memberships and hence, fuzzier clusters. In an example, the value of fuzziness control parameter (m) may be 2. According to one implementation, the assignment module 120 aggregates the memberships using equation (3) provided below: μ ?Γ ( x k) = ±∑f=1[ vM (*&]m , 1≤ k < N, 1 < i < c
.... (3) where,
Figure imgf000014_0001
represents the aggregated membership of ( x* k) to belong to the ith
cluster,
μΑί (xj^ ) represents membership of ( ^) to belong to the ith cluster; and m represents the fuzziness control parameter.
[0040] Based on the aggregated memberships, the modification module
122 computes a cluster center of each of the plurality of clusters. A cluster center of a cluster may be understood as average of all data points in the cluster. According to an implementation, the modification module 122 computes a cluster center using equation (4) provided below:
V' - Σ^ ΚΓ ( ) ] ' l≤;≤n, l < 1≤ c
.... (4) where, vj represents jth dimension of the ith cluster center,
x^ represents jth dimension of the kth data point, and μ^ ( x* k) represents the aggregated membership of ( x* k) to belong to the ith
cluster.
[0041] In an implementation, the modification module 122 may initially compute the cluster center of each of the plurality of clusters using equation (5) provided below and then compute new cluster center of each of the plurality of clusters using equation (5) provided above:
Figure imgf000015_0001
.... (5) where, μΑ; (xJ k ) represents membership of xk to belong to the ith cluster; and m represents the fuzziness control parameter.
[0042] The modification module 122 may then calculate square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points. The square of distance calculated between the each dimension of cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points is mathematically represented by the expression provided below:
(xk - vj) 2, 1 < j < n, 1 < k < N, 1 < i < c
.... (6) [0043] In the above expression, (xJ k— vj) 2 denotes square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points, (xk ) denotes jth dimension of the kth data point and (vj ) denotes jth dimension of the ith cluster center.
[0044] Further, the modification module 122 determines a modified membership for each dimension of each of the plurality of data points. In an implementation, the modified membership is determined based on modifying the initial membership assigned to each dimension based on the cluster center of each of the plurality of clusters. In said implementation, the modification module 122 determines the modified membership based on equation 7 (provided below).
[0045] According to an implementation, the modification module 122 modifies the membership based on the calculation of square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points. For instance, if [(x^— vj) 2] > 0 then the membership assigned to each dimension is modified. In another instance, if [ x^— ν·) 2] = 0 then the membership assigned to each dimension of each of the plurality of data points is set to 1 for the corresponding cluster and set to zero for the remaining clusters.
[0046] According to an implementation, the modification module 122 modifies the me ed below:
Figure imgf000016_0001
.... (7) where, μΑ; x^ ) represents membership of xJ k to belong to the ith cluster,
(xk— vj) 2 represents square of distance between each dimensions of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points, and
m represents the fuzziness control parameter.
[0047] In an implementation, the modification module 122 computes the cluster center of each of the plurality of clusters and modifies the memberships using equation (8) provided below:
=
Figure imgf000016_0002
.... (8) where, Αί0¾ ) represents membership of x^ to belong to the ith cluster;
(x^— ν·) 2 represents square of distance between each dimensions of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points,
represents Lagrange's multiplier, and
m represents the fuzziness control parameter.
[0048] In said implementation, partial derivative of equation (8) is taken with respect to memberships, cluster centers, and Lagrange's multiplier to obtain equation (5) and equation (7).
[0049] Since dimensions of each data point independently contribute to determination of cluster centers of the plurality of clusters it may cause fluctuation in the cluster centers. To stabilize the fluctuation in the cluster centers, the modification module 122 aggregates the memberships or membership values and adapts the fuzziness control parameter (m) towards its convergence, i.e., the fuzziness control parameter (m) is placed in the less sensitive region of the cluster centers.
[0050] In one implementation, the modification module 122 updates the fuzziness control parameter (m) to stabilize the cluster centers. The stability of the cluster centers is obtained by using the value of the fuzziness control parameter (m) which has the minimum effect in the change of cluster centers due to the change in membership values.
[0051] According to an implementation, the modification module 122 updates the fuzziness control parameter (m) based on weighted sum of initial or previous fuzziness control parameter (m) and modified fuzziness control parameter (m). In said implementation, the modification module 122 takes partial derivative of each dimension of the cluster centers with respect to membership or membership degree of each dimension of each of the plurality of data points and then it may be set to zero.
[0052] The modification module 122 selects the value of the fuzziness control parameter (m) such that it gives minimum absolute value of the partial derivative accumulated over all the dimensions of all the clusters for all data points. In an implementation, the modification module 122 may update the fuzziness control parameter (m) using an update factor weight (a). For instance, the value of the update factor weight (a) may be 0.95.
[0053] In an implementation, the modification module 122 updates the fuzziness control parameter (m) using equation (9) and (10) provided below:
[0054] m_modified = arg min (abs (∑f=1=1=1— -A- )) , l .l<m<2.5 (9) m_new = a x m + (l— a) x m_modified
.... (10) where, μΑί( {( ) represents membership of x^ to belong to the ith cluster;
m represents the initial or previous fuzziness control parameter (m);. m modified represents the modified fuzziness control parameter (m), where m modified is calculated using equation (9),
a is the weight factor and
m new represents the updated fuzziness control parameter (m).
[0055] According to an implementation, the modification module 122 may update the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) in a plurality of iterations until a sum of absolute difference between the modified memberships and the initial memberships or previously updated memberships is less than a pre-defined limit (ε) For example, the value of the pre-defined limit (ε) may be 0.01. In an implementation, the plurality of iterations is predefined, and in said implementation, the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) are updated till the predefined number of iterations is exhausted. [0056] Thereafter, the identification module 124 of the clustering system
102 calculates a binary rank matrix for each dimension of each of the plurality of data points. The binary rank matrix is indicative of membership representation of each dimension of a data point. The membership may be represented in terms of binary notation i.e. either as Is or as 0s. In an implementation, the matrix dimension of a binary rank matrix is equal to ratio of total number of clusters to total number of dimensions of a data point. The identification module 124 may assign a value of 1 to that cluster which corresponds to maximum value of membership and all other clusters are assigned a value of 0. Further, the identification module 124 computes a membership rank matrix for each of the plurality of data points. The membership rank matrix may be indicative of average membership of dimensions of a data point for which binary rank matrix entry is equal to 1.
[0057] Based on the binary rank matrix and the membership rank matrix, the identification module 124 performs a hard assignment of each of the dimensions of each of the plurality of data points to the cluster centers of the plurality of clusters is performed. To perform the hard assignment, the identification module 124 identifies a point cluster matrix for each of the plurality of data points and a dimension cluster index for each dimension of each of the plurality of data points to assign each data points to cluster centers of the clusters. The point cluster index for a data point may be understood as the index which has a maximum number of Is in a binary rank matrix for each dimension of the data point. The point cluster index for the data point is mathematically represented by the expression provided below:
ata = arg max (∑f=1 Mtj ( x k)), l≤k≤N, l≤i≤c
.... (1 1)
[0058] In the above expression, (C ata) denotes a point cluster index of kth data point which has maximum number of Is in the binary rank matrix and Mij(x*k ) denotes the binary rank matrix. [0059] In an implementation, if more than one cluster index have equal number of ' I s' or have same value of binary rank matrix, then the hard assignment is done based on the membership rank matrix for each of the plurality of data points. The point cluster index for this case is mathematically represented by the expression provided below:
ata = arg max ( UAi(x k )), 1 < k < N, 1 < i < c
.... (12)
[0060] In the above expression, UAi(x ) denotes the membership rank matrix.
[0061] According to an implementation, the identification module 124 also performs hard assignment to assign each of the plurality of data points to cluster centers of the plurality of clusters based on identifying a dimension cluster index for each dimension of each of the plurality of data points. The dimension cluster index for a dimension of a data point may be understood as the index for which binary rank matrix for the dimension of the data point is 1.
[0062] In an implementation, the dimension cluster index for a dimension of a data point is mathematically represented by the expression provided below:
= arg (MjjC k) = 1), 1 < k < N, 1 < i < c, 1 < j < n
.... (13) [0063] In the above expression, (Ck) denotes a dimension cluster index of jth dimension of kth data point and Mjj(5fk ) denotes the binary rank matrix. In one implementation, the point cluster index and the dimension cluster index identified by the identification module 124 may be stored as the index data 132 within the clustering system 102.
[0064] Thereafter, the determination module 126 determines a measurement metric for each dimension of each of the plurality of data points. The measurement metric may be interchangeably referred to as a goodness measurement metric. The measurement metric may be understood as a goodness measure (Gk) for each dimension of a data point. If the value of goodness measure (G'k) is high then a dimension follows the data points well indicating the significance of the dimension in the process of clustering. The value of the measurement metric for each dimension accumulated over all the data points provides the measure of significance of the dimension. For instance, if the value of the measurement metric is high, then measure of significance of the dimension may also be high. In one implementation, a set of dimensions that have higher goodness measure G^) may be selected to be used for clustering the n- dimensional data.
[0065] According to an implementation, the determination module 126 may determine the measurement metric for each dimension of each of the plurality of data points based on comparison of the point cluster index and the dimension cluster index. The measurement metric of a dimension of a data point is equal to 1 if the point cluster index is same as the dimension cluster index. If the point cluster index and the dimension cluster index are not equal, then the measurement metric is equal to 0.
[0066] According to the present subject matter, each dimension of the data points independently contribute in the process of determining the membership of the data points to the clusters. Since membership degree, i.e., degree of belongingness of each dimension to each cluster is taken into consideration, the distance between each dimension of a data point with the cluster center of that dimension of the data point is significantly minimized and as a result accuracy of clustering of input data improves significantly. Further, since cluster assignment of a data point by considering the highest aggregated membership for that cluster can also be ascertained by a set of dimensions having higher membership to belong to that cluster. Although, independent contribution of the dimensions to determine the cluster center results in a fluctuation in cluster centers, however the fluctuation in the cluster centers can be stabilized by membership aggregation and adapting fuzziness control parameter (m) towards its convergence, thereby stabilizing the cluster centers. Therefore, the clustering of the data is performed reliably and accurately. [0067] Figure lb illustrates comparison of an exemplary image segmented by the present clustering system and a conventional clustering system.
[0068] As shown in figure lb, image 140 is an original image that is to be segmented. Further, image 142 is the segmented image that is obtained as a result of image segmentation performed by the conventional clustering system, and image 144 is the segmented image that is obtained as a result of the image segmentation performed by the present clustering system 102, i.e., clustering system described in accordance with the present subject matter. The performance of the segmentation process is justified in terms of number of data points originally belonging to the subjects are misclassified as the background. As can be seen in figure lb, the present clustering system 102 outperforms the conventional clustering system by minimizing the misclassification error.
[0069] It is to be noted that the original image 140 as depicted in figure lb for the image segmentation experiment has been taken from the source "P. Arbelaez, M. Maire, C. Fowlkes and J. Malik., "Contour Detection and Hierarchical Image Segmentation", IEEE TP AMI, Vol. 33, No. 5, pp. 898-916, May 201 1."
[0070] Figures 2a and 2b illustrate a method 200 for determining significant dimensions for clustering multi-dimensional data, according to an embodiment of the present subject matter. The method 200 is implemented in computing device, such as a clustering system 102. The method 200 may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The method 200 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network.
[0071] The order in which the method 200 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 200, or an alternative method. Furthermore, the method 200 can be implemented in any suitable hardware, software, firmware or combination thereof.
[0072] At block 202, the method 200 includes obtaining multidimensional data, where the multi-dimensional data includes a plurality of data points. In an example, the multi-dimensional data may be an n-dimensional data and the multi-dimensional data may be represented by a plurality of data points. Further, each of the plurality of data-points may include a plurality of dimensions. In one example, the multi-dimensional data may be multimedia data, say an image. Thus, the data points of the image can be the pixels and components of the pixels, i.e., RGB (red, green and blue) or HSV (hue, saturation and value) may be the dimensions. In an implementation, the assignment module 120 of the clustering system 102 may obtain the multi-dimensional data from the database 108.
[0073] At block 204, the method 200 includes assigning initial memberships to each dimension of each of the plurality of data points for a plurality of clusters. In an implementation, a membership to each dimension of each of the plurality of data points to a plurality of clusters. In an implementation, the plurality of clusters may be pre-defined. A membership assigned to a dimension of a data point may be understood as strength of association between the dimension of the data point and a particular cluster. In one implementation, the memberships can be assigned in a random fashion using a random number between 0 and 1. The assignment module 120 of the clustering system 102 assigns a membership to each dimension of each of the plurality of data points to a plurality of clusters.
[0074] At block 206, the method 200 includes aggregating the initial memberships assigned to the dimensions of the plurality of data points. The memberships may be aggregated induced by fuzziness control parameter (m). The fuzziness control parameter (m) determines the level of fuzziness in a cluster. In an example, the value of fuzziness control parameter (m) may be 2. According to one implementation, the assignment module 120 aggregates the memberships based on the equation (3) described in the previous section.
[0075] At block 208, the method 200 includes computing a cluster center of each of the plurality of clusters based on the aggregated memberships. A cluster center of a cluster may be understood as average of all data points in the cluster. According to an implementation, the modification module 122 computes a cluster center based on the equation (4) described in the previous section.
[0076] At block 210, the method 200 includes calculating square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points. According to an implementation, the modification module 122 calculates a square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points.
[0077] At block 212, the method 200 includes modifying the initial memberships assigned to the dimensions of each of the plurality of data points. For instance, if square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points is greater than 0 then the membership assigned to each dimension is modified. In another instance, if square of distance between each dimensions of t!ie cluster center and each dimension of each of the plurality of data points is equal to 0 then the membership is set to 1 for the corresponding cluster and set to 0 for the rest of the clusters. In an implementation, the modification module 122 modifying the membership assigned to each dimension of each of the plurality of data points based on the equation (6) described in the previous section.
[0078] At block 214, the method 200 includes updating a fuzziness control parameter (m). In one implementation, the fuzziness control parameter (m) may be updated based on weighted sum of initial or previous fuzziness control parameter (m) and modified fuzziness control parameter (m). The modified fuzziness control parameter (m) is computed by taking the partial derivative of each dimension of the cluster centers with respect to membership of membership degree of each dimension of each of the plurality of data points and then it may be set to zero. The modification module 122 selects the value of the fuzziness control parameter (m) such that it gives minimum absolute value of the partial derivative accumulated over all the dimensions of all the clusters for all data points. The value of cluster centers and hence the membership values are updated until sum of absolute difference between the modified membership and the initial membership is less than a pre-defined limit (ε) or a predefined limit of iterations have been exhausted. For instance, the pre-defined limit (ε) may be 0.01. According to an implementation, the modification module 122 updates the fuzziness control parameter (m) to stabilize the cluster centers.
[0079] At block 216, the method 200 includes identifying a point cluster index for each data point and a dimension cluster index for dimensions of each data point. The point cluster index for a data point may be understood as the index which has a maximum number of Is in a binary rank matrix for each dimension of the data point. The dimension cluster index for a dimension of a data point may be understood as the index for which binary rank matrix for the dimension of the data point is 1. According to an implementation, the identification module 124 identifies a point cluster index for each data point and a dimension cluster index for dimensions of each data point.
[0080] At block 218, the method 200 includes assigning each of the plurality of data points to cluster centers of the plurality of clusters based on the point cluster index. According to an implementation, the identification module 124 performs a hard assignment of the plurality of data points to the cluster centers of the plurality of clusters using the point cluster index
[0081] At block 220, the method 200 includes determining a measurement metric for each dimension of each of the plurality of data points. In one implementation, the measurement metric may be determined based on comparison of the point cluster index for each of the plurality of data points and the dimension cluster index for each dimension of each of the plurality of data points. The measurement metric may be understood as a goodness measure (G^) for each dimension of a data point. If the value of goodness measure (G^) is high then a dimension follows the data points well indicating that the dimension and the corresponding data point index belong to the same cluster. In one implementation, a set of dimensions that have higher goodness measure G^) may be selected to be used for clustering the n-dimensional data. The determination module 126 determines the measurement metric for each dimension of each of the plurality of data points.
[0082] The method blocks 206, 208, 210, 212, and 214 described above are repeated in a plurality of iterations. In one implementation, the plurality of iterations terminate when a sum of absolute difference between the modified memberships and the initial memberships or previously updated memberships is less than a pre-defined limit (ε). In another implementation, the plurality of iterations is predefined, and the method blocks 206, 208, 210, 212, and 214 are repeated till the predefined number of iterations is exhausted.
[0083] Although embodiments for methods and systems for clustering multi-dimensional data have been described in a language specific to structural features and/or methods, it is to be understood that the invention is not necessarily limited to the specific features or methods described. Rather, the specific features and methods are disclosed as exemplary embodiments for clustering multi-dimensional data.

Claims

I/We claim:
1. A clustering system (102) to determine significant dimensions for clustering multi-dimensional data, the clustering system (102) comprising:
a processor (1 10);
an assignment module (120), coupled to the processor (1 10), the assignment module (120) to:
obtain the multi-dimensional data comprising a plurality of data points, each of the plurality of data points having a plurality of dimensions;
assign initial memberships to each of the plurality of dimensions of each of the plurality of data points of the multidimensional data, for a plurality of clusters; and
aggregate initial memberships assigned to the plurality of dimensions of each of the plurality of data points induced by a fuzziness control parameter (m);
a modification module (122), coupled to the processor (1 10), the modification module (122) to:
modify the initial memberships assigned to the plurality of dimensions based on a cluster center of each of the plurality of clusters, wherein the cluster center of each of the plurality of clusters is computed based on the aggregated initial memberships; and
update the fuzziness control parameter (m) based on weighted sum of a modified fuzziness control parameter (m), and at least one of an initial fuzziness control parameter (m) and a previous fuzziness control parameter (m); and
a determination module (126), coupled to the processor (1 10), the determination module (126) to determine a goodness measurement metric for each dimension based on comparison of a point cluster index for each data point, and a dimension cluster index for each dimension of each of the plurality of data points, wherein value of goodness measurement metric is indicative of significance of the each dimension.
The clustering system (102) as claimed in claim 1 , wherein the initial memberships, the aggregated initial memberships, the cluster centers, and the fuzziness control parameter (m) are updated in a plurality of iterations till a sum of absolute difference between the modified memberships and the initial memberships is less than a pre-defined limit (ε).
The clustering system (102) as claimed in claim 2, wherein the plurality of iterations are predefined, and the initial memberships, the aggregated initial memberships, the cluster centers, and the fuzziness control parameter (m) are updated till the predefined number of iterations is exhausted.
The clustering system (102) as claimed in claim 1, wherein the modified fuzziness control parameter (m) is determined based on minimizing sum of derivative of cluster centers with respect to memberships of dimensions of each of the plurality of clusters for the plurality of data points.
The clustering system (102) as claimed in claim 1 , wherein the modification module (122) further:
calculates square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points followed by a power of a negative of a ratio of 1 and fuzziness control parameter minus 1 (m-1); and
updates the membership values for each dimension using the calculated square of distance and normalization with respect to the square of distance from all the cluster centers.
The clustering system (102) as claimed in claim 1, wherein the identification module (124) further: calculates a binary rank matrix for each dimension of each of the plurality of data points, wherein the binary rank matrix is indicative of membership representation of each dimension of a data point;
computes a membership rank matrix for each of the plurality of data points, wherein the membership rank matrix is indicative of average membership of dimensions of a data point for which binary rank matrix entry is equal to 1 ; and
identify a point cluster index for each of the plurality of data points and a dimension cluster index for each dimension of each of the plurality of data points to assign each of the plurality of data points to cluster centers of the plurality of clusters, based on the binary rank matrix and the membership rank matrix.
The clustering system (102) as claimed in claim 1, wherein the initial membership is initialized to each dimension of each of the plurality of data points using a random number ranging between 0 and 1.
The clustering system (102) as claimed in claim 1, wherein value of the goodness measurement metric is 1 if the point cluster index is equal to the dimension cluster index.
The clustering system (102) as claimed in claim 1, wherein value of the goodness measurement metric is 0 if the point cluster index is not equal to the dimension cluster index.
A method for determining significant dimensions for clustering multidimensional data, the method comprising:
obtaining, by the clustering system 102, the multi-dimensional data comprising a plurality of data points, each of the plurality of data points having a plurality of dimensions; assigning, by the clustering system 102, initial memberships to each of the plurality of dimensions of each of the plurality of data points of the multi-dimensional data, for a plurality of clusters;
aggregating, by the clustering system 102, one of the initial memberships and modified memberships assigned to the plurality of dimensions of each of the plurality of data points induced by a fuzziness control parameter (m);
computing, by the clustering system 102, a cluster center of each of the plurality of clusters based on the aggregated one of the initial memberships and the modified memberships;
calculating, by the clustering system 102, square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points;
modifying, by the clustering system 102, one of the initial memberships and the modified memberships, assigned to the plurality of dimensions of each of the plurality of data points, based the calculation; updating, by the clustering system 102, the fuzziness control parameter (m) based on weighted sum of a modified fuzziness control parameter (m), and at least one of an initial fuzziness control parameter (m) and a previous fuzziness control parameter (m); and
determining, by the clustering system 102, a goodness measurement metric for each dimension based on comparison of a point cluster index for each data point, and a dimension cluster index for each dimension of each of the plurality of data points, wherein value of measurement metric is indicative of significance of the each dimension.
The method as claimed in claim 10 further comprising: updating the fuzziness control parameter (m) till a sum of absolute difference between the modified memberships and the initial memberships is less than a pre-defined limit (ε);
calculating a binary rank matrix for each dimension of each of the plurality of data points, wherein the binary rank matrix is indicative of membership representation of each dimension of a data point;
computing a membership rank matrix for each of the plurality of data points, wherein the membership rank matrix is indicative of average membership of all dimensions of a data point for which binary rank matrix entry is equal to 1 ; and
identifying a point cluster index for each of the plurality of data points and a dimension cluster index for each dimension of each of the plurality of data points to assign each of the plurality of data points to cluster centers of the plurality of clusters, based on the binary rank matrix and the membership rank matrix.
The method as claimed in claim 10, wherein the initial memberships, the aggregated initial memberships, and the cluster centers are updated in a plurality of iterations till a sum of absolute difference between the modified memberships and the initial memberships is less than a predefined limit (ε), and wherein the plurality of iterations are predefined, and the initial memberships, the aggregated initial memberships, and the cluster centers are updated till the predefined number of iterations is exhausted.
The method as claimed in claim 10, wherein the initial memberships are modified based on square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points followed by a power of a negative of a ratio of 1 and fuzziness control parameter minus 1 (m-1) and normalization with respect to the square of distance from all the cluster centers. The method as claimed in claim 10, wherein value of the goodness measurement metric is 1 if the point cluster index is equal to the dimension cluster index and the value of goodness measurement metric is 0 if the point cluster index is not equal to the dimension cluster index.
A non-transitory computer-readable medium having embodied thereon a computer program for executing a method comprising:
obtaining, by the clustering system 102, the multi-dimensional data comprising a plurality of data points, each of the plurality of data points having a plurality of dimensions;
assigning, by the clustering system 102, initial memberships to each of the plurality of dimensions of each of the plurality of data points of the multi-dimensional data, for a plurality of clusters;
aggregating, by the clustering system 102, one of the initial memberships and modified memberships assigned to the plurality of dimensions of each of the plurality of data points induced by a fuzziness control parameter (m);
computing, by the clustering system 102, a cluster center of each of the plurality of clusters based on the aggregated one of the initial memberships and the modified memberships;
calculating, by the clustering system 102, square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points;
modifying, by the clustering system 102, one of the initial memberships and the modified memberships, assigned to the plurality of dimensions of each of the plurality of data points, based the calculation; updating, by the clustering system 102, the fuzziness control parameter (m) based on weighted sum of a modified fuzziness control parameter (m), and at least one of an initial fuzziness control parameter (m) and a previous fuzziness control parameter (m); and
determining, by the clustering system 102, a goodness measurement metric for each dimension based on comparison of a point cluster index for each data point, and a dimension cluster index for each dimension of each of the plurality of data points, wherein value of measurement metric is indicative of significance of the each dimension.
PCT/IB2014/001262 2013-07-05 2014-07-03 Multi-dimensional data clustering WO2015001416A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN2282/MUM/2013 2013-07-05
IN2282MU2013 2013-07-05

Publications (1)

Publication Number Publication Date
WO2015001416A1 true WO2015001416A1 (en) 2015-01-08

Family

ID=51399676

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2014/001262 WO2015001416A1 (en) 2013-07-05 2014-07-03 Multi-dimensional data clustering

Country Status (1)

Country Link
WO (1) WO2015001416A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017062530A1 (en) * 2015-10-05 2017-04-13 Bayer Healthcare Llc Generating orthotic product recommendations
WO2017149139A1 (en) 2016-03-03 2017-09-08 Curevac Ag Rna analysis by total hydrolysis
CN110610200A (en) * 2019-08-27 2019-12-24 浙江大搜车软件技术有限公司 Vehicle and merchant classification method and device, computer equipment and storage medium
CN113298115A (en) * 2021-04-19 2021-08-24 百果园技术(新加坡)有限公司 User grouping method, device, equipment and storage medium based on clustering
CN113919449A (en) * 2021-12-15 2022-01-11 国网江西省电力有限公司供电服务管理中心 Resident electric power data clustering method and device based on precise fuzzy clustering algorithm
US11315177B2 (en) * 2019-06-03 2022-04-26 Intuit Inc. Bias prediction and categorization in financial tools
CN114863151A (en) * 2022-03-20 2022-08-05 西北工业大学 Image dimensionality reduction clustering method based on fuzzy theory

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LIANG PANG ET AL: "A Improved Clustering Analysis Method Based on Fuzzy C-Means Algorithm by Adding PSO Algorithm", 28 March 2012, HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, SPRINGER BERLIN HEIDELBERG, BERLIN, HEIDELBERG, PAGE(S) 231 - 242, ISBN: 978-3-642-28941-5, XP019174522 *
P. ARBELAEZ; M. MAIRE; C. FOWLKES; J. MALIK.: "Contour Detection and Hierarchical Image Segmentation", IEEE TPAMI, vol. 33, no. 5, May 2011 (2011-05-01), pages 898 - 916
R. SUGANYA ET AL: "Fuzzy C-Means Algorithm - A Review", INTERNATIONAL JOURNAL OF SCIENTIFIC AND RESEARCH PUBLICATIONS, vol. 2, no. 11, November 2012 (2012-11-01), pages 440 - 442, XP055151575 *
WEINA WANG ET AL: "The Global Fuzzy C-Means Clustering Algorithm", INTELLIGENT CONTROL AND AUTOMATION, 2006. WCICA 2006. THE SIXTH WORLD CONGRESS ON DALIAN, CHINA 21-23 JUNE 2006, PISCATAWAY, NJ, USA,IEEE, vol. 1, 21 June 2006 (2006-06-21), pages 3604 - 3607, XP010946075, ISBN: 978-1-4244-0332-5, DOI: 10.1109/WCICA.2006.1713041 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017062530A1 (en) * 2015-10-05 2017-04-13 Bayer Healthcare Llc Generating orthotic product recommendations
US11134863B2 (en) 2015-10-05 2021-10-05 Scholl's Wellness Company Llc Generating orthotic product recommendations
WO2017149139A1 (en) 2016-03-03 2017-09-08 Curevac Ag Rna analysis by total hydrolysis
US11920174B2 (en) 2016-03-03 2024-03-05 CureVac SE RNA analysis by total hydrolysis and quantification of released nucleosides
US11315177B2 (en) * 2019-06-03 2022-04-26 Intuit Inc. Bias prediction and categorization in financial tools
CN110610200A (en) * 2019-08-27 2019-12-24 浙江大搜车软件技术有限公司 Vehicle and merchant classification method and device, computer equipment and storage medium
CN113298115A (en) * 2021-04-19 2021-08-24 百果园技术(新加坡)有限公司 User grouping method, device, equipment and storage medium based on clustering
CN113919449A (en) * 2021-12-15 2022-01-11 国网江西省电力有限公司供电服务管理中心 Resident electric power data clustering method and device based on precise fuzzy clustering algorithm
CN113919449B (en) * 2021-12-15 2022-03-15 国网江西省电力有限公司供电服务管理中心 Resident electric power data clustering method and device based on precise fuzzy clustering algorithm
CN114863151A (en) * 2022-03-20 2022-08-05 西北工业大学 Image dimensionality reduction clustering method based on fuzzy theory
CN114863151B (en) * 2022-03-20 2024-02-27 西北工业大学 Image dimension reduction clustering method based on fuzzy theory

Similar Documents

Publication Publication Date Title
US10713597B2 (en) Systems and methods for preparing data for use by machine learning algorithms
WO2015001416A1 (en) Multi-dimensional data clustering
Manzanera et al. Line and circle detection using dense one-to-one Hough transforms on greyscale images
CN108764726B (en) Method and device for making decision on request according to rules
US20150039538A1 (en) Method for processing a large-scale data set, and associated apparatus
US20220300528A1 (en) Information retrieval and/or visualization method
US11775610B2 (en) Flexible imputation of missing data
CN107832456B (en) Parallel KNN text classification method based on critical value data division
Hetland et al. Ptolemaic access methods: Challenging the reign of the metric space model
JP2015203946A (en) Method for calculating center of gravity of histogram
CN110147455A (en) A kind of face matching retrieval device and method
Wu et al. 3D scene reconstruction based on improved ICP algorithm
CN111026865A (en) Relation alignment method, device and equipment of knowledge graph and storage medium
Akgül et al. Density-based 3D shape descriptors
Zhang et al. An adaptive mean shift clustering algorithm based on locality-sensitive hashing
Yang et al. Geometric-inspired graph-based Incomplete Multi-view Clustering
Barger et al. k-means for streaming and distributed big sparse data
Dharamsotu et al. k-NN Sampling for Visualization of Dynamic data using LION-tSNE
Burdescu et al. A Spatial Segmentation Method.
Burdescu et al. Multimedia data for efficient detection of visual objects
Beck et al. Distributed mean shift clustering with approximate nearest neighbours.
Myasnikov Evaluation of space partitioning data structures for nonlinear mapping
Park et al. Encouraging second-order consistency for multiple graph matching
Denisova et al. The Algorithms of Hierarchical Histogram computation for multichannel images
Zhan et al. Graph entropy-based clustering algorithm in medical brain image database

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14755705

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14755705

Country of ref document: EP

Kind code of ref document: A1