CN106484818B

CN106484818B - Hierarchical clustering method based on Hadoop and HBase

Info

Publication number: CN106484818B
Application number: CN201610851970.1A
Authority: CN
Inventors: 刘发贵; 周晓场
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-09-26
Filing date: 2016-09-26
Publication date: 2023-04-28
Anticipated expiration: 2036-09-26
Also published as: CN106484818A

Abstract

The invention discloses a hierarchical clustering method based on Hadoop and HBase. According to the method, a distance matrix is calculated through Hadoop, the result is converted into an HFile file, the HFile file is imported into the HBase through a Bulk Load method, the HBase is used for storing the distance matrix and is mainly divided into two tables, one table is ordered according to cluster ID pairs, the other table is ordered according to the distance between clusters, and therefore the fact that two clusters closest to each iteration are taken out for merging can be conveniently achieved. Finally, a multithreading algorithm is realized, a distance matrix in the HBase is processed uniformly by combining with a cache technology, a hierarchical clustering algorithm is realized, a plurality of adjustable parameters are reserved, and the algorithm supports three clustering methods of single-linkage, complete-linkage and average-linkage. The scheme provided by the invention utilizes the parallel computing capability of Hadoop and the mass data storage capability of HBase, thereby improving the big data processing capability and expansibility of the hierarchical clustering algorithm.

Description

Hierarchical clustering method based on Hadoop and HBase

Technical Field

The invention relates to the related technical fields of hierarchical clustering algorithm, hadoop and HBase, in particular to design and implementation of a hierarchical clustering method based on Hadoop and HBase.

Background

Hierarchical clustering algorithms have been applied in many ways, such as information retrieval, bioinformatics, as a simple and widely accepted clustering algorithm. The hierarchical clustering algorithm has the advantage that it represents the clustering result in a more detailed way. The clustering relation among clusters is organized into a tree diagram, a user can clearly know how the clusters are clustered together, and other clustering algorithms do not give such results. Moreover, hierarchical clustering algorithms do not require the user to specify the number of clusters in advance, as compared to other clustering algorithms such as k-means. Although hierarchical clustering algorithms have many advantages and are widely accepted and used, with the rapid increase in data volume, the performance of stand-alone hierarchical clustering algorithms has failed to meet the needs, and the high complexity and inherent data dependencies of hierarchical clustering algorithms make it difficult to perform efficiently on large data sets. However, more useful information is often extracted only from more sufficient data, and the size of the data set becomes an important factor in machine learning algorithms. This has prompted us to have a hierarchical clustering algorithm that can run on large data sets.

Hadoop is a software framework that can perform distributed processing on mass data, and provides a reliable, efficient, scalable way to perform data processing. And HBase is another important member in the Hadoop ecological circle and is a non-relational distributed database. It provides a highly reliable, high performance, scalable, column-oriented storage system suitable for unstructured data storage. As a very important big data processing technology, hadoop and HBase are widely used in many big data fields.

Hierarchical clustering algorithms rely on spatial complexity as O (n ² ) Which results in that it does not handle large data sets well on a single machine, nor does it scale well. The parallel computing capacity of Hadoop and the high-performance mass data storage capacity of HBase are utilized, so that an effective solution can be provided for hierarchical clustering algorithm. However, at present, no hierarchical clustering algorithm capable of supporting multiple clustering methods of single-linkage, complete-linkage and average-linkage exists on Hadoop and HBase platforms.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a hierarchical clustering method based on Hadoop and HBase, and simultaneously supports three clustering methods of single-linkage, complete-linkage and average-linkage, and the specific technical scheme is as follows.

A hierarchical clustering method based on Hadoop and HBase realizes parallelization calculation of a distance matrix by using Hadoop, stores the distance matrix by using HBase, adopts RowKey design in table design to fully utilize the ordering function of HBase, and processes the distance matrix in the HBase by combining a multithreading technology and a cache technology, thereby realizing the hierarchical clustering method with expansibility, which can be applied to big data.

Further, the parallelization calculation of the distance matrix specifically includes: the method is realized by adopting Hadoop and mainly comprises two MapReduce algorithms, wherein the first MapReduce algorithm calculates the distance and stores the result into an intermediate file, and the second MapReduce algorithm converts the intermediate file result into an HFile format; and finally, importing the result into the HBase by using a Bulk Load method.

Further, there are mainly two tables in the HBase, wherein RowKey of one table is Cluster ID pair, and the Cluster ID is complemented with the preamble 0 to keep the same length, so that records in the table can be ordered according to the Cluster ID, and value is the distance between Clmakers; the distance associated with the specified Cluster ID is obtained quickly through this table; the RowKey of the other table is the distance between Cluster, and leading 0 is complemented in front of the distance, so that records in the table can be ordered from small to large, and value is the Cluster ID pair; by acquiring the first row of the table, two Cluster closest to the table can be acquired rapidly;

and in the initial stage of the clustering method, pre-partitioning the two tables according to Cluster IDs and distances in advance, and distributing the partitions to all nodes of the HBase Cluster.

Further, in the clustering process, the merging paths of the Cluster and the number of all the atom Cluster contained in the Cluster are recorded in a memory; when the single-link or complex-link algorithm is adopted, the distance between the newly synthesized Cluster and the existing Cluster is calculated, and the new calculation is carried out by only retrieving the distance between two Cluster synthesizing the new Cluster and the rest Cluster from the HBase or cache; when the average-link algorithm is adopted, besides the distance between two Cluster and other Cluster which need to synthesize a new Cluster, the number of all the atom Cluster contained in each Cluster is obtained from the memory to calculate the average value.

Further, in combination with the multithreading algorithm of the cache, two Cluster closest to the HBase are obtained and combined into a new Cluster, and distance information between the two Cluster closest to the cache or the HBase and other Cluster is obtained in parallel, during the period, a deleting thread is started to delete invalid data in the HBase, and simultaneously, a calculating thread is started to calculate the distance between the Cluster with the distance information obtained in advance and the new Cluster, and simultaneously, the new distance information is written back to the cache or the HBase in parallel; cache technology is employed here to reduce the network IO of the algorithm.

Further, hadoop is used for calculating an algorithm distance matrix, HBase is used for storing the distance matrix, and the algorithm is designed and realized by combining a multithreading technology and a cache technology, so that the expandability of the hierarchical clustering algorithm and the capability of processing big data are improved.

Compared with the prior art, the invention has the following advantages and technical effects:

according to the invention, a distance matrix parallelization calculation algorithm is realized based on Hadoop, and the result is converted into an HFile file and is imported into the HBase by a Bulk Load method. The HBase is used for storing the distance matrix, a unique RowKey design is adopted in the design of the table, the ordering function of the HBase is fully utilized, and the algorithm can conveniently and rapidly acquire distance information and acquire two Cluster closest to the distance. Simultaneously, the algorithm is designed and realized by combining the multithreading and cache technology, and a plurality of adjustable parameters are reserved. Through the parallel computing capability of Hadoop and the mass data storage capability of HBase, the expansibility and the big data processing capability of the hierarchical clustering algorithm are improved.

Drawings

Fig. 1 is a schematic diagram of a distance matrix calculation algorithm in an example.

Fig. 2 is a schematic diagram of a table design in an example.

FIG. 3 is a schematic diagram of a hierarchical tree of a hierarchical clustering algorithm in an example.

FIG. 4 is a schematic diagram of a thread execution sequence.

Fig. 5 is an illustration of algorithm parameters in an example.

Detailed Description

In order to make the technical solution and advantages of the present invention more apparent, further detailed description is given below with reference to the accompanying drawings, but the practice and protection of the present invention are not limited thereto. It is noted that symbols or procedures not specifically described below are understood or implemented by those skilled in the art with reference to the prior art.

1. Parallelization calculation algorithm of distance matrix

The parallelization calculation algorithm of the distance matrix aims at improving the calculation speed of the distance matrix and rapidly importing the distance matrix into the HBase. In the clustering process, the hierarchical clustering algorithm needs to rely on a spatial complexity of O (n ² ) In the method, a Hadoop-based parallelization calculation algorithm is designed and realized for the calculation of the distance matrix, as shown in fig. 1. The algorithm firstly takes a file with data to be clustered as a global cacheAnd distributing the stored files to various tasks, and then partitioning the files stored with the Cluster IDs, wherein each task processes one block. Each task iteration takes the respective ID and calculates the distance between it and all other clusters with ID value larger than it, e.g. Cluster with id=2, then calculates the distance between it and Cluster with id=3, 4,5 … and outputs the result to the intermediate file in the format of (distance ID1, ID2 timestamp). The spatial complexity is O (n ² ) If the distance matrix of (2) is inserted into the HBase one by one in the reduce, the speed is relatively slow, and in this implementation, the Bulk Load method is used to rapidly import data into the HBase. The Bulk Load method is based on the principle that HBase data is stored in a specific format on HDFS. Therefore, in the implementation method, another MapReduce program is implemented to convert the intermediate result into a file in the HFile format, and then the HFile is imported into the HBase by using the Bulk Load method. The Hadoop and Bulk Load methods are utilized to realize parallel calculation of distance matrix and rapid introduction into HBase.

2. Design of watch

The design of the table above HBase has a great influence on the performance of the algorithm. An unreasonable design will result in a significant reduction in algorithm performance and thus impact the usability of the algorithm. In this implementation, the table is uniquely designed according to the characteristics of the hierarchical clustering algorithm, as shown in fig. 2. There are mainly two tables in HBase: a distance table and a sortedDistance table. The distance matrix table stores distance matrix data for sorting from small to large according to Cluster IDs, rowKey is the Cluster ID pair, the Cluster ID is complemented with a preamble 0 to keep the RowKey length consistent, records in the table can be sorted according to the Cluster IDs, and value is the distance between Clusters. The distance associated with the specified Cluster ID can be obtained quickly from this table. The sortedDistance table stores distance matrix data which are ordered from small to large according to the distance, rowKey is the distance between Cluster, and the front part of the distance is complemented with leading 0 to keep the lengths of all RowKey consistent, so that records in the table can be ordered from small to large according to the distance, and value is the Cluster ID pair. By acquiring the top row of this table, the two Cluster closest to it can be acquired quickly. Meanwhile, parameters are reserved to specify the initial region numbers of the two tables, the two tables are pre-partitioned in advance according to Cluster IDs and distances in the initial stage of an algorithm, and each region after partitioning is allocated to each node of the HBase Cluster, so that the concurrence processing capacity of the HBase is improved.

3. Simplifying computation

In calculating the distance between two clusters, if two clusters are synthesized by other clusters, we need to use the distance between all the point pairs between two clusters, such as cluster D is synthesized by singleton cluster A and singleton cluster B, and the distance between cluster C and singleton cluster A and cluster C and singleton cluster B is calculated by singleton cluster C. However, it is not necessary to compare the distances of all singleton cluster at a time, but only the distances between the sub-clusters under two clusters. Fig. 3 is a hierarchical tree of a condensed hierarchical clustering algorithm. We define dis (A, B) as the distance between cluster A and cluster B, min as the minimum, max as the maximum, avg as the average, and count (A) as the number of all points contained by cluster A. The single-link method is used to calculate the distance dis (7, 9) between cluster7and cluster9 as follows:

dis(7,9)＝min(dis(1,5),dis(1,6),dis(2,5),dis(2,6),dis(3,5),dis(3,6))

＝min(min(dis(1,5),dis(1,6)),min(dis(2,5),dis(2,6)),min(dis(3,5),dis(3,6)))

＝min(dis(1,7),dis(2,7),dis(3,7))

＝min(dis(1,7),min(dis(2,7),dis(3,7)))

＝min(dis(1,7),dis(8,7))

it can be seen that only dis (1, 7), dis (8, 7) are needed to calculate the distance dis (7, 9) between cluster7and cluster9, and not all the distances of clusters 1-4 and clusters 4-5. The complex-link method is similar to the single-link method, except that min in the above formula is replaced with max. When the average-link method is used to calculate the distance dis (7, 9) between the cluster7and the cluster9, there are:

it can be seen that not only dis (1, 7), dis (8, 7) but also the number of points contained in each of the clusters 7and 9 are needed when using average-linkage so that the average value can be calculated.

4. Design and implementation of multithreading algorithm combined with cache

In addition to the efficient access method of distance matrix, a parallelization algorithm is needed to complete the clustering process. In the algorithm, the design and the implementation of the algorithm are completed by combining a multithreading technology and a cache technology. According to the hierarchical clustering algorithm principle, two clusters closest to each other need to be acquired first and combined into a new cluster. As described above, the cluster pairs and the distances between them are stored in the SORTEDDIRECTION table, and most importantly, they are sorted from small to large, so we only need to obtain the first record of the SORTEDDIRECTION table. The first piece of data is acquired from the sourceded distance table by utilizing the scan API of the HBase, and the cache of the scan is set to be 1 in order to avoid acquiring redundant data.

For ease of illustration, assuming an initial dataset with 10 points, each point is initially individually a cluster, there are 10 clusters initially: c1 to C10. If the two clusters closest to the first iteration are C1 and C2, and we merge them into a new cluster C11. According to the description of the previous subsection, we need the distances between C1, C2 and cluster C3 to C10 to calculate C11 and C3.. In case of a miss in the cache we need to read the distance from HBase, since this is an IO intensive operation, we have used a multithreading approach to concurrently read the relevant distance from the distance matrix table using the Scan API, while we adjust the number of rows per Scan read back by Scan. In the process of reading the distance from the distance matrix table, we also start multiple threads in parallel to delete the distances related to the two clusters that have been merged from the sourceded distance table, obviously, both get and put operations are IO intensive operations, so we can calculate the already acquired data in advance, without waiting until all the data is acquired. So we start the calculation thread to calculate at the same time we start the scan thread to read data in HBase instead of waiting for the scan thread to finish. And we also start multithreading in parallel to write new distances back to the cache or HBase, thus also completing the computing task substantially when all data is acquired.

The execution order of the parallel threads is shown in fig. 4. As we know, we need some synchronization techniques to control threads. In our algorithm we use BlockingQueue and barrer techniques. The BlockingQueue controls the communication of threads by controlling two threads to alternately put and take elements into and out of the BlockingQueue. The barrer technique is very useful in parallel iterative algorithms that split a problem into multiple sub-problems and execute in parallel and wait when a fence position is reached until all threads have reached the fence position.

5. Adjustable parameter description

In the using process of Hadoop and HBase, a plurality of parameters need to be adjusted according to the actual application requirements so as to achieve better performance. In the design process of the algorithm, besides parameters supported by Hadoop and HBase, a plurality of self-defined parameters of the algorithm are reserved, so that the self-defined parameters can be conveniently adjusted according to actual conditions in the process of using the algorithm, and main parameters are shown in fig. 5. The main custom parameters will be described next.

The distance matrix generated in the hierarchical clustering algorithm calculation process is relatively large, and is put in the HBase table, and a layer of cache is added to the client. In order to fully utilize the performance of the HBase cluster multiple machines, it is preferable to split the load among the machines, so when building the table, we pre-partition the table, split one table into multiple regions, and then each server is responsible for a part of the regions. We provide the RegionCountDM and RegionCountSD parameters for specifying how many regions the distance matrix table and the sortedDistance table are divided into, respectively. Meanwhile, a layer of cache is maintained at the client side so as to improve the data searching performance, and a cache size parameter is provided so that the number of records cached by the cache can be conveniently adjusted according to actual conditions. Corresponding to the single-link, complex-link and average-link methods of hierarchical clustering algorithms, we provide a similarity_method parameter for specifying which method to use. Distance_method is also provided for specifying which distance measurement method is used to calculate the distance between two points, and we only implement the Euclidean method during testing. In the calculation process of the algorithm, we use the multithreading technology, so that parameters can be provided to adjust the number of threads, we can adjust the number of threads by adjusting the putthread parameter, meanwhile, we can adjust the pagesbum parameter to control the number of threads for reading data from the HBase, we can page the data required to be read from the HBase, each thread reads one page, the pagesbum parameter controls the number of pages, for example, pagesbum=10, we need to read the data with ID of 0 to 10000, and then the number of threads is divided into 10 threads, and 0 to 1000, 1001 to 2000 and so on are respectively read. Finally, to control the end condition of the hierarchical clustering algorithm, it may be specified how many classes the data is clustered by setting a maxcroustonum parameter, or it may be specified by setting a minDistance parameter to end the clustering when the minimum distance between two clusters is smaller than the value specified by the minDistance.

Claims

1. A hierarchical clustering method based on Hadoop and HBase is characterized in that Hadoop is used for realizing parallelization calculation of a distance matrix, HBase is used for storing the distance matrix, rowKey design is adopted in table design, the ordering function of the HBase is fully utilized, and a multithreading technology and a cache technology are combined for processing the distance matrix in the HBase, so that the hierarchical clustering method with expansibility, which can be applied to big data, is realized; the parallelization calculation of the distance matrix specifically comprises: the method is realized by adopting Hadoop and is divided into two MapReduce algorithms, wherein the first MapReduce algorithm calculates the distance and stores the result into an intermediate file, and the second MapReduce algorithm converts the intermediate file result into an HFile format; finally, importing the result into HBase by using a Bulk Load method; two tables are arranged in the HBase, wherein the RowKey of one table is a Cluster ID pair, and the Cluster ID is complemented with the preamble 0 to be kept at the same length, so that records in the tables can be ordered according to the Cluster ID, and the value is the distance between the Clmakers; the distance associated with the specified Cluster ID is obtained quickly through this table; the RowKey of the other table is the distance between Cluster, and leading 0 is complemented in front of the distance, so that records in the table can be ordered from small to large, and value is the Cluster ID pair; by acquiring the first row of the table, two Cluster closest to the table can be acquired rapidly;

in the initial stage of the clustering method, pre-partitioning two tables according to Cluster IDs and distances in advance, and distributing the partitions to all nodes of an HBase Cluster; in the clustering process, recording the merging paths of Cluster and the number of all the atom Cluster contained in the Cluster in an internal memory; when the single-link or complex-link algorithm is adopted, the distance between the newly synthesized Cluster and the existing Cluster is calculated, and the new calculation is carried out by only retrieving the distance between two Cluster synthesizing the new Cluster and the rest Cluster from the HBase or cache; when an average-link algorithm is adopted, the distances between two Cluster and other Cluster needing to synthesize a new Cluster are added, and the number of all the atom Cluster contained in each Cluster is obtained from a memory to calculate an average value; in combination with the multithreading algorithm of the cache, two Cluster closest to the HBase are obtained and combined into a new Cluster, and distance information between the two Cluster closest to the cache or the HBase and other Cluster is obtained in parallel, during the period, a deleting thread is started to delete invalid data in the HBase, and a calculating thread is started to calculate the distance between the Cluster which has obtained the distance information and the new Cluster in advance, and simultaneously, the new distance information is written back to the cache or the HBase in parallel.