CN111695667A

CN111695667A - MapReduce-based distributed particle swarm clustering algorithm

Info

Publication number: CN111695667A
Application number: CN202010460098.4A
Authority: CN
Inventors: 璧靛溅; 赵彦
Original assignee: Jiangsu Vocational College of Information Technology
Current assignee: Jiangsu Vocational College of Information Technology
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-09-22

Abstract

The invention relates to the field of artificial intelligence big data analysis, in particular to a MapReduce-based distributed particle swarm clustering algorithm, which is characterized by comprising the following steps of: the algorithm comprises the following steps: step 1: updating the centroid of the particle swarm by adopting MapReduce operation; step 2: adopting MapReduce operation to evaluate the adaptability of the population with the new particle mass center generated in the step 1, calculating a new adaptability value of the updated population, wherein the fitness evaluation is based on a fitness function, and measuring the distances between all data points and the particle mass center by obtaining the average distance between the particle mass centers; and step 3: combining the fitness value calculated in the step 2 with the updating group generated in the step 1, and updating the optimal individual centroid and the optimal global centroid simultaneously; and returning to the step 1 for next iteration. The invention effectively solves the clustering problem of the super-large scale commercial data set and realizes high-quality clustering.

Description

MapReduce-based distributed particle swarm clustering algorithm

Technical Field

The invention relates to the field of artificial intelligence big data analysis, in particular to a distributed particle swarm clustering algorithm based on MapReduce.

Background

With the development of internet technology, data required to be stored, analyzed and processed is explosively increased, and besides the huge amount of data, the created or collected data is more and more complex. To solve how to effectively generate, manage, and analyze data and obtain result information, a comprehensive, end-to-end method is needed, covering all stages from initial data to final analysis. Clustering is a data mining technique used when analyzing data. The main goal of clustering algorithms is to divide a set of unlabeled data objects into different clusters, so that the cluster members have a common specification and more approximate membership. To achieve high quality clustering, the similarity between data objects within a cluster is maximized and the similarity between data objects within a cluster is minimized. Clustering social network user information, classifying library articles, analyzing learning conditions of intelligent teaching students, analyzing interest preference of shoppers and the like all belong to the problem of clustering a large number of super-large data sets recorded with high dimension. At present, most sequential clustering algorithms are inversely proportional to the scale increase and expansibility of a data set, and the high time complexity and space complexity aggravate the cost of the clustering algorithm.

MapReduce programming model

MapReduce is a programming model introduced by Google, mainly applied to parallel computing of large data sets of size over 1TB, and can automatically implement parallel tasks and provide fault tolerance and load balancing. The MapReduce programming model uses Map operations and Reduce operations to implement the overall process from problem formulation to functional abstraction. Map operations break up a complex task into several simple tasks to process, iterate through a large number of records, extract useful information from each record, and send all values with the same key into the same Reduce operation. Reduce operations implement rollups, aggregating intermediate results using the same keys generated from Map operations, generating final results. Map, Reduce operations are shown in equations (1) and (2).

And Map operation: map (k, v) → [ (k ', v') ] (1)

Reduce operation: reduce (k ', [ v') ]→ [ (k ', v') ] (2)

Apache Hadoop is the mainstream open source implementation framework that implements the MapReduce programming model, supports data-intensive distributed applications under the Apache license, and enables applications to work with thousands of independently-computed computers and PB-level data. The Hadoop distributed file system (HDFS-storage component) and the MapReduce (processing component) are the main core components of Apache Hadoop. HDFS achieves high throughput access to data while maintaining fault tolerance by creating multiple copies of a target data block. Fig. 1 shows a Hadoop architecture diagram with operation cycles.

Particle swarm optimization algorithm

In 1995, Kennedy and Eberhart first proposed a particle swarm optimization algorithm (PSO), which is a swarm intelligence method. The behavior of particle swarm optimization is inspired by the search of optimal food sources by the flock, where the direction of movement of the birds is influenced by their current motion, the historical optimal food source, and the optimal food source of any bird in the flock. In PSO, the solution to the optimal particle problem varies as the optimal particle moves in the search space. The movement of the particles is influenced by inertia, a personal optimum position and a global optimum position. The cluster is composed of a plurality of particles, each having a fitness value assigned by an objective function and optimized according to its position. Furthermore, the particles contain other information, such as the speed of movement of the particles, in addition to the fitness values and the positions. In addition, the PSO maintains an optimal personal pose, and a particle optimal state value. At the same time, the PSO has the best global and best fitness values experienced by any particle. The research aim of the project is to compare and analyze the distributed application of various improved algorithms of the PSO on an Apache Hadoop MapReduce open source framework, effectively solve the problem of clustering of large-scale data, and compare and analyze the application range and the application effect of various distributed particle swarm optimization clustering algorithms based on MapReduce.

The particle swarm optimization algorithm moves particles within the problem search space using equation (3), where X_iIs the position of the particle i, t is the number of iterations, V_iIs the velocity of particle i; the particle velocity is updated using equation (4), where W is the inertial weight, r1 and r2 are randomly generated numbers, cons1, cons2 are constant coefficients, XPi is the current best position of particle i, and XG is the current best global position of the entire cluster.

X_i(t+1)＝X_i(t)+V_i(t+1) (3)

V_i(t+1)＝W×V_i(t)+(r₁×cons₁)×[XP_i-X_i(t)]+(r₂×cons₂)×[XG-X_i(t)]（4）

Disclosure of Invention

The invention aims to provide a distributed particle swarm clustering algorithm based on MapReduce, which effectively solves the clustering problem of a super-large-scale commercial data set and realizes high-quality clustering.

In order to solve the technical problems, the technical scheme of the invention is as follows: the distributed particle swarm clustering algorithm based on MapReduce comprises the following steps:

step 1: updating the centroid of the particle swarm by adopting MapReduce operation;

step 2: adopting MapReduce operation to evaluate the adaptability of the population with the new particle mass center generated in the step 1, calculating a new adaptability value of the updated population, wherein the fitness evaluation is based on a fitness function, and measuring the distances between all data points and the particle mass center by obtaining the average distance between the particle mass centers;

and step 3: combining the fitness value calculated in the step 2 with the updating group generated in the step 1, and updating the optimal individual centroid and the optimal global centroid simultaneously; and returning to the step 1 for next iteration.

According to the scheme, the step 1 specifically comprises the following steps; the Map function in MapReduce is used for receiving the particles with identification numbers, wherein the particle ID is used as a key, and the particle itself is used as a value; the Map value comprises a centroid vector, a velocity vector, an fitness value, an optimal individual centroid, an optimal individual fitness value, an optimal global centroid and an optimal overall fitness value of the particle;

in the Map function, the centroid is updated according to the following formula:

X_i(t+1)＝X_i(t)+V_i(t+1) (3)

V_i(t+1)＝W×V_i(t)+(r₁×cons₁)×[XP_i-X_i(t)]+(r₂×cons₂)×[XG-X_i(t)](4)

equation (3) moves particles within the problem search space, where X_iIs the position of the particle i, t is the number of iterations, V_iIs the velocity of particle i; updating the particle velocity according to equation (4), where W is the inertial weight, r1 and r2 are randomly generated numbers, cons1 and cons2 are constant coefficients, XPi is the current optimal position of particle i, and XG is the current optimal global position of the entire cluster; retrieving formula (4) from a configuration fileThe used PSO coefficients cons1 and cons2, inertial weight W information; then, the Map function transmits the particles with the updated centroids to the Reduce function;

in the step 1, a Reduce function in the MapReduce is an Identityreduce function, and the function is used for sequencing the Map results and combining all the results into an output file; the population of particles is stored in a distributed file system for use in steps 2 and 3.

According to the scheme, the step 2 specifically comprises the following steps: receiving the data record with the recordID number by the Map function, wherein the recordID is taken as a key at the moment, and the data record is taken as a value per se; the Map function firstly retrieves particle swarms from a distributed cache file of a MapReduce framework, then obtains the centroid vector of each particle, calculates the distance value between a record and the centroid vector, and finally obtains the minimum distance with the centroid ID; the Map function uses the partileled with the shortest distance centroid ID to formulate a new composite key; similarly, a new value is made starting from the minimum distance; then, the Map function sends the new key and the new value to the Reduce function;

the Reduce function calculates the average distance using the values with the same key and assigns it as a fitness value for each centroid in each particle; then, the Reduce function sends out keys with average distance to formulate a new fitness value, and the new fitness value is stored in the distributed file system; the fitness value calculation formula is as follows:

in the formula (5), n_jRepresents the number of records belonging to cluster j; r_iRepresenting the ith record; k represents the number of available clusters; distance (R)_i,C_j) Is recording of R_iWith cluster centroid C_jThe distance adopts a Manhattan distance formula;

in the formula (6), D records R_iThe dimension (d); r_ivIs recording of R_iThe value of the medium v dimension; c_jvIs the center of mass C_jThe value of the medium v dimension.

According to the scheme, the step 3 specifically comprises the following steps: combining the outputs of step 1 and step 2 to have a new cluster, the new fitness value being obtained at the particle level by summing all the centroid fitness values generated in step 2, and then updating the cluster with the new fitness value; comparing the optimal personal fitness value of each particle with a new particle fitness value, and if the new particle fitness value is smaller than the current optimal personal fitness value, updating the optimal personal fitness value and the centroid thereof; if the fitness value of any particle is smaller than the current optimal overall fitness value, updating the optimal overall fitness value with the center of mass; the new cluster with the new information will then be saved in the distributed file system to be used as input for the next iteration.

The invention has the following beneficial effects: the PSO clustering algorithm (PSOC-MR) based on MapReduce is characterized in that a MapReduce model is adopted, a Hadoop frame is combined with a particle swarm algorithm, the PSOC-MR algorithm is constructed, two main operations of particle centroid updating and adaptability evaluation are completed, and high-quality clustering of large-scale data is realized; the PSOC-MR algorithm solves the problem of low PSO clustering efficiency of the large data set by using a MapReduce distributed parallel mode. The PSOC-MR algorithm firstly formulates a clustering task into an optimization problem, and then obtains an optimal solution by calculating the minimum distance between a data point and a centroid in a cluster; the algorithm is similar to a k-means clustering algorithm, and the mass center of each cluster is updated according to the speed of particles; the PSOC-MR algorithm presents good expansibility and acceleration under the condition that the proportion of the number of clusters and the size of a data set is increased, and can effectively solve the clustering problem of a super-large-scale commercial data set.

Drawings

FIG. 1 is a schematic diagram of a Hadoop architecture in the prior art;

FIG. 2 is a diagram of the architectural framework for the PSOC-MR algorithm of an embodiment of the present invention;

FIG. 3 is a flowchart of the Map function body of module 1 according to the embodiment of the present invention;

FIG. 4 is a flow chart of Reduce function of module 1 according to an embodiment of the present invention;

FIG. 5 is a flowchart of the Map function of module 2 according to an embodiment of the present invention;

FIG. 6 is a flowchart of Reduce function of module 2 according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Referring to fig. 1 to 6, the present invention is a MapReduce-based distributed particle swarm clustering algorithm, which includes the following steps:

The following is a detailed description:

PSO algorithm design

The PSO clustering algorithm (PSOC-MR) based on MapReduce expresses a clustering task as an optimization problem mainly according to the minimum distance between a data point and a centroid in a cluster so as to obtain an optimal solution. PSOC-MR is a clustering algorithm similar to the k-means clustering method, where clustering is represented by its centroid. In k-means clustering, the centroid is calculated by a weighted average of the points in the cluster, whereas in PSOC-MR, the centroid of each cluster is updated according to the velocity of the population particles.

In the clustering process of the PSOC-MR algorithm, each particle XP_iThe contained information is shown in table 1.

TABLE 1 particles P_iInformation contained

Information name	Means of
		Centroid Vector (CV)	Current cluster centroid vector
Velocity Vector (VV)	Current velocity vector
		Fitness Value (FV)	When iterating t times, the current fitness value of the particle
Optimum personal centroid (BPC)	So far, the best personal centroid for XPi
		Best personal fitness value (BPCFV)	To date, the best overall fitness value for the entire population
Optimal Global centroid (BGC)	To date, the best global centroid can be seen in the entire cluster
		Best overall fitness value (BGCFV)	To date, the best overall fitness value for the entire population

During each iteration, the above information is updated based on the previous cluster state; in the PSOC-MR algorithm, two main operations of particle centroid updating and adaptability evaluation need to be completed, and the purpose of clustering large-scale data is achieved. Continuously updating each particle centroid during each iteration according to PSO motion equations (3) and (4); due to the large particle swarm, it takes a long time to update the centroid; the fitness evaluation is based on a fitness function, measuring the distance between all data points and the particle centroids by taking the average distance between the particle centroids, and is based on equations (5) and (6).

In the formula (5), n_jRepresents the number of records belonging to cluster j; r_iRepresenting the ith record; k represents the number of available clusters; distance (R)_i,C_j) Is recording of R_iWith cluster centroid C_jThe distance is expressed by using the Manhattan distance formula, as shown in formula (6).

When the PSO algorithm is used for clustering a large data set, the adaptability evaluation needs a long time to be executed. Experiments have shown that if a dataset contains 5 million data points in 100 dimensions, and the cluster number is 5, the cluster size is 30, then the algorithm needs to calculate a distance value of 5 × 107 × 5 × 100 × 30 to 75 × 1010 times to complete one iteration. This task required 460 hours of operation on a 4.6GHz processor.

PSOC-MR algorithm architecture framework design

The PSOC-MR algorithm adopts a MapReduce model, uses a distributed processing technology, can effectively improve the execution efficiency of the clustering processing of the super-large-scale data, and comprises 3 modules in total;

the module 1 and the module 2 are both MapReduce operations, wherein the module 1 is used for updating the particle swarm centroid, and the module 2 is used for evaluating the adaptability of the population with the new particle centroid generated by the module 1; module 3 performs a merge for merging the fitness value calculated by module 2 with the update cluster generated in module 1, while updating the optimal individual centroid and the optimal global centroid in module 3 in preparation for the next iteration, and the architecture framework diagram of the PSOC-MR algorithm is shown in fig. 2.

PSOC-MR algorithm implementation

The purpose of module 1 is to initiate a MapReduce job to update the particle centroid. The Map function is used for receiving the particles with identification numbers, wherein the particle ID is used as a key, and the particle itself is used as a value; the Map value contains all the information of the particle, as shown in table 1. In the Map function, the centroid is updated according to the formulas (3) and (4). This job will retrieve information such as PSO coefficients (cons1, cons2), inertial weights (W), etc., to be used by equation (4) from the configuration file; the Map function then transmits the centroid-updated particles to the Reduce function. In order for the PSO algorithm to benefit from the MapReduce framework, the number of Maps is related to the number of cluster nodes and the particle swarm size. The Reduce function in the module 1 is an Identityreduce function, the function is used for sequencing the results of the Map and combining all the results into an output file, and the particle swarm is stored in the distributed file system to be used by other two modules. The flow of the Map function and Reduce function of module 1 is shown in fig. 3 and 4.

The purpose of module 2 is to restart the MapReduce job to compute a new fitness value for the updated population. The Map function receives the data record with the recordID number, now keyed by the recordID, and the data record itself as the value. The Map function firstly retrieves particle swarms from a distributed cache file of a MapReduce framework, then obtains the centroid vector of each particle, calculates the distance value between a record and the centroid vector, and finally obtains the minimum distance with the centroid ID. The Map function uses the partileled with the shortest distance centroid ID to formulate a new composite key; similarly, a new value is made starting from the minimum distance, after which the Map function sends the new key and the new value to the Reduce function.

The Reduce function calculates the average distance using the values with the same key and assigns it as a fitness value for each centroid in each particle; then, the Reduce function sends out keys with average distance to formulate a new fitness value, and the new fitness value is stored in the distributed file system; the flow of the Map function and Reduce function of module 2 is shown in fig. 5 and 6.

The purpose of module 3 is to merge the outputs of module 1 and module 2 to have a new cluster. The new fitness value is obtained at the particle level by summing all the centroid fitness values generated by module 2 and then updating the cluster with the new fitness value. Next, the best personal fitness value BPCFV for each particle is compared to the new particle fitness value. If the new particle fitness value is less than the current BPCFV, the BPCFV and its centroid are updated. In addition, if the fitness value of any particle is smaller than the current best overall fitness value BGCFV, the BGCFV with the centroid is updated; the new cluster with the new information will then be saved in the distributed file system to be used as input for the next iteration

The PSOC-MR algorithm solves the problem of low PSO clustering efficiency of a large data set by using a MapReduce distributed parallel mode. The PSOC-MR algorithm first formulates the clustering task as an optimization problem and then obtains the best solution by calculating the minimum distance between the data point and the centroid within the cluster. The algorithm is similar to the k-means clustering algorithm, and the centroid of each cluster is updated according to the velocity of the particles. The expansion and acceleration performance of the algorithm is verified by using an actual data set, experimental results show that the algorithm can be successfully parallelized on commercial hardware, the algorithm has better expansibility along with the rapid increase of data scale, the algorithm is close to linear acceleration while the clustering quality is maintained, and the clustering quality, the expandability and the acceleration performance are all superior to those of a K-mean sequence algorithm. The problem of clustering of massive commercial data can be effectively solved, and the effectiveness of intelligent data analysis and decision making is further improved. The later plan applies the algorithm to a large-scale student learning situation analysis link in the intelligent teaching process.

The foregoing is a more detailed description of the present invention that is presented in conjunction with specific embodiments, and the practice of the invention is not to be considered limited to those descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A distributed particle swarm clustering algorithm based on MapReduce is characterized in that: the algorithm comprises the following steps:

2. The MapReduce-based distributed particle swarm clustering algorithm of claim 1, wherein: the step 1 specifically comprises the following steps: the Map function in MapReduce is used for receiving the particles with identification numbers, wherein the particle ID is used as a key, and the particle itself is used as a value; the Map value comprises a centroid vector, a velocity vector, an fitness value, an optimal individual centroid, an optimal individual fitness value, an optimal global centroid and an optimal overall fitness value of the particle;

X_i(t+1)＝X_i(t)+V_i(t+1) (3)

formula (II)(3) Moving particles within a problem search space, where X_iIs the position of the particle i, t is the number of iterations, V_iIs the velocity of particle i; updating the particle velocity according to equation (4), where W is the inertial weight, r1 and r2 are randomly generated numbers, cons1 and cons2 are constant coefficients, XPi is the current optimal position of particle i, and XG is the current optimal global position of the entire cluster; retrieving PSO coefficients cons1 and cons2 to be used in formula (4) and inertial weight W information from a configuration file; then, the Map function transmits the particles with the updated centroids to the Reduce function;

3. The MapReduce-based distributed particle swarm clustering algorithm of claim 1, wherein: the step 2 specifically comprises the following steps: receiving the data record with the recordID number by the Map function, wherein the recordID is taken as a key at the moment, and the data record is taken as a value per se; the Map function firstly retrieves particle swarms from a distributed cache file of a MapReduce framework, then obtains the centroid vector of each particle, calculates the distance value between a record and the centroid vector, and finally obtains the minimum distance with the centroid ID; the Map function uses the partileled with the shortest distance centroid ID to formulate a new composite key; similarly, a new value is made starting from the minimum distance; then, the Map function sends the new key and the new value to the Reduce function;

in the formula (5), n_jRepresents the number of records belonging to cluster j; r_iRepresenting the ith record; k represents the number of available clusters; distance (R)_i，C_j) Is recording of R_iWith cluster centroid C_jThe distance adopts a Manhattan distance formula;

4. The MapReduce-based distributed particle swarm clustering algorithm of claim 1, wherein: the step 3 specifically comprises the following steps: combining the outputs of step 1 and step 2 to have a new cluster, the new fitness value being obtained at the particle level by summing all the centroid fitness values generated in step 2, and then updating the cluster with the new fitness value; comparing the optimal personal fitness value of each particle with a new particle fitness value, and if the new particle fitness value is smaller than the current optimal personal fitness value, updating the optimal personal fitness value and the centroid thereof; if the fitness value of any particle is smaller than the current optimal overall fitness value, updating the optimal overall fitness value with the center of mass; the new cluster with the new information will then be saved in the distributed file system to be used as input for the next iteration.