CN104598565A

CN104598565A - K-means large-scale data clustering method based on stochastic gradient descent algorithm

Info

Publication number: CN104598565A
Application number: CN201510011974.4A
Authority: CN
Inventors: 韩海韵; 丁杰; 戴江鹏; 周爱华; 孙玉宝
Original assignee: State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI
Current assignee: State Grid Corp of China SGCC; China Electric Power Research Institute Co Ltd CEPRI; Smart Grid Research Institute of SGCC
Priority date: 2015-01-09
Filing date: 2015-01-09
Publication date: 2015-05-06
Anticipated expiration: 2035-01-09
Also published as: CN104598565B

Abstract

The invention provides a K-means large-scale data clustering method based on a stochastic gradient descent algorithm, which includes the following steps that K clustering centers are initialized stochastically; data samples are sampled, and the data samples are divided into respective types; a target function is iterated; steps 1 to 3 are repeated until the clustering centers are converged. The K-means large-scale data clustering method based on the stochastic gradient descent algorithm provided by the invention greatly increases the execution efficiency of the algorithm, and achieves a better clustering effect. Data can be dug more rapidly and effectively, and the raising of the method provides a possibility for processing of large-scale power data and other data problems.

Description

A kind of K average large-scale data clustering method based on stochastic gradient descent algorithm

Technical field

The present invention relates to a kind of clustering method, be specifically related to a kind of K average large-scale data clustering method based on stochastic gradient descent algorithm.

Background technology

In recent years along with the lifting of Data Collection means and ability, individual, the data volume that particularly enterprise can obtain sharply increase.Such as, State Grid Corporation of China is after SG186 engineering is built up, and the eight large service application data record that on average increases day by day reaches more than 5,000 ten thousand (144G); And along with the construction of intelligent grid and SG-ERP, the data growth rate of company also can turn over several times again.Ultra-large compound information stores, back up and disaster tolerance all will become important technical field, and the construct effects of biology of data center and disaster recovery center will directly have influence on the continuity of enterprise overall business.How to pass through powerful algorithm, make full use of the data of historical data that electrical production controls and produce in enterprise operation, real time data, predicted data and different geographical space, level, more promptly completing the value " purification " of data, is the large data of electric power difficult problems urgently to be resolved hurrily.

Business data wide material sources, scale is growing.Say in a sense, the proportion shared by valuable information company is declined, from the information of magnanimity, how to find useful information becoming more and more difficult.Data carried out effectively, arranges fully and analyze, reduce or compress unworthy data, improve the value of valid data, data storage size can be reduced, reduce the computational resource that data analysis takies, thus directly guiding enterprise information assets optimization.

Along with the fast development of computer technology and memory device, people can obtain easily ten hundreds of even 1,000,000 data.From these data, how to analyze or interested information useful to us, become current problem in the urgent need to address.Traditional K means clustering algorithm is many methods that Data Mining uses, first random initializtion K cluster centre, then all samples are divided into K different type according to each sample to the distance of cluster centre, finally upgrade cluster centre with the mean value of samples all in each class, the continuous iteration of whole process, until convergence.Obviously, need during each iteration to calculate the distance of all samples to K cluster centre, when in the face of large-scale data, its computation process requires a great deal of time, and greatly reduces the execution efficiency of algorithm.

At present, the treatment scheme of large data generally may be summarized to be four steps: data acquisition, importing and pre-service, statistics and analysis, excavation and decision support.Wherein, the calculating based on various algorithm is mainly carried out in excavation and decision support on available data, thus play the effect of prediction and decision support, realize the demand of some high-level data analyses with this, more typically have the K means clustering algorithm for cluster.But the greatest problem that traditional data mining technology faces is exactly poor real, requiring a great deal of time processes data.For the data of real-time change, be difficult to obtain useful information in time, thus affect the decision-making of enterprise.

Summary of the invention

In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of K average large-scale data clustering method based on stochastic gradient descent algorithm, substantially increasing the execution efficiency of algorithm, reach better Clustering Effect.Can excavate data more rapidly and effectively, the proposition of the method is a kind of for the process large data of electric power and other data problem provide may.

In order to realize foregoing invention object, the present invention takes following technical scheme:

The invention provides a kind of K average large-scale data clustering method based on stochastic gradient descent algorithm, said method comprising the steps of:

Step 1: a random initializtion K cluster centre;

Step 2: sampled data sample, and this data sample is divided into affiliated type;

Step 3: iteration is carried out to objective function;

Step 4: repeat step 1-3, until cluster centre convergence.

In described step 1, for needing K class data set to be processed, random initializtion K cluster centre w ₁, w ₂..., w _k..., w _k∈ R ^d, wherein, R represents real number, and d represents dimension, so R ^drepresent that d ties up real number, w _krepresent the cluster centre that kth class data set is corresponding.

In described step 1, by the number n of data sample in each cluster centre ₁, n ₂..., n _k..., n _k∈ N is initialized as 0, and wherein N represents integer, n _krepresent the data sample number that kth class data set is corresponding.

In described step 2, stochastic sampling data sample z ∈ R ^d, and data sample z is divided into affiliated type by the cluster centre corresponding according to minor increment.

The code name k of data set in the cluster centre that minor increment is corresponding ^*represent, have:

k^{*} = \arg \min_{k} {(z - w_{k})}^{2}

Wherein, (z-w _k) ²represent data sample z to w _kdistance.

Described step 3 specifically comprises the following steps:

Step 3-1: set objective function as Q _kmeans, have:

Q_{kmeans} = \min_{k} \frac{1}{2} {(z - w_{k})}^{2}

Q _kmeansabout derivative use represent, have:

{&dtri;}_{w_{k^{*}}} Q_{kmeans} = \frac{&PartialD; Q_{kmeans}}{&PartialD; w_{k^{*}}} = - (z - w_{k^{*}}) = w_{k^{*}} - z

Wherein, for kth ^*the cluster centre that class data set is corresponding;

Step 3-2: establish represent kth ^*the data sample number that class data set is corresponding, adopts q _kmeanswith upgrade respectively with

In described step 4, repeated execution of steps 1-3, if the cluster centre distance threshold of twice iteration is less than 10 before and after meeting ^-6, then cluster centre w is shown ₁, w ₂..., w _k..., w _kconvergence.

Compared with prior art, beneficial effect of the present invention is:

K average large-scale data clustering method based on stochastic gradient descent algorithm provided by the invention greatly reduces the computation complexity of algorithm, can reach convergence more fast, and can obtain better Clustering Effect.Owing to being all random choose sample during each iteration, and the situation of sample before not needing to consider, therefore stochastic gradient descent algorithm is a minimized process of expected risk in essence.The proposition of the method is a kind of for the process large data of electric power and other data problem provide may.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of stochastic gradient descent algorithm in the embodiment of the present invention;

Fig. 2 is the distribution plan of raw data in the embodiment of the present invention;

Fig. 3 is the cluster result figure of K means clustering method of the prior art;

Fig. 4 is the K mean cluster result figure based on stochastic gradient descent algorithm in the embodiment of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in further detail.

Embodiment

First the sample race of stochastic generation two " moon " shapes, represents with triangle and round dot, as shown in Figure 2 respectively.Data are made up of the feature of two dimensions, and every class packet is containing 200000 samples, and always have 400000 data, belong to large data processing problem, in order to the convenience shown, selection portion divided data is mapped.The present embodiment do the allocation of computer of testing and be: the operating system of 64, the internal memory of 16GB, Intel processors, software runtime environment is MATLAB R2012a version.Detailed process is as follows:

A) random initializtion 2 cluster centre w ₁, w ₂∈ R ², the number n of every class sample ₁, n ₂∈ N is initialized as 0;

B) stochastic sampling data sample z ∈ R ², according to formula be divided into corresponding type;

C) to objective function

Q_{kmeans} = \min_{k = 1,2} \frac{1}{2} {(z - w_{k})}^{2}

About ask its derivative

D) upgrade with :

E) step b) to d) constantly repeating, until cluster centre w ₁, w ₂convergence.

Fig. 3 is classical K means clustering algorithm at the result figure obtained through 3 iteration, 32 seconds consuming time altogether, and Fig. 4 is the result obtained 17 seconds consuming time time based on the K means clustering algorithm of gradient descent algorithm, have passed through 500 iteration, " x " type circle represents two cluster centres.As seen from the figure, the cluster centre of two width figure is almost consistent.In the result quantized, classical K mean cluster needs cost 32 seconds, and only needs cost 17 seconds based on the k mean cluster of stochastic gradient descent algorithm, and accuracy rate reaches 78.41%, is slightly higher than 78.1% of classical k mean cluster.

Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; those of ordinary skill in the field still can modify to the specific embodiment of the present invention with reference to above-described embodiment or equivalent replacement; these do not depart from any amendment of spirit and scope of the invention or equivalent replacement, are all applying within the claims of the present invention awaited the reply.

Claims

1., based on a K average large-scale data clustering method for stochastic gradient descent algorithm, it is characterized in that: said method comprising the steps of:

Step 1: a random initializtion K cluster centre;

Step 3: iteration is carried out to objective function;

Step 4: repeat step 1-3, until cluster centre convergence.

2. the K average large-scale data clustering method based on stochastic gradient descent algorithm according to claim 1, is characterized in that: in described step 1, for needing K class data set to be processed, random initializtion K cluster centre w ₁, w ₂..., w _k..., w _k∈ R ^d, wherein, R represents real number, and d represents dimension, so R ^drepresent that d ties up real number, w _krepresent the cluster centre that kth class data set is corresponding.

3. the K average large-scale data clustering method based on stochastic gradient descent algorithm according to claim 2, is characterized in that: in described step 1, by the number n of data sample in each cluster centre ₁, n ₂..., n _k..., n _k∈ N is initialized as 0, and wherein N represents integer, n _krepresent the data sample number that kth class data set is corresponding.

4. the K average large-scale data clustering method based on stochastic gradient descent algorithm according to claim 3, is characterized in that: in described step 2, stochastic sampling data sample z ∈ R ^d, and data sample z is divided into affiliated type by the cluster centre corresponding according to minor increment.

5. the K average large-scale data clustering method based on stochastic gradient descent algorithm according to claim 4, is characterized in that: the code name k of data set in the cluster centre that minor increment is corresponding ^*represent, have:

k^{*} = \arg \min_{k} {(z - w_{k})}^{2}

Wherein, (z-w _k) ²represent data sample z to w _kdistance.

6. the K average large-scale data clustering method based on stochastic gradient descent algorithm according to claim 4, is characterized in that: described step 3 specifically comprises the following steps:

Step 3-1: set objective function as Q _kmeans, have:

Q_{kmeans} = \min_{k} \frac{1}{2} {(z - w_{k})}^{2}

Q _kmeansabout derivative use represent, have:

{&dtri;}_{w_{k^{*}}} Q_{kmeans} = \frac{&PartialD; Q_{kmeans}}{&PartialD; w_{k^{*}}} = - (z - w_{k^{*}}) = w_{k^{*}} - z

Wherein, for kth ^*the cluster centre that class data set is corresponding;

Step 3-2: establish represent kth ^*the data sample number that class data set is corresponding, adopts with

n_{k^{*}} &LeftArrow; n_{k^{*}} + 1

Upgrade respectively with

7. the K average large-scale data clustering method based on stochastic gradient descent algorithm according to claim 6, is characterized in that: in described step 4, repeated execution of steps 1-3, if the cluster centre distance threshold of twice iteration is less than 10 before and after meeting ^-6, then cluster centre w is shown ₁, w ₂..., w _k..., w _kconvergence.