CN104598565A - K-means large-scale data clustering method based on stochastic gradient descent algorithm - Google Patents

K-means large-scale data clustering method based on stochastic gradient descent algorithm Download PDF

Info

Publication number
CN104598565A
CN104598565A CN201510011974.4A CN201510011974A CN104598565A CN 104598565 A CN104598565 A CN 104598565A CN 201510011974 A CN201510011974 A CN 201510011974A CN 104598565 A CN104598565 A CN 104598565A
Authority
CN
China
Prior art keywords
data
cluster centre
gradient descent
stochastic gradient
descent algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510011974.4A
Other languages
Chinese (zh)
Other versions
CN104598565B (en
Inventor
韩海韵
丁杰
戴江鹏
周爱华
孙玉宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
China Electric Power Research Institute Co Ltd CEPRI
Smart Grid Research Institute of SGCC
Original Assignee
State Grid Corp of China SGCC
China Electric Power Research Institute Co Ltd CEPRI
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, China Electric Power Research Institute Co Ltd CEPRI filed Critical State Grid Corp of China SGCC
Priority to CN201510011974.4A priority Critical patent/CN104598565B/en
Publication of CN104598565A publication Critical patent/CN104598565A/en
Application granted granted Critical
Publication of CN104598565B publication Critical patent/CN104598565B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23211Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with adaptive number of clusters

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides a K-means large-scale data clustering method based on a stochastic gradient descent algorithm, which includes the following steps that K clustering centers are initialized stochastically; data samples are sampled, and the data samples are divided into respective types; a target function is iterated; steps 1 to 3 are repeated until the clustering centers are converged. The K-means large-scale data clustering method based on the stochastic gradient descent algorithm provided by the invention greatly increases the execution efficiency of the algorithm, and achieves a better clustering effect. Data can be dug more rapidly and effectively, and the raising of the method provides a possibility for processing of large-scale power data and other data problems.

Description

A kind of K average large-scale data clustering method based on stochastic gradient descent algorithm
Technical field
The present invention relates to a kind of clustering method, be specifically related to a kind of K average large-scale data clustering method based on stochastic gradient descent algorithm.
Background technology
In recent years along with the lifting of Data Collection means and ability, individual, the data volume that particularly enterprise can obtain sharply increase.Such as, State Grid Corporation of China is after SG186 engineering is built up, and the eight large service application data record that on average increases day by day reaches more than 5,000 ten thousand (144G); And along with the construction of intelligent grid and SG-ERP, the data growth rate of company also can turn over several times again.Ultra-large compound information stores, back up and disaster tolerance all will become important technical field, and the construct effects of biology of data center and disaster recovery center will directly have influence on the continuity of enterprise overall business.How to pass through powerful algorithm, make full use of the data of historical data that electrical production controls and produce in enterprise operation, real time data, predicted data and different geographical space, level, more promptly completing the value " purification " of data, is the large data of electric power difficult problems urgently to be resolved hurrily.
Business data wide material sources, scale is growing.Say in a sense, the proportion shared by valuable information company is declined, from the information of magnanimity, how to find useful information becoming more and more difficult.Data carried out effectively, arranges fully and analyze, reduce or compress unworthy data, improve the value of valid data, data storage size can be reduced, reduce the computational resource that data analysis takies, thus directly guiding enterprise information assets optimization.
Along with the fast development of computer technology and memory device, people can obtain easily ten hundreds of even 1,000,000 data.From these data, how to analyze or interested information useful to us, become current problem in the urgent need to address.Traditional K means clustering algorithm is many methods that Data Mining uses, first random initializtion K cluster centre, then all samples are divided into K different type according to each sample to the distance of cluster centre, finally upgrade cluster centre with the mean value of samples all in each class, the continuous iteration of whole process, until convergence.Obviously, need during each iteration to calculate the distance of all samples to K cluster centre, when in the face of large-scale data, its computation process requires a great deal of time, and greatly reduces the execution efficiency of algorithm.
At present, the treatment scheme of large data generally may be summarized to be four steps: data acquisition, importing and pre-service, statistics and analysis, excavation and decision support.Wherein, the calculating based on various algorithm is mainly carried out in excavation and decision support on available data, thus play the effect of prediction and decision support, realize the demand of some high-level data analyses with this, more typically have the K means clustering algorithm for cluster.But the greatest problem that traditional data mining technology faces is exactly poor real, requiring a great deal of time processes data.For the data of real-time change, be difficult to obtain useful information in time, thus affect the decision-making of enterprise.
Summary of the invention
In order to overcome above-mentioned the deficiencies in the prior art, the invention provides a kind of K average large-scale data clustering method based on stochastic gradient descent algorithm, substantially increasing the execution efficiency of algorithm, reach better Clustering Effect.Can excavate data more rapidly and effectively, the proposition of the method is a kind of for the process large data of electric power and other data problem provide may.
In order to realize foregoing invention object, the present invention takes following technical scheme:
The invention provides a kind of K average large-scale data clustering method based on stochastic gradient descent algorithm, said method comprising the steps of:
Step 1: a random initializtion K cluster centre;
Step 2: sampled data sample, and this data sample is divided into affiliated type;
Step 3: iteration is carried out to objective function;
Step 4: repeat step 1-3, until cluster centre convergence.
In described step 1, for needing K class data set to be processed, random initializtion K cluster centre w 1, w 2..., w k..., w k∈ R d, wherein, R represents real number, and d represents dimension, so R drepresent that d ties up real number, w krepresent the cluster centre that kth class data set is corresponding.
In described step 1, by the number n of data sample in each cluster centre 1, n 2..., n k..., n k∈ N is initialized as 0, and wherein N represents integer, n krepresent the data sample number that kth class data set is corresponding.
In described step 2, stochastic sampling data sample z ∈ R d, and data sample z is divided into affiliated type by the cluster centre corresponding according to minor increment.
The code name k of data set in the cluster centre that minor increment is corresponding *represent, have:
k * = arg min k ( z - w k ) 2
Wherein, (z-w k) 2represent data sample z to w kdistance.
Described step 3 specifically comprises the following steps:
Step 3-1: set objective function as Q kmeans, have:
Q kmeans = min k 1 2 ( z - w k ) 2
Q kmeansabout derivative use represent, have:
▿ w k * Q kmeans = ∂ Q kmeans ∂ w k * = - ( z - w k * ) = w k * - z
Wherein, for kth *the cluster centre that class data set is corresponding;
Step 3-2: establish represent kth *the data sample number that class data set is corresponding, adopts q kmeanswith upgrade respectively with
In described step 4, repeated execution of steps 1-3, if the cluster centre distance threshold of twice iteration is less than 10 before and after meeting -6, then cluster centre w is shown 1, w 2..., w k..., w kconvergence.
Compared with prior art, beneficial effect of the present invention is:
K average large-scale data clustering method based on stochastic gradient descent algorithm provided by the invention greatly reduces the computation complexity of algorithm, can reach convergence more fast, and can obtain better Clustering Effect.Owing to being all random choose sample during each iteration, and the situation of sample before not needing to consider, therefore stochastic gradient descent algorithm is a minimized process of expected risk in essence.The proposition of the method is a kind of for the process large data of electric power and other data problem provide may.
Accompanying drawing explanation
Fig. 1 is the schematic diagram of stochastic gradient descent algorithm in the embodiment of the present invention;
Fig. 2 is the distribution plan of raw data in the embodiment of the present invention;
Fig. 3 is the cluster result figure of K means clustering method of the prior art;
Fig. 4 is the K mean cluster result figure based on stochastic gradient descent algorithm in the embodiment of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail.
Embodiment
First the sample race of stochastic generation two " moon " shapes, represents with triangle and round dot, as shown in Figure 2 respectively.Data are made up of the feature of two dimensions, and every class packet is containing 200000 samples, and always have 400000 data, belong to large data processing problem, in order to the convenience shown, selection portion divided data is mapped.The present embodiment do the allocation of computer of testing and be: the operating system of 64, the internal memory of 16GB, Intel processors, software runtime environment is MATLAB R2012a version.Detailed process is as follows:
A) random initializtion 2 cluster centre w 1, w 2∈ R 2, the number n of every class sample 1, n 2∈ N is initialized as 0;
B) stochastic sampling data sample z ∈ R 2, according to formula be divided into corresponding type;
C) to objective function Q kmeans = min k = 1,2 1 2 ( z - w k ) 2 About ask its derivative
D) upgrade with :
E) step b) to d) constantly repeating, until cluster centre w 1, w 2convergence.
Fig. 3 is classical K means clustering algorithm at the result figure obtained through 3 iteration, 32 seconds consuming time altogether, and Fig. 4 is the result obtained 17 seconds consuming time time based on the K means clustering algorithm of gradient descent algorithm, have passed through 500 iteration, " x " type circle represents two cluster centres.As seen from the figure, the cluster centre of two width figure is almost consistent.In the result quantized, classical K mean cluster needs cost 32 seconds, and only needs cost 17 seconds based on the k mean cluster of stochastic gradient descent algorithm, and accuracy rate reaches 78.41%, is slightly higher than 78.1% of classical k mean cluster.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; those of ordinary skill in the field still can modify to the specific embodiment of the present invention with reference to above-described embodiment or equivalent replacement; these do not depart from any amendment of spirit and scope of the invention or equivalent replacement, are all applying within the claims of the present invention awaited the reply.

Claims (7)

1., based on a K average large-scale data clustering method for stochastic gradient descent algorithm, it is characterized in that: said method comprising the steps of:
Step 1: a random initializtion K cluster centre;
Step 2: sampled data sample, and this data sample is divided into affiliated type;
Step 3: iteration is carried out to objective function;
Step 4: repeat step 1-3, until cluster centre convergence.
2. the K average large-scale data clustering method based on stochastic gradient descent algorithm according to claim 1, is characterized in that: in described step 1, for needing K class data set to be processed, random initializtion K cluster centre w 1, w 2..., w k..., w k∈ R d, wherein, R represents real number, and d represents dimension, so R drepresent that d ties up real number, w krepresent the cluster centre that kth class data set is corresponding.
3. the K average large-scale data clustering method based on stochastic gradient descent algorithm according to claim 2, is characterized in that: in described step 1, by the number n of data sample in each cluster centre 1, n 2..., n k..., n k∈ N is initialized as 0, and wherein N represents integer, n krepresent the data sample number that kth class data set is corresponding.
4. the K average large-scale data clustering method based on stochastic gradient descent algorithm according to claim 3, is characterized in that: in described step 2, stochastic sampling data sample z ∈ R d, and data sample z is divided into affiliated type by the cluster centre corresponding according to minor increment.
5. the K average large-scale data clustering method based on stochastic gradient descent algorithm according to claim 4, is characterized in that: the code name k of data set in the cluster centre that minor increment is corresponding *represent, have:
k * = arg min k ( z - w k ) 2
Wherein, (z-w k) 2represent data sample z to w kdistance.
6. the K average large-scale data clustering method based on stochastic gradient descent algorithm according to claim 4, is characterized in that: described step 3 specifically comprises the following steps:
Step 3-1: set objective function as Q kmeans, have:
Q kmeans = min k 1 2 ( z - w k ) 2
Q kmeansabout derivative use represent, have:
▿ w k * Q kmeans = ∂ Q kmeans ∂ w k * = - ( z - w k * ) = w k * - z
Wherein, for kth *the cluster centre that class data set is corresponding;
Step 3-2: establish represent kth *the data sample number that class data set is corresponding, adopts with n k * ← n k * + 1 Upgrade respectively with
7. the K average large-scale data clustering method based on stochastic gradient descent algorithm according to claim 6, is characterized in that: in described step 4, repeated execution of steps 1-3, if the cluster centre distance threshold of twice iteration is less than 10 before and after meeting -6, then cluster centre w is shown 1, w 2..., w k..., w kconvergence.
CN201510011974.4A 2015-01-09 2015-01-09 A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm Active CN104598565B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510011974.4A CN104598565B (en) 2015-01-09 2015-01-09 A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510011974.4A CN104598565B (en) 2015-01-09 2015-01-09 A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm

Publications (2)

Publication Number Publication Date
CN104598565A true CN104598565A (en) 2015-05-06
CN104598565B CN104598565B (en) 2018-08-14

Family

ID=53124350

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510011974.4A Active CN104598565B (en) 2015-01-09 2015-01-09 A kind of K mean value large-scale data clustering methods based on stochastic gradient descent algorithm

Country Status (1)

Country Link
CN (1) CN104598565B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139277A (en) * 2015-08-18 2015-12-09 国家电网公司 Electric power distribution network information clustering system and method
CN105681089A (en) * 2016-01-26 2016-06-15 上海晶赞科技发展有限公司 Network user behavior clustering method, device and terminal
CN108460499A (en) * 2018-04-02 2018-08-28 福州大学 A kind of micro-blog user force arrangement method of fusion user time information
CN108846532A (en) * 2018-03-21 2018-11-20 宁波工程学院 Business risk appraisal procedure and device applied to logistics supply platform chain
CN110288004A (en) * 2019-05-30 2019-09-27 武汉大学 A kind of diagnosis method for system fault and device excavated based on log semanteme
US10503580B2 (en) 2017-06-15 2019-12-10 Microsoft Technology Licensing, Llc Determining a likelihood of a resource experiencing a problem based on telemetry data
CN111385243A (en) * 2018-12-27 2020-07-07 ***通信集团山西有限公司 DDoS detection method, device and equipment
US10805317B2 (en) 2017-06-15 2020-10-13 Microsoft Technology Licensing, Llc Implementing network security measures in response to a detected cyber attack
US10922627B2 (en) 2017-06-15 2021-02-16 Microsoft Technology Licensing, Llc Determining a course of action based on aggregated data
US11062226B2 (en) 2017-06-15 2021-07-13 Microsoft Technology Licensing, Llc Determining a likelihood of a user interaction with a content element

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070118492A1 (en) * 2005-11-18 2007-05-24 Claus Bahlmann Variational sparse kernel machines
CN101488189A (en) * 2009-02-04 2009-07-22 天津大学 Brain-electrical signal processing method based on isolated component automatic clustering process
US20100095254A1 (en) * 2005-08-12 2010-04-15 Demaris David L System and method for testing pattern sensitive algorithms for semiconductor design
CN101872343A (en) * 2009-04-24 2010-10-27 罗彤 Semi-supervised mass data hierarchy classification method
CN103810261A (en) * 2014-01-26 2014-05-21 西安理工大学 K-means clustering method based on quotient space theory

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100095254A1 (en) * 2005-08-12 2010-04-15 Demaris David L System and method for testing pattern sensitive algorithms for semiconductor design
US20070118492A1 (en) * 2005-11-18 2007-05-24 Claus Bahlmann Variational sparse kernel machines
CN101488189A (en) * 2009-02-04 2009-07-22 天津大学 Brain-electrical signal processing method based on isolated component automatic clustering process
CN101872343A (en) * 2009-04-24 2010-10-27 罗彤 Semi-supervised mass data hierarchy classification method
CN103810261A (en) * 2014-01-26 2014-05-21 西安理工大学 K-means clustering method based on quotient space theory

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴小涛等: "基于遗传算法和梯度下降法的聚类新算法", 《计算技术与信息发展》 *
汪宝彬等: "随机梯度下降法的一些性质", 《数学杂志》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105139277B (en) * 2015-08-18 2018-09-11 国家电网公司 A kind of power matching network information cluster system and method
CN105139277A (en) * 2015-08-18 2015-12-09 国家电网公司 Electric power distribution network information clustering system and method
CN105681089B (en) * 2016-01-26 2019-10-18 上海晶赞科技发展有限公司 Networks congestion control clustering method, device and terminal
CN105681089A (en) * 2016-01-26 2016-06-15 上海晶赞科技发展有限公司 Network user behavior clustering method, device and terminal
US10503580B2 (en) 2017-06-15 2019-12-10 Microsoft Technology Licensing, Llc Determining a likelihood of a resource experiencing a problem based on telemetry data
US10805317B2 (en) 2017-06-15 2020-10-13 Microsoft Technology Licensing, Llc Implementing network security measures in response to a detected cyber attack
US10922627B2 (en) 2017-06-15 2021-02-16 Microsoft Technology Licensing, Llc Determining a course of action based on aggregated data
US11062226B2 (en) 2017-06-15 2021-07-13 Microsoft Technology Licensing, Llc Determining a likelihood of a user interaction with a content element
CN108846532A (en) * 2018-03-21 2018-11-20 宁波工程学院 Business risk appraisal procedure and device applied to logistics supply platform chain
CN108460499A (en) * 2018-04-02 2018-08-28 福州大学 A kind of micro-blog user force arrangement method of fusion user time information
CN108460499B (en) * 2018-04-02 2022-03-08 福州大学 Microblog user influence ranking method integrating user time information
CN111385243A (en) * 2018-12-27 2020-07-07 ***通信集团山西有限公司 DDoS detection method, device and equipment
CN110288004A (en) * 2019-05-30 2019-09-27 武汉大学 A kind of diagnosis method for system fault and device excavated based on log semanteme

Also Published As

Publication number Publication date
CN104598565B (en) 2018-08-14

Similar Documents

Publication Publication Date Title
CN104598565A (en) K-means large-scale data clustering method based on stochastic gradient descent algorithm
Zhang et al. A large-scale multiobjective satellite data transmission scheduling algorithm based on SVM+ NSGA-II
Goh et al. Wind energy assessment considering wind speed correlation in Malaysia
CN102222092A (en) Massive high-dimension data clustering method for MapReduce platform
CN102542051A (en) Design method for multi-target cooperative sampling scheme of randomly-distributed geographic elements
Lai et al. Application of big data in smart grid
Cheng et al. Attribute reduction based on genetic algorithm for the coevolution of meteorological data in the industrial internet of things
Kaplan et al. A novel method based on Weibull distribution for short-term wind speed prediction
CN111984702A (en) Method, device, equipment and storage medium for analyzing spatial evolution of village and town settlement
CN108335010A (en) A kind of wind power output time series modeling method and system
Myers Solar radiation resource assessment for renewable energy conversion
Sun et al. Spatial modelling the location choice of large-scale solar photovoltaic power plants: Application of interpretable machine learning techniques and the national inventory
CN111159152B (en) Secondary operation and data fusion method based on big data processing technology
CN102945198A (en) Method for characterizing application characteristics of high performance computing
CN103793438A (en) MapReduce based parallel clustering method
CN104022505A (en) Distribution network reconstruction method with important node voltage dip economic losses considered
Hu et al. Parallel clustering of big data of spatio-temporal trajectory
Kim et al. Tutorial on time series prediction using 1D-CNN and BiLSTM: A case example of peak electricity demand and system marginal price prediction
CN107590225A (en) A kind of Visualized management system based on distributed data digging algorithm
Zhao et al. Optimisation algorithm for decision trees and the prediction of horizon displacement of landslides monitoring
Zhang et al. Clustering and decision tree based analysis of typical operation modes of power systems
CN114676931B (en) Electric quantity prediction system based on data center technology
Derzko et al. Optimal exploration and consumption of a national resource-stochastic case
CN115292361A (en) Method and system for screening distributed energy abnormal data
Lagomarsino-Oneto et al. Physics informed shallow machine learning for wind speed prediction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20160425

Address after: 100031 Xicheng District West Chang'an Avenue, No. 86, Beijing

Applicant after: State Grid Corporation of China

Applicant after: China Electric Power Research Institute

Applicant after: State Grid Smart Grid Institute

Address before: 100031 Xicheng District West Chang'an Avenue, No. 86, Beijing

Applicant before: State Grid Corporation of China

Applicant before: China Electric Power Research Institute

CB02 Change of applicant information

Address after: 100031 Xicheng District West Chang'an Avenue, No. 86, Beijing

Applicant after: State Grid Corporation of China

Applicant after: China Electric Power Research Institute

Applicant after: GLOBAL ENERGY INTERCONNECTION RESEARCH INSTITUTE

Address before: 100031 Xicheng District West Chang'an Avenue, No. 86, Beijing

Applicant before: State Grid Corporation of China

Applicant before: China Electric Power Research Institute

Applicant before: State Grid Smart Grid Institute

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant