CN106909932A - A kind of method and device of website cluster - Google Patents
A kind of method and device of website cluster Download PDFInfo
- Publication number
- CN106909932A CN106909932A CN201510982364.9A CN201510982364A CN106909932A CN 106909932 A CN106909932 A CN 106909932A CN 201510982364 A CN201510982364 A CN 201510982364A CN 106909932 A CN106909932 A CN 106909932A
- Authority
- CN
- China
- Prior art keywords
- distance
- cluster centre
- point
- cluster
- centre point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a kind of method and device of website cluster,Because method provided in an embodiment of the present invention employs the predicted value of the 3rd distance for concentrating cluster centre more than or equal to the technical scheme that the corresponding second cluster centre point of predicted value of the first distance of twice is filtered,The cluster result for being obtained can include with realm information,Clustering information after dimension is clustered to each website in the cluster of website on the basis of structural information and visitor information,So as to be supported for follow-up Web Hosting provides data according to the cluster result for obtaining,And the distance between the second cluster centre point and sample point need not be calculated in current clustering distance ergodic process,Without calculating the distance between the second sample point and other cluster centre points to be traveled through,Therefore,Reduce time and the amount of calculation for calculating that the distance between the second sample point and other cluster centre points to be traveled through are consumed,Improve the computational efficiency of data clusters.
Description
Technical field
The present invention relates to technical field of data processing, more particularly to the method and device that a kind of website clusters.
Background technology
With the development in epoch, website turns into the important channel that people obtain information, website miscellaneous to
People show various information.For example, music class website shows music to people, video class website is to people's exhibition
Show video, news category website is to people's displaying news etc..The structure that website miscellaneous uses also differs
Sample, the website for example having uses flat structure, and some websites use diversification structure, and this can be given people with not
With experience, people can according to oneself like select corresponding website, so the respective access in website
Number exists different.People can be searched after recording in corresponding big data to the access data of these websites,
Consequently facilitating being analyzed by the information that big data is included, the website for such as analyzing which type is more received
To liking for user, supported for follow-up Web Hosting provides data.
At present, this kind of big data is analyzed usually using clustering algorithm, for example, to sample set
When sample in S { S1, S2, S3 ... Sn } is clustered, using following the first scheme:In K iteration,
For any one sample Si, it is asked to arrive each cluster in cluster centre collection M { M1, M2 ... Mj ... Mk }
The distance of central point, in the class set where the Si is divided into closest cluster centre point;Using equal
The method of value, updates the cluster centre point in cluster centre collection M;Calculate current iteration produce class set with
Difference between the class set that last iteration is produced, untill the difference meets preset error condition.
The method is when the cluster set of cluster centre point calculate, it is necessary to by each sample in sample set S
Row distance calculating is clicked through with each cluster centre in cluster centre collection M respectively, that is, needs to carry out n*k times
Point-to-point distance is calculated, and amount of calculation is larger, is taken more long.
In order to solve computationally intensive, the time-consuming problem currently available technology more long that above-mentioned the first scheme is present
In additionally provide second scheme, be divided into for Si relative to the first scheme closest poly- by the program
The operating process of class set where class central point is improved, and improved plan is specific as follows:Calculate in clustering
The distance between any two cluster centre point in heart collection M { M1, M2 ... Mj ... Mk }, and preserve;It is logical
Triangle inequality principle is crossed, that is, calculates the distance between Luj and 2Lui, wherein, Luj is cluster centre
The distance between point Mu and cluster centre point Mj, wherein, cluster centre point Mu is Si and current distance
Si nearest cluster centre point, cluster centre point Mj is cluster centre to be traveled through in current ergodic process
Point, Lui is the distance between Si and cluster centre point Mu;If Luj is more than or equal to 2Liu, ignore
Fall cluster centre point Mj, and continue to travel through next cluster centre point, or, after the completion of traversal, by this
Si be divided into Mu where class set in;If Luj is less than 2Liu, the distance between Si and Mj is calculated
Lij, wherein, Lij is the distance between sample point Si and cluster centre point Mj;When Lij is less than Lui,
Lui=Lij, Mu=Mj are set, continue to travel through next cluster centre point, or, after the completion of traversal, will
The Si be divided into Mu where class set in.
By above two scheme, i.e., the cluster in big data can be obtained by being clustered to big data
Information, but, when second scheme is implemented, inventor has found it, and there are the following problems:Judging certain
When whether cluster centre point is the cluster centre point of sample, in sample Si and cluster centre collection M is determined
After nearest cluster centre point Mu, based on triangle inequality principle, by can not in cluster centre collection M
Can be that the cluster centre point of Si is abandoned, without calculating between the cluster centre point and sample Si that abandon
Distance, can to a certain extent reduce amount of calculation, shorten and calculate duration;But, in some clusters
Heart point is more, for the finer demand of cluster, because each iterative process is required to calculate cluster centre point
Distance between any two, causes amount of calculation larger, takes more long.
Therefore, clustering algorithm is present because each iterative process is required to calculate cluster centre point two in the prior art
The distance between two and cause amount of calculation larger, take technical problem more long.
The content of the invention
The embodiment of the present invention is used to solve in the prior art by providing the method and device that a kind of website clusters
What clustering algorithm was present leads because each iterative process is required to calculate cluster centre point distance between any two
Cause amount of calculation larger, take technical problem more long.
The method that embodiment of the present invention first aspect provides a kind of website cluster, it is characterised in that including:
The cluster centre collection of the sample set and the sample set for website cluster is obtained, in the sample set
Each sample point includes the description information of each website in the cluster of website, and the description information at least includes field
Information, structural information and visitor information;
For each sample point in the sample set, each cluster that cluster centre is concentrated is traveled through successively
Central point, it is determined that described each sample point concentrates closest cluster centre point with the cluster centre,
And described each sample point is divided into the closest cluster centre point correspondence of the cluster centre concentration
Set in, obtain each corresponding cluster set of cluster centre point that the cluster centre is concentrated;
The average value of sample point in the cluster set is obtained, and the cluster centre is updated according to the average value
Collection;
According to the last predicted value for updating front and rear itself difference the first distance of acquisition of the first cluster centre point;
Wherein, first distance is to need to carry out between the sample point of data clusters and the first cluster centre point
Distance, the first cluster centre point is the cluster closest with the sample point in clustering distance traversal
Central point;
According to itself difference and second before and after second distance, the last renewal of the first cluster centre point
Cluster centre point is last update before and after itself difference obtain the predicted value of the 3rd distance, wherein, described the
Two distances the first cluster centre point and the second cluster centre point described in last clustering distance ergodic process
The distance between, during the second cluster centre point is cluster to be traveled through in current clustering distance ergodic process
Heart point;
According to triangle inequality rule by the prediction of the predicted value of first distance and the 3rd distance
Value is compared;
If the predicted value of the 3rd distance is more than or equal to the predicted value of first distance of twice,
The second cluster centre point is abandoned, when being traveled through to carry out clustering distance, the sample point is no longer calculated
Treat that traversal is poly- with other with the distance between the second cluster centre point and the second cluster centre point
The distance between class central point;
The distance traversal is carried out based on the cluster centre collection for having abandoned the second cluster centre point, institute is obtained
The cluster result of sample set is stated, the cluster result is included with the realm information, the structural information and institute
State the cluster letter after dimension on the basis of visitor information is clustered to each website in the website cluster
Breath.
Alternatively, after the cluster result for obtaining the sample set, methods described also includes:
The cluster result is analyzed, is evaluated with to the clustering method.
Alternatively, it is described that the cluster result is analyzed, evaluated with to the clustering method, have
Body includes:
The cluster result is analyzed by entropy verification algorithm or purity verification algorithm;
When the entropy of the cluster result obtained in the entropy verification algorithm is less than the first preset value, it is determined that
The clustering method meets preset need;Or
When the purity of the cluster result obtained in the purity verification algorithm is more than the second preset value, it is determined that
The clustering method meets the preset need.
Alternatively, methods described also includes:
If the predicted value of the 3rd distance is less than the predicted value of first distance of twice, according to upper one
The first cluster centre point after secondary renewal clicks through row data clustering processing to second cluster centre.
Alternatively, during the first cluster centre point after the renewal according to the last time is clustered to described second
The heart clicks through row data clustering processing, including:
Calculate it is described it is last update after the distance between the first cluster centre point and the sample point,
Obtain the actual value of the first distance;
According to triangle inequality rule by the prediction of the actual value of first distance and the 3rd distance
Value is compared;
If the predicted value of the 3rd distance is more than or equal to the actual value of first distance of twice,
The second cluster centre point is abandoned, when being traveled through to carry out clustering distance, the sample point is no longer calculated
Treat that traversal is poly- with other with the distance between the second cluster centre point and the second cluster centre point
The distance between class central point;
If the predicted value of the 3rd distance is less than the actual value of first distance of twice, the 4th is calculated
Distance, and determine whether the 4th distance is less than the actual value of first distance;Wherein, the described 4th
Distance is the sample point and the distance of the second cluster centre point;
If the 4th distance is less than the actual value of first distance, and the second cluster centre point is true
It is set to closest with sample point cluster centre point in current distance ergodic process;
If the 4th distance is more than or equal to the actual value of first distance, by the last time more
The first cluster centre point after new is defined as in current distance ergodic process with sample point distance most
Near cluster centre point.
Alternatively, it is described to be defined as in current distance ergodic process and the sample the second cluster centre point
The closest cluster centre point of this point, including:
If the 4th distance has been traveled through less than the actual value of first distance, and current clustering distance
Into, then by the second cluster centre point be assigned to it is described it is last update after first cluster centre
Point, and the 4th distance is assigned to the actual value of first distance;
If the 4th distance is less than the actual value of first distance, and current clustering distance traversal is not complete
Into, then by the second cluster centre point be assigned to it is described it is last update after first cluster centre
Point, and the 4th distance is assigned to the actual value of first distance, and based on assignment after first
The actual value of the first distance after cluster centre point and assignment continues to travel through what the current cluster centre was concentrated
Next cluster centre point.
Alternatively, the first cluster centre point after the last renewal is defined as current distance traversal
During the cluster centre point closest with the sample point, including:
If the 4th distance is more than or equal to the actual value of first distance, and current clustering distance
Traversal is completed, then the first cluster centre point after the last renewal is defined as into current distance traversal
During the cluster centre point closest with the sample point;
If the 4th distance is more than or equal to the actual value of first distance, and current clustering distance
Traversal is not completed, then based on it is described it is last update after the first cluster centre point and described first away from
From actual value continue to travel through next cluster centre point that the current cluster centre is concentrated.
Alternatively, before the 4th distance is calculated, methods described also includes:
The 5th distance is calculated, after the 5th distance updates for the second cluster centre point with the last time
The distance between the first cluster centre point;
The actual value of first distance is compared with the 5th distance according to triangle inequality rule
Compared with;
If actual value of the 5th distance more than or equal to first distance of twice, by described the
Two cluster central points are abandoned, and to carry out during cluster traversal, no longer calculate the sample point poly- with described second
Between the distance between class central point and the second cluster centre point and other cluster centre points to be traveled through
Distance;
The 4th distance of the calculating, including:
If the 5th distance is performed described in the calculating less than the actual value of first distance of twice
4th distance.
Alternatively, it is described according to second distance, the first cluster centre point is last update before and after itself
Difference and the last predicted value for updating front and rear itself difference the 3rd distance of acquisition of the second cluster centre point,
Including:
Corresponding value after obtaining the preceding corresponding value of the last renewal of the first cluster centre point and updating, and count
The first difference between calculating before and after the first cluster centre point updates;
Corresponding value after obtaining the preceding corresponding value of the last renewal of the second cluster centre point and updating, and count
The second difference between calculating before and after the second cluster centre point updates;
The second distance and first difference and second difference are carried out into subtraction, obtains described
The predicted value of the 3rd distance.
Alternatively, after the second cluster centre point is abandoned, described based on having abandoned described second
The cluster centre collection of cluster centre point carries out the distance traversal, obtain the sample set cluster result it
Before, methods described also includes:
Judge whether the current clustering distance traversal completes;
If not traveling through completion, continue to travel through next cluster centre point that the current cluster centre is concentrated;
If traversal is completed, it is traversed that the first cluster centre point after the last time is updated is defined as current distance
The closest cluster centre point of sample point described in Cheng Zhongyu.
Embodiment of the present invention second aspect also provides a kind of device of website cluster, including:
Obtaining unit, the cluster centre for obtaining sample set and the sample set for website cluster
Collection, each sample point includes the description information of each website in the cluster of website, the description in the sample set
Information at least includes realm information, structural information and visitor information;
Cluster set obtaining unit, for for each sample point in the sample set, cluster being traveled through successively
Each cluster centre point that center is concentrated, it is determined that described each sample point and the cluster centre collection middle-range
From nearest cluster centre point, and described each sample point is divided into the cluster centre concentration distance most
In the near corresponding set of cluster centre point, each cluster centre point pair that the cluster centre is concentrated is obtained
The cluster set answered;
Average value obtaining unit, the average value for obtaining sample point in the cluster set, and according to described flat
Average updates the cluster centre collection;
First acquisition unit, for being obtained according to itself difference before and after the last renewal of the first cluster centre point
The predicted value of the first distance;Wherein, first distance for need to carry out the sample point of data clusters with it is described
The distance between first cluster centre point, the first cluster centre point be clustering distance traversal in the sample
The closest cluster centre point of this point;
Second acquisition unit, before and after according to second distance, the last renewal of the first cluster centre point
Itself difference and the second cluster centre point is last update before and after itself difference obtain the 3rd distance
Predicted value, wherein, the second distance is the first cluster centre described in last clustering distance ergodic process
The distance between point and the second cluster centre point, the second cluster centre point are that current clustering distance is traversed
Cluster centre to be traveled through point in journey;
Comparing unit, for the first acquisition unit is obtained according to triangle inequality rule described the
The predicted value of the 3rd distance that the predicted value of one distance is obtained with the second acquisition unit is compared;
Discarding unit, the predicted value of the 3rd distance for comparing when the comparing unit is more than or waits
When the predicted value of first distance of twice, the second cluster centre point is abandoned, to be gathered
During class distance traversal, the distance between the sample point and described second cluster centre point and institute are no longer calculated
State the distance between the second cluster centre point and other cluster centre points to be traveled through;
Cluster result obtaining unit, for being entered based on the cluster centre collection for having abandoned the second cluster centre point
The row distance traversal, obtains the cluster result of the sample set, and the cluster result is included with the field
Information, the structural information and on the basis of the visitor information dimension to each net in the website cluster
Station clustered after clustering information.
Alternatively, described device also includes:
Analytic unit, after obtaining the cluster result in the cluster result obtaining unit, to institute
State cluster result to be analyzed, evaluated with to the clustering method.
Alternatively, the analytic unit specifically for by entropy verification algorithm or purity verification algorithm to described
Cluster result is analyzed, wherein, the entropy of the cluster result obtained in the entropy verification algorithm is small
When the first preset value, determine that the clustering method meets preset need, or in the purity verification algorithm
When the purity of the cluster result for obtaining is more than the second preset value, determine that the clustering method meets described pre-
If demand.
Alternatively, described device also includes:
Processing unit, the predicted value of the 3rd distance for comparing when the comparing unit is less than twice
During the predicted value of first distance, the first cluster centre point after being updated according to the last time is to described the
Two cluster central points carry out data clusters treatment.
Alternatively, the processing unit specifically for:Calculate first cluster after the last renewal
The distance between central point and described sample point, obtain the actual value of the first distance;
The reality of first distance for calculating first computing module according to triangle inequality rule
Value is compared with the predicted value of the 3rd distance;
The predicted value of the 3rd distance compared when first comparison module is more than or equal to twice
During the actual value of first distance, the second cluster centre point is abandoned, to carry out clustering distance time
Last, no longer calculate the distance between the sample point and described second cluster centre point and described second gather
The distance between class central point and other cluster centre points to be traveled through;
Described first of the predicted value of the 3rd distance compared when first comparison module less than twice
The actual value of distance, then calculate the 4th distance;Wherein, the 4th distance is the sample point and described the
The distance of two cluster central points;
Determine whether the 4th distance of the second computing module calculating is less than the reality of first distance
Actual value;
When first determining module determines that the 4th distance is less than the actual value of first distance, will
The second cluster centre point is defined as closest with the sample point poly- in current distance ergodic process
Class central point;
When first determining module determines reality of the 4th distance more than or equal to first distance
During actual value, the first cluster centre point after the last renewal is defined as current distance ergodic process
In the cluster centre point closest with the sample point.
Alternatively, the computing module is specifically additionally operable to:
When actual value of the described 4th distance less than first distance, and current clustering distance traversal are completed
When, the second cluster centre point is assigned to the first cluster centre point after the last renewal,
And the 4th distance is assigned to the actual value of first distance;
When actual value of the described 4th distance less than first distance, and current clustering distance traversal is not complete
Cheng Shi, first cluster centre after the last renewal is assigned to by the second cluster centre point
Point, and the 4th distance is assigned to the actual value of first distance, and based on assignment after first
The actual value of the first distance after cluster centre point and assignment continues to travel through what the current cluster centre was concentrated
Next cluster centre point.
Alternatively, the computing module is specifically additionally operable to:
When actual value of the described 4th distance more than or equal to first distance, and current clustering distance
When traversal is completed, the first cluster centre point after the last renewal is defined as current distance traversal
During the cluster centre point closest with the sample point;
When actual value of the described 4th distance more than or equal to first distance, and current clustering distance
Traversal is not completed, then based on it is described it is last update after the first cluster centre point and described first away from
From actual value continue to travel through next cluster centre point that the current cluster centre is concentrated.
Alternatively, the processing unit is specifically additionally operable to:
Before the 4th distance that second computing module is calculated, the 5th distance, the described 5th are calculated
Distance is between the first cluster centre point after the second cluster centre point and the last renewal
Distance;
The reality of first distance for calculating first computing module according to triangle inequality rule
The 5th distance that value is calculated with the 3rd computing module is compared;
When the 5th distance that second comparison module compares is more than or equal to described the first of twice
The actual value of distance, then abandon the second cluster centre point, to carry out during cluster traversal, no longer counting
Calculate the distance between the sample point and described second cluster centre point and the second cluster centre point with
The distance between other cluster centre points to be traveled through;
When reality of the 5th distance less than first distance of twice that second comparison module compares
Actual value, then perform calculating the 4th distance.
Alternatively, the second acquisition unit, specifically for:
Corresponding value after obtaining the preceding corresponding value of the last renewal of the first cluster centre point and updating, and count
The first difference between calculating before and after the first cluster centre point updates;
Corresponding value after obtaining the preceding corresponding value of the last renewal of the second cluster centre point and updating, and count
The second difference between calculating before and after the second cluster centre point updates;
First difference and the second processing that the second distance is calculated with the first processing module
Second difference that module is calculated carries out subtraction, obtains the predicted value of the 3rd distance.
Alternatively, described device also includes:
Judging unit, after the discarding unit abandons the second cluster centre point, judges described working as
Whether preceding clustering distance traversal completes;
Traversal Unit, when the judging unit judges not traveling through completion, continuation is traveled through in the current cluster
Next cluster centre point that the heart is concentrated;
Determining unit, for when the judging unit judges that traversal is completed, by first after last time renewal
Cluster centre point is defined as closest with sample point cluster centre point in current distance ergodic process.
One or more technical schemes provided in the embodiment of the present invention, at least have the following technical effect that or excellent
Point:
1st, the method and device for being clustered by website provided in an embodiment of the present invention, the cluster knot for being obtained
Fruit can include that dimension is to each in the cluster of website by realm information, structural information and on the basis of visitor information
Website clustered after clustering information such that it is able to according to obtain cluster result be follow-up Web Hosting
Data are provided to support.
2nd, the method and device for being clustered by website provided in an embodiment of the present invention, in current clustering distance
In ergodic process, based on the last cluster centre collection for updating, before being updated according to the first cluster centre point last time
Rear itself difference obtains the predicted value of the first distance, and the predicted value of first distance is gathered to need to carry out data
The distance between the sample point of class and the closest cluster centre point of the sample point, according to second distance, the
Before itself difference and the last renewal of the second cluster centre point before and after the last renewal of 1 cluster centre point
Itself difference afterwards obtains the predicted value of the 3rd distance, and second distance is in last clustering distance ergodic process
The distance between first cluster centre point and the second cluster centre point, the second cluster centre point for current cluster away from
Cluster centre the to be traveled through point in ergodic process, by the predicted value of the 3rd distance and the predicted value of the first distance
It is compared, if the predicted value of the 3rd distance is more than or equal to the predicted value of the first distance of twice, will
The second cluster centre point is abandoned.In the present invention, based on triangle inequality rule, cluster centre is concentrated
The 3rd distance predicted value more than or equal to twice the first distance predicted value it is corresponding second cluster
Central point is filtered, without calculating the distance between the second cluster centre point and sample point, without calculating
The distance between second sample point and other cluster centre points to be traveled through, therefore, reduce the second sample of calculating
Time and amount of calculation that point is consumed with the distance between other cluster centre points to be traveled through, improve data poly-
The computational efficiency of class.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the method for website cluster provided in an embodiment of the present invention;
Fig. 2 be the 3rd distance provided in an embodiment of the present invention predicted value more than or equal to twice first away from
From predicted value schematic diagram;
Fig. 3 is that the first cluster centre point Mu ' after the renewal according to the last time provided in an embodiment of the present invention is right
Second cluster centre point Mj ' carries out the flow chart of data clusters processing method;
Fig. 4 is that the embodiment of the invention provides the flow chart for determining sample point Si correspondence cluster centre point methods;
Fig. 5 is the functional block diagram of the device of website cluster provided in an embodiment of the present invention.
Specific embodiment
The embodiment of the present invention is used to solve in the prior art by providing the method and device that a kind of website clusters
What clustering algorithm was present leads because each iterative process is required to calculate cluster centre point distance between any two
Cause amount of calculation larger, take technical problem more long.
The method that embodiment of the present invention first aspect provides a kind of website cluster, refer to Fig. 1, and Fig. 1 is
The schematic flow sheet of the method for website cluster provided in an embodiment of the present invention, as shown in figure 1, the method includes:
101:The cluster centre collection of the sample set and sample set for website cluster is obtained, it is every in sample set
Individual sample point include website cluster in each website description information, description information at least include realm information,
Structural information and visitor information;
In order to be discussed in detail the technical scheme in the embodiment of the present invention, description information include above-mentioned realm information,
Structural information and visitor information these three factors, in other embodiments, visitor information may be used also in description information
Believed with the income information, academic information, regional information, religious belief information and the credit rating that are specifically divided into visitor
Breath etc., just repeats no more herein.
In the present embodiment, set, the sample set for website cluster is S { S1, S2 ... Sn }, initial poly-
Class center collection M { M1, M2 ... Mj ... Mk }, the sample set can be the sub- business web site of network power or wide
The data message of the user that website is collected is accused, initialization cluster centre collection can at random be selected by sample set
Select the central point of predetermined number, selected from sample set initially apart from optimized algorithm or density technique of estimation scheduling algorithm
Cluster centre point, so as to obtain initial cluster center collection, just repeats no more herein.
102:For each sample point in sample set, each of traversal cluster centre concentration is poly- successively
Class central point, determines that each sample point concentrates closest cluster centre point with cluster centre, and will be every
One sample point is divided into the closest corresponding set of cluster centre point of cluster centre concentration, is gathered
Each corresponding cluster set of cluster centre point that class center is concentrated;
In this step, can first calculate cluster centre point in initial cluster center collection M between any two away from
From:D11, d12 ... d (k-1) k, then, for the arbitrary sample point Si in sample set S, wherein, i is more than
Equal to 1 and less than or equal to n, each the cluster centre point in cluster centre collection M is traveled through successively, it is determined that
Si concentrates closest cluster centre point Mu with cluster centre, and Si is divided into the cluster centre point
In the corresponding set of Mu, and preserve between sample point Si and cluster centre point Mu first apart from Liu,
The like obtain the corresponding cluster set of cluster centre point, such as cluster centre point M1, M2 ... Mj ... Mk couple
The cluster set answered respectively N1, N2 ... Nj ... Nk.
103:The average value of sample point in cluster set is obtained, and cluster centre collection is updated according to average value;
In this step, the average value for calculating sample point in cluster set N1, N2 ... Nj ... Nk is M1 ',
M2 ' ... Mj ' ... Mk ', and M1 ' is used, M2 ' ... Mj ' ... Mk ' update M1, M2 ... Mj ...
Mk, cluster centre collection M after renewal are { M1 ', M2 ' ... Mj ' ... Mk ' }.
104:The pre- of first distance is obtained according to itself difference before and after the last renewal of the first cluster centre point
Measured value;Wherein, the first distance is to need to carry out between the sample point of data clusters and the first cluster centre point
Distance, the first cluster centre point is cluster centre point closest with sample point during clustering distance is traveled through;
In order to improve the accuracy of data clusters, it is necessary to be iterated calculating, gather current data is carried out
It is { M1 ', M2 ' ... Mj ' ... Mk ' } based on the cluster centre collection M after above-mentioned renewal during class algorithm
Calculated.Wherein, first is to need to carry out the sample point Si of data clusters and last renewal apart from Liu
The distance between first cluster centre point Mu ' afterwards, the first cluster centre point Mu ' are traveled through for clustering distance
In the cluster centre point closest with sample point.
The corresponding first range prediction value of sample point Si is set and is set to Liu=Liu+Tu, wherein, Tu is the
One cluster centre point Mu ' is last to update itself front and rear difference, i.e. Tu is between Mu ' and Mu
Difference.In the embodiment of the present invention, it is by the purpose that the first range prediction value is set to Liu=Liu+Tu,
Ensure sample point Si and it is last update after the first cluster centre point Mu ' between distance maximum;Base
Liu=Liu+Tu after resetting, carries out current clustering distance traversal.
In embodiments of the present invention, the first cluster centre point after sample point Si and upper once renewal is calculated
The distance between Mu ', calculate initial cluster center concentrate cluster centre point between any two away from
From:During d11, d12 ... d (k-1) k, can use but be not limited to following method and realize, for example, Euclidean away from
With a distance from, manhatton distance, Chebyshev, power distance, cosine similarity, Pearson's similarity, amendment
Cosine similarity, Jaccard similarities, Hamming distance, weighting Euclidean distance, correlation distance, geneva
Distance etc. calculates the algorithm of distance, the embodiment of the present invention to calculate apart from when the specific method that is used do not carry out
Limit.
105:According to second distance, the first cluster centre point is last update before and after itself difference and the
Two cluster central points are last to update the predicted value that itself front and rear difference obtains the 3rd distance, wherein, second
In for last clustering distance ergodic process between the first cluster centre point and the second cluster centre point
Distance, the second cluster centre point is cluster centre point to be traveled through in current clustering distance ergodic process;
Wherein, second distance duj concentrates cluster centre point Mu and cluster centre to calculate initial cluster center
The distance between point Mj, cluster centre point Mj are poly- before the second cluster centre point Mj ' does not update
Class central point, during the second cluster centre point Mj ' is cluster to be traveled through in current clustering distance ergodic process
Heart point;Tu is that the first cluster centre point Mu ' is last updates itself front and rear difference, i.e. Tu is Mu '
Difference between Mu;Tj is that the second cluster centre point Mj ' is last updates itself front and rear difference,
That is Tj is the difference between Mj ' and Mj, and second distance duj and Tu and Tj are carried out into subtraction, is obtained
Predicted value to the 3rd distance is (duj-Tu-Tj).
It should be noted that the predicted value of the 3rd distance be (duj-Tu-Tj), its in calculating process,
Itself difference and the second cluster centre point before and after the last renewals of the first cluster centre point Mu ' need to only be calculated
Mj ' is last to update itself front and rear difference, and without calculating the cluster centre collection after last renewal
Cluster centre point distance between any two in M { M1 ', M2 ' ... Mj ' ... Mk ' }, can reduce data
Amount of calculation and raising computational efficiency during cluster.
106:The predicted value of the first distance is entered with the predicted value of the 3rd distance according to triangle inequality rule
Row compares;
Based on triangle inequality rule, namely in the triangles, necessarily there is both sides sum more than the 3rd side,
As triangle inequality will obtain the first distance predicted value Liu, with obtain the 3rd distance predicted value
It is compared.
107:If the predicted value of the 3rd distance is more than or equal to the predicted value of the first distance of twice, will
Second cluster centre point is abandoned, and when being traveled through to carry out clustering distance, no longer calculates sample point and the second cluster
The distance between the distance between central point and the second cluster centre point and other cluster centre points to be traveled through;
When the predicted value of the 3rd distance is more than or equal to the predicted value of the first distance of twice, i.e.,
(duj-Tu-Tj) it is more than or equal to 2*Liu, illustrates between sample point Si and the second cluster centre point Mj '
The distance apart from Lij ' more than or equal to the predicted value Liu of sample point Si and the first distance, by second
Cluster centre point Mj ' abandon, equivalent to the 3rd distance for concentrating cluster centre predicted value be more than or
The second cluster centre point corresponding equal to the predicted value of the first distance of twice is filtered, therefore, entering
Before the trade in clustering distance ergodic process, without calculate the distance between sample point Si and the second sample point Mj,
Without calculating the distance between the second cluster centre point and other cluster centre points to be traveled through.As shown in Fig. 2
Fig. 2 show the predicted value of the 3rd distance provided in an embodiment of the present invention more than or equal to twice first away from
From predicted value schematic diagram.
108:Row distance traversal is entered based on the cluster centre collection for having abandoned the second cluster centre point, sample is obtained
The cluster result of collection, cluster result is included by realm information, structural information and dimension pair on the basis of visitor information
Each website in the cluster of website clustered after clustering information.
Of course, enter what row distance traversal was obtained based on the cluster centre collection for having abandoned the second cluster centre point
Cluster result may not meet demand, it is possible to which the method for providing according to embodiments of the present invention is repeated
Data clusters, untill the cluster result that acquisition meets demand, just repeat no more herein.
In the present embodiment, after the cluster result for meeting demand is obtained, the cluster result can include with field
Information, structural information and after dimension is clustered to each website in the cluster of website on the basis of visitor information
Clustering information, i.e., can according to obtain cluster result for follow-up Web Hosting provides data support, example
Such as, realm information is that topical news, the flat structure that structural information is 3 layers, visitor information are more than 100,000
Ratio to occupy in the cluster of website with topical news be the 68% of the website of main presentation content, then it is follow-up new
The topical news website of construction can be then defined by the flat structure that structural information is 3 layers as far as possible, be easy to the greatest extent
Soon with the habituation of relative users, so that newly-built website can promptly be easily accepted by a user.
It can thus be seen that in current clustering distance ergodic process, based on the last cluster centre collection for updating,
Itself difference before and after being updated according to the first cluster centre point last time obtains the predicted value of the first distance, and this first
The predicted value of distance is to need to carry out the sample point of data clusters and the closest cluster centre of the sample point
The distance between point, according to second distance, the first cluster centre point is last update before and after itself difference with
And second cluster centre point it is last update before and after itself difference obtain the predicted value of the 3rd distance, second away from
In for last clustering distance ergodic process between the first cluster centre point and the second cluster centre point away from
From, the second cluster centre point be current clustering distance ergodic process in cluster centre point to be traveled through, by the 3rd
The predicted value of distance is compared with the predicted value of the first distance, if the predicted value of the 3rd distance is more than or waits
When the predicted value of the first distance of twice, the second cluster centre point is abandoned.In the embodiment of the present invention,
Based on triangle inequality rule, the predicted value of the 3rd distance that cluster centre is concentrated is more than or equal to two
Times the corresponding second cluster centre point of predicted value of the first distance filtered, without calculating the second cluster in
The distance between heart point and sample point, without calculate the second sample point and other cluster centre points to be traveled through it
Between distance, therefore, reduce the distance between calculating the second sample point and other cluster centre points to be traveled through
The time for being consumed and amount of calculation, improve the computational efficiency of data clusters.
In the present embodiment, it is provided in an embodiment of the present invention after the cluster result of the acquisition sample set
Method also includes:
The cluster result is analyzed, is evaluated with to the clustering method.
In specific implementation process, this pair cluster result is analyzed, and is evaluated with to the clustering method,
Specifically include:
The cluster result is entered by entropy (entropy) verification algorithm or purity (purity) verification algorithm
Row analysis;
In actual applications, as a example by being analyzed to cluster result by entropy verification algorithm, for one
For cluster i, P is calculated firstij, PijThe member (member) for referring to clustering in i belongs to class (class)
The probability of j,Wherein, miIt is the number of all members in i is clustered, mijBe cluster i in
Member belong to the number of class j.The entropy of each cluster can be expressed as
Wherein L is the number of class (class).Entirely the entropy of clustering isWherein K
It is the number for clustering (cluster), m is the membership involved by whole clustering.In this implementation
In example, when the entropy of the cluster result obtained in the entropy verification algorithm is less than the first preset value, it is determined that should
Clustering method meets preset need;
It is of course also possible to cluster result is analyzed by purity verification algorithm, it is similar, for one
For individual cluster i, Pi is calculated firstj, PijThe member (member) for referring to clustering in i belongs to class (class)
The probability of j,The purity of setting cluster i is defined as pi=max (pij).It is whole poly-
Class divide purity beWherein K is the number for clustering (cluster), mi
It is the number of all members in i is clustered, m is the membership involved by whole clustering.At this
In embodiment, when the purity of the cluster result obtained in the purity verification algorithm is more than the second preset value, really
The fixed clustering method meets the preset need.
Perform step 103 according to triangle inequality rule by the predicted value Liu and the 3rd of the first distance away from
From predicted value (duj-Tu-Tj) be compared when, if the predicted value (duj-Tu-Tj) of the 3rd distance is less than
The predicted value Liu of the first distance of twice, illustrate between sample point Si and the second cluster centre point Mj ' away from
From Lij ' less than the predicted value Liu with a distance from first between sample point Si and the first cluster centre point Mu ', root
The first cluster centre point Mu ' after being updated according to the last time is carried out at data clusters to the second cluster centre point Mj '
Manage to determine that the corresponding cluster centre points of sample point Si are the first cluster centre point Mu ' after last renewal
Or the second cluster centre point Mj '.As shown in figure 3, Fig. 3 shows basis provided in an embodiment of the present invention
The first cluster centre point Mu ' after last time renewal carries out data clusters treatment to the second cluster centre point Mj '
The flow chart of method, the method includes:
301st, the distance between first cluster centre point and sample point after last renewal are calculated, the is obtained
The actual value of one distance.
The distance between the first cluster centre point Mu ' and the sample point Si after last renewal Liu ' is calculated, should
Liu ' is the actual value of the first distance in current clustering distance ergodic process, the embodiment of the present invention computationally
The actual range Liu ' of the first distance between the first cluster centre point Mu ' and sample point Si after secondary renewal
When, the algorithm for being used refer to the associated description in above-mentioned steps 101, and the embodiment of the present invention is herein no longer
Repeated.
302nd, the actual value of the first distance is entered with the predicted value of the 3rd distance according to triangle inequality rule
Row compares.
Based on triangle inequality rule, by the actual value Liu ' of the first distance and the predicted value of the 3rd distance
(duj-Tu-Tj) it is compared, if the predicted value (duj-Tu-Tj) of the 3rd distance is more than or equal to twice
The first distance actual value Liu ', then perform step 303;If the predicted value (duj-Tu-Tj) of the 3rd distance
Less than the actual value Liu ' of the first distance of twice, then step 304 is performed.
303rd, the second cluster centre point is abandoned.
When the actual value of the predicted value (duj-Tu-Tj) more than or equal to the first distance of twice of the 3rd distance
During Liu ', illustrate in current clustering distance ergodic process, the cluster centre points of sample point Si to second Mj's '
Actual range of the actual range more than or equal to the cluster centre points of sample point Si to first Mu ', i.e. sample point
The corresponding cluster centre points of Si are unlikely to be the second cluster centre point Mj ', therefore by the second cluster centre point
Mj ' is abandoned, and no longer calculates the distance between sample point Si and the second cluster centre point Mj ' and the second cluster
The distance between central point Mj ' and other cluster centre points to be traveled through.
304th, the 4th distance is calculated, and determines whether the 4th distance is less than the actual value of the first distance.
As the actual value Liu ' of the first distance that the predicted value (duj-Tu-Tj) of the 3rd distance is less than twice,
Illustrate in current clustering distance ergodic process, the actual range of the cluster centre points of sample point Si to second Mj '
Less than in the actual range of the cluster centre points of sample point Si to first Mu ', the i.e. corresponding clusters of sample point Si
Heart point is probably the second cluster centre point Mj '.
In determining that the corresponding cluster centre points of sample point Si are the first cluster centre point Mu ', or second clusters
Heart point Mj ', it is necessary to calculate the 4th apart from Lij ', wherein, the 4th is sample point Si and second apart from Lij '
The distance between cluster centre point Mj '.If the 4th actual value Liu ' apart from Lij ' less than the first distance,
Perform step 305;If the 4th actual value Liu ' apart from Lij ' more than or equal to the first distance, perform
Step 306.
305th, the second cluster centre point is defined as closest with sample point in current distance ergodic process
Cluster centre point.
When the 4th apart from actual value Lius ' of the Lij ' less than the first distance, the second cluster centre point Mj ' is determined
It is cluster centre point closest with sample point Si in current distance ergodic process.In the embodiment of the present invention
A kind of implementation in, when the 4th apart from Lij ' less than first distance actual value Liu ', and work as
Preceding clustering distance traversal is completed, then by the second cluster centre point Mj ' be assigned to after last renewal this
One cluster centre point Mu ', and the actual value Liu ' that the first distance is assigned to apart from Lij ' by the 4th, i.e.,
Lui '=Lij ', Mu '=Mj ';In another implementation of the embodiment of the present invention, when the 4th apart from Lij '
Less than the actual value Liu ' of the first distance, and current clustering distance traversal is not completed, then during second is clustered
Heart point Mj ' is assigned to the first cluster centre point Mu ' after last renewal, and by the 4th apart from Lij '
Be assigned to the actual value Liu ' of the first distance, i.e. Lui '=Lij ', Mu '=Mj ', and based on assignment after first
The actual value Liu ' of the first distance after cluster centre point Mu ' and assignment continues to travel through current cluster centre concentration
Next cluster centre point, until having traveled through current cluster centre collection.
306th, the first cluster centre point after the last time is updated is defined as in current distance ergodic process and sample
The closest cluster centre point of this point.
When the 4th apart from actual value Lius ' of the Lij ' more than or equal to the first distance, after determining last time renewal
The first cluster centre point Mu ' in closest with sample point Si cluster in current distance ergodic process
Heart point.In a kind of implementation of the embodiment of the present invention, first is more than or equal to apart from Lij ' when the 4th
The actual value Liu ' of distance, and during current clustering distance traversal completion, after the last time is updated first
Cluster centre point Mu ' is defined as closest with sample point Si cluster centre in current distance ergodic process
Point;As the 4th actual value Liu ' apart from Lij ' more than or equal to the first distance, and current clustering distance
Traversal is not completed, then the reality of the first cluster centre point Mu ' and the first distance after being updated based on the last time
Value Liu ' continues to travel through next cluster centre point that current cluster centre is concentrated.
Be combined for Fig. 1 and Fig. 3 to determine sample point in specific implementation process by the embodiment of the present invention
The corresponding cluster centre points of Si, as shown in figure 4, Fig. 4 shows the embodiment of the invention provides determination sample
The flow chart of point Si correspondence cluster centre point methods, the method includes:
401st, the pre- of the first distance is obtained according to itself difference before and after the last renewal of the first cluster centre point
Measured value.
402nd, according to second distance, the first cluster centre point is last update before and after itself difference and the
Two cluster central points are last to update the predicted value that itself front and rear difference obtains the 3rd distance.
403rd, the predicted value of the first distance is entered with the predicted value of the 3rd distance according to triangle inequality rule
Row compares.
If the predicted value of the 3rd distance performs step more than or equal to the predicted value of the first distance of twice
404;If the predicted value of the 3rd distance is less than the predicted value of the first distance of twice, step 405 is performed.
404th, the second cluster centre point is abandoned.
405th, the first cluster centre point after being updated according to the last time clicks through line number to second cluster centre
According to clustering processing.
The first cluster centre point after the relevant renewal according to the last time clicks through line number to second cluster centre
According to the implementation process of clustering processing, the detailed description of Fig. 3 is refer to, the embodiment of the present invention is no longer carried out herein
Repeat.
Further, before step 304 the 4th distance of calculating is performed, the 5th is calculated apart from duj ', the 5th
Distance is the distance between first cluster centre point Mu ' after the second cluster centre point Mj ' and last time renewal,
The actual value Liu ' of the first distance is compared with the 5th apart from duj ' according to triangle inequality rule, when
During the actual value Liu ' of the 5th the first distance for being more than or equal to twice apart from duj ', by the second cluster centre
Point is abandoned, no longer in the clusters of calculating the distance between sample point Si and the second cluster centre point Mj ' and second
The distance between heart point Mj ' and other cluster centre points to be traveled through;When the 5th apart from duj ' less than the of twice
During the actual value Liu ' of one distance, step 304 is continued executing with.
It should be noted that in the operating process of actual operating procedure 301- steps 303, will can cluster
Major part in center collection M, actual value Liu ' of the distance more than or equal to the first distance with sample point Si
Cluster centre point abandoned, and in cluster centre collection M be left part cluster centre point be and sample
Cluster centre point of this distance of Si less than the actual value Liu ' of the first distance.Exemplary, it is assumed that cluster
Have 1000 cluster centre points in center collection M, during by step 301- steps 303, can 800 with
The distance of sample point Si is carried out more than or equal to the corresponding cluster centre point of the actual value Liu ' of the first distance
Abandon, now, remaining 200 cluster centre points in cluster centre collection M.Calculate respectively remaining 200
In cluster centre point the second cluster centre point Mj ' and last time update after the first cluster centre point Mu ' between
5th distance, when the 5th apart from actual value Lius ' of the duj ' more than or equal to the first distance of twice, will
150 second cluster centre point Mj ' are abandoned, now, remaining 50 cluster centre points in cluster centre collection M,
The distance between remaining 50 cluster centre points in sample point Si and cluster centre collection M are calculated respectively, it is determined that
Sample point Si closest cluster centre point.It should be noted that in actual operating process, meter
When calculating in cluster centre collection M the 5th between cluster centre point two-by-two apart from duj ', than calculating sample point
Between Si and the second cluster centre point Mj ' the 4th apart from Lij ' amount of calculation is small, elapsed time is few.This hair
Bright embodiment is based on triangle inequality rule, twice to the distance in cluster centre collection M with sample point Si
Cluster centre point more than or equal to the actual value Liu ' of the first distance is abandoned, and is entered to a certain extent
The amount of calculation for reducing calculating sample point Si and the second cluster centre point Mj ' of one step.
Further, as the refinement and extension to above-described embodiment, above-mentioned steps 102 obtain the 3rd away from
From predicted value (duj-Tu-Tj) when, can using but be not limited to it is following by the way of realize, obtain first
Cluster centre point Mu ' is last to update corresponding value Mu ' after preceding corresponding value Mu and renewal, calculates first
Difference Tu, wherein, the Tu is the difference between Mu ' and Mu;Obtain on the second cluster centre point Mj '
Corresponding value Mj ' after once updating preceding corresponding value Mj and updating, calculates the second difference Tj, wherein, should
Tj is the difference between Tj ' and Tj;Second distance duj and the first difference Tu and the second difference Tj are carried out
Subtraction, obtains the predicted value (duj-Tu-Tj) of the 3rd distance.
Further, after execution 404, judge whether current clustering distance traversal completes, if not traveling through
Complete, then continue to travel through the next cluster centre point of current cluster centre concentration;If traversal is completed, will be upper
The first cluster centre point Mu ' after once updating be defined as in current distance ergodic process with the sample point away from
From nearest cluster centre point.
It is determined that after the corresponding cluster centre points of sample point Si, the like obtain cluster centre collection M
M1 ', M2 ' ... the corresponding cluster sets of Mj ' ... Mk ' are respectively N1 ', N2 ' ... Nj ' ... Nk ', calculation procedure 101
In cluster set N1, N2 ... the Nj ... Nk that the is somebody's turn to do cluster set N1 ', N2 ' ... that determine with current clustering distance traversal
Difference O1, O2 ... Oj ... Ok between Nj ' ... Nk ', and whether judge the difference O1, O2 ... Oj ... Ok
Meet default error threshold, if meeting, the cluster set N1 ', N2 ' ... that preceding clustering distance traversal is determined
Nj ' ... Nk ' are defined as the result of final data cluster;If not meeting, as above should based on the embodiment of the present invention
Method repeats data clusters, until it is determined that untill the result of final data cluster.In the present embodiment,
Need to be configured according to the actual requirements default error threshold is set, for the need that some fine datas are clustered
For asking, the smaller of default error threshold is set, for example, it is 1 or 0 etc. to set default error threshold,
The embodiment of the present invention is not defined to presetting the particular content that error threshold is set.
Based on same inventive concept, embodiment of the present invention second aspect additionally provides a kind of dress of website cluster
Put, refer to Fig. 5, Fig. 5 is the functional block diagram of the device of website cluster provided in an embodiment of the present invention,
As shown in figure 5, the device includes:
Obtaining unit 501, the cluster centre for obtaining sample set and the sample set for website cluster
Collection, each sample point includes corresponding personal description information in the cluster of website, description letter in the sample set
Breath at least includes age information, gender information, preference information and spending amount information;
Cluster set obtaining unit 502, for for each sample point in the sample set, traveling through successively poly-
Each cluster centre point that class center is concentrated, determines that each sample point concentrates distance with the cluster centre
Nearest cluster centre point, and each sample point is divided into the cluster centre concentrates closest poly-
In the corresponding set of class central point, each corresponding cluster of cluster centre point of cluster centre concentration is obtained
Collection;
Average value obtaining unit 503, the average value for obtaining sample point in the cluster set, and it is flat according to this
Average updates the cluster centre collection;
First acquisition unit 504, for according to itself difference before and after the last renewal of the first cluster centre point
Obtain the predicted value of the first distance;Wherein, first distance is to need to carry out the sample point of data clusters and be somebody's turn to do
The distance between first cluster centre point, the first cluster centre point be clustering distance traversal in the sample point
Closest cluster centre point;
Second acquisition unit 505, before according to second distance, the last renewal of the first cluster centre point
Itself difference before and after the last renewal of itself difference and the second cluster centre point afterwards obtains the 3rd distance
Predicted value, wherein, the second distance is the first cluster centre point in last clustering distance ergodic process
The distance between with the second cluster centre point, during the second cluster centre point is current clustering distance ergodic process
Cluster centre point to be traveled through;
Comparing unit 506, for obtain the first acquisition unit 504 according to triangle inequality rule
The predicted value of the 3rd distance that the predicted value of first distance is obtained with the second acquisition unit 505 is compared
Compared with;
Discarding unit 507, the predicted value of the 3rd distance for comparing when the comparing unit 506 be more than or
When person is equal to the predicted value of first distance of twice, the second cluster centre point is abandoned, to be gathered
During class distance traversal, no longer calculate the distance between the sample point and the second cluster centre point and this second
The distance between cluster centre point and other cluster centre points to be traveled through;
Cluster result obtaining unit 508, for based on the cluster centre collection for having abandoned the second cluster centre point
Carry out the distance traversal, obtain the cluster result of the sample set, the cluster result include with the age information,
The gender information, the preference information and each during dimension is to the website cluster on the basis of the spending amount information
Individual website clustered after clustering information.
Further, the device also includes:
Analytic unit 509, it is right after obtaining the cluster result in the cluster result obtaining unit 508
The cluster result is analyzed, is evaluated with to the clustering method.
Further, the analytic unit 509 is specifically for by entropy verification algorithm or purity verification algorithm pair
The cluster result is analyzed, wherein, the entropy of the cluster result obtained in the entropy verification algorithm is less than
During the first preset value, determine that the clustering method meets preset need, or obtained in the purity verification algorithm
When the purity of the cluster result is more than the second preset value, determine that the clustering method meets the preset need.
Further, the device also includes:
Processing unit 510, the predicted value of the 3rd distance for comparing when the comparing unit 506 is less than two
Times first distance predicted value when, according to last time update after the first cluster centre point to this second
Cluster centre clicks through row data clustering processing.
Further, the processing unit 510 specifically for:Calculate first cluster after last time renewal
The distance between central point and the sample point, obtain the actual value of the first distance;
The actual value of first distance for calculating the first processing units 510 according to triangle inequality rule
Predicted value with the 3rd distance is compared;
The predicted value of the 3rd distance compared when first comparison module more than or equal to twice this
During the actual value of one distance, the second cluster centre point is abandoned, when being traveled through to carry out clustering distance, no
Calculate again the distance between the sample point and the second cluster centre point and the second cluster centre point and its
The distance between his cluster centre point to be traveled through;
The predicted value of the 3rd distance compared when first comparison module is less than first distance of twice
Actual value, then calculate the 4th distance;Wherein, the 4th distance is the sample point and the second cluster centre point
Distance;
Determine whether the 4th distance of the second processing unit 510 calculating is less than the reality of first distance
Value;
When first determining module determine the 4th distance less than first distance actual value when, by this second
Cluster centre point is defined as closest with sample point cluster centre point in current distance ergodic process;
When first determining module determines that the 4th distance is more than or equal to the actual value of first distance,
The first cluster centre point after the last time is updated is defined as in current distance ergodic process and the sample
The closest cluster centre point of point.
Further, the processing unit 510 is specifically additionally operable to:
When actual value of the 4th distance less than first distance, and current clustering distance traversal are completed,
By the second cluster centre point be assigned to the last time update after the first cluster centre point, and by this
Four distances are assigned to the actual value of first distance;
When actual value of the 4th distance less than first distance, and current clustering distance traversal are not completed
When, the second cluster centre point is assigned to the first cluster centre point after last time renewal, and will
4th distance is assigned to the actual value of first distance, and based on assignment after the first cluster centre point and tax
The actual value of the first distance after value continues to travel through next cluster centre point that the current cluster centre is concentrated.
Further, the processing unit 510 is specifically additionally operable to:
When actual value of the 4th distance more than or equal to first distance, and current clustering distance traversal
During completion, the first cluster centre point after the last time is updated be defined as in current distance ergodic process with
The closest cluster centre point of the sample point;
When actual value of the 4th distance more than or equal to first distance, and current clustering distance traversal
Do not complete, then the actual value of the first cluster centre point and first distance after being updated based on the last time
Continue to travel through next cluster centre point that the current cluster centre is concentrated.
Further, the processing unit 510 is specifically additionally operable to:
The second processing unit 510 calculate the 4th distance before, calculate the 5th distance, the 5th away from
The distance between the first cluster centre point after being updated for the second cluster centre point and the last time;
The actual value of first distance for calculating the first processing units 510 according to triangle inequality rule
The 5th distance calculated with the 3rd processing unit 510 is compared;
When the 5th distance that second comparison module compares is more than or equal to first distance of twice
Actual value, then abandon the second cluster centre point, to carry out during cluster traversal, no longer calculating the sample
The distance between point and the second cluster centre point and the second cluster centre point treat traversal cluster with other
The distance between central point;
When the 5th distance that second comparison module compares is less than the actual value of first distance of twice, then
Perform the distance of calculating the 4th.
Further, the second acquisition unit 505, specifically for:
Corresponding value after obtaining the preceding corresponding value of the last renewal of the first cluster centre point and updating, and calculate
The first difference between before and after first cluster centre point renewal;
Corresponding value after obtaining the preceding corresponding value of the last renewal of the second cluster centre point and updating, and calculate
The second difference between before and after second cluster centre point renewal;
First difference and the Second processing module that the second distance is calculated with the first processing module are calculated
Second difference carry out subtraction, obtain the predicted value of the 3rd distance.
Further, the device also includes:
Judging unit 511, after the discarding unit 507 abandons the second cluster centre point, judging should
Whether current clustering distance traversal completes;
Traversal Unit 512, when the judging unit 511 judges not traveling through completion, this is current poly- to continue traversal
Next cluster centre point that class center is concentrated;
Determining unit 513, for when the judging unit 511 judges that traversal is completed, after last time renewal
The first cluster centre point be defined as in closest with sample point cluster in current distance ergodic process
Heart point.
The device of website cluster provided in an embodiment of the present invention, in current clustering distance ergodic process, is based on
The cluster centre collection that last time updates, itself difference before and after being updated according to the first cluster centre point last time is obtained
The predicted value of the first distance, the predicted value of first distance is to need to carry out the sample point of data clusters and the sample
The distance between closest cluster centre point of this point, according to one on second distance, the first cluster centre point
Itself difference before and after the last renewal of itself difference and the second cluster centre point before and after secondary renewal is obtained
The predicted value of the 3rd distance, second distance be in last clustering distance ergodic process the first cluster centre point with
The distance between second cluster centre point, the second cluster centre point is to treat time in current clustering distance ergodic process
The cluster centre point gone through, the predicted value of the 3rd distance and the predicted value of the first distance are compared, if the 3rd
When the predicted value of distance is more than or equal to the predicted value of the first distance of twice, by the second cluster centre point
Abandon.In the present invention, based on triangle inequality rule, the prediction of the 3rd distance that cluster centre is concentrated
Value is filtered more than or equal to the corresponding second cluster centre point of predicted value of the first distance of twice, nothing
The distance between the second cluster centre point and sample point need to be calculated, is treated with other without the second sample point is calculated
The distance between traversal cluster centre point, therefore, reduce the second sample point of calculating and treat traversal cluster with other
Time and amount of calculation that the distance between central point is consumed, improve the computational efficiency of data clusters.
Technical scheme in the embodiments of the present invention, at least has the following technical effect that or advantage:
1st, the method and device for being clustered by website provided in an embodiment of the present invention, the cluster knot for being obtained
Fruit can include that dimension is to each in the cluster of website by realm information, structural information and on the basis of visitor information
Website clustered after clustering information such that it is able to according to obtain cluster result be follow-up Web Hosting
Data are provided to support.
2nd, the method and device for being clustered by website provided in an embodiment of the present invention, in current clustering distance
In ergodic process, based on the last cluster centre collection for updating, before being updated according to the first cluster centre point last time
Rear itself difference obtains the predicted value of the first distance, and the predicted value of first distance is gathered to need to carry out data
The distance between the sample point of class and the closest cluster centre point of the sample point, according to second distance, the
Before itself difference and the last renewal of the second cluster centre point before and after the last renewal of 1 cluster centre point
Itself difference afterwards obtains the predicted value of the 3rd distance, and second distance is in last clustering distance ergodic process
The distance between first cluster centre point and the second cluster centre point, the second cluster centre point for current cluster away from
Cluster centre the to be traveled through point in ergodic process, by the predicted value of the 3rd distance and the predicted value of the first distance
It is compared, if the predicted value of the 3rd distance is more than or equal to the predicted value of the first distance of twice, will
The second cluster centre point is abandoned.In the present invention, based on triangle inequality rule, cluster centre is concentrated
The 3rd distance predicted value more than or equal to twice the first distance predicted value it is corresponding second cluster
Central point is filtered, without calculating the distance between the second cluster centre point and sample point, without calculating
The distance between second sample point and other cluster centre points to be traveled through, therefore, reduce the second sample of calculating
Time and amount of calculation that point is consumed with the distance between other cluster centre points to be traveled through, improve data poly-
The computational efficiency of class.
The embodiment of the invention discloses:
The method of A1, a kind of website cluster, it is characterised in that including:
The cluster centre collection of the sample set and the sample set for website cluster is obtained, in the sample set
Each sample point includes the description information of each website in the cluster of website, and the description information at least includes field
Information, structural information and visitor information;
For each sample point in the sample set, each cluster that cluster centre is concentrated is traveled through successively
Central point, it is determined that described each sample point concentrates closest cluster centre point with the cluster centre,
And described each sample point is divided into the closest cluster centre point correspondence of the cluster centre concentration
Set in, obtain each corresponding cluster set of cluster centre point that the cluster centre is concentrated;
The average value of sample point in the cluster set is obtained, and the cluster centre is updated according to the average value
Collection;
According to the last predicted value for updating front and rear itself difference the first distance of acquisition of the first cluster centre point;
Wherein, first distance is to need to carry out between the sample point of data clusters and the first cluster centre point
Distance, the first cluster centre point is the cluster closest with the sample point in clustering distance traversal
Central point;
According to itself difference and second before and after second distance, the last renewal of the first cluster centre point
Cluster centre point is last update before and after itself difference obtain the predicted value of the 3rd distance, wherein, described the
Two distances the first cluster centre point and the second cluster centre point described in last clustering distance ergodic process
The distance between, during the second cluster centre point is cluster to be traveled through in current clustering distance ergodic process
Heart point;
According to triangle inequality rule by the prediction of the predicted value of first distance and the 3rd distance
Value is compared;
If the predicted value of the 3rd distance is more than or equal to the predicted value of first distance of twice,
The second cluster centre point is abandoned, when being traveled through to carry out clustering distance, the sample point is no longer calculated
Treat that traversal is poly- with other with the distance between the second cluster centre point and the second cluster centre point
The distance between class central point;
The distance traversal is carried out based on the cluster centre collection for having abandoned the second cluster centre point, institute is obtained
The cluster result of sample set is stated, the cluster result is included with the realm information, the structural information and institute
State the cluster letter after dimension on the basis of visitor information is clustered to each website in the website cluster
Breath.
A2, the method according to A1, it is characterised in that in the cluster knot for obtaining the sample set
After fruit, methods described also includes:
The cluster result is analyzed, is evaluated with to the clustering method.
A3, the method according to A2, it is characterised in that described to be analyzed to the cluster result,
Evaluated with to the clustering method, specifically included:
The cluster result is analyzed by entropy verification algorithm or purity verification algorithm;
When the entropy of the cluster result obtained in the entropy verification algorithm is less than the first preset value, it is determined that
The clustering method meets preset need;Or
When the purity of the cluster result obtained in the purity verification algorithm is more than the second preset value, it is determined that
The clustering method meets the preset need.
A4, the method according to A1, it is characterised in that methods described also includes:
If the predicted value of the 3rd distance is less than the predicted value of first distance of twice, according to upper one
The first cluster centre point after secondary renewal clicks through row data clustering processing to second cluster centre.
A5, the method according to A3, it is characterised in that it is described updated according to the last time after described the
Cluster centre point clicks through row data clustering processing to second cluster centre, including:
Calculate it is described it is last update after the distance between the first cluster centre point and the sample point,
Obtain the actual value of the first distance;
According to triangle inequality rule by the prediction of the actual value of first distance and the 3rd distance
Value is compared;
If the predicted value of the 3rd distance is more than or equal to the actual value of first distance of twice,
The second cluster centre point is abandoned, when being traveled through to carry out clustering distance, the sample point is no longer calculated
Treat that traversal is poly- with other with the distance between the second cluster centre point and the second cluster centre point
The distance between class central point;
If the predicted value of the 3rd distance is less than the actual value of first distance of twice, the 4th is calculated
Distance, and determine whether the 4th distance is less than the actual value of first distance;Wherein, the described 4th
Distance is the sample point and the distance of the second cluster centre point;
If the 4th distance is less than the actual value of first distance, and the second cluster centre point is true
It is set to closest with sample point cluster centre point in current distance ergodic process;
If the 4th distance is more than or equal to the actual value of first distance, by the last time more
The first cluster centre point after new is defined as in current distance ergodic process with sample point distance most
Near cluster centre point.
A6, the method according to A5, it is characterised in that described to determine the second cluster centre point
It is cluster centre point closest with the sample point in current distance ergodic process, including:
If the 4th distance has been traveled through less than the actual value of first distance, and current clustering distance
Into, then by the second cluster centre point be assigned to it is described it is last update after first cluster centre
Point, and the 4th distance is assigned to the actual value of first distance;
If the 4th distance is less than the actual value of first distance, and current clustering distance traversal is not complete
Into, then by the second cluster centre point be assigned to it is described it is last update after first cluster centre
Point, and the 4th distance is assigned to the actual value of first distance, and based on assignment after first
The actual value of the first distance after cluster centre point and assignment continues to travel through what the current cluster centre was concentrated
Next cluster centre point.
A7, the method according to A5, it is characterised in that by described first after the last renewal
Cluster centre point is defined as closest with sample point cluster centre point in current distance ergodic process,
Including:
If the 4th distance is more than or equal to the actual value of first distance, and current clustering distance
Traversal is completed, then the first cluster centre point after the last renewal is defined as into current distance traversal
During the cluster centre point closest with the sample point;
If the 4th distance is more than or equal to the actual value of first distance, and current clustering distance
Traversal is not completed, then based on it is described it is last update after the first cluster centre point and described first away from
From actual value continue to travel through next cluster centre point that the current cluster centre is concentrated.
A8, the method according to A6 or A7, it is characterised in that before the 4th distance is calculated, institute
Stating method also includes:
The 5th distance is calculated, after the 5th distance updates for the second cluster centre point with the last time
The distance between the first cluster centre point;
The actual value of first distance is compared with the 5th distance according to triangle inequality rule
Compared with;
If actual value of the 5th distance more than or equal to first distance of twice, by described the
Two cluster central points are abandoned, and to carry out during cluster traversal, no longer calculate the sample point poly- with described second
Between the distance between class central point and the second cluster centre point and other cluster centre points to be traveled through
Distance;
The 4th distance of the calculating, including:
If the 5th distance is performed described in the calculating less than the actual value of first distance of twice
4th distance.
A9, the method according to any one of A1-A7, it is characterised in that it is described according to second distance,
The first cluster centre point is last to update front and rear itself difference and the second cluster centre point last time
Itself difference before and after updating obtains the predicted value of the 3rd distance, including:
Corresponding value after obtaining the preceding corresponding value of the last renewal of the first cluster centre point and updating, and count
The first difference between calculating before and after the first cluster centre point updates;
Corresponding value after obtaining the preceding corresponding value of the last renewal of the second cluster centre point and updating, and count
The second difference between calculating before and after the second cluster centre point updates;
The second distance and first difference and second difference are carried out into subtraction, obtains described
The predicted value of the 3rd distance.
A10, the method according to A7, it is characterised in that abandoned by the second cluster centre point
Afterwards, the distance time is carried out based on the cluster centre collection for having abandoned the second cluster centre point described
Go through, before obtaining the cluster result of the sample set, methods described also includes:
Judge whether the current clustering distance traversal completes;
If not traveling through completion, continue to travel through next cluster centre point that the current cluster centre is concentrated;
If traversal is completed, it is traversed that the first cluster centre point after the last time is updated is defined as current distance
The closest cluster centre point of sample point described in Cheng Zhongyu.
The device of B11, a kind of website cluster, it is characterised in that including:
Obtaining unit, the cluster centre for obtaining sample set and the sample set for website cluster
Collection, each sample point includes the description information of each website in the cluster of website, the description in the sample set
Information at least includes realm information, structural information and visitor information;
Cluster set obtaining unit, for for each sample point in the sample set, cluster being traveled through successively
Each cluster centre point that center is concentrated, it is determined that described each sample point and the cluster centre collection middle-range
From nearest cluster centre point, and described each sample point is divided into the cluster centre concentration distance most
In the near corresponding set of cluster centre point, each cluster centre point pair that the cluster centre is concentrated is obtained
The cluster set answered;
Average value obtaining unit, the average value for obtaining sample point in the cluster set, and according to described flat
Average updates the cluster centre collection;
First acquisition unit, for being obtained according to itself difference before and after the last renewal of the first cluster centre point
The predicted value of the first distance;Wherein, first distance for need to carry out the sample point of data clusters with it is described
The distance between first cluster centre point, the first cluster centre point be clustering distance traversal in the sample
The closest cluster centre point of this point;
Second acquisition unit, before and after according to second distance, the last renewal of the first cluster centre point
Itself difference and the second cluster centre point is last update before and after itself difference obtain the 3rd distance
Predicted value, wherein, the second distance is the first cluster centre described in last clustering distance ergodic process
The distance between point and the second cluster centre point, the second cluster centre point are that current clustering distance is traversed
Cluster centre to be traveled through point in journey;
Comparing unit, for the first acquisition unit is obtained according to triangle inequality rule described the
The predicted value of the 3rd distance that the predicted value of one distance is obtained with the second acquisition unit is compared;
Discarding unit, the predicted value of the 3rd distance for comparing when the comparing unit is more than or waits
When the predicted value of first distance of twice, the second cluster centre point is abandoned, to be gathered
During class distance traversal, the distance between the sample point and described second cluster centre point and institute are no longer calculated
State the distance between the second cluster centre point and other cluster centre points to be traveled through;
Cluster result obtaining unit, for being entered based on the cluster centre collection for having abandoned the second cluster centre point
The row distance traversal, obtains the cluster result of the sample set, and the cluster result is included with the field
Information, the structural information and on the basis of the visitor information dimension to each net in the website cluster
Station clustered after clustering information.
B12, the device according to B11, it is characterised in that described device also includes:
Analytic unit, after obtaining the cluster result in the cluster result obtaining unit, to institute
State cluster result to be analyzed, evaluated with to the clustering method.
B13, the device according to B12, it is characterised in that the analytic unit is specifically for by entropy
Value verification algorithm or purity verification algorithm are analyzed to the cluster result, wherein, in entropy checking
When the entropy of the cluster result that algorithm is obtained is less than the first preset value, determine that the clustering method meets pre-
If demand, or the purity of the cluster result obtained in the purity verification algorithm is more than the second preset value
When, determine that the clustering method meets the preset need.
B14, the device according to B11, it is characterised in that described device also includes:
Processing unit, the predicted value of the 3rd distance for comparing when the comparing unit is less than twice
During the predicted value of first distance, the first cluster centre point after being updated according to the last time is to described the
Two cluster central points carry out data clusters treatment.
B15, the device according to B14, it is characterised in that the processing unit specifically for:Calculate
It is described it is last update after the distance between the first cluster centre point and the sample point, obtain first
The actual value of distance;
The reality of first distance for calculating first computing module according to triangle inequality rule
Value is compared with the predicted value of the 3rd distance;
The predicted value of the 3rd distance compared when first comparison module is more than or equal to twice
During the actual value of first distance, the second cluster centre point is abandoned, to carry out clustering distance time
Last, no longer calculate the distance between the sample point and described second cluster centre point and described second gather
The distance between class central point and other cluster centre points to be traveled through;
Described first of the predicted value of the 3rd distance compared when first comparison module less than twice
The actual value of distance, then calculate the 4th distance;Wherein, the 4th distance is the sample point and described the
The distance of two cluster central points;
Determine whether the 4th distance of the second computing module calculating is less than the reality of first distance
Actual value;
When first determining module determines that the 4th distance is less than the actual value of first distance, will
The second cluster centre point is defined as closest with the sample point poly- in current distance ergodic process
Class central point;
When first determining module determines reality of the 4th distance more than or equal to first distance
During actual value, the first cluster centre point after the last renewal is defined as current distance ergodic process
In the cluster centre point closest with the sample point.
B16, the device according to B14, it is characterised in that the computing module is specifically additionally operable to:
When actual value of the described 4th distance less than first distance, and current clustering distance traversal are completed
When, the second cluster centre point is assigned to the first cluster centre point after the last renewal,
And the 4th distance is assigned to the actual value of first distance;
When actual value of the described 4th distance less than first distance, and current clustering distance traversal is not complete
Cheng Shi, first cluster centre after the last renewal is assigned to by the second cluster centre point
Point, and the 4th distance is assigned to the actual value of first distance, and based on assignment after first
The actual value of the first distance after cluster centre point and assignment continues to travel through what the current cluster centre was concentrated
Next cluster centre point.
B17, the device according to B14, it is characterised in that the computing module is specifically additionally operable to:
When actual value of the described 4th distance more than or equal to first distance, and current clustering distance
When traversal is completed, the first cluster centre point after the last renewal is defined as current distance traversal
During the cluster centre point closest with the sample point;
When actual value of the described 4th distance more than or equal to first distance, and current clustering distance
Traversal is not completed, then based on it is described it is last update after the first cluster centre point and described first away from
From actual value continue to travel through next cluster centre point that the current cluster centre is concentrated.
B18, the device according to B16 or B17, it is characterised in that the processing unit is specifically also used
In:
Before the 4th distance that second computing module is calculated, the 5th distance, the described 5th are calculated
Distance is between the first cluster centre point after the second cluster centre point and the last renewal
Distance;
The reality of first distance for calculating first computing module according to triangle inequality rule
The 5th distance that value is calculated with the 3rd computing module is compared;
When the 5th distance that second comparison module compares is more than or equal to described the first of twice
The actual value of distance, then abandon the second cluster centre point, to carry out during cluster traversal, no longer counting
Calculate the distance between the sample point and described second cluster centre point and the second cluster centre point with
The distance between other cluster centre points to be traveled through;
When reality of the 5th distance less than first distance of twice that second comparison module compares
Actual value, then perform calculating the 4th distance.
B19, the device according to any one of B11-B17, it is characterised in that described second obtains single
Unit, specifically for:
Corresponding value after obtaining the preceding corresponding value of the last renewal of the first cluster centre point and updating, and count
The first difference between calculating before and after the first cluster centre point updates;
Corresponding value after obtaining the preceding corresponding value of the last renewal of the second cluster centre point and updating, and count
The second difference between calculating before and after the second cluster centre point updates;
First difference and the second processing that the second distance is calculated with the first processing module
Second difference that module is calculated carries out subtraction, obtains the predicted value of the 3rd distance.
B20, the device according to B17, it is characterised in that described device also includes:
Judging unit, after the discarding unit abandons the second cluster centre point, judges described working as
Whether preceding clustering distance traversal completes;
Traversal Unit, when the judging unit judges not traveling through completion, continuation is traveled through in the current cluster
Next cluster centre point that the heart is concentrated;
Determining unit, for when the judging unit judges that traversal is completed, by first after last time renewal
Cluster centre point is defined as closest with sample point cluster centre point in current distance ergodic process.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or meter
Calculation machine program product.Therefore, the present invention can be using complete hardware embodiment, complete software embodiment or knot
Close the form of the embodiment in terms of software and hardware.And, the present invention can be used and wherein wrapped at one or more
Containing computer usable program code computer-usable storage medium (including but not limited to magnetic disk storage,
CD-ROM, optical memory etc.) on implement computer program product form.
The present invention is produced with reference to method according to embodiments of the present invention, equipment (system) and computer program
The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions realize flow chart and
/ or block diagram in each flow and/or the flow in square frame and flow chart and/or block diagram and/
Or the combination of square frame.These computer program instructions to all-purpose computer, special-purpose computer, insertion can be provided
The processor of formula processor or other programmable data processing devices is producing a machine so that by calculating
The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one
The device of the function of being specified in individual flow or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or the treatment of other programmable datas to set
In the standby computer-readable memory for working in a specific way so that storage is in the computer-readable memory
Instruction produce include the manufacture of command device, the command device realization in one flow of flow chart or multiple
The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made
Obtain and series of operation steps is performed on computer or other programmable devices to produce computer implemented place
Reason, so as to the instruction performed on computer or other programmable devices is provided for realizing in flow chart one
The step of function of being specified in flow or multiple one square frame of flow and/or block diagram or multiple square frames.
Obviously, those skilled in the art can carry out various changes and modification without deviating from this hair to the present invention
Bright spirit and scope.So, if it is of the invention these modification and modification belong to the claims in the present invention and
Within the scope of its equivalent technologies, then the present invention is also intended to comprising these changes and modification.
Claims (10)
1. a kind of method that website clusters, it is characterised in that including:
The cluster centre collection of the sample set and the sample set for website cluster is obtained, in the sample set
Each sample point includes the description information of each website in the cluster of website, and the description information at least includes field
Information, structural information and visitor information;
For each sample point in the sample set, each cluster that cluster centre is concentrated is traveled through successively
Central point, it is determined that described each sample point concentrates closest cluster centre point with the cluster centre,
And described each sample point is divided into the closest cluster centre point correspondence of the cluster centre concentration
Set in, obtain each corresponding cluster set of cluster centre point that the cluster centre is concentrated;
The average value of sample point in the cluster set is obtained, and the cluster centre is updated according to the average value
Collection;
According to the last predicted value for updating front and rear itself difference the first distance of acquisition of the first cluster centre point;
Wherein, first distance is to need to carry out between the sample point of data clusters and the first cluster centre point
Distance, the first cluster centre point is the cluster closest with the sample point in clustering distance traversal
Central point;
According to itself difference and second before and after second distance, the last renewal of the first cluster centre point
Cluster centre point is last update before and after itself difference obtain the predicted value of the 3rd distance, wherein, described the
Two distances the first cluster centre point and the second cluster centre point described in last clustering distance ergodic process
The distance between, during the second cluster centre point is cluster to be traveled through in current clustering distance ergodic process
Heart point;
According to triangle inequality rule by the prediction of the predicted value of first distance and the 3rd distance
Value is compared;
If the predicted value of the 3rd distance is more than or equal to the predicted value of first distance of twice,
The second cluster centre point is abandoned, when being traveled through to carry out clustering distance, the sample point is no longer calculated
Treat that traversal is poly- with other with the distance between the second cluster centre point and the second cluster centre point
The distance between class central point;
The distance traversal is carried out based on the cluster centre collection for having abandoned the second cluster centre point, institute is obtained
The cluster result of sample set is stated, the cluster result is included with the realm information, the structural information and institute
State the cluster letter after dimension on the basis of visitor information is clustered to each website in the website cluster
Breath.
2. method according to claim 1, it is characterised in that obtain the sample set described
After cluster result, methods described also includes:
The cluster result is analyzed, is evaluated with to the clustering method.
3. method according to claim 2, it is characterised in that described to be carried out to the cluster result
Analysis, evaluates with to the clustering method, specifically includes:
The cluster result is analyzed by entropy verification algorithm or purity verification algorithm;
When the entropy of the cluster result obtained in the entropy verification algorithm is less than the first preset value, it is determined that
The clustering method meets preset need;Or
When the purity of the cluster result obtained in the purity verification algorithm is more than the second preset value, it is determined that
The clustering method meets the preset need.
4. method according to claim 1, it is characterised in that methods described also includes:
If the predicted value of the 3rd distance is less than the predicted value of first distance of twice, according to upper one
The first cluster centre point after secondary renewal clicks through row data clustering processing to second cluster centre.
5. method according to claim 3, it is characterised in that it is described updated according to the last time after
The first cluster centre point clicks through row data clustering processing to second cluster centre, including:
Calculate it is described it is last update after the distance between the first cluster centre point and the sample point,
Obtain the actual value of the first distance;
According to triangle inequality rule by the prediction of the actual value of first distance and the 3rd distance
Value is compared;
If the predicted value of the 3rd distance is more than or equal to the actual value of first distance of twice,
The second cluster centre point is abandoned, when being traveled through to carry out clustering distance, the sample point is no longer calculated
Treat that traversal is poly- with other with the distance between the second cluster centre point and the second cluster centre point
The distance between class central point;
If the predicted value of the 3rd distance is less than the actual value of first distance of twice, the 4th is calculated
Distance, and determine whether the 4th distance is less than the actual value of first distance;Wherein, the described 4th
Distance is the sample point and the distance of the second cluster centre point;
If the 4th distance is less than the actual value of first distance, and the second cluster centre point is true
It is set to closest with sample point cluster centre point in current distance ergodic process;
If the 4th distance is more than or equal to the actual value of first distance, by the last time more
The first cluster centre point after new is defined as in current distance ergodic process with sample point distance most
Near cluster centre point.
6. method according to claim 5, it is characterised in that described by second cluster centre
Point is defined as closest with sample point cluster centre point in current distance ergodic process, including:
If the 4th distance has been traveled through less than the actual value of first distance, and current clustering distance
Into, then by the second cluster centre point be assigned to it is described it is last update after first cluster centre
Point, and the 4th distance is assigned to the actual value of first distance;
If the 4th distance is less than the actual value of first distance, and current clustering distance traversal is not complete
Into, then by the second cluster centre point be assigned to it is described it is last update after first cluster centre
Point, and the 4th distance is assigned to the actual value of first distance, and based on assignment after first
The actual value of the first distance after cluster centre point and assignment continues to travel through what the current cluster centre was concentrated
Next cluster centre point.
7. method according to claim 5, it is characterised in that by it is described it is last update after institute
State the first cluster centre point and be defined as closest with sample point cluster in current distance ergodic process
Central point, including:
If the 4th distance is more than or equal to the actual value of first distance, and current clustering distance
Traversal is completed, then the first cluster centre point after the last renewal is defined as into current distance traversal
During the cluster centre point closest with the sample point;
If the 4th distance is more than or equal to the actual value of first distance, and current clustering distance
Traversal is not completed, then based on it is described it is last update after the first cluster centre point and described first away from
From actual value continue to travel through next cluster centre point that the current cluster centre is concentrated.
8. the method according to claim 6 or 7, it is characterised in that before the 4th distance is calculated,
Methods described also includes:
The 5th distance is calculated, after the 5th distance updates for the second cluster centre point with the last time
The distance between the first cluster centre point;
The actual value of first distance is compared with the 5th distance according to triangle inequality rule
Compared with;
If actual value of the 5th distance more than or equal to first distance of twice, by described the
Two cluster central points are abandoned, and to carry out during cluster traversal, no longer calculate the sample point poly- with described second
Between the distance between class central point and the second cluster centre point and other cluster centre points to be traveled through
Distance;
The 4th distance of the calculating, including:
If the 5th distance is performed described in the calculating less than the actual value of first distance of twice
4th distance.
9. the method according to any one of claim 1-7, it is characterised in that described according to second
Distance, the first cluster centre point are last to be updated on front and rear itself difference and the second cluster centre point
Itself difference before and after once updating obtains the predicted value of the 3rd distance, including:
Corresponding value after obtaining the preceding corresponding value of the last renewal of the first cluster centre point and updating, and count
The first difference between calculating before and after the first cluster centre point updates;
Corresponding value after obtaining the preceding corresponding value of the last renewal of the second cluster centre point and updating, and count
The second difference between calculating before and after the second cluster centre point updates;
The second distance and first difference and second difference are carried out into subtraction, obtains described
The predicted value of the 3rd distance.
10. the device that a kind of website clusters, it is characterised in that including:
Obtaining unit, the cluster centre for obtaining sample set and the sample set for website cluster
Collection, each sample point includes the description information of each website in the cluster of website, the description in the sample set
Information at least includes realm information, structural information and visitor information;
Cluster set obtaining unit, for for each sample point in the sample set, cluster being traveled through successively
Each cluster centre point that center is concentrated, it is determined that described each sample point and the cluster centre collection middle-range
From nearest cluster centre point, and described each sample point is divided into the cluster centre concentration distance most
In the near corresponding set of cluster centre point, each cluster centre point pair that the cluster centre is concentrated is obtained
The cluster set answered;
Average value obtaining unit, the average value for obtaining sample point in the cluster set, and according to described flat
Average updates the cluster centre collection;
First acquisition unit, for being obtained according to itself difference before and after the last renewal of the first cluster centre point
The predicted value of the first distance;Wherein, first distance for need to carry out the sample point of data clusters with it is described
The distance between first cluster centre point, the first cluster centre point be clustering distance traversal in the sample
The closest cluster centre point of this point;
Second acquisition unit, before and after according to second distance, the last renewal of the first cluster centre point
Itself difference and the second cluster centre point is last update before and after itself difference obtain the 3rd distance
Predicted value, wherein, the second distance is the first cluster centre described in last clustering distance ergodic process
The distance between point and the second cluster centre point, the second cluster centre point are that current clustering distance is traversed
Cluster centre to be traveled through point in journey;
Comparing unit, for the first acquisition unit is obtained according to triangle inequality rule described the
The predicted value of the 3rd distance that the predicted value of one distance is obtained with the second acquisition unit is compared;
Discarding unit, the predicted value of the 3rd distance for comparing when the comparing unit is more than or waits
When the predicted value of first distance of twice, the second cluster centre point is abandoned, to be gathered
During class distance traversal, the distance between the sample point and described second cluster centre point and institute are no longer calculated
State the distance between the second cluster centre point and other cluster centre points to be traveled through;
Cluster result obtaining unit, for being entered based on the cluster centre collection for having abandoned the second cluster centre point
The row distance traversal, obtains the cluster result of the sample set, and the cluster result is included with the field
Information, the structural information and on the basis of the visitor information dimension to each net in the website cluster
Station clustered after clustering information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510982364.9A CN106909932A (en) | 2015-12-23 | 2015-12-23 | A kind of method and device of website cluster |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510982364.9A CN106909932A (en) | 2015-12-23 | 2015-12-23 | A kind of method and device of website cluster |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106909932A true CN106909932A (en) | 2017-06-30 |
Family
ID=59206042
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510982364.9A Pending CN106909932A (en) | 2015-12-23 | 2015-12-23 | A kind of method and device of website cluster |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106909932A (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996197A (en) * | 2009-08-31 | 2011-03-30 | ***通信集团公司 | Cluster realizing method and system |
CN102750647A (en) * | 2012-06-29 | 2012-10-24 | 南京大学 | Merchant recommendation method based on transaction network |
CN103412948A (en) * | 2013-08-27 | 2013-11-27 | 北京交通大学 | Cluster-based collaborative filtering commodity recommendation method and system |
CN103605794A (en) * | 2013-12-05 | 2014-02-26 | 国家计算机网络与信息安全管理中心 | Website classifying method |
CN104101902A (en) * | 2013-04-10 | 2014-10-15 | 中国石油天然气股份有限公司 | Earthquake attribute cluster method and apparatus |
CN104765776A (en) * | 2015-03-18 | 2015-07-08 | 华为技术有限公司 | Data sample clustering method and device |
CN105095912A (en) * | 2015-08-06 | 2015-11-25 | 北京奇虎科技有限公司 | Data clustering method and device |
-
2015
- 2015-12-23 CN CN201510982364.9A patent/CN106909932A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101996197A (en) * | 2009-08-31 | 2011-03-30 | ***通信集团公司 | Cluster realizing method and system |
CN102750647A (en) * | 2012-06-29 | 2012-10-24 | 南京大学 | Merchant recommendation method based on transaction network |
CN104101902A (en) * | 2013-04-10 | 2014-10-15 | 中国石油天然气股份有限公司 | Earthquake attribute cluster method and apparatus |
CN103412948A (en) * | 2013-08-27 | 2013-11-27 | 北京交通大学 | Cluster-based collaborative filtering commodity recommendation method and system |
CN103605794A (en) * | 2013-12-05 | 2014-02-26 | 国家计算机网络与信息安全管理中心 | Website classifying method |
CN104765776A (en) * | 2015-03-18 | 2015-07-08 | 华为技术有限公司 | Data sample clustering method and device |
CN105095912A (en) * | 2015-08-06 | 2015-11-25 | 北京奇虎科技有限公司 | Data clustering method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
De Boer et al. | A tutorial on the cross-entropy method | |
US20160357845A1 (en) | Method and Apparatus for Classifying Object Based on Social Networking Service, and Storage Medium | |
CN108268931A (en) | The methods, devices and systems of data processing | |
CN104735166B (en) | The Skyline method for service selection annealed based on MapReduce and multi-target simulation | |
CN109214337A (en) | A kind of Demographics' method, apparatus, equipment and computer readable storage medium | |
US9147009B2 (en) | Method of temporal bipartite projection | |
CN110110237A (en) | User interest information recommended method, storage medium | |
CN112819157B (en) | Neural network training method and device, intelligent driving control method and device | |
CN110267206A (en) | User location prediction technique and device | |
CN112687266B (en) | Speech recognition method, device, computer equipment and storage medium | |
Qian et al. | Kernel estimation and model combination in a bandit problem with covariates | |
CN111178486A (en) | Hyper-parameter asynchronous parallel search method based on population evolution | |
CN110462638A (en) | Training neural network is sharpened using posteriority | |
CN109063041A (en) | The method and device of relational network figure insertion | |
CN111967964B (en) | Intelligent recommending method and device for bank client sites | |
CN106803092B (en) | Method and device for determining standard problem data | |
CN106910079A (en) | A kind of method and device of crowd's cluster | |
Lin et al. | Currency exchange rates prediction based on linear regression analysis using cloud computing | |
Feuer et al. | TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks | |
Joest et al. | A user-aware tour proposal framework using a hybrid optimization approach | |
CN106909932A (en) | A kind of method and device of website cluster | |
CN106910080A (en) | A kind of method and device being analyzed according to crowd's cluster result | |
CN115907262A (en) | Tour route planning method and device, electronic equipment and storage medium | |
CN106909569A (en) | A kind of method and device being analyzed according to website cluster result | |
CN115661861A (en) | Skeleton behavior identification method based on dynamic time sequence multidimensional adaptive graph convolution network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170630 |
|
RJ01 | Rejection of invention patent application after publication |