CN106909932A - A kind of method and device of website cluster - Google Patents

A kind of method and device of website cluster Download PDF

Info

Publication number
CN106909932A
CN106909932A CN201510982364.9A CN201510982364A CN106909932A CN 106909932 A CN106909932 A CN 106909932A CN 201510982364 A CN201510982364 A CN 201510982364A CN 106909932 A CN106909932 A CN 106909932A
Authority
CN
China
Prior art keywords
distance
cluster centre
point
cluster
centre point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510982364.9A
Other languages
Chinese (zh)
Inventor
杨诗
向园
洪春晓
吕俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510982364.9A priority Critical patent/CN106909932A/en
Publication of CN106909932A publication Critical patent/CN106909932A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of method and device of website cluster,Because method provided in an embodiment of the present invention employs the predicted value of the 3rd distance for concentrating cluster centre more than or equal to the technical scheme that the corresponding second cluster centre point of predicted value of the first distance of twice is filtered,The cluster result for being obtained can include with realm information,Clustering information after dimension is clustered to each website in the cluster of website on the basis of structural information and visitor information,So as to be supported for follow-up Web Hosting provides data according to the cluster result for obtaining,And the distance between the second cluster centre point and sample point need not be calculated in current clustering distance ergodic process,Without calculating the distance between the second sample point and other cluster centre points to be traveled through,Therefore,Reduce time and the amount of calculation for calculating that the distance between the second sample point and other cluster centre points to be traveled through are consumed,Improve the computational efficiency of data clusters.

Description

A kind of method and device of website cluster
Technical field
The present invention relates to technical field of data processing, more particularly to the method and device that a kind of website clusters.
Background technology
With the development in epoch, website turns into the important channel that people obtain information, website miscellaneous to People show various information.For example, music class website shows music to people, video class website is to people's exhibition Show video, news category website is to people's displaying news etc..The structure that website miscellaneous uses also differs Sample, the website for example having uses flat structure, and some websites use diversification structure, and this can be given people with not With experience, people can according to oneself like select corresponding website, so the respective access in website Number exists different.People can be searched after recording in corresponding big data to the access data of these websites, Consequently facilitating being analyzed by the information that big data is included, the website for such as analyzing which type is more received To liking for user, supported for follow-up Web Hosting provides data.
At present, this kind of big data is analyzed usually using clustering algorithm, for example, to sample set When sample in S { S1, S2, S3 ... Sn } is clustered, using following the first scheme:In K iteration, For any one sample Si, it is asked to arrive each cluster in cluster centre collection M { M1, M2 ... Mj ... Mk } The distance of central point, in the class set where the Si is divided into closest cluster centre point;Using equal The method of value, updates the cluster centre point in cluster centre collection M;Calculate current iteration produce class set with Difference between the class set that last iteration is produced, untill the difference meets preset error condition.
The method is when the cluster set of cluster centre point calculate, it is necessary to by each sample in sample set S Row distance calculating is clicked through with each cluster centre in cluster centre collection M respectively, that is, needs to carry out n*k times Point-to-point distance is calculated, and amount of calculation is larger, is taken more long.
In order to solve computationally intensive, the time-consuming problem currently available technology more long that above-mentioned the first scheme is present In additionally provide second scheme, be divided into for Si relative to the first scheme closest poly- by the program The operating process of class set where class central point is improved, and improved plan is specific as follows:Calculate in clustering The distance between any two cluster centre point in heart collection M { M1, M2 ... Mj ... Mk }, and preserve;It is logical Triangle inequality principle is crossed, that is, calculates the distance between Luj and 2Lui, wherein, Luj is cluster centre The distance between point Mu and cluster centre point Mj, wherein, cluster centre point Mu is Si and current distance Si nearest cluster centre point, cluster centre point Mj is cluster centre to be traveled through in current ergodic process Point, Lui is the distance between Si and cluster centre point Mu;If Luj is more than or equal to 2Liu, ignore Fall cluster centre point Mj, and continue to travel through next cluster centre point, or, after the completion of traversal, by this Si be divided into Mu where class set in;If Luj is less than 2Liu, the distance between Si and Mj is calculated Lij, wherein, Lij is the distance between sample point Si and cluster centre point Mj;When Lij is less than Lui, Lui=Lij, Mu=Mj are set, continue to travel through next cluster centre point, or, after the completion of traversal, will The Si be divided into Mu where class set in.
By above two scheme, i.e., the cluster in big data can be obtained by being clustered to big data Information, but, when second scheme is implemented, inventor has found it, and there are the following problems:Judging certain When whether cluster centre point is the cluster centre point of sample, in sample Si and cluster centre collection M is determined After nearest cluster centre point Mu, based on triangle inequality principle, by can not in cluster centre collection M Can be that the cluster centre point of Si is abandoned, without calculating between the cluster centre point and sample Si that abandon Distance, can to a certain extent reduce amount of calculation, shorten and calculate duration;But, in some clusters Heart point is more, for the finer demand of cluster, because each iterative process is required to calculate cluster centre point Distance between any two, causes amount of calculation larger, takes more long.
Therefore, clustering algorithm is present because each iterative process is required to calculate cluster centre point two in the prior art The distance between two and cause amount of calculation larger, take technical problem more long.
The content of the invention
The embodiment of the present invention is used to solve in the prior art by providing the method and device that a kind of website clusters What clustering algorithm was present leads because each iterative process is required to calculate cluster centre point distance between any two Cause amount of calculation larger, take technical problem more long.
The method that embodiment of the present invention first aspect provides a kind of website cluster, it is characterised in that including:
The cluster centre collection of the sample set and the sample set for website cluster is obtained, in the sample set Each sample point includes the description information of each website in the cluster of website, and the description information at least includes field Information, structural information and visitor information;
For each sample point in the sample set, each cluster that cluster centre is concentrated is traveled through successively Central point, it is determined that described each sample point concentrates closest cluster centre point with the cluster centre, And described each sample point is divided into the closest cluster centre point correspondence of the cluster centre concentration Set in, obtain each corresponding cluster set of cluster centre point that the cluster centre is concentrated;
The average value of sample point in the cluster set is obtained, and the cluster centre is updated according to the average value Collection;
According to the last predicted value for updating front and rear itself difference the first distance of acquisition of the first cluster centre point; Wherein, first distance is to need to carry out between the sample point of data clusters and the first cluster centre point Distance, the first cluster centre point is the cluster closest with the sample point in clustering distance traversal Central point;
According to itself difference and second before and after second distance, the last renewal of the first cluster centre point Cluster centre point is last update before and after itself difference obtain the predicted value of the 3rd distance, wherein, described the Two distances the first cluster centre point and the second cluster centre point described in last clustering distance ergodic process The distance between, during the second cluster centre point is cluster to be traveled through in current clustering distance ergodic process Heart point;
According to triangle inequality rule by the prediction of the predicted value of first distance and the 3rd distance Value is compared;
If the predicted value of the 3rd distance is more than or equal to the predicted value of first distance of twice, The second cluster centre point is abandoned, when being traveled through to carry out clustering distance, the sample point is no longer calculated Treat that traversal is poly- with other with the distance between the second cluster centre point and the second cluster centre point The distance between class central point;
The distance traversal is carried out based on the cluster centre collection for having abandoned the second cluster centre point, institute is obtained The cluster result of sample set is stated, the cluster result is included with the realm information, the structural information and institute State the cluster letter after dimension on the basis of visitor information is clustered to each website in the website cluster Breath.
Alternatively, after the cluster result for obtaining the sample set, methods described also includes:
The cluster result is analyzed, is evaluated with to the clustering method.
Alternatively, it is described that the cluster result is analyzed, evaluated with to the clustering method, have Body includes:
The cluster result is analyzed by entropy verification algorithm or purity verification algorithm;
When the entropy of the cluster result obtained in the entropy verification algorithm is less than the first preset value, it is determined that The clustering method meets preset need;Or
When the purity of the cluster result obtained in the purity verification algorithm is more than the second preset value, it is determined that The clustering method meets the preset need.
Alternatively, methods described also includes:
If the predicted value of the 3rd distance is less than the predicted value of first distance of twice, according to upper one The first cluster centre point after secondary renewal clicks through row data clustering processing to second cluster centre.
Alternatively, during the first cluster centre point after the renewal according to the last time is clustered to described second The heart clicks through row data clustering processing, including:
Calculate it is described it is last update after the distance between the first cluster centre point and the sample point, Obtain the actual value of the first distance;
According to triangle inequality rule by the prediction of the actual value of first distance and the 3rd distance Value is compared;
If the predicted value of the 3rd distance is more than or equal to the actual value of first distance of twice, The second cluster centre point is abandoned, when being traveled through to carry out clustering distance, the sample point is no longer calculated Treat that traversal is poly- with other with the distance between the second cluster centre point and the second cluster centre point The distance between class central point;
If the predicted value of the 3rd distance is less than the actual value of first distance of twice, the 4th is calculated Distance, and determine whether the 4th distance is less than the actual value of first distance;Wherein, the described 4th Distance is the sample point and the distance of the second cluster centre point;
If the 4th distance is less than the actual value of first distance, and the second cluster centre point is true It is set to closest with sample point cluster centre point in current distance ergodic process;
If the 4th distance is more than or equal to the actual value of first distance, by the last time more The first cluster centre point after new is defined as in current distance ergodic process with sample point distance most Near cluster centre point.
Alternatively, it is described to be defined as in current distance ergodic process and the sample the second cluster centre point The closest cluster centre point of this point, including:
If the 4th distance has been traveled through less than the actual value of first distance, and current clustering distance Into, then by the second cluster centre point be assigned to it is described it is last update after first cluster centre Point, and the 4th distance is assigned to the actual value of first distance;
If the 4th distance is less than the actual value of first distance, and current clustering distance traversal is not complete Into, then by the second cluster centre point be assigned to it is described it is last update after first cluster centre Point, and the 4th distance is assigned to the actual value of first distance, and based on assignment after first The actual value of the first distance after cluster centre point and assignment continues to travel through what the current cluster centre was concentrated Next cluster centre point.
Alternatively, the first cluster centre point after the last renewal is defined as current distance traversal During the cluster centre point closest with the sample point, including:
If the 4th distance is more than or equal to the actual value of first distance, and current clustering distance Traversal is completed, then the first cluster centre point after the last renewal is defined as into current distance traversal During the cluster centre point closest with the sample point;
If the 4th distance is more than or equal to the actual value of first distance, and current clustering distance Traversal is not completed, then based on it is described it is last update after the first cluster centre point and described first away from From actual value continue to travel through next cluster centre point that the current cluster centre is concentrated.
Alternatively, before the 4th distance is calculated, methods described also includes:
The 5th distance is calculated, after the 5th distance updates for the second cluster centre point with the last time The distance between the first cluster centre point;
The actual value of first distance is compared with the 5th distance according to triangle inequality rule Compared with;
If actual value of the 5th distance more than or equal to first distance of twice, by described the Two cluster central points are abandoned, and to carry out during cluster traversal, no longer calculate the sample point poly- with described second Between the distance between class central point and the second cluster centre point and other cluster centre points to be traveled through Distance;
The 4th distance of the calculating, including:
If the 5th distance is performed described in the calculating less than the actual value of first distance of twice 4th distance.
Alternatively, it is described according to second distance, the first cluster centre point is last update before and after itself Difference and the last predicted value for updating front and rear itself difference the 3rd distance of acquisition of the second cluster centre point, Including:
Corresponding value after obtaining the preceding corresponding value of the last renewal of the first cluster centre point and updating, and count The first difference between calculating before and after the first cluster centre point updates;
Corresponding value after obtaining the preceding corresponding value of the last renewal of the second cluster centre point and updating, and count The second difference between calculating before and after the second cluster centre point updates;
The second distance and first difference and second difference are carried out into subtraction, obtains described The predicted value of the 3rd distance.
Alternatively, after the second cluster centre point is abandoned, described based on having abandoned described second The cluster centre collection of cluster centre point carries out the distance traversal, obtain the sample set cluster result it Before, methods described also includes:
Judge whether the current clustering distance traversal completes;
If not traveling through completion, continue to travel through next cluster centre point that the current cluster centre is concentrated;
If traversal is completed, it is traversed that the first cluster centre point after the last time is updated is defined as current distance The closest cluster centre point of sample point described in Cheng Zhongyu.
Embodiment of the present invention second aspect also provides a kind of device of website cluster, including:
Obtaining unit, the cluster centre for obtaining sample set and the sample set for website cluster Collection, each sample point includes the description information of each website in the cluster of website, the description in the sample set Information at least includes realm information, structural information and visitor information;
Cluster set obtaining unit, for for each sample point in the sample set, cluster being traveled through successively Each cluster centre point that center is concentrated, it is determined that described each sample point and the cluster centre collection middle-range From nearest cluster centre point, and described each sample point is divided into the cluster centre concentration distance most In the near corresponding set of cluster centre point, each cluster centre point pair that the cluster centre is concentrated is obtained The cluster set answered;
Average value obtaining unit, the average value for obtaining sample point in the cluster set, and according to described flat Average updates the cluster centre collection;
First acquisition unit, for being obtained according to itself difference before and after the last renewal of the first cluster centre point The predicted value of the first distance;Wherein, first distance for need to carry out the sample point of data clusters with it is described The distance between first cluster centre point, the first cluster centre point be clustering distance traversal in the sample The closest cluster centre point of this point;
Second acquisition unit, before and after according to second distance, the last renewal of the first cluster centre point Itself difference and the second cluster centre point is last update before and after itself difference obtain the 3rd distance Predicted value, wherein, the second distance is the first cluster centre described in last clustering distance ergodic process The distance between point and the second cluster centre point, the second cluster centre point are that current clustering distance is traversed Cluster centre to be traveled through point in journey;
Comparing unit, for the first acquisition unit is obtained according to triangle inequality rule described the The predicted value of the 3rd distance that the predicted value of one distance is obtained with the second acquisition unit is compared;
Discarding unit, the predicted value of the 3rd distance for comparing when the comparing unit is more than or waits When the predicted value of first distance of twice, the second cluster centre point is abandoned, to be gathered During class distance traversal, the distance between the sample point and described second cluster centre point and institute are no longer calculated State the distance between the second cluster centre point and other cluster centre points to be traveled through;
Cluster result obtaining unit, for being entered based on the cluster centre collection for having abandoned the second cluster centre point The row distance traversal, obtains the cluster result of the sample set, and the cluster result is included with the field Information, the structural information and on the basis of the visitor information dimension to each net in the website cluster Station clustered after clustering information.
Alternatively, described device also includes:
Analytic unit, after obtaining the cluster result in the cluster result obtaining unit, to institute State cluster result to be analyzed, evaluated with to the clustering method.
Alternatively, the analytic unit specifically for by entropy verification algorithm or purity verification algorithm to described Cluster result is analyzed, wherein, the entropy of the cluster result obtained in the entropy verification algorithm is small When the first preset value, determine that the clustering method meets preset need, or in the purity verification algorithm When the purity of the cluster result for obtaining is more than the second preset value, determine that the clustering method meets described pre- If demand.
Alternatively, described device also includes:
Processing unit, the predicted value of the 3rd distance for comparing when the comparing unit is less than twice During the predicted value of first distance, the first cluster centre point after being updated according to the last time is to described the Two cluster central points carry out data clusters treatment.
Alternatively, the processing unit specifically for:Calculate first cluster after the last renewal The distance between central point and described sample point, obtain the actual value of the first distance;
The reality of first distance for calculating first computing module according to triangle inequality rule Value is compared with the predicted value of the 3rd distance;
The predicted value of the 3rd distance compared when first comparison module is more than or equal to twice During the actual value of first distance, the second cluster centre point is abandoned, to carry out clustering distance time Last, no longer calculate the distance between the sample point and described second cluster centre point and described second gather The distance between class central point and other cluster centre points to be traveled through;
Described first of the predicted value of the 3rd distance compared when first comparison module less than twice The actual value of distance, then calculate the 4th distance;Wherein, the 4th distance is the sample point and described the The distance of two cluster central points;
Determine whether the 4th distance of the second computing module calculating is less than the reality of first distance Actual value;
When first determining module determines that the 4th distance is less than the actual value of first distance, will The second cluster centre point is defined as closest with the sample point poly- in current distance ergodic process Class central point;
When first determining module determines reality of the 4th distance more than or equal to first distance During actual value, the first cluster centre point after the last renewal is defined as current distance ergodic process In the cluster centre point closest with the sample point.
Alternatively, the computing module is specifically additionally operable to:
When actual value of the described 4th distance less than first distance, and current clustering distance traversal are completed When, the second cluster centre point is assigned to the first cluster centre point after the last renewal, And the 4th distance is assigned to the actual value of first distance;
When actual value of the described 4th distance less than first distance, and current clustering distance traversal is not complete Cheng Shi, first cluster centre after the last renewal is assigned to by the second cluster centre point Point, and the 4th distance is assigned to the actual value of first distance, and based on assignment after first The actual value of the first distance after cluster centre point and assignment continues to travel through what the current cluster centre was concentrated Next cluster centre point.
Alternatively, the computing module is specifically additionally operable to:
When actual value of the described 4th distance more than or equal to first distance, and current clustering distance When traversal is completed, the first cluster centre point after the last renewal is defined as current distance traversal During the cluster centre point closest with the sample point;
When actual value of the described 4th distance more than or equal to first distance, and current clustering distance Traversal is not completed, then based on it is described it is last update after the first cluster centre point and described first away from From actual value continue to travel through next cluster centre point that the current cluster centre is concentrated.
Alternatively, the processing unit is specifically additionally operable to:
Before the 4th distance that second computing module is calculated, the 5th distance, the described 5th are calculated Distance is between the first cluster centre point after the second cluster centre point and the last renewal Distance;
The reality of first distance for calculating first computing module according to triangle inequality rule The 5th distance that value is calculated with the 3rd computing module is compared;
When the 5th distance that second comparison module compares is more than or equal to described the first of twice The actual value of distance, then abandon the second cluster centre point, to carry out during cluster traversal, no longer counting Calculate the distance between the sample point and described second cluster centre point and the second cluster centre point with The distance between other cluster centre points to be traveled through;
When reality of the 5th distance less than first distance of twice that second comparison module compares Actual value, then perform calculating the 4th distance.
Alternatively, the second acquisition unit, specifically for:
Corresponding value after obtaining the preceding corresponding value of the last renewal of the first cluster centre point and updating, and count The first difference between calculating before and after the first cluster centre point updates;
Corresponding value after obtaining the preceding corresponding value of the last renewal of the second cluster centre point and updating, and count The second difference between calculating before and after the second cluster centre point updates;
First difference and the second processing that the second distance is calculated with the first processing module Second difference that module is calculated carries out subtraction, obtains the predicted value of the 3rd distance.
Alternatively, described device also includes:
Judging unit, after the discarding unit abandons the second cluster centre point, judges described working as Whether preceding clustering distance traversal completes;
Traversal Unit, when the judging unit judges not traveling through completion, continuation is traveled through in the current cluster Next cluster centre point that the heart is concentrated;
Determining unit, for when the judging unit judges that traversal is completed, by first after last time renewal Cluster centre point is defined as closest with sample point cluster centre point in current distance ergodic process.
One or more technical schemes provided in the embodiment of the present invention, at least have the following technical effect that or excellent Point:
1st, the method and device for being clustered by website provided in an embodiment of the present invention, the cluster knot for being obtained Fruit can include that dimension is to each in the cluster of website by realm information, structural information and on the basis of visitor information Website clustered after clustering information such that it is able to according to obtain cluster result be follow-up Web Hosting Data are provided to support.
2nd, the method and device for being clustered by website provided in an embodiment of the present invention, in current clustering distance In ergodic process, based on the last cluster centre collection for updating, before being updated according to the first cluster centre point last time Rear itself difference obtains the predicted value of the first distance, and the predicted value of first distance is gathered to need to carry out data The distance between the sample point of class and the closest cluster centre point of the sample point, according to second distance, the Before itself difference and the last renewal of the second cluster centre point before and after the last renewal of 1 cluster centre point Itself difference afterwards obtains the predicted value of the 3rd distance, and second distance is in last clustering distance ergodic process The distance between first cluster centre point and the second cluster centre point, the second cluster centre point for current cluster away from Cluster centre the to be traveled through point in ergodic process, by the predicted value of the 3rd distance and the predicted value of the first distance It is compared, if the predicted value of the 3rd distance is more than or equal to the predicted value of the first distance of twice, will The second cluster centre point is abandoned.In the present invention, based on triangle inequality rule, cluster centre is concentrated The 3rd distance predicted value more than or equal to twice the first distance predicted value it is corresponding second cluster Central point is filtered, without calculating the distance between the second cluster centre point and sample point, without calculating The distance between second sample point and other cluster centre points to be traveled through, therefore, reduce the second sample of calculating Time and amount of calculation that point is consumed with the distance between other cluster centre points to be traveled through, improve data poly- The computational efficiency of class.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the method for website cluster provided in an embodiment of the present invention;
Fig. 2 be the 3rd distance provided in an embodiment of the present invention predicted value more than or equal to twice first away from From predicted value schematic diagram;
Fig. 3 is that the first cluster centre point Mu ' after the renewal according to the last time provided in an embodiment of the present invention is right Second cluster centre point Mj ' carries out the flow chart of data clusters processing method;
Fig. 4 is that the embodiment of the invention provides the flow chart for determining sample point Si correspondence cluster centre point methods;
Fig. 5 is the functional block diagram of the device of website cluster provided in an embodiment of the present invention.
Specific embodiment
The embodiment of the present invention is used to solve in the prior art by providing the method and device that a kind of website clusters What clustering algorithm was present leads because each iterative process is required to calculate cluster centre point distance between any two Cause amount of calculation larger, take technical problem more long.
The method that embodiment of the present invention first aspect provides a kind of website cluster, refer to Fig. 1, and Fig. 1 is The schematic flow sheet of the method for website cluster provided in an embodiment of the present invention, as shown in figure 1, the method includes:
101:The cluster centre collection of the sample set and sample set for website cluster is obtained, it is every in sample set Individual sample point include website cluster in each website description information, description information at least include realm information, Structural information and visitor information;
In order to be discussed in detail the technical scheme in the embodiment of the present invention, description information include above-mentioned realm information, Structural information and visitor information these three factors, in other embodiments, visitor information may be used also in description information Believed with the income information, academic information, regional information, religious belief information and the credit rating that are specifically divided into visitor Breath etc., just repeats no more herein.
In the present embodiment, set, the sample set for website cluster is S { S1, S2 ... Sn }, initial poly- Class center collection M { M1, M2 ... Mj ... Mk }, the sample set can be the sub- business web site of network power or wide The data message of the user that website is collected is accused, initialization cluster centre collection can at random be selected by sample set Select the central point of predetermined number, selected from sample set initially apart from optimized algorithm or density technique of estimation scheduling algorithm Cluster centre point, so as to obtain initial cluster center collection, just repeats no more herein.
102:For each sample point in sample set, each of traversal cluster centre concentration is poly- successively Class central point, determines that each sample point concentrates closest cluster centre point with cluster centre, and will be every One sample point is divided into the closest corresponding set of cluster centre point of cluster centre concentration, is gathered Each corresponding cluster set of cluster centre point that class center is concentrated;
In this step, can first calculate cluster centre point in initial cluster center collection M between any two away from From:D11, d12 ... d (k-1) k, then, for the arbitrary sample point Si in sample set S, wherein, i is more than Equal to 1 and less than or equal to n, each the cluster centre point in cluster centre collection M is traveled through successively, it is determined that Si concentrates closest cluster centre point Mu with cluster centre, and Si is divided into the cluster centre point In the corresponding set of Mu, and preserve between sample point Si and cluster centre point Mu first apart from Liu, The like obtain the corresponding cluster set of cluster centre point, such as cluster centre point M1, M2 ... Mj ... Mk couple The cluster set answered respectively N1, N2 ... Nj ... Nk.
103:The average value of sample point in cluster set is obtained, and cluster centre collection is updated according to average value;
In this step, the average value for calculating sample point in cluster set N1, N2 ... Nj ... Nk is M1 ', M2 ' ... Mj ' ... Mk ', and M1 ' is used, M2 ' ... Mj ' ... Mk ' update M1, M2 ... Mj ... Mk, cluster centre collection M after renewal are { M1 ', M2 ' ... Mj ' ... Mk ' }.
104:The pre- of first distance is obtained according to itself difference before and after the last renewal of the first cluster centre point Measured value;Wherein, the first distance is to need to carry out between the sample point of data clusters and the first cluster centre point Distance, the first cluster centre point is cluster centre point closest with sample point during clustering distance is traveled through;
In order to improve the accuracy of data clusters, it is necessary to be iterated calculating, gather current data is carried out It is { M1 ', M2 ' ... Mj ' ... Mk ' } based on the cluster centre collection M after above-mentioned renewal during class algorithm Calculated.Wherein, first is to need to carry out the sample point Si of data clusters and last renewal apart from Liu The distance between first cluster centre point Mu ' afterwards, the first cluster centre point Mu ' are traveled through for clustering distance In the cluster centre point closest with sample point.
The corresponding first range prediction value of sample point Si is set and is set to Liu=Liu+Tu, wherein, Tu is the One cluster centre point Mu ' is last to update itself front and rear difference, i.e. Tu is between Mu ' and Mu Difference.In the embodiment of the present invention, it is by the purpose that the first range prediction value is set to Liu=Liu+Tu, Ensure sample point Si and it is last update after the first cluster centre point Mu ' between distance maximum;Base Liu=Liu+Tu after resetting, carries out current clustering distance traversal.
In embodiments of the present invention, the first cluster centre point after sample point Si and upper once renewal is calculated The distance between Mu ', calculate initial cluster center concentrate cluster centre point between any two away from From:During d11, d12 ... d (k-1) k, can use but be not limited to following method and realize, for example, Euclidean away from With a distance from, manhatton distance, Chebyshev, power distance, cosine similarity, Pearson's similarity, amendment Cosine similarity, Jaccard similarities, Hamming distance, weighting Euclidean distance, correlation distance, geneva Distance etc. calculates the algorithm of distance, the embodiment of the present invention to calculate apart from when the specific method that is used do not carry out Limit.
105:According to second distance, the first cluster centre point is last update before and after itself difference and the Two cluster central points are last to update the predicted value that itself front and rear difference obtains the 3rd distance, wherein, second In for last clustering distance ergodic process between the first cluster centre point and the second cluster centre point Distance, the second cluster centre point is cluster centre point to be traveled through in current clustering distance ergodic process;
Wherein, second distance duj concentrates cluster centre point Mu and cluster centre to calculate initial cluster center The distance between point Mj, cluster centre point Mj are poly- before the second cluster centre point Mj ' does not update Class central point, during the second cluster centre point Mj ' is cluster to be traveled through in current clustering distance ergodic process Heart point;Tu is that the first cluster centre point Mu ' is last updates itself front and rear difference, i.e. Tu is Mu ' Difference between Mu;Tj is that the second cluster centre point Mj ' is last updates itself front and rear difference, That is Tj is the difference between Mj ' and Mj, and second distance duj and Tu and Tj are carried out into subtraction, is obtained Predicted value to the 3rd distance is (duj-Tu-Tj).
It should be noted that the predicted value of the 3rd distance be (duj-Tu-Tj), its in calculating process, Itself difference and the second cluster centre point before and after the last renewals of the first cluster centre point Mu ' need to only be calculated Mj ' is last to update itself front and rear difference, and without calculating the cluster centre collection after last renewal Cluster centre point distance between any two in M { M1 ', M2 ' ... Mj ' ... Mk ' }, can reduce data Amount of calculation and raising computational efficiency during cluster.
106:The predicted value of the first distance is entered with the predicted value of the 3rd distance according to triangle inequality rule Row compares;
Based on triangle inequality rule, namely in the triangles, necessarily there is both sides sum more than the 3rd side, As triangle inequality will obtain the first distance predicted value Liu, with obtain the 3rd distance predicted value It is compared.
107:If the predicted value of the 3rd distance is more than or equal to the predicted value of the first distance of twice, will Second cluster centre point is abandoned, and when being traveled through to carry out clustering distance, no longer calculates sample point and the second cluster The distance between the distance between central point and the second cluster centre point and other cluster centre points to be traveled through;
When the predicted value of the 3rd distance is more than or equal to the predicted value of the first distance of twice, i.e., (duj-Tu-Tj) it is more than or equal to 2*Liu, illustrates between sample point Si and the second cluster centre point Mj ' The distance apart from Lij ' more than or equal to the predicted value Liu of sample point Si and the first distance, by second Cluster centre point Mj ' abandon, equivalent to the 3rd distance for concentrating cluster centre predicted value be more than or The second cluster centre point corresponding equal to the predicted value of the first distance of twice is filtered, therefore, entering Before the trade in clustering distance ergodic process, without calculate the distance between sample point Si and the second sample point Mj, Without calculating the distance between the second cluster centre point and other cluster centre points to be traveled through.As shown in Fig. 2 Fig. 2 show the predicted value of the 3rd distance provided in an embodiment of the present invention more than or equal to twice first away from From predicted value schematic diagram.
108:Row distance traversal is entered based on the cluster centre collection for having abandoned the second cluster centre point, sample is obtained The cluster result of collection, cluster result is included by realm information, structural information and dimension pair on the basis of visitor information Each website in the cluster of website clustered after clustering information.
Of course, enter what row distance traversal was obtained based on the cluster centre collection for having abandoned the second cluster centre point Cluster result may not meet demand, it is possible to which the method for providing according to embodiments of the present invention is repeated Data clusters, untill the cluster result that acquisition meets demand, just repeat no more herein.
In the present embodiment, after the cluster result for meeting demand is obtained, the cluster result can include with field Information, structural information and after dimension is clustered to each website in the cluster of website on the basis of visitor information Clustering information, i.e., can according to obtain cluster result for follow-up Web Hosting provides data support, example Such as, realm information is that topical news, the flat structure that structural information is 3 layers, visitor information are more than 100,000 Ratio to occupy in the cluster of website with topical news be the 68% of the website of main presentation content, then it is follow-up new The topical news website of construction can be then defined by the flat structure that structural information is 3 layers as far as possible, be easy to the greatest extent Soon with the habituation of relative users, so that newly-built website can promptly be easily accepted by a user.
It can thus be seen that in current clustering distance ergodic process, based on the last cluster centre collection for updating, Itself difference before and after being updated according to the first cluster centre point last time obtains the predicted value of the first distance, and this first The predicted value of distance is to need to carry out the sample point of data clusters and the closest cluster centre of the sample point The distance between point, according to second distance, the first cluster centre point is last update before and after itself difference with And second cluster centre point it is last update before and after itself difference obtain the predicted value of the 3rd distance, second away from In for last clustering distance ergodic process between the first cluster centre point and the second cluster centre point away from From, the second cluster centre point be current clustering distance ergodic process in cluster centre point to be traveled through, by the 3rd The predicted value of distance is compared with the predicted value of the first distance, if the predicted value of the 3rd distance is more than or waits When the predicted value of the first distance of twice, the second cluster centre point is abandoned.In the embodiment of the present invention, Based on triangle inequality rule, the predicted value of the 3rd distance that cluster centre is concentrated is more than or equal to two Times the corresponding second cluster centre point of predicted value of the first distance filtered, without calculating the second cluster in The distance between heart point and sample point, without calculate the second sample point and other cluster centre points to be traveled through it Between distance, therefore, reduce the distance between calculating the second sample point and other cluster centre points to be traveled through The time for being consumed and amount of calculation, improve the computational efficiency of data clusters.
In the present embodiment, it is provided in an embodiment of the present invention after the cluster result of the acquisition sample set Method also includes:
The cluster result is analyzed, is evaluated with to the clustering method.
In specific implementation process, this pair cluster result is analyzed, and is evaluated with to the clustering method, Specifically include:
The cluster result is entered by entropy (entropy) verification algorithm or purity (purity) verification algorithm Row analysis;
In actual applications, as a example by being analyzed to cluster result by entropy verification algorithm, for one For cluster i, P is calculated firstij, PijThe member (member) for referring to clustering in i belongs to class (class) The probability of j,Wherein, miIt is the number of all members in i is clustered, mijBe cluster i in Member belong to the number of class j.The entropy of each cluster can be expressed as Wherein L is the number of class (class).Entirely the entropy of clustering isWherein K It is the number for clustering (cluster), m is the membership involved by whole clustering.In this implementation In example, when the entropy of the cluster result obtained in the entropy verification algorithm is less than the first preset value, it is determined that should Clustering method meets preset need;
It is of course also possible to cluster result is analyzed by purity verification algorithm, it is similar, for one For individual cluster i, Pi is calculated firstj, PijThe member (member) for referring to clustering in i belongs to class (class) The probability of j,The purity of setting cluster i is defined as pi=max (pij).It is whole poly- Class divide purity beWherein K is the number for clustering (cluster), mi It is the number of all members in i is clustered, m is the membership involved by whole clustering.At this In embodiment, when the purity of the cluster result obtained in the purity verification algorithm is more than the second preset value, really The fixed clustering method meets the preset need.
Perform step 103 according to triangle inequality rule by the predicted value Liu and the 3rd of the first distance away from From predicted value (duj-Tu-Tj) be compared when, if the predicted value (duj-Tu-Tj) of the 3rd distance is less than The predicted value Liu of the first distance of twice, illustrate between sample point Si and the second cluster centre point Mj ' away from From Lij ' less than the predicted value Liu with a distance from first between sample point Si and the first cluster centre point Mu ', root The first cluster centre point Mu ' after being updated according to the last time is carried out at data clusters to the second cluster centre point Mj ' Manage to determine that the corresponding cluster centre points of sample point Si are the first cluster centre point Mu ' after last renewal Or the second cluster centre point Mj '.As shown in figure 3, Fig. 3 shows basis provided in an embodiment of the present invention The first cluster centre point Mu ' after last time renewal carries out data clusters treatment to the second cluster centre point Mj ' The flow chart of method, the method includes:
301st, the distance between first cluster centre point and sample point after last renewal are calculated, the is obtained The actual value of one distance.
The distance between the first cluster centre point Mu ' and the sample point Si after last renewal Liu ' is calculated, should Liu ' is the actual value of the first distance in current clustering distance ergodic process, the embodiment of the present invention computationally The actual range Liu ' of the first distance between the first cluster centre point Mu ' and sample point Si after secondary renewal When, the algorithm for being used refer to the associated description in above-mentioned steps 101, and the embodiment of the present invention is herein no longer Repeated.
302nd, the actual value of the first distance is entered with the predicted value of the 3rd distance according to triangle inequality rule Row compares.
Based on triangle inequality rule, by the actual value Liu ' of the first distance and the predicted value of the 3rd distance (duj-Tu-Tj) it is compared, if the predicted value (duj-Tu-Tj) of the 3rd distance is more than or equal to twice The first distance actual value Liu ', then perform step 303;If the predicted value (duj-Tu-Tj) of the 3rd distance Less than the actual value Liu ' of the first distance of twice, then step 304 is performed.
303rd, the second cluster centre point is abandoned.
When the actual value of the predicted value (duj-Tu-Tj) more than or equal to the first distance of twice of the 3rd distance During Liu ', illustrate in current clustering distance ergodic process, the cluster centre points of sample point Si to second Mj's ' Actual range of the actual range more than or equal to the cluster centre points of sample point Si to first Mu ', i.e. sample point The corresponding cluster centre points of Si are unlikely to be the second cluster centre point Mj ', therefore by the second cluster centre point Mj ' is abandoned, and no longer calculates the distance between sample point Si and the second cluster centre point Mj ' and the second cluster The distance between central point Mj ' and other cluster centre points to be traveled through.
304th, the 4th distance is calculated, and determines whether the 4th distance is less than the actual value of the first distance.
As the actual value Liu ' of the first distance that the predicted value (duj-Tu-Tj) of the 3rd distance is less than twice, Illustrate in current clustering distance ergodic process, the actual range of the cluster centre points of sample point Si to second Mj ' Less than in the actual range of the cluster centre points of sample point Si to first Mu ', the i.e. corresponding clusters of sample point Si Heart point is probably the second cluster centre point Mj '.
In determining that the corresponding cluster centre points of sample point Si are the first cluster centre point Mu ', or second clusters Heart point Mj ', it is necessary to calculate the 4th apart from Lij ', wherein, the 4th is sample point Si and second apart from Lij ' The distance between cluster centre point Mj '.If the 4th actual value Liu ' apart from Lij ' less than the first distance, Perform step 305;If the 4th actual value Liu ' apart from Lij ' more than or equal to the first distance, perform Step 306.
305th, the second cluster centre point is defined as closest with sample point in current distance ergodic process Cluster centre point.
When the 4th apart from actual value Lius ' of the Lij ' less than the first distance, the second cluster centre point Mj ' is determined It is cluster centre point closest with sample point Si in current distance ergodic process.In the embodiment of the present invention A kind of implementation in, when the 4th apart from Lij ' less than first distance actual value Liu ', and work as Preceding clustering distance traversal is completed, then by the second cluster centre point Mj ' be assigned to after last renewal this One cluster centre point Mu ', and the actual value Liu ' that the first distance is assigned to apart from Lij ' by the 4th, i.e., Lui '=Lij ', Mu '=Mj ';In another implementation of the embodiment of the present invention, when the 4th apart from Lij ' Less than the actual value Liu ' of the first distance, and current clustering distance traversal is not completed, then during second is clustered Heart point Mj ' is assigned to the first cluster centre point Mu ' after last renewal, and by the 4th apart from Lij ' Be assigned to the actual value Liu ' of the first distance, i.e. Lui '=Lij ', Mu '=Mj ', and based on assignment after first The actual value Liu ' of the first distance after cluster centre point Mu ' and assignment continues to travel through current cluster centre concentration Next cluster centre point, until having traveled through current cluster centre collection.
306th, the first cluster centre point after the last time is updated is defined as in current distance ergodic process and sample The closest cluster centre point of this point.
When the 4th apart from actual value Lius ' of the Lij ' more than or equal to the first distance, after determining last time renewal The first cluster centre point Mu ' in closest with sample point Si cluster in current distance ergodic process Heart point.In a kind of implementation of the embodiment of the present invention, first is more than or equal to apart from Lij ' when the 4th The actual value Liu ' of distance, and during current clustering distance traversal completion, after the last time is updated first Cluster centre point Mu ' is defined as closest with sample point Si cluster centre in current distance ergodic process Point;As the 4th actual value Liu ' apart from Lij ' more than or equal to the first distance, and current clustering distance Traversal is not completed, then the reality of the first cluster centre point Mu ' and the first distance after being updated based on the last time Value Liu ' continues to travel through next cluster centre point that current cluster centre is concentrated.
Be combined for Fig. 1 and Fig. 3 to determine sample point in specific implementation process by the embodiment of the present invention The corresponding cluster centre points of Si, as shown in figure 4, Fig. 4 shows the embodiment of the invention provides determination sample The flow chart of point Si correspondence cluster centre point methods, the method includes:
401st, the pre- of the first distance is obtained according to itself difference before and after the last renewal of the first cluster centre point Measured value.
402nd, according to second distance, the first cluster centre point is last update before and after itself difference and the Two cluster central points are last to update the predicted value that itself front and rear difference obtains the 3rd distance.
403rd, the predicted value of the first distance is entered with the predicted value of the 3rd distance according to triangle inequality rule Row compares.
If the predicted value of the 3rd distance performs step more than or equal to the predicted value of the first distance of twice 404;If the predicted value of the 3rd distance is less than the predicted value of the first distance of twice, step 405 is performed.
404th, the second cluster centre point is abandoned.
405th, the first cluster centre point after being updated according to the last time clicks through line number to second cluster centre According to clustering processing.
The first cluster centre point after the relevant renewal according to the last time clicks through line number to second cluster centre According to the implementation process of clustering processing, the detailed description of Fig. 3 is refer to, the embodiment of the present invention is no longer carried out herein Repeat.
Further, before step 304 the 4th distance of calculating is performed, the 5th is calculated apart from duj ', the 5th Distance is the distance between first cluster centre point Mu ' after the second cluster centre point Mj ' and last time renewal, The actual value Liu ' of the first distance is compared with the 5th apart from duj ' according to triangle inequality rule, when During the actual value Liu ' of the 5th the first distance for being more than or equal to twice apart from duj ', by the second cluster centre Point is abandoned, no longer in the clusters of calculating the distance between sample point Si and the second cluster centre point Mj ' and second The distance between heart point Mj ' and other cluster centre points to be traveled through;When the 5th apart from duj ' less than the of twice During the actual value Liu ' of one distance, step 304 is continued executing with.
It should be noted that in the operating process of actual operating procedure 301- steps 303, will can cluster Major part in center collection M, actual value Liu ' of the distance more than or equal to the first distance with sample point Si Cluster centre point abandoned, and in cluster centre collection M be left part cluster centre point be and sample Cluster centre point of this distance of Si less than the actual value Liu ' of the first distance.Exemplary, it is assumed that cluster Have 1000 cluster centre points in center collection M, during by step 301- steps 303, can 800 with The distance of sample point Si is carried out more than or equal to the corresponding cluster centre point of the actual value Liu ' of the first distance Abandon, now, remaining 200 cluster centre points in cluster centre collection M.Calculate respectively remaining 200 In cluster centre point the second cluster centre point Mj ' and last time update after the first cluster centre point Mu ' between 5th distance, when the 5th apart from actual value Lius ' of the duj ' more than or equal to the first distance of twice, will 150 second cluster centre point Mj ' are abandoned, now, remaining 50 cluster centre points in cluster centre collection M, The distance between remaining 50 cluster centre points in sample point Si and cluster centre collection M are calculated respectively, it is determined that Sample point Si closest cluster centre point.It should be noted that in actual operating process, meter When calculating in cluster centre collection M the 5th between cluster centre point two-by-two apart from duj ', than calculating sample point Between Si and the second cluster centre point Mj ' the 4th apart from Lij ' amount of calculation is small, elapsed time is few.This hair Bright embodiment is based on triangle inequality rule, twice to the distance in cluster centre collection M with sample point Si Cluster centre point more than or equal to the actual value Liu ' of the first distance is abandoned, and is entered to a certain extent The amount of calculation for reducing calculating sample point Si and the second cluster centre point Mj ' of one step.
Further, as the refinement and extension to above-described embodiment, above-mentioned steps 102 obtain the 3rd away from From predicted value (duj-Tu-Tj) when, can using but be not limited to it is following by the way of realize, obtain first Cluster centre point Mu ' is last to update corresponding value Mu ' after preceding corresponding value Mu and renewal, calculates first Difference Tu, wherein, the Tu is the difference between Mu ' and Mu;Obtain on the second cluster centre point Mj ' Corresponding value Mj ' after once updating preceding corresponding value Mj and updating, calculates the second difference Tj, wherein, should Tj is the difference between Tj ' and Tj;Second distance duj and the first difference Tu and the second difference Tj are carried out Subtraction, obtains the predicted value (duj-Tu-Tj) of the 3rd distance.
Further, after execution 404, judge whether current clustering distance traversal completes, if not traveling through Complete, then continue to travel through the next cluster centre point of current cluster centre concentration;If traversal is completed, will be upper The first cluster centre point Mu ' after once updating be defined as in current distance ergodic process with the sample point away from From nearest cluster centre point.
It is determined that after the corresponding cluster centre points of sample point Si, the like obtain cluster centre collection M M1 ', M2 ' ... the corresponding cluster sets of Mj ' ... Mk ' are respectively N1 ', N2 ' ... Nj ' ... Nk ', calculation procedure 101 In cluster set N1, N2 ... the Nj ... Nk that the is somebody's turn to do cluster set N1 ', N2 ' ... that determine with current clustering distance traversal Difference O1, O2 ... Oj ... Ok between Nj ' ... Nk ', and whether judge the difference O1, O2 ... Oj ... Ok Meet default error threshold, if meeting, the cluster set N1 ', N2 ' ... that preceding clustering distance traversal is determined Nj ' ... Nk ' are defined as the result of final data cluster;If not meeting, as above should based on the embodiment of the present invention Method repeats data clusters, until it is determined that untill the result of final data cluster.In the present embodiment, Need to be configured according to the actual requirements default error threshold is set, for the need that some fine datas are clustered For asking, the smaller of default error threshold is set, for example, it is 1 or 0 etc. to set default error threshold, The embodiment of the present invention is not defined to presetting the particular content that error threshold is set.
Based on same inventive concept, embodiment of the present invention second aspect additionally provides a kind of dress of website cluster Put, refer to Fig. 5, Fig. 5 is the functional block diagram of the device of website cluster provided in an embodiment of the present invention, As shown in figure 5, the device includes:
Obtaining unit 501, the cluster centre for obtaining sample set and the sample set for website cluster Collection, each sample point includes corresponding personal description information in the cluster of website, description letter in the sample set Breath at least includes age information, gender information, preference information and spending amount information;
Cluster set obtaining unit 502, for for each sample point in the sample set, traveling through successively poly- Each cluster centre point that class center is concentrated, determines that each sample point concentrates distance with the cluster centre Nearest cluster centre point, and each sample point is divided into the cluster centre concentrates closest poly- In the corresponding set of class central point, each corresponding cluster of cluster centre point of cluster centre concentration is obtained Collection;
Average value obtaining unit 503, the average value for obtaining sample point in the cluster set, and it is flat according to this Average updates the cluster centre collection;
First acquisition unit 504, for according to itself difference before and after the last renewal of the first cluster centre point Obtain the predicted value of the first distance;Wherein, first distance is to need to carry out the sample point of data clusters and be somebody's turn to do The distance between first cluster centre point, the first cluster centre point be clustering distance traversal in the sample point Closest cluster centre point;
Second acquisition unit 505, before according to second distance, the last renewal of the first cluster centre point Itself difference before and after the last renewal of itself difference and the second cluster centre point afterwards obtains the 3rd distance Predicted value, wherein, the second distance is the first cluster centre point in last clustering distance ergodic process The distance between with the second cluster centre point, during the second cluster centre point is current clustering distance ergodic process Cluster centre point to be traveled through;
Comparing unit 506, for obtain the first acquisition unit 504 according to triangle inequality rule The predicted value of the 3rd distance that the predicted value of first distance is obtained with the second acquisition unit 505 is compared Compared with;
Discarding unit 507, the predicted value of the 3rd distance for comparing when the comparing unit 506 be more than or When person is equal to the predicted value of first distance of twice, the second cluster centre point is abandoned, to be gathered During class distance traversal, no longer calculate the distance between the sample point and the second cluster centre point and this second The distance between cluster centre point and other cluster centre points to be traveled through;
Cluster result obtaining unit 508, for based on the cluster centre collection for having abandoned the second cluster centre point Carry out the distance traversal, obtain the cluster result of the sample set, the cluster result include with the age information, The gender information, the preference information and each during dimension is to the website cluster on the basis of the spending amount information Individual website clustered after clustering information.
Further, the device also includes:
Analytic unit 509, it is right after obtaining the cluster result in the cluster result obtaining unit 508 The cluster result is analyzed, is evaluated with to the clustering method.
Further, the analytic unit 509 is specifically for by entropy verification algorithm or purity verification algorithm pair The cluster result is analyzed, wherein, the entropy of the cluster result obtained in the entropy verification algorithm is less than During the first preset value, determine that the clustering method meets preset need, or obtained in the purity verification algorithm When the purity of the cluster result is more than the second preset value, determine that the clustering method meets the preset need.
Further, the device also includes:
Processing unit 510, the predicted value of the 3rd distance for comparing when the comparing unit 506 is less than two Times first distance predicted value when, according to last time update after the first cluster centre point to this second Cluster centre clicks through row data clustering processing.
Further, the processing unit 510 specifically for:Calculate first cluster after last time renewal The distance between central point and the sample point, obtain the actual value of the first distance;
The actual value of first distance for calculating the first processing units 510 according to triangle inequality rule Predicted value with the 3rd distance is compared;
The predicted value of the 3rd distance compared when first comparison module more than or equal to twice this During the actual value of one distance, the second cluster centre point is abandoned, when being traveled through to carry out clustering distance, no Calculate again the distance between the sample point and the second cluster centre point and the second cluster centre point and its The distance between his cluster centre point to be traveled through;
The predicted value of the 3rd distance compared when first comparison module is less than first distance of twice Actual value, then calculate the 4th distance;Wherein, the 4th distance is the sample point and the second cluster centre point Distance;
Determine whether the 4th distance of the second processing unit 510 calculating is less than the reality of first distance Value;
When first determining module determine the 4th distance less than first distance actual value when, by this second Cluster centre point is defined as closest with sample point cluster centre point in current distance ergodic process;
When first determining module determines that the 4th distance is more than or equal to the actual value of first distance, The first cluster centre point after the last time is updated is defined as in current distance ergodic process and the sample The closest cluster centre point of point.
Further, the processing unit 510 is specifically additionally operable to:
When actual value of the 4th distance less than first distance, and current clustering distance traversal are completed, By the second cluster centre point be assigned to the last time update after the first cluster centre point, and by this Four distances are assigned to the actual value of first distance;
When actual value of the 4th distance less than first distance, and current clustering distance traversal are not completed When, the second cluster centre point is assigned to the first cluster centre point after last time renewal, and will 4th distance is assigned to the actual value of first distance, and based on assignment after the first cluster centre point and tax The actual value of the first distance after value continues to travel through next cluster centre point that the current cluster centre is concentrated.
Further, the processing unit 510 is specifically additionally operable to:
When actual value of the 4th distance more than or equal to first distance, and current clustering distance traversal During completion, the first cluster centre point after the last time is updated be defined as in current distance ergodic process with The closest cluster centre point of the sample point;
When actual value of the 4th distance more than or equal to first distance, and current clustering distance traversal Do not complete, then the actual value of the first cluster centre point and first distance after being updated based on the last time Continue to travel through next cluster centre point that the current cluster centre is concentrated.
Further, the processing unit 510 is specifically additionally operable to:
The second processing unit 510 calculate the 4th distance before, calculate the 5th distance, the 5th away from The distance between the first cluster centre point after being updated for the second cluster centre point and the last time;
The actual value of first distance for calculating the first processing units 510 according to triangle inequality rule The 5th distance calculated with the 3rd processing unit 510 is compared;
When the 5th distance that second comparison module compares is more than or equal to first distance of twice Actual value, then abandon the second cluster centre point, to carry out during cluster traversal, no longer calculating the sample The distance between point and the second cluster centre point and the second cluster centre point treat traversal cluster with other The distance between central point;
When the 5th distance that second comparison module compares is less than the actual value of first distance of twice, then Perform the distance of calculating the 4th.
Further, the second acquisition unit 505, specifically for:
Corresponding value after obtaining the preceding corresponding value of the last renewal of the first cluster centre point and updating, and calculate The first difference between before and after first cluster centre point renewal;
Corresponding value after obtaining the preceding corresponding value of the last renewal of the second cluster centre point and updating, and calculate The second difference between before and after second cluster centre point renewal;
First difference and the Second processing module that the second distance is calculated with the first processing module are calculated Second difference carry out subtraction, obtain the predicted value of the 3rd distance.
Further, the device also includes:
Judging unit 511, after the discarding unit 507 abandons the second cluster centre point, judging should Whether current clustering distance traversal completes;
Traversal Unit 512, when the judging unit 511 judges not traveling through completion, this is current poly- to continue traversal Next cluster centre point that class center is concentrated;
Determining unit 513, for when the judging unit 511 judges that traversal is completed, after last time renewal The first cluster centre point be defined as in closest with sample point cluster in current distance ergodic process Heart point.
The device of website cluster provided in an embodiment of the present invention, in current clustering distance ergodic process, is based on The cluster centre collection that last time updates, itself difference before and after being updated according to the first cluster centre point last time is obtained The predicted value of the first distance, the predicted value of first distance is to need to carry out the sample point of data clusters and the sample The distance between closest cluster centre point of this point, according to one on second distance, the first cluster centre point Itself difference before and after the last renewal of itself difference and the second cluster centre point before and after secondary renewal is obtained The predicted value of the 3rd distance, second distance be in last clustering distance ergodic process the first cluster centre point with The distance between second cluster centre point, the second cluster centre point is to treat time in current clustering distance ergodic process The cluster centre point gone through, the predicted value of the 3rd distance and the predicted value of the first distance are compared, if the 3rd When the predicted value of distance is more than or equal to the predicted value of the first distance of twice, by the second cluster centre point Abandon.In the present invention, based on triangle inequality rule, the prediction of the 3rd distance that cluster centre is concentrated Value is filtered more than or equal to the corresponding second cluster centre point of predicted value of the first distance of twice, nothing The distance between the second cluster centre point and sample point need to be calculated, is treated with other without the second sample point is calculated The distance between traversal cluster centre point, therefore, reduce the second sample point of calculating and treat traversal cluster with other Time and amount of calculation that the distance between central point is consumed, improve the computational efficiency of data clusters.
Technical scheme in the embodiments of the present invention, at least has the following technical effect that or advantage:
1st, the method and device for being clustered by website provided in an embodiment of the present invention, the cluster knot for being obtained Fruit can include that dimension is to each in the cluster of website by realm information, structural information and on the basis of visitor information Website clustered after clustering information such that it is able to according to obtain cluster result be follow-up Web Hosting Data are provided to support.
2nd, the method and device for being clustered by website provided in an embodiment of the present invention, in current clustering distance In ergodic process, based on the last cluster centre collection for updating, before being updated according to the first cluster centre point last time Rear itself difference obtains the predicted value of the first distance, and the predicted value of first distance is gathered to need to carry out data The distance between the sample point of class and the closest cluster centre point of the sample point, according to second distance, the Before itself difference and the last renewal of the second cluster centre point before and after the last renewal of 1 cluster centre point Itself difference afterwards obtains the predicted value of the 3rd distance, and second distance is in last clustering distance ergodic process The distance between first cluster centre point and the second cluster centre point, the second cluster centre point for current cluster away from Cluster centre the to be traveled through point in ergodic process, by the predicted value of the 3rd distance and the predicted value of the first distance It is compared, if the predicted value of the 3rd distance is more than or equal to the predicted value of the first distance of twice, will The second cluster centre point is abandoned.In the present invention, based on triangle inequality rule, cluster centre is concentrated The 3rd distance predicted value more than or equal to twice the first distance predicted value it is corresponding second cluster Central point is filtered, without calculating the distance between the second cluster centre point and sample point, without calculating The distance between second sample point and other cluster centre points to be traveled through, therefore, reduce the second sample of calculating Time and amount of calculation that point is consumed with the distance between other cluster centre points to be traveled through, improve data poly- The computational efficiency of class.
The embodiment of the invention discloses:
The method of A1, a kind of website cluster, it is characterised in that including:
The cluster centre collection of the sample set and the sample set for website cluster is obtained, in the sample set Each sample point includes the description information of each website in the cluster of website, and the description information at least includes field Information, structural information and visitor information;
For each sample point in the sample set, each cluster that cluster centre is concentrated is traveled through successively Central point, it is determined that described each sample point concentrates closest cluster centre point with the cluster centre, And described each sample point is divided into the closest cluster centre point correspondence of the cluster centre concentration Set in, obtain each corresponding cluster set of cluster centre point that the cluster centre is concentrated;
The average value of sample point in the cluster set is obtained, and the cluster centre is updated according to the average value Collection;
According to the last predicted value for updating front and rear itself difference the first distance of acquisition of the first cluster centre point; Wherein, first distance is to need to carry out between the sample point of data clusters and the first cluster centre point Distance, the first cluster centre point is the cluster closest with the sample point in clustering distance traversal Central point;
According to itself difference and second before and after second distance, the last renewal of the first cluster centre point Cluster centre point is last update before and after itself difference obtain the predicted value of the 3rd distance, wherein, described the Two distances the first cluster centre point and the second cluster centre point described in last clustering distance ergodic process The distance between, during the second cluster centre point is cluster to be traveled through in current clustering distance ergodic process Heart point;
According to triangle inequality rule by the prediction of the predicted value of first distance and the 3rd distance Value is compared;
If the predicted value of the 3rd distance is more than or equal to the predicted value of first distance of twice, The second cluster centre point is abandoned, when being traveled through to carry out clustering distance, the sample point is no longer calculated Treat that traversal is poly- with other with the distance between the second cluster centre point and the second cluster centre point The distance between class central point;
The distance traversal is carried out based on the cluster centre collection for having abandoned the second cluster centre point, institute is obtained The cluster result of sample set is stated, the cluster result is included with the realm information, the structural information and institute State the cluster letter after dimension on the basis of visitor information is clustered to each website in the website cluster Breath.
A2, the method according to A1, it is characterised in that in the cluster knot for obtaining the sample set After fruit, methods described also includes:
The cluster result is analyzed, is evaluated with to the clustering method.
A3, the method according to A2, it is characterised in that described to be analyzed to the cluster result, Evaluated with to the clustering method, specifically included:
The cluster result is analyzed by entropy verification algorithm or purity verification algorithm;
When the entropy of the cluster result obtained in the entropy verification algorithm is less than the first preset value, it is determined that The clustering method meets preset need;Or
When the purity of the cluster result obtained in the purity verification algorithm is more than the second preset value, it is determined that The clustering method meets the preset need.
A4, the method according to A1, it is characterised in that methods described also includes:
If the predicted value of the 3rd distance is less than the predicted value of first distance of twice, according to upper one The first cluster centre point after secondary renewal clicks through row data clustering processing to second cluster centre.
A5, the method according to A3, it is characterised in that it is described updated according to the last time after described the Cluster centre point clicks through row data clustering processing to second cluster centre, including:
Calculate it is described it is last update after the distance between the first cluster centre point and the sample point, Obtain the actual value of the first distance;
According to triangle inequality rule by the prediction of the actual value of first distance and the 3rd distance Value is compared;
If the predicted value of the 3rd distance is more than or equal to the actual value of first distance of twice, The second cluster centre point is abandoned, when being traveled through to carry out clustering distance, the sample point is no longer calculated Treat that traversal is poly- with other with the distance between the second cluster centre point and the second cluster centre point The distance between class central point;
If the predicted value of the 3rd distance is less than the actual value of first distance of twice, the 4th is calculated Distance, and determine whether the 4th distance is less than the actual value of first distance;Wherein, the described 4th Distance is the sample point and the distance of the second cluster centre point;
If the 4th distance is less than the actual value of first distance, and the second cluster centre point is true It is set to closest with sample point cluster centre point in current distance ergodic process;
If the 4th distance is more than or equal to the actual value of first distance, by the last time more The first cluster centre point after new is defined as in current distance ergodic process with sample point distance most Near cluster centre point.
A6, the method according to A5, it is characterised in that described to determine the second cluster centre point It is cluster centre point closest with the sample point in current distance ergodic process, including:
If the 4th distance has been traveled through less than the actual value of first distance, and current clustering distance Into, then by the second cluster centre point be assigned to it is described it is last update after first cluster centre Point, and the 4th distance is assigned to the actual value of first distance;
If the 4th distance is less than the actual value of first distance, and current clustering distance traversal is not complete Into, then by the second cluster centre point be assigned to it is described it is last update after first cluster centre Point, and the 4th distance is assigned to the actual value of first distance, and based on assignment after first The actual value of the first distance after cluster centre point and assignment continues to travel through what the current cluster centre was concentrated Next cluster centre point.
A7, the method according to A5, it is characterised in that by described first after the last renewal Cluster centre point is defined as closest with sample point cluster centre point in current distance ergodic process, Including:
If the 4th distance is more than or equal to the actual value of first distance, and current clustering distance Traversal is completed, then the first cluster centre point after the last renewal is defined as into current distance traversal During the cluster centre point closest with the sample point;
If the 4th distance is more than or equal to the actual value of first distance, and current clustering distance Traversal is not completed, then based on it is described it is last update after the first cluster centre point and described first away from From actual value continue to travel through next cluster centre point that the current cluster centre is concentrated.
A8, the method according to A6 or A7, it is characterised in that before the 4th distance is calculated, institute Stating method also includes:
The 5th distance is calculated, after the 5th distance updates for the second cluster centre point with the last time The distance between the first cluster centre point;
The actual value of first distance is compared with the 5th distance according to triangle inequality rule Compared with;
If actual value of the 5th distance more than or equal to first distance of twice, by described the Two cluster central points are abandoned, and to carry out during cluster traversal, no longer calculate the sample point poly- with described second Between the distance between class central point and the second cluster centre point and other cluster centre points to be traveled through Distance;
The 4th distance of the calculating, including:
If the 5th distance is performed described in the calculating less than the actual value of first distance of twice 4th distance.
A9, the method according to any one of A1-A7, it is characterised in that it is described according to second distance, The first cluster centre point is last to update front and rear itself difference and the second cluster centre point last time Itself difference before and after updating obtains the predicted value of the 3rd distance, including:
Corresponding value after obtaining the preceding corresponding value of the last renewal of the first cluster centre point and updating, and count The first difference between calculating before and after the first cluster centre point updates;
Corresponding value after obtaining the preceding corresponding value of the last renewal of the second cluster centre point and updating, and count The second difference between calculating before and after the second cluster centre point updates;
The second distance and first difference and second difference are carried out into subtraction, obtains described The predicted value of the 3rd distance.
A10, the method according to A7, it is characterised in that abandoned by the second cluster centre point Afterwards, the distance time is carried out based on the cluster centre collection for having abandoned the second cluster centre point described Go through, before obtaining the cluster result of the sample set, methods described also includes:
Judge whether the current clustering distance traversal completes;
If not traveling through completion, continue to travel through next cluster centre point that the current cluster centre is concentrated;
If traversal is completed, it is traversed that the first cluster centre point after the last time is updated is defined as current distance The closest cluster centre point of sample point described in Cheng Zhongyu.
The device of B11, a kind of website cluster, it is characterised in that including:
Obtaining unit, the cluster centre for obtaining sample set and the sample set for website cluster Collection, each sample point includes the description information of each website in the cluster of website, the description in the sample set Information at least includes realm information, structural information and visitor information;
Cluster set obtaining unit, for for each sample point in the sample set, cluster being traveled through successively Each cluster centre point that center is concentrated, it is determined that described each sample point and the cluster centre collection middle-range From nearest cluster centre point, and described each sample point is divided into the cluster centre concentration distance most In the near corresponding set of cluster centre point, each cluster centre point pair that the cluster centre is concentrated is obtained The cluster set answered;
Average value obtaining unit, the average value for obtaining sample point in the cluster set, and according to described flat Average updates the cluster centre collection;
First acquisition unit, for being obtained according to itself difference before and after the last renewal of the first cluster centre point The predicted value of the first distance;Wherein, first distance for need to carry out the sample point of data clusters with it is described The distance between first cluster centre point, the first cluster centre point be clustering distance traversal in the sample The closest cluster centre point of this point;
Second acquisition unit, before and after according to second distance, the last renewal of the first cluster centre point Itself difference and the second cluster centre point is last update before and after itself difference obtain the 3rd distance Predicted value, wherein, the second distance is the first cluster centre described in last clustering distance ergodic process The distance between point and the second cluster centre point, the second cluster centre point are that current clustering distance is traversed Cluster centre to be traveled through point in journey;
Comparing unit, for the first acquisition unit is obtained according to triangle inequality rule described the The predicted value of the 3rd distance that the predicted value of one distance is obtained with the second acquisition unit is compared;
Discarding unit, the predicted value of the 3rd distance for comparing when the comparing unit is more than or waits When the predicted value of first distance of twice, the second cluster centre point is abandoned, to be gathered During class distance traversal, the distance between the sample point and described second cluster centre point and institute are no longer calculated State the distance between the second cluster centre point and other cluster centre points to be traveled through;
Cluster result obtaining unit, for being entered based on the cluster centre collection for having abandoned the second cluster centre point The row distance traversal, obtains the cluster result of the sample set, and the cluster result is included with the field Information, the structural information and on the basis of the visitor information dimension to each net in the website cluster Station clustered after clustering information.
B12, the device according to B11, it is characterised in that described device also includes:
Analytic unit, after obtaining the cluster result in the cluster result obtaining unit, to institute State cluster result to be analyzed, evaluated with to the clustering method.
B13, the device according to B12, it is characterised in that the analytic unit is specifically for by entropy Value verification algorithm or purity verification algorithm are analyzed to the cluster result, wherein, in entropy checking When the entropy of the cluster result that algorithm is obtained is less than the first preset value, determine that the clustering method meets pre- If demand, or the purity of the cluster result obtained in the purity verification algorithm is more than the second preset value When, determine that the clustering method meets the preset need.
B14, the device according to B11, it is characterised in that described device also includes:
Processing unit, the predicted value of the 3rd distance for comparing when the comparing unit is less than twice During the predicted value of first distance, the first cluster centre point after being updated according to the last time is to described the Two cluster central points carry out data clusters treatment.
B15, the device according to B14, it is characterised in that the processing unit specifically for:Calculate It is described it is last update after the distance between the first cluster centre point and the sample point, obtain first The actual value of distance;
The reality of first distance for calculating first computing module according to triangle inequality rule Value is compared with the predicted value of the 3rd distance;
The predicted value of the 3rd distance compared when first comparison module is more than or equal to twice During the actual value of first distance, the second cluster centre point is abandoned, to carry out clustering distance time Last, no longer calculate the distance between the sample point and described second cluster centre point and described second gather The distance between class central point and other cluster centre points to be traveled through;
Described first of the predicted value of the 3rd distance compared when first comparison module less than twice The actual value of distance, then calculate the 4th distance;Wherein, the 4th distance is the sample point and described the The distance of two cluster central points;
Determine whether the 4th distance of the second computing module calculating is less than the reality of first distance Actual value;
When first determining module determines that the 4th distance is less than the actual value of first distance, will The second cluster centre point is defined as closest with the sample point poly- in current distance ergodic process Class central point;
When first determining module determines reality of the 4th distance more than or equal to first distance During actual value, the first cluster centre point after the last renewal is defined as current distance ergodic process In the cluster centre point closest with the sample point.
B16, the device according to B14, it is characterised in that the computing module is specifically additionally operable to:
When actual value of the described 4th distance less than first distance, and current clustering distance traversal are completed When, the second cluster centre point is assigned to the first cluster centre point after the last renewal, And the 4th distance is assigned to the actual value of first distance;
When actual value of the described 4th distance less than first distance, and current clustering distance traversal is not complete Cheng Shi, first cluster centre after the last renewal is assigned to by the second cluster centre point Point, and the 4th distance is assigned to the actual value of first distance, and based on assignment after first The actual value of the first distance after cluster centre point and assignment continues to travel through what the current cluster centre was concentrated Next cluster centre point.
B17, the device according to B14, it is characterised in that the computing module is specifically additionally operable to:
When actual value of the described 4th distance more than or equal to first distance, and current clustering distance When traversal is completed, the first cluster centre point after the last renewal is defined as current distance traversal During the cluster centre point closest with the sample point;
When actual value of the described 4th distance more than or equal to first distance, and current clustering distance Traversal is not completed, then based on it is described it is last update after the first cluster centre point and described first away from From actual value continue to travel through next cluster centre point that the current cluster centre is concentrated.
B18, the device according to B16 or B17, it is characterised in that the processing unit is specifically also used In:
Before the 4th distance that second computing module is calculated, the 5th distance, the described 5th are calculated Distance is between the first cluster centre point after the second cluster centre point and the last renewal Distance;
The reality of first distance for calculating first computing module according to triangle inequality rule The 5th distance that value is calculated with the 3rd computing module is compared;
When the 5th distance that second comparison module compares is more than or equal to described the first of twice The actual value of distance, then abandon the second cluster centre point, to carry out during cluster traversal, no longer counting Calculate the distance between the sample point and described second cluster centre point and the second cluster centre point with The distance between other cluster centre points to be traveled through;
When reality of the 5th distance less than first distance of twice that second comparison module compares Actual value, then perform calculating the 4th distance.
B19, the device according to any one of B11-B17, it is characterised in that described second obtains single Unit, specifically for:
Corresponding value after obtaining the preceding corresponding value of the last renewal of the first cluster centre point and updating, and count The first difference between calculating before and after the first cluster centre point updates;
Corresponding value after obtaining the preceding corresponding value of the last renewal of the second cluster centre point and updating, and count The second difference between calculating before and after the second cluster centre point updates;
First difference and the second processing that the second distance is calculated with the first processing module Second difference that module is calculated carries out subtraction, obtains the predicted value of the 3rd distance.
B20, the device according to B17, it is characterised in that described device also includes:
Judging unit, after the discarding unit abandons the second cluster centre point, judges described working as Whether preceding clustering distance traversal completes;
Traversal Unit, when the judging unit judges not traveling through completion, continuation is traveled through in the current cluster Next cluster centre point that the heart is concentrated;
Determining unit, for when the judging unit judges that traversal is completed, by first after last time renewal Cluster centre point is defined as closest with sample point cluster centre point in current distance ergodic process.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or meter Calculation machine program product.Therefore, the present invention can be using complete hardware embodiment, complete software embodiment or knot Close the form of the embodiment in terms of software and hardware.And, the present invention can be used and wherein wrapped at one or more Containing computer usable program code computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) on implement computer program product form.
The present invention is produced with reference to method according to embodiments of the present invention, equipment (system) and computer program The flow chart and/or block diagram of product is described.It should be understood that can by computer program instructions realize flow chart and / or block diagram in each flow and/or the flow in square frame and flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions to all-purpose computer, special-purpose computer, insertion can be provided The processor of formula processor or other programmable data processing devices is producing a machine so that by calculating The instruction of the computing device of machine or other programmable data processing devices is produced for realizing in flow chart one The device of the function of being specified in individual flow or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions may be alternatively stored in can guide computer or the treatment of other programmable datas to set In the standby computer-readable memory for working in a specific way so that storage is in the computer-readable memory Instruction produce include the manufacture of command device, the command device realization in one flow of flow chart or multiple The function of being specified in one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices, made Obtain and series of operation steps is performed on computer or other programmable devices to produce computer implemented place Reason, so as to the instruction performed on computer or other programmable devices is provided for realizing in flow chart one The step of function of being specified in flow or multiple one square frame of flow and/or block diagram or multiple square frames.
Obviously, those skilled in the art can carry out various changes and modification without deviating from this hair to the present invention Bright spirit and scope.So, if it is of the invention these modification and modification belong to the claims in the present invention and Within the scope of its equivalent technologies, then the present invention is also intended to comprising these changes and modification.

Claims (10)

1. a kind of method that website clusters, it is characterised in that including:
The cluster centre collection of the sample set and the sample set for website cluster is obtained, in the sample set Each sample point includes the description information of each website in the cluster of website, and the description information at least includes field Information, structural information and visitor information;
For each sample point in the sample set, each cluster that cluster centre is concentrated is traveled through successively Central point, it is determined that described each sample point concentrates closest cluster centre point with the cluster centre, And described each sample point is divided into the closest cluster centre point correspondence of the cluster centre concentration Set in, obtain each corresponding cluster set of cluster centre point that the cluster centre is concentrated;
The average value of sample point in the cluster set is obtained, and the cluster centre is updated according to the average value Collection;
According to the last predicted value for updating front and rear itself difference the first distance of acquisition of the first cluster centre point; Wherein, first distance is to need to carry out between the sample point of data clusters and the first cluster centre point Distance, the first cluster centre point is the cluster closest with the sample point in clustering distance traversal Central point;
According to itself difference and second before and after second distance, the last renewal of the first cluster centre point Cluster centre point is last update before and after itself difference obtain the predicted value of the 3rd distance, wherein, described the Two distances the first cluster centre point and the second cluster centre point described in last clustering distance ergodic process The distance between, during the second cluster centre point is cluster to be traveled through in current clustering distance ergodic process Heart point;
According to triangle inequality rule by the prediction of the predicted value of first distance and the 3rd distance Value is compared;
If the predicted value of the 3rd distance is more than or equal to the predicted value of first distance of twice, The second cluster centre point is abandoned, when being traveled through to carry out clustering distance, the sample point is no longer calculated Treat that traversal is poly- with other with the distance between the second cluster centre point and the second cluster centre point The distance between class central point;
The distance traversal is carried out based on the cluster centre collection for having abandoned the second cluster centre point, institute is obtained The cluster result of sample set is stated, the cluster result is included with the realm information, the structural information and institute State the cluster letter after dimension on the basis of visitor information is clustered to each website in the website cluster Breath.
2. method according to claim 1, it is characterised in that obtain the sample set described After cluster result, methods described also includes:
The cluster result is analyzed, is evaluated with to the clustering method.
3. method according to claim 2, it is characterised in that described to be carried out to the cluster result Analysis, evaluates with to the clustering method, specifically includes:
The cluster result is analyzed by entropy verification algorithm or purity verification algorithm;
When the entropy of the cluster result obtained in the entropy verification algorithm is less than the first preset value, it is determined that The clustering method meets preset need;Or
When the purity of the cluster result obtained in the purity verification algorithm is more than the second preset value, it is determined that The clustering method meets the preset need.
4. method according to claim 1, it is characterised in that methods described also includes:
If the predicted value of the 3rd distance is less than the predicted value of first distance of twice, according to upper one The first cluster centre point after secondary renewal clicks through row data clustering processing to second cluster centre.
5. method according to claim 3, it is characterised in that it is described updated according to the last time after The first cluster centre point clicks through row data clustering processing to second cluster centre, including:
Calculate it is described it is last update after the distance between the first cluster centre point and the sample point, Obtain the actual value of the first distance;
According to triangle inequality rule by the prediction of the actual value of first distance and the 3rd distance Value is compared;
If the predicted value of the 3rd distance is more than or equal to the actual value of first distance of twice, The second cluster centre point is abandoned, when being traveled through to carry out clustering distance, the sample point is no longer calculated Treat that traversal is poly- with other with the distance between the second cluster centre point and the second cluster centre point The distance between class central point;
If the predicted value of the 3rd distance is less than the actual value of first distance of twice, the 4th is calculated Distance, and determine whether the 4th distance is less than the actual value of first distance;Wherein, the described 4th Distance is the sample point and the distance of the second cluster centre point;
If the 4th distance is less than the actual value of first distance, and the second cluster centre point is true It is set to closest with sample point cluster centre point in current distance ergodic process;
If the 4th distance is more than or equal to the actual value of first distance, by the last time more The first cluster centre point after new is defined as in current distance ergodic process with sample point distance most Near cluster centre point.
6. method according to claim 5, it is characterised in that described by second cluster centre Point is defined as closest with sample point cluster centre point in current distance ergodic process, including:
If the 4th distance has been traveled through less than the actual value of first distance, and current clustering distance Into, then by the second cluster centre point be assigned to it is described it is last update after first cluster centre Point, and the 4th distance is assigned to the actual value of first distance;
If the 4th distance is less than the actual value of first distance, and current clustering distance traversal is not complete Into, then by the second cluster centre point be assigned to it is described it is last update after first cluster centre Point, and the 4th distance is assigned to the actual value of first distance, and based on assignment after first The actual value of the first distance after cluster centre point and assignment continues to travel through what the current cluster centre was concentrated Next cluster centre point.
7. method according to claim 5, it is characterised in that by it is described it is last update after institute State the first cluster centre point and be defined as closest with sample point cluster in current distance ergodic process Central point, including:
If the 4th distance is more than or equal to the actual value of first distance, and current clustering distance Traversal is completed, then the first cluster centre point after the last renewal is defined as into current distance traversal During the cluster centre point closest with the sample point;
If the 4th distance is more than or equal to the actual value of first distance, and current clustering distance Traversal is not completed, then based on it is described it is last update after the first cluster centre point and described first away from From actual value continue to travel through next cluster centre point that the current cluster centre is concentrated.
8. the method according to claim 6 or 7, it is characterised in that before the 4th distance is calculated, Methods described also includes:
The 5th distance is calculated, after the 5th distance updates for the second cluster centre point with the last time The distance between the first cluster centre point;
The actual value of first distance is compared with the 5th distance according to triangle inequality rule Compared with;
If actual value of the 5th distance more than or equal to first distance of twice, by described the Two cluster central points are abandoned, and to carry out during cluster traversal, no longer calculate the sample point poly- with described second Between the distance between class central point and the second cluster centre point and other cluster centre points to be traveled through Distance;
The 4th distance of the calculating, including:
If the 5th distance is performed described in the calculating less than the actual value of first distance of twice 4th distance.
9. the method according to any one of claim 1-7, it is characterised in that described according to second Distance, the first cluster centre point are last to be updated on front and rear itself difference and the second cluster centre point Itself difference before and after once updating obtains the predicted value of the 3rd distance, including:
Corresponding value after obtaining the preceding corresponding value of the last renewal of the first cluster centre point and updating, and count The first difference between calculating before and after the first cluster centre point updates;
Corresponding value after obtaining the preceding corresponding value of the last renewal of the second cluster centre point and updating, and count The second difference between calculating before and after the second cluster centre point updates;
The second distance and first difference and second difference are carried out into subtraction, obtains described The predicted value of the 3rd distance.
10. the device that a kind of website clusters, it is characterised in that including:
Obtaining unit, the cluster centre for obtaining sample set and the sample set for website cluster Collection, each sample point includes the description information of each website in the cluster of website, the description in the sample set Information at least includes realm information, structural information and visitor information;
Cluster set obtaining unit, for for each sample point in the sample set, cluster being traveled through successively Each cluster centre point that center is concentrated, it is determined that described each sample point and the cluster centre collection middle-range From nearest cluster centre point, and described each sample point is divided into the cluster centre concentration distance most In the near corresponding set of cluster centre point, each cluster centre point pair that the cluster centre is concentrated is obtained The cluster set answered;
Average value obtaining unit, the average value for obtaining sample point in the cluster set, and according to described flat Average updates the cluster centre collection;
First acquisition unit, for being obtained according to itself difference before and after the last renewal of the first cluster centre point The predicted value of the first distance;Wherein, first distance for need to carry out the sample point of data clusters with it is described The distance between first cluster centre point, the first cluster centre point be clustering distance traversal in the sample The closest cluster centre point of this point;
Second acquisition unit, before and after according to second distance, the last renewal of the first cluster centre point Itself difference and the second cluster centre point is last update before and after itself difference obtain the 3rd distance Predicted value, wherein, the second distance is the first cluster centre described in last clustering distance ergodic process The distance between point and the second cluster centre point, the second cluster centre point are that current clustering distance is traversed Cluster centre to be traveled through point in journey;
Comparing unit, for the first acquisition unit is obtained according to triangle inequality rule described the The predicted value of the 3rd distance that the predicted value of one distance is obtained with the second acquisition unit is compared;
Discarding unit, the predicted value of the 3rd distance for comparing when the comparing unit is more than or waits When the predicted value of first distance of twice, the second cluster centre point is abandoned, to be gathered During class distance traversal, the distance between the sample point and described second cluster centre point and institute are no longer calculated State the distance between the second cluster centre point and other cluster centre points to be traveled through;
Cluster result obtaining unit, for being entered based on the cluster centre collection for having abandoned the second cluster centre point The row distance traversal, obtains the cluster result of the sample set, and the cluster result is included with the field Information, the structural information and on the basis of the visitor information dimension to each net in the website cluster Station clustered after clustering information.
CN201510982364.9A 2015-12-23 2015-12-23 A kind of method and device of website cluster Pending CN106909932A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510982364.9A CN106909932A (en) 2015-12-23 2015-12-23 A kind of method and device of website cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510982364.9A CN106909932A (en) 2015-12-23 2015-12-23 A kind of method and device of website cluster

Publications (1)

Publication Number Publication Date
CN106909932A true CN106909932A (en) 2017-06-30

Family

ID=59206042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510982364.9A Pending CN106909932A (en) 2015-12-23 2015-12-23 A kind of method and device of website cluster

Country Status (1)

Country Link
CN (1) CN106909932A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996197A (en) * 2009-08-31 2011-03-30 ***通信集团公司 Cluster realizing method and system
CN102750647A (en) * 2012-06-29 2012-10-24 南京大学 Merchant recommendation method based on transaction network
CN103412948A (en) * 2013-08-27 2013-11-27 北京交通大学 Cluster-based collaborative filtering commodity recommendation method and system
CN103605794A (en) * 2013-12-05 2014-02-26 国家计算机网络与信息安全管理中心 Website classifying method
CN104101902A (en) * 2013-04-10 2014-10-15 中国石油天然气股份有限公司 Earthquake attribute cluster method and apparatus
CN104765776A (en) * 2015-03-18 2015-07-08 华为技术有限公司 Data sample clustering method and device
CN105095912A (en) * 2015-08-06 2015-11-25 北京奇虎科技有限公司 Data clustering method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101996197A (en) * 2009-08-31 2011-03-30 ***通信集团公司 Cluster realizing method and system
CN102750647A (en) * 2012-06-29 2012-10-24 南京大学 Merchant recommendation method based on transaction network
CN104101902A (en) * 2013-04-10 2014-10-15 中国石油天然气股份有限公司 Earthquake attribute cluster method and apparatus
CN103412948A (en) * 2013-08-27 2013-11-27 北京交通大学 Cluster-based collaborative filtering commodity recommendation method and system
CN103605794A (en) * 2013-12-05 2014-02-26 国家计算机网络与信息安全管理中心 Website classifying method
CN104765776A (en) * 2015-03-18 2015-07-08 华为技术有限公司 Data sample clustering method and device
CN105095912A (en) * 2015-08-06 2015-11-25 北京奇虎科技有限公司 Data clustering method and device

Similar Documents

Publication Publication Date Title
De Boer et al. A tutorial on the cross-entropy method
US20160357845A1 (en) Method and Apparatus for Classifying Object Based on Social Networking Service, and Storage Medium
CN108268931A (en) The methods, devices and systems of data processing
CN104735166B (en) The Skyline method for service selection annealed based on MapReduce and multi-target simulation
CN109214337A (en) A kind of Demographics' method, apparatus, equipment and computer readable storage medium
US9147009B2 (en) Method of temporal bipartite projection
CN110110237A (en) User interest information recommended method, storage medium
CN112819157B (en) Neural network training method and device, intelligent driving control method and device
CN110267206A (en) User location prediction technique and device
CN112687266B (en) Speech recognition method, device, computer equipment and storage medium
Qian et al. Kernel estimation and model combination in a bandit problem with covariates
CN111178486A (en) Hyper-parameter asynchronous parallel search method based on population evolution
CN110462638A (en) Training neural network is sharpened using posteriority
CN109063041A (en) The method and device of relational network figure insertion
CN111967964B (en) Intelligent recommending method and device for bank client sites
CN106803092B (en) Method and device for determining standard problem data
CN106910079A (en) A kind of method and device of crowd's cluster
Lin et al. Currency exchange rates prediction based on linear regression analysis using cloud computing
Feuer et al. TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks
Joest et al. A user-aware tour proposal framework using a hybrid optimization approach
CN106909932A (en) A kind of method and device of website cluster
CN106910080A (en) A kind of method and device being analyzed according to crowd's cluster result
CN115907262A (en) Tour route planning method and device, electronic equipment and storage medium
CN106909569A (en) A kind of method and device being analyzed according to website cluster result
CN115661861A (en) Skeleton behavior identification method based on dynamic time sequence multidimensional adaptive graph convolution network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170630

RJ01 Rejection of invention patent application after publication