CN103198103B

CN103198103B - The microblogging method for pushing of a kind of density based term clustering and device

Info

Publication number: CN103198103B
Application number: CN201310090524.XA
Authority: CN
Inventors: 冯扬; 姜贵彬; 宋莉; 刘莹莹; 桑军
Original assignee: Weimeng Chuangke Network Technology China Co Ltd
Current assignee: Weimeng Chuangke Network Technology China Co Ltd
Priority date: 2013-03-20
Filing date: 2013-03-20
Publication date: 2016-06-29
Anticipated expiration: 2033-03-20
Also published as: CN103198103A

Abstract

The invention discloses the microblogging method for pushing of a kind of density based term clustering and device, big in order to solve server stress in prior art, the problem of waste Internet resources.The method server determines the word spacing between each vocabulary, determine therefrom that core word, and divide word set for each core word, adopt OPTICS clustering algorithm that the vocabulary in each word set is clustered again, obtain several words bunch to be combined, then several words bunch to be combined are merged and obtain combinatorial word bunch, push the content of microblogging finally according to the combinatorial word bunch at interest word place and treating of user and push and wait to push microblogging.The vocabulary with generality implication can be got rid of beyond the word set divided when dividing word set by said method, and not by the impact of initial value when clustering, therefore the accuracy of cluster can be improved, server can be treated propelling movement microblogging according to the combinatorial word bunch obtained and push accurately, such that it is able to effectively reduce server stress, also save Internet resources.

Description

The microblogging method for pushing of a kind of density based term clustering and device

Technical field

The present invention relates to networking technology area, particularly to microblogging method for pushing and the dress of a kind of density based term clustering.

Background technology

At present, microblogging application in social media is more and more influential, has become as one of user's Main Means carrying out information issue, exchange, acquisition.

For a microblog users, except the microblogging of the server other users issue except can this user be paid close attention to is pushed to this user, it is also possible to the interest according to this user, the microblogging relevant to the interest of this user is pushed to this user.

Concrete, the interest word that can first set this user (can be set by this user oneself, also the microblogging can be browsed according to this user by server, forward, collect, paid close attention to determines the interest word of this user), server is further according to waiting to push the content of microblogging, whether relevant to the interest word of this user judge that this waits to push microblogging, if relevant, then this is waited that pushing microblogging is pushed to this user.

Such as, the interest word of this user is " computer ", and server is then according to waiting to push the content of microblogging, it is judged that whether this waits to push microblogging relevant to " computer ", if so, then this being waited, pushing microblogging is pushed to this user.

But, there is the same meaning expressed by multiple different word or close situation in actual applications, as above close with the interest word " computer " of this user in example word includes " computer ", " notebook " etc..Whether, if waiting to push microblogging for one, relevant to the interest word of this user decide whether to be pushed to this user according only to it, the accuracy that will certainly cause propelling movement microblogging is relatively low.Accordingly, it would be desirable to each vocabulary in dictionary is clustered, that is, same meaning or the close word of expression are brought together formation word bunch.So, judging that one in time pushing microblogging and whether should be pushed to a user, then may determine that this waits that whether the content pushing microblogging is relevant to the word bunch at the interest word place of this user, push further according to judged result.Visible, the accuracy that each vocabulary in dictionary is clustered is directly connected to the accuracy pushing microblogging.

In the prior art, generally can adopt following two clustering algorithm that vocabulary is clustered.

One, based on the clustering algorithm divided, such as k-means clustering algorithm, its method is, first specifies the quantity k of word bunch；K the word centre word respectively as k word bunch is randomly choosed again from dictionary；Then for other vocabulary each in dictionary, calculate the distance of this vocabulary and this k centre word respectively, and determine the centre word closest with this vocabulary, this vocabulary is divided into the word bunch at this centre word place；After other vocabulary all of have been processed, redefine the centre word (being k centre word equally) of each word bunch, and again calculate other vocabulary respectively with the distance of k the centre word redefined, repartitioning of word bunch is carried out according to distance, so iteration continues, till meeting certain termination iterated conditional.

But, adopt the impact of the word number of clusters amount k that the accuracy that above-mentioned first method clusters can initially be specified, and, the word bunch that said method clusters out is the word bunch of " spherical ", for the word bunch obtained, the farthest vocabulary of centre word apart from this word bunch is often very low with the dependency of this word bunch.

Two, based on the hierarchical clustering algorithm of coagulation type, its method is, in the starting stage using each word as a word bunch, then the distance between each word bunch is calculated, the nearer word bunch of combined distance, to form bigger word bunch, recalculates the distance between each word bunch, continues to merge according to the distance between word bunch, so iteration continues, till meeting certain termination iterated conditional.

But, the accuracy adopting above-mentioned the second clustering algorithm can be subject to the impact of the much vocabulary with generality implication, such as " company ", " enterprise ", " experience " etc., these vocabulary with generality implication all have certain dependency with much different classes of vocabulary, therefore, when combinatorial word bunch, often owing to these have recapitulative vocabulary, the word bunch that two bases do not have dependency is combined.

In sum, the accuracy in prior art, vocabulary clustered is relatively low, thus causing that server can not be treated propelling movement microblogging and push accurately, for a user, if relevant to its interest is not waited that pushing microblogging is pushed to this user by server, then this user will certainly search for these by other modes such as search and treat propelling movement microblogging, increase server stress, if and server not related with its interest will wait that pushing microblogging is pushed to this user, will also result in the waste of Internet resources.

Summary of the invention

The embodiment of the present invention provides microblogging method for pushing and the dress of a kind of density based term clustering, big in order to solve server stress in prior art, the problem of waste Internet resources.

The microblogging method for pushing of a kind of density based term clustering that the embodiment of the present invention provides, including:

The server co-occurrence word set according to each vocabulary, it is determined that the word spacing between each vocabulary；And

Core word is determined according to the word spacing between each vocabulary；And

For each core word determined, the N number of vocabulary minimum with the word spacing of this core word and this core word being divided into a word set, N is the first default quantity；

For each word set marked off, adopt OPTICS clustering algorithm that the vocabulary in this word set is clustered, obtain several words bunch to be combined；And

According to the vocabulary in the word bunch each to be combined obtained, the word bunch each to be combined obtained is merged process, obtains combinatorial word bunch；

Described server is according to the combinatorial word bunch at the interest word place of user and waits to push the content of microblogging, waits to push microblogging described in propelling movement.

The microblogging of a kind of density based term clustering that the embodiment of the present invention provides pushes dress, including:

Word spacing determines module, for the co-occurrence word set according to each vocabulary, it is determined that the word spacing between each vocabulary；

Core word determines module, for determining core word according to the word spacing between each vocabulary；

Word set divides module, and for for each core word determined, the N number of vocabulary minimum with the word spacing of this core word and this core word being divided into a word set, N is the first default quantity；

Cluster module, for for each word set marked off, adopting OPTICS clustering algorithm that the vocabulary in this word set is clustered, obtain several words bunch to be combined；

Merge module, for according to the vocabulary in the word bunch each to be combined obtained, the word bunch each to be combined obtained being merged process, obtains combinatorial word bunch；

Pushing module, is used for the combinatorial word bunch at the interest word place according to user and waits to push the content of microblogging, waiting to push microblogging described in propelling movement.

The embodiment of the present invention provides microblogging method for pushing and the dress of a kind of density based term clustering, the method server first determines the word spacing between each vocabulary, determine therefrom that core word, and for each core word, the N number of vocabulary minimum with the word spacing of this core word and this core word are divided into a word set, adopt OPTICS clustering algorithm that the vocabulary in each word set is clustered again, obtain several words bunch to be combined, then several words bunch to be combined are merged and obtain combinatorial word bunch, push the content of microblogging finally according to the combinatorial word bunch at interest word place and treating of user to push and wait to push microblogging.The vocabulary with generality implication can be got rid of beyond the word set divided when dividing word set by said method, and not by the impact of initial value when clustering, therefore the accuracy of cluster can be improved, server can be treated propelling movement microblogging according to the combinatorial word bunch obtained and push accurately, such that it is able to effectively reduce server stress, also save Internet resources.

Accompanying drawing explanation

The microblogging of the density based term clustering that Fig. 1 provides for the embodiment of the present invention pushes process；

Fig. 2 divides the process of word set for the server that the embodiment of the present invention provides；

Fig. 3 for the embodiment of the present invention provide with vocabulary p divide word set schematic diagram；

Two crossing word set schematic diagrams that Fig. 4 provides for the embodiment of the present invention；

The word set schematic diagram of two mutual exclusions that Fig. 5 provides for the embodiment of the present invention；

Two word set schematic diagrams excessively intersected that Fig. 6 provides for the embodiment of the present invention；

The process adopting OPTICS clustering algorithm that the vocabulary in one word set is clustered that Fig. 7 provides for the embodiment of the present invention；

The microblogging of the density based term clustering that Fig. 8 provides for the embodiment of the present invention pushes assembling structure schematic diagram.

Detailed description of the invention

The initial value (needing appointment word number of clusters amount k first inputted in k-means clustering algorithm) that the inputs impact on cluster result during in order to avoid having vocabulary and a cluster of generality implication, each vocabulary in dictionary is first divided into several word sets by the embodiment of the present invention, the vocabulary with generality implication can be got rid of (such as " company " when dividing word set, " enterprise ", vocabulary such as " experiences "), adopt without inputting initial value and the OPTICS clustering algorithm of arbitrarily " shape " word bunch can be gone out according to the Density Clustering of vocabulary each word set is clustered again, obtain word bunch to be combined, finally word bunch to be combined is merged and obtain combinatorial word bunch, such that it is able to improve the accuracy of cluster, server can be treated propelling movement microblogging according to the combinatorial word bunch obtained and push accurately, therefore server stress can be effectively reduced, save Internet resources.

Below in conjunction with accompanying drawing, the preferred embodiment of the present invention is described in detail.

The microblogging of the density based term clustering that Fig. 1 provides for the embodiment of the present invention pushes process, specifically includes following steps:

S101: the server co-occurrence word set according to each vocabulary, it is determined that the word spacing between each vocabulary.

In embodiments of the present invention, server is for any two vocabulary in dictionary, the word spacing between the two vocabulary can be determined by the two vocabulary similarity between expected context is distributed, if the context distribution of the two word is more similar, then the implication expressed by the two word is more similar, therebetween word spacing is more little, otherwise then word spacing is more big.

Concrete, any two vocabulary in dictionary is represented for the first vocabulary and the second vocabulary, server, for the first vocabulary and the second vocabulary, can adopt formulaDetermining the word spacing between the first vocabulary and the second vocabulary, wherein, i represents the first vocabulary, and j represents the second vocabulary, and (i j) is word spacing between the first vocabulary and the second vocabulary, T to D_iIt is the co-occurrence word set of the first vocabulary, T_jIt is the co-occurrence word set of the second vocabulary, | T_i∩T_j| it is the quantity of the vocabulary comprised in the common factor of the co-occurrence word set of co-occurrence word set and second vocabulary of the first vocabulary, | T_i| it is that the co-occurrence word of the first vocabulary concentrates the quantity of vocabulary comprised, | T_j| it is that the co-occurrence word of the second vocabulary concentrates the quantity of vocabulary comprised.

S102: determine core word according to the word spacing between each vocabulary.

In embodiments of the present invention, for a vocabulary undetermined, if the quantity of other vocabulary less with the word spacing of this vocabulary undetermined is more, then can determine that this vocabulary undetermined is core word.

Concrete, after server determines in dictionary the word spacing between each vocabulary, can using each vocabulary in dictionary all as vocabulary undetermined, and for each vocabulary undetermined, judge that whether the word spacing with this vocabulary undetermined is not more than the quantity of other vocabulary of default neighborhood distance ε more than the second default quantity M, if, it is determined that this vocabulary undetermined is core word, otherwise, it determines this vocabulary undetermined is not core word.Wherein, preset neighborhood distance ε and the second quantity M all can be set as required.

S103: for each core word determined, the N number of vocabulary minimum with the word spacing of this core word and this core word are divided into a word set.

Wherein, N is the first default quantity.

It should be noted that for the word set marked off, this word set not only comprises a core word.And, it is that the N number of vocabulary minimum with this core word is divided in this word set during owing to dividing, for having the vocabulary of generality implication, these vocabulary are mostly and the word spacing little all not (little of being divided in a word set) of any one core word, therefore, after dividing word set, will being left some unallocated vocabulary in any word set, these vocabulary are exactly the vocabulary with generality implication.The word set marked off then only is processed by follow-up cluster process, and these vocabulary not being divided in any one word set is not processed, and the vocabulary with generality implication thus can be avoided to affect the accuracy of follow-up cluster.

S104: for each word set marked off, adopts OPTICS clustering algorithm that the vocabulary in this word set is clustered, obtains several words bunch to be combined.

Owing to the OPTICS clustering algorithm of density based has higher cluster accuracy compared to based on the clustering algorithm (such as k-means clustering algorithm) divided and the hierarchical clustering algorithm based on coagulation type, therefore, in embodiments of the present invention, after server divides word set, can adopt OPTICS clustering algorithm that each word set is clustered, obtain several words bunch to be combined.

S105: according to the vocabulary in the word bunch each to be combined obtained, the word bunch each to be combined obtained is merged process, obtains combinatorial word bunch.

Concrete, the word bunch each to be combined obtained can be added in bunch queue by server；For the word bunch each to be combined in bunch queue, extract this word bunch to be combined, determine in bunch queue with this word bunch to be combined other words bunch to be combined that comprise at least one identical vocabulary extracted, it is to be determined to other words bunch to be combined be incorporated in the word bunch to be combined of extraction, obtain middle word bunch to be combined；This word bunch to be combined deleting extraction from bunch queue and other words bunch to be combined determined, continue to determine other words bunch to be combined comprising at least one identical vocabulary with this centre word bunch to be combined in current cluster queue, and merge, till the quantity of the vocabulary comprised in the word bunch to be combined of this centre no longer changes；When the quantity of the vocabulary comprised in the word bunch to be combined of this centre no longer changes, using word bunch to be combined for this centre as the combinatorial word bunch obtained.

Such as, there are 4 words bunch to be combined, respectively C, C1, C2, C3, wherein, C and C1 comprises identical vocabulary, and C1 and C2 comprises identical vocabulary, and the vocabulary that C3 comprises and C, C1, C2 all differ, then these 4 words bunch are added to after in thick queue by server, extract C, it is determined that other words bunch to be combined comprising identical vocabulary in bunch queue with C are C1, are therefore incorporated in C by C1, obtain middle word bunch C ∪ C1 to be combined, from bunch queue, delete C and C1.

Now, bunch queue also includes 2 words bunch to be combined, i.e. C2 and C3.Server continues to determine and this centre word bunch to be combined C ∪ C1 other words bunch to be combined comprising identical vocabulary, namely C2(due to, C1 and C2 comprises identical vocabulary, therefore C2 and C ∪ C1 also comprises identical vocabulary), therefore, C2 is incorporated in C ∪ C1, obtains middle word bunch C ∪ C1 ∪ C2 to be combined.

Then, server deletes C2 again from bunch queue, now only remains C3 in bunch queue, owing to C3 and the C ∪ C1 ∪ C2 vocabulary comprised all differs, the quantity of the vocabulary therefore comprised in centre word bunch to be combined C ∪ C1 ∪ C2 now no longer changes, thus using C ∪ C1 ∪ C2 as a combinatorial word bunch.

S106: server is according to the combinatorial word bunch at the interest word place of user and waits to push the content of microblogging, pushes this and waits to push microblogging.

Obtaining each after combinatorial word bunch, server for waiting to push microblogging, then according to the combinatorial word bunch at the interest word place of each user and this content waiting to push microblogging, can push this and wait to push microblogging.As, for a user, server can determine that the combinatorial word bunch at the interest word place of this user, further according to waiting to push the content of microblogging, judge that this waits this, whether the content the pushing microblogging dependency with the combinatorial word bunch determined is more than setting threshold value, if so, then waits that pushing microblogging is pushed to this user, otherwise, this is not waited that pushing microblogging is pushed to this user.

Pass through said method, the vocabulary with generality implication can be got rid of beyond the word set divided when dividing word set by server, and not by the impact of initial value when clustering, therefore the accuracy of cluster can be improved, thus, one is waited to push for microblogging, server is according to the combinatorial word bunch obtained, if it is determined that this waits that the dependency pushing the combinatorial word bunch at the interest word place of microblogging and this user is higher, then can will wait that pushing microblogging is pushed to this user, this user is made to wait to push microblogging without searching for this by other modes such as search, therefore can effectively reduce server stress to enter, and if it is determined that this waits that the dependency pushing the combinatorial word bunch at the interest word place of microblogging and this user is relatively low, then this is not waited that pushing microblogging is pushed to this user, also save Internet resources.

It is also preferred that the left in order to improve the cluster efficiency in said process as far as possible, in above-mentioned steps S103, server can adopt method as shown in Figure 2 to divide word set.

Fig. 2 divides the process of word set for the server that the embodiment of the present invention provides, and specifically includes following steps:

S1031: each vocabulary is added in original queue by server with random order.

S1032: for each vocabulary in original queue, extract this vocabulary, it is judged that whether this vocabulary of extraction is core word, if so, then performs step S1033, otherwise, performs step S1034.

S1033: this vocabulary of N number of vocabulary minimum with the word spacing of this vocabulary extracted in original queue and extraction is divided into a word set, and from original queue, delete this vocabulary of extraction and be smaller than presetting the core word of 2 times of neighborhood distance with the word of this vocabulary extracted, and return step S1032.

Wherein, the purpose of the vocabulary deleting be smaller than presetting neighborhood distance in original queue with the word of this vocabulary extracted 2 times is as follows:

If vocabulary p is a core word as, this vocabulary p is regarded a point in space, then with this vocabulary p word set (being called S (p)) divided as shown in Figure 3.Fig. 3 for the embodiment of the present invention provide with vocabulary p divide word set schematic diagram, in figure 3, the circle that radius is ε represents that the word spacing with this vocabulary p is not more than the scope of default neighborhood distance, namely, ε is for presetting neighborhood distance, h (p) is with the word spacing of vocabulary farthest with the word spacing of this vocabulary p in this vocabulary p word set divided to this vocabulary p, h (p) is called, and word set divides distance, is that the vocabulary in h (p) scope is included in S (p) at radius.

Based on the word set shown in Fig. 3, it is known that two word sets would be likely to occur and intersect or two kinds of situations of mutual exclusion, as shown in Figure 4 and Figure 5.

Two crossing word set schematic diagrams that Fig. 4 provides for the embodiment of the present invention, in the diagram, with the vocabulary p1 word set S (p1) divided with so that the vocabulary p2 word set S (p2) divided all to comprise identical vocabulary r, then word set S (p1) and word set S (p2) is claimed to intersect.

The word set schematic diagram of two mutual exclusions that Fig. 5 provides for the embodiment of the present invention, in Figure 5, so that vocabulary p1 word set S (p1) and the word set S (p2) divided with vocabulary p2 divided not comprise any identical vocabulary, then claim word set S (p1) and word set S (p2) mutual exclusion.

Further, the situation that two word sets intersect is divided into again and common intersect and excessively intersect, and crossing situation as shown in Figure 4 is common crossing situation, and shown in Fig. 6 is situation about excessively intersecting.

Two word set schematic diagrams excessively intersected that Fig. 6 provides for the embodiment of the present invention, in figure 6, with the vocabulary p1 word set S (p1) divided with so that the vocabulary p2 word set S (p2) divided all to comprise identical vocabulary r, and the word of this vocabulary r and p1 is smaller than presetting neighborhood distance ε, the word spacing of vocabulary r and p2 again smaller than default neighborhood distance ε, then claims word set S (p1) and word set S (p2) excessively crossing.

Obviously, if dividing word set with each core word in dictionary, the number of the word set of division will certainly be sharply increased, affecting the follow-up efficiency adopting OPTICS cluster, therefore, in order to improve the efficiency of follow-up cluster, it is necessary to avoid situation about excessively intersecting to produce.So, in step S1033, after having divided a word set with a core word, it is necessary to other core words that word with this core word in original queue is smaller than 2 ε are deleted.So, any word set of subsequent divided all excessively will not intersect with this word set.

S1034: be put back in original queue by this vocabulary extracted, returns step S1032.

At least there is one by the core word comprised in the word set that said method divides, a general word set can exist many core words.

Further, step S104 as shown in Figure 1 adopt OPTICS clustering algorithm the vocabulary in one word set is carried out cluster process as shown in Figure 7.

The process adopting OPTICS clustering algorithm that the vocabulary in one word set is clustered that Fig. 7 provides for the embodiment of the present invention, specifically includes following steps:

S1041: for each word set marked off, all vocabulary in this word set are added in orderly seed queue.

For example, it is assumed that this word set includes n vocabulary, p1～pn, then this n vocabulary can be added in orderly seed queue with random order when server is initial.

S1042: according to the sequencing of each vocabulary in orderly seed queue, extracts first vocabulary.

Continue to use the example above, it is assumed that the order adding above-mentioned n vocabulary is p1～pn, then server extracts first vocabulary p1 from orderly seed queue.

S1043: judge whether this vocabulary extracted is core word, if so, then performs step S1044, otherwise performs step S1045.

S1044: using other vocabulary in orderly seed queue as the candidate word of this vocabulary extracted, according to each candidate word to the reach distance of this vocabulary extracted, update the reach distance intermediate value of each candidate word, this vocabulary extracted is inserted into the end of result queue, this vocabulary of extraction is deleted from orderly seed queue, and according to the current reach distance intermediate value of each candidate word order from small to large, each candidate word in orderly seed queue is ranked up, perform step S1046.

Wherein, according to each candidate word to the reach distance of this vocabulary extracted, update the method for the reach distance intermediate value of each candidate word particularly as follows: for each candidate word, determine this candidate word reach distance to this vocabulary extracted, if currently the reach distance intermediate value of this candidate word is not more than this candidate word reach distance to this vocabulary extracted, the reach distance intermediate value then keeping this candidate word is constant, if currently the reach distance intermediate value of this candidate word is more than this candidate word to the reach distance of this vocabulary extracted, then using the reach distance of this candidate word again as the reach distance intermediate value of this candidate word.

Continuing to use the example above, after extracting first vocabulary p1, then by other vocabulary in orderly seed queue, namely p2～pn is as the candidate word of p1.Server determines the reach distance of p2 to p1, p3 to p1 ... pn to p1.

Reach distance to determine p2 to p1 below, illustrates the method that server determines the reach distance between two vocabulary.If p1 and p2 is core word, then the core distance that reach distance is p1 of p2 to p1 and the maximum in the word spacing of p2 to p1.

Wherein, the core determination of distance method of p1 is: in the dictionary of dictionary, according to the order from small to large of the word spacing with p1, it is determined that M(namely, above-mentioned second quantity) individual vocabulary, the most major term spacing of this M vocabulary to p1 is defined as the core distance of p1.

For p2, if the reach distance intermediate value of current p2 is not more than the reach distance of p2 to p1, the reach distance intermediate value then keeping current p2 is constant, if the reach distance that the reach distance intermediate value of current p2 is more than p2 to p1, then the reach distance intermediate value of p2 is updated to the reach distance of p2 to p1.Similar, the reach distance intermediate value of renewable p3～pn.Wherein, time initial, the reach distance intermediate value of p1～pn can be set to initial value by server, as, for core word, its reach distance intermediate value can be set to a bigger value (more than reach distance between any two core word in this word set) by server, for non-core word, then its reach distance intermediate value can be set to a less value (such as-1).

After the reach distance intermediate value of p2～pn has all been updated, the p1 of extraction is then inserted into the end of result queue (owing to now result queue is sky by server, therefore after inserting p1, p1 makes number one in result queue), and from orderly seed queue, delete p1, according still further to reach distance intermediate value order from small to large, the p2～pn in orderly seed queue is ranked up.

Assume that the order of p2～pn after sequence is p3, p2, p4, p5, p6 ... pn, after then performing step S1046, determining in current seed queue in order and there is also vocabulary, therefore returning step S1042, first vocabulary continuing to extract in orderly seed queue processes.

Owing to first vocabulary in now orderly seed queue is p3, therefore server extracts p3, and processes according to the method described above.

So, after all vocabulary in orderly seed queue are all disposed, for any one core word pi in result queue, the reach distance coming next core word pj to the pi after this core word pi that can ensure in result queue is: all core words after coming this core word pi in result queue are to minimum in the reach distance of this core word pi.

S1045: this vocabulary extracted is inserted into the end of result queue, deletes this vocabulary of extraction from orderly seed queue, performs step S1046.

S1046: judge whether there is also vocabulary in current seed queue in order, if so, then return step S1042, otherwise perform step S1047.

S1047: according to the sequencing of vocabulary each in result queue, searches successively except first vocabulary and current reach distance intermediate value is more than the vocabulary of default neighborhood distance.

All words clusterings before coming this vocabulary found in result queue are a word bunch to be combined, delete all vocabulary in this word bunch to be combined from result queue by S1048: when finding.

Based on result queue obtained above, then according to the sequencing of vocabulary each in result queue, search the vocabulary that reach distance intermediate value is bigger successively backward, the vocabulary found is exactly the region that density is less, the namely border of a word bunch, therefore, server finds out in result queue except first vocabulary and the current reach distance intermediate value vocabulary more than default neighborhood distance ε successively backward.When finding, then it is a word bunch to be combined by all words clusterings come in result queue before the vocabulary found, and from result queue, deletes all vocabulary in this word bunch to be combined.Search successively backward except first vocabulary and the current reach distance intermediate value vocabulary more than default neighborhood distance ε in continuation, carry out above-mentioned merging treatment, till result queue is sky.

S1049: judge whether there is also vocabulary in current results queue, if so, then returns step S1047, otherwise, performs step S1040.

S1040: the cluster of this word set is terminated.

Adopt said method that the vocabulary in one word set is clustered, then can obtain several words bunch to be combined, follow-up, merged being respectively directed to the word bunch each to be combined that each word set obtains by step S105 as shown in Figure 1, obtain combinatorial word bunch, and treat propelling movement microblogging by step S106 and push.

Although prior art there is also the method directly adopting OPTICS clustering algorithm to cluster all vocabulary in dictionary, namely, directly all vocabulary in dictionary are carried out cluster process as shown in Figure 7, but, owing to the process shown in Fig. 7 includes the process of two loop iterations, this is also accomplished by being clustered by individual server, and, if individual server breaks down in cluster process and causes that cluster process interrupts, it is accomplished by the whole vocabulary in dictionary are re-started cluster, therefore, directly all vocabulary in dictionary are adopted stability that OPTICS clustering algorithm clusters and inefficient.

And in the said method that the embodiment of the present invention provides, first adopt the method shown in Fig. 2 that the vocabulary in dictionary is divided into word set, the word set divided can't destroy the word spacing in original dictionary between each vocabulary, namely, the density relationship of each vocabulary in dictionary will not be destroyed, then method as shown in Figure 7 is adopted to cluster for each word set, except the effect same that can reach and directly adopt OPTICS clustering algorithm to cluster all vocabulary in dictionary, also can support distributed treatment, namely, a server can be realized some or certain several word sets are clustered, other word set is clustered by another server, and, when a server causes that because of fault cluster process interrupts in cluster process, only need to start to re-start from word set handled when interrupting and cluster, cluster without to all vocabulary in dictionary, therefore, the above-mentioned clustering method that the embodiment of the present invention provides can be effectively improved efficiency and the stability of cluster.

The microblogging method for pushing of the density based term clustering provided for the embodiment of the present invention above, based on same invention thinking, the embodiment of the present invention also provides for the microblogging of a kind of density based term clustering and pushes dress, as shown in Figure 8.

The microblogging of the density based term clustering that Fig. 8 provides for the embodiment of the present invention pushes assembling structure schematic diagram, specifically includes:

Word spacing determines module 801, for the co-occurrence word set according to each vocabulary, it is determined that the word spacing between each vocabulary；

Core word determines module 802, for determining core word according to the word spacing between each vocabulary；

Word set divides module 803, and for for each core word determined, the N number of vocabulary minimum with the word spacing of this core word and this core word being divided into a word set, N is the first default quantity；

Cluster module 804, for for each word set marked off, adopting OPTICS clustering algorithm that the vocabulary in this word set is clustered, obtain several words bunch to be combined；

Merge module 805, for according to the vocabulary in the word bunch each to be combined obtained, the word bunch each to be combined obtained being merged process, obtains combinatorial word bunch；

Pushing module 806, is used for the combinatorial word bunch at the interest word place according to user and waits to push the content of microblogging, waiting to push microblogging described in propelling movement.

Institute's predicate spacing determine module 801 specifically for, for the first vocabulary and the second vocabulary, adopt formulaDetermining the word spacing between described first vocabulary and the second vocabulary, wherein, the first vocabulary and the second vocabulary are any two vocabulary, i represents described first vocabulary, and j represents described second vocabulary, D (i, j) for the word spacing between described first vocabulary and the second vocabulary, T_iIt is the co-occurrence word set of the first vocabulary, T_jIt is the co-occurrence word set of the second vocabulary, | T_i∩T_j| for described first vocabulary co-occurrence word set and described second vocabulary co-occurrence word set common factor in the quantity of vocabulary that comprises, | T_i| it is that the co-occurrence word of the first vocabulary concentrates the quantity of vocabulary comprised, | T_j| it is that the co-occurrence word of the second vocabulary concentrates the quantity of vocabulary comprised.

Described core word determine module 802 specifically for, for each vocabulary undetermined, judge that whether the word spacing with this vocabulary undetermined is not more than the quantity of other vocabulary of default neighborhood distance more than the second default quantity, if, then determine that this vocabulary undetermined is core word, otherwise, it determines this vocabulary undetermined is not core word.

Described word set divide module 803 specifically for, each vocabulary is added in original queue with random order；For each vocabulary in original queue, extract this vocabulary, it is judged that whether this vocabulary of extraction is core word；If, then this vocabulary of N number of vocabulary minimum with the word spacing of this vocabulary extracted in original queue and extraction is divided into a word set, and from original queue, deletes this vocabulary of extraction and be smaller than the core word of 2 times of described default neighborhood distance with the word of this vocabulary extracted；Otherwise, this vocabulary extracted is put back in original queue.

Described cluster module 804 specifically for, for each word set marked off, all vocabulary in this word set are added in orderly seed queue；According to the sequencing of each vocabulary in orderly seed queue, extract first vocabulary；Judge whether this vocabulary extracted is core word；If so, then using other vocabulary in orderly seed queue as the candidate word of this vocabulary extracted, and for each candidate word, it is determined that this candidate word is to the reach distance of this vocabulary of extraction；If currently the reach distance intermediate value of this candidate word is not more than this candidate word reach distance to this vocabulary extracted, the reach distance intermediate value then keeping this candidate word is constant, if currently the reach distance intermediate value of this candidate word is more than this candidate word to the reach distance of this vocabulary extracted, then using the reach distance of this candidate word to this vocabulary of extraction again as the reach distance intermediate value of this candidate word；This vocabulary extracted is inserted into the end of result queue, from orderly seed queue, deletes this vocabulary of extraction, and according to the current reach distance intermediate value of each candidate word order from small to large, each candidate word in orderly seed queue is ranked up；First vocabulary continuing to extract in orderly seed queue processes, till being absent from any vocabulary in seed queue in order；Otherwise, this vocabulary extracted being inserted into the end of result queue, delete this vocabulary of extraction from orderly seed queue, first vocabulary continuing to extract in orderly seed queue processes, till being absent from any vocabulary in seed queue in order；When orderly seed queue is absent from any vocabulary, according to the sequencing of vocabulary each in result queue, search successively except first vocabulary and current reach distance intermediate value is more than the vocabulary of default neighborhood distance；When finding, it is a word bunch to be combined by all words clusterings before coming this vocabulary found in result queue, the all vocabulary in this word bunch to be combined are deleted from result queue, and continue the sequencing according to vocabulary each in result queue, search successively except first vocabulary and current reach distance intermediate value clusters more than the vocabulary of default neighborhood distance, till being absent from any vocabulary in result queue.

Described merging module 805 specifically for, the word bunch each to be combined obtained is added in bunch queue；For the word bunch each to be combined in bunch queue, extract this word bunch to be combined, determine other words bunch to be combined comprising at least one identical vocabulary with this word bunch to be combined extracted in bunch queue, other words bunch to be combined determined are incorporated in this word bunch to be combined of extraction, obtain middle word bunch to be combined；This word bunch to be combined deleting extraction from bunch queue and other words bunch to be combined determined, continue to determine other words bunch to be combined comprising at least one identical vocabulary with this centre word bunch to be combined in current cluster queue, and merge, till the quantity of the vocabulary comprised in the word bunch to be combined of this centre no longer changes；When the quantity of the vocabulary comprised in the word bunch to be combined of this centre no longer changes, using word bunch to be combined for this centre as the combinatorial word bunch obtained.

Concrete above-mentioned microblogging pushes dress and may be located in server.

Those skilled in the art are it should be appreciated that embodiments herein can be provided as method, system or computer program.Therefore, the application can adopt the form of complete hardware embodiment, complete software implementation or the embodiment in conjunction with software and hardware aspect.And, the application can adopt the form at one or more upper computer programs implemented of computer-usable storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) wherein including computer usable program code.

The application describes with reference to flow chart and/or the block diagram according to the method for the embodiment of the present application, equipment (system) and computer program.It should be understood that can by the combination of the flow process in each flow process in computer program instructions flowchart and/or block diagram and/or square frame and flow chart and/or block diagram and/or square frame.These computer program instructions can be provided to produce a machine to the processor of general purpose computer, special-purpose computer, Embedded Processor or other programmable data processing device so that the instruction performed by the processor of computer or other programmable data processing device is produced for realizing the dress of function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame.

These computer program instructions may be alternatively stored in and can guide in the computer-readable memory that computer or other programmable data processing device work in a specific way, the instruction making to be stored in this computer-readable memory produces to include the manufacture of instruction dress, and this instruction dress realizes the function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame.

These computer program instructions also can be loaded in computer or other programmable data processing device, make on computer or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computer or other programmable devices provides for realizing the step of function specified in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame.

Although having been described for the preferred embodiment of the application, but those skilled in the art are once know basic creative concept, then these embodiments can be made other change and amendment.So, claims are intended to be construed to include preferred embodiment and fall into all changes and the amendment of the application scope.

Obviously, the embodiment of the present application can be carried out various change and the modification spirit and scope without deviating from the embodiment of the present application by those skilled in the art.So, if these amendments of the embodiment of the present application and modification belong within the scope of the application claim and equivalent technologies thereof, then the application is also intended to comprise these change and modification.

Claims

1. a microblogging method for pushing, it is characterised in that including:

2. the method for claim 1, it is characterised in that the server co-occurrence word set according to each vocabulary, it is determined that the word spacing between each vocabulary, specifically includes:

Described server, for the first vocabulary and the second vocabulary, adopts formulaDetermining the word spacing between described first vocabulary and the second vocabulary, wherein, the first vocabulary and the second vocabulary are any two vocabulary, i represents described first vocabulary, and j represents described second vocabulary, D (i, j) for the word spacing between described first vocabulary and the second vocabulary, T_iIt is the co-occurrence word set of the first vocabulary, T_jIt is the co-occurrence word set of the second vocabulary, | T_i∩T_j| for described first vocabulary co-occurrence word set and described second vocabulary co-occurrence word set common factor in the quantity of vocabulary that comprises, | T_i| it is that the co-occurrence word of the first vocabulary concentrates the quantity of vocabulary comprised, | T_j| it is that the co-occurrence word of the second vocabulary concentrates the quantity of vocabulary comprised.

3. the method for claim 1, it is characterised in that determine core word according to the word spacing between each vocabulary, specifically include:

Described server is for each vocabulary undetermined, judge that whether the word spacing with this vocabulary undetermined is not more than the quantity of other vocabulary of default neighborhood distance more than the second default quantity, if, it is determined that this vocabulary undetermined is core word, otherwise, it determines this vocabulary undetermined is not core word.

4. method as claimed in claim 3, it is characterised in that for each core word determined, the N number of vocabulary minimum with the word spacing of this core word and this core word are divided into a word set, specifically include:

Each vocabulary is added in original queue with random order；

For each vocabulary in original queue, extract this vocabulary, it is judged that whether this vocabulary of extraction is core word；

If, then this vocabulary of N number of vocabulary minimum with the word spacing of this vocabulary extracted in original queue and extraction is divided into a word set, and from original queue, deletes this vocabulary of extraction and be smaller than the core word of 2 times of described default neighborhood distance with the word of this vocabulary extracted；

Otherwise, this vocabulary extracted is put back in original queue.

5. the method for claim 1, it is characterised in that for each word set marked off, adopts OPTICS clustering algorithm that the vocabulary in this word set is clustered, obtains several words bunch to be combined, specifically include:

For each word set marked off, all vocabulary in this word set are added in orderly seed queue；

According to the sequencing of each vocabulary in orderly seed queue, extract first vocabulary；

Judge whether this vocabulary extracted is core word；

If so, then using other vocabulary in orderly seed queue as the candidate word of this vocabulary extracted, and for each candidate word, it is determined that this candidate word is to the reach distance of this vocabulary of extraction；If currently the reach distance intermediate value of this candidate word is not more than this candidate word reach distance to this vocabulary extracted, the reach distance intermediate value then keeping this candidate word is constant, if currently the reach distance intermediate value of this candidate word is more than this candidate word to the reach distance of this vocabulary extracted, then using the reach distance of this candidate word to this vocabulary of extraction again as the reach distance intermediate value of this candidate word；This vocabulary extracted is inserted into the end of result queue, from orderly seed queue, deletes this vocabulary of extraction, and according to the current reach distance intermediate value of each candidate word order from small to large, each candidate word in orderly seed queue is ranked up；First vocabulary continuing to extract in orderly seed queue processes, till being absent from any vocabulary in seed queue in order；

Otherwise, this vocabulary extracted being inserted into the end of result queue, delete this vocabulary of extraction from orderly seed queue, first vocabulary continuing to extract in orderly seed queue processes, till being absent from any vocabulary in seed queue in order；

When orderly seed queue is absent from any vocabulary, according to the sequencing of vocabulary each in result queue, search successively except first vocabulary and current reach distance intermediate value is more than the vocabulary of default neighborhood distance；

When finding, it is a word bunch to be combined by all words clusterings before coming this vocabulary found in result queue, the all vocabulary in this word bunch to be combined are deleted from result queue, and continue the sequencing according to vocabulary each in result queue, search successively except first vocabulary and current reach distance intermediate value clusters more than the vocabulary of default neighborhood distance, till being absent from any vocabulary in result queue.

6. the method for claim 1, it is characterised in that according to the vocabulary in the word bunch each to be combined obtained, the word bunch each to be combined obtained is merged process, obtains combinatorial word bunch, specifically include:

The word bunch each to be combined obtained is added in bunch queue；

For the word bunch each to be combined in bunch queue, extract this word bunch to be combined, determine other words bunch to be combined comprising at least one identical vocabulary with this word bunch to be combined extracted in bunch queue, other words bunch to be combined determined are incorporated in this word bunch to be combined of extraction, obtain middle word bunch to be combined；

This word bunch to be combined deleting extraction from bunch queue and other words bunch to be combined determined, continue to determine other words bunch to be combined comprising at least one identical vocabulary with this centre word bunch to be combined in current cluster queue, and merge, till the quantity of the vocabulary comprised in the word bunch to be combined of this centre no longer changes；

When the quantity of the vocabulary comprised in the word bunch to be combined of this centre no longer changes, using word bunch to be combined for this centre as the combinatorial word bunch obtained.

7. a microblogging pusher, it is characterised in that including:

8. device as claimed in claim 7, it is characterised in that institute's predicate spacing determine module specifically for, for the first vocabulary and the second vocabulary, adopt formulaDetermining the word spacing between described first vocabulary and the second vocabulary, wherein, the first vocabulary and the second vocabulary are any two vocabulary, i represents described first vocabulary, and j represents described second vocabulary, D (i, j) for the word spacing between described first vocabulary and the second vocabulary, T_iIt is the co-occurrence word set of the first vocabulary, T_jIt is the co-occurrence word set of the second vocabulary, | T_i∩T_j| for described first vocabulary co-occurrence word set and described second vocabulary co-occurrence word set common factor in the quantity of vocabulary that comprises, | T_i| it is that the co-occurrence word of the first vocabulary concentrates the quantity of vocabulary comprised, | T_j| it is that the co-occurrence word of the second vocabulary concentrates the quantity of vocabulary comprised.

9. device as claimed in claim 7, it is characterized in that, described core word determine module specifically for, for each vocabulary undetermined, judge that whether the word spacing with this vocabulary undetermined is not more than the quantity of other vocabulary of default neighborhood distance more than the second default quantity, if, it is determined that this vocabulary undetermined is core word, otherwise, it determines this vocabulary undetermined is not core word.

10. device as claimed in claim 9, it is characterised in that described word set divide module specifically for, each vocabulary is added in original queue with random order；For each vocabulary in original queue, extract this vocabulary, it is judged that whether this vocabulary of extraction is core word；If, then this vocabulary of N number of vocabulary minimum with the word spacing of this vocabulary extracted in original queue and extraction is divided into a word set, and from original queue, deletes this vocabulary of extraction and be smaller than the core word of 2 times of described default neighborhood distance with the word of this vocabulary extracted；Otherwise, this vocabulary extracted is put back in original queue.

11. device as claimed in claim 7, it is characterised in that described cluster module specifically for, for each word set marked off, all vocabulary in this word set are added in orderly seed queue；According to the sequencing of each vocabulary in orderly seed queue, extract first vocabulary；Judge whether this vocabulary extracted is core word；If so, then using other vocabulary in orderly seed queue as the candidate word of this vocabulary extracted, and for each candidate word, it is determined that this candidate word is to the reach distance of this vocabulary of extraction；If currently the reach distance intermediate value of this candidate word is not more than this candidate word reach distance to this vocabulary extracted, the reach distance intermediate value then keeping this candidate word is constant, if currently the reach distance intermediate value of this candidate word is more than this candidate word to the reach distance of this vocabulary extracted, then using the reach distance of this candidate word to this vocabulary of extraction again as the reach distance intermediate value of this candidate word；This vocabulary extracted is inserted into the end of result queue, from orderly seed queue, deletes this vocabulary of extraction, and according to the current reach distance intermediate value of each candidate word order from small to large, each candidate word in orderly seed queue is ranked up；First vocabulary continuing to extract in orderly seed queue processes, till being absent from any vocabulary in seed queue in order；Otherwise, this vocabulary extracted being inserted into the end of result queue, delete this vocabulary of extraction from orderly seed queue, first vocabulary continuing to extract in orderly seed queue processes, till being absent from any vocabulary in seed queue in order；When orderly seed queue is absent from any vocabulary, according to the sequencing of vocabulary each in result queue, search successively except first vocabulary and current reach distance intermediate value is more than the vocabulary of default neighborhood distance；When finding, it is a word bunch to be combined by all words clusterings before coming this vocabulary found in result queue, the all vocabulary in this word bunch to be combined are deleted from result queue, and continue the sequencing according to vocabulary each in result queue, search successively except first vocabulary and current reach distance intermediate value clusters more than the vocabulary of default neighborhood distance, till being absent from any vocabulary in result queue.

12. device as claimed in claim 7, it is characterised in that described merging module specifically for, the word bunch each to be combined obtained is added in bunch queue；For the word bunch each to be combined in bunch queue, extract this word bunch to be combined, determine other words bunch to be combined comprising at least one identical vocabulary with this word bunch to be combined extracted in bunch queue, other words bunch to be combined determined are incorporated in this word bunch to be combined of extraction, obtain middle word bunch to be combined；This word bunch to be combined deleting extraction from bunch queue and other words bunch to be combined determined, continue to determine other words bunch to be combined comprising at least one identical vocabulary with this centre word bunch to be combined in current cluster queue, and merge, till the quantity of the vocabulary comprised in the word bunch to be combined of this centre no longer changes；When the quantity of the vocabulary comprised in the word bunch to be combined of this centre no longer changes, using word bunch to be combined for this centre as the combinatorial word bunch obtained.