CN101159704A

CN101159704A - Microcontent similarity based antirubbish method

Info

Publication number: CN101159704A
Application number: CNA2007101561840A
Authority: CN
Inventors: 胡天磊; 陈珂; 陈刚; 寿黎但; 汪源
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2007-10-23
Filing date: 2007-10-23
Publication date: 2008-04-09

Abstract

The invention discloses an anti-spam method based on micro-content similarity. The method comprises clustering the comments that are discriminated to be the spam manually to generate a clustered spam file; and discriminating the unknown comments by using a spam discriminator according to the clustered spam file. The method for scoring the similarity of a random sample in all spam comment classes with a comment to be processed and scoring the class where the random sample with the highest similarity can obviate the similarity comparison between the spam comment to be processed and the clustered spam so as to effectively reduce the frequency for comparing the comment similarities, thereby improving the efficiency of spam discrimination and the clustered spam file maintenance to satisfy the performance requirement for massive spam discrimination on the internet.

Description

Anti-rubbish method based on the micro-content similarity

Technical field

The present invention relates to the anti-rubbish method of internet micro-content, particularly a kind of anti-rubbish method based on the micro-content similarity.

Background technology

Blog is the 4th kind of internet exchange mode that occurs after Email, BBS, ICQ, be individual's " Reader's Digest " of cybertimes, being to be the network diary of weapon with the hyperlink, is to represent new life style and new working method, is more representing new mode of learning.But, the technology of anti-rubbish mail day by day ripe now, the Blog comment also more and more is subjected to businessman and common online friend's welcome as the means of interspersing advertisements and propaganda.This causes the rubbish comment on the Blog more and more, has greatly wasted the network bandwidth, Blog owner and reader's time, and system resource, and make the user painstaking, become and hindered a great problem that Blog popularizes.

The technology and the method for anti-rubbish comment commonly used at present have:

1) put phrase and filter, some responsive words are filtered or shield, less better but this filtration is taken precautions against ability for the mutation of responsive word, as divine by means of characters etc., and along with the continuous increase of dictionary, safeguard and operational efficiency all is affected.But this is a kind of method the most efficiently, can play the strick precaution effect of getting instant result.

2) check code is set, prevents the robot submission by check code being carried out the legitimacy verification.But at present robot still can permeate by OCR or exhaustive method, even the version of Xiu Gaiing is specially only needed efforts and just can be found method to crack.Also brought some obstacles for simultaneously normal users.

3) check Refer, fall those not accession page and connections of the comment of directly entering the station by the Refer Field Sanitization in the http protocol, this also is one of method of door chain, and efficient is very high.If submit to but the instrument that uses special modification to cross pretends http protocol, the method will be felt simply helpless.

4) control is submitted at interval continuously, and this measure prevents that the malice robot from carrying out saturation attack to database, reduces the server burden, but can not effect a permanent cure, and belongs to passive passive defense.

5) content-based scoring, the realization threshold value is cut apart, and whether intelligent decision is the rubbish comment.The method is the most scientific and reasonable, but needs server to do a large amount of processing, has increased the server burden, if be connected to remote server, service quality may can't guarantee because of network.

Therefore above method all can not satisfy the demand that online in real time is differentiated the rubbish comment fully.

Summary of the invention

The object of the invention is to provide a kind of anti-rubbish method based on the micro-content similarity.

The technical scheme that the present invention solves its technical problem employing is that the step of this method is as follows:

1) carries out cluster by will artificially differentiating, produce the cluster garbage files, comprise a plurality of refuse classifications to the comment of rubbish;

2) use the rubbish discriminator according to the cluster garbage files, the unknown comment is differentiated.

Described rubbish comment cluster process is:

1) the cluster garbage files is initially sky;

2) when new artificial discriminating was the rubbish comment of rubbish, following condition and step selectively joined in the cluster garbage files:

The first step is chosen the typical sample of a rubbish comment as this refuse classification arbitrarily from all refuse classifications;

The new rubbish comment of second step is carried out the similarity scoring with the typical sample of all refuse classifications;

The 3rd step, new rubbish comment comment was commented on the similarity scoring with all rubbish of this refuse classification again, calculated its highest similarity score to have the typical sample place refuse classification of highest similarity scoring in second step with unknown rubbish;

The 4th step, then commented on new rubbish as a new refuse classification less than certain assign thresholds as if the highest similarity score; Otherwise if the highest similarity score then joins new rubbish comment comment in the existing classification as a new rubbish comment sample less than another assign thresholds; Otherwise ignore this new rubbish comment.

Described rubbish identification algorithm step is as follows:

1) from all refuse classifications, chooses the typical sample of a rubbish comment arbitrarily as this refuse classification;

2) unknown comment is carried out the similarity scoring with the typical sample of all refuse classifications;

3) to last step 2) in have the typical sample place refuse classification of highest similarity scoring with unknown rubbish, unknown comment is commented on the similarity scoring with all rubbish of this refuse classification again;

4) surpass assign thresholds if unknown comment and all rubbish of above-mentioned refuse classification are commented on the maximum of similarity scoring, judge that then unknown the comment is the rubbish comment.

The beneficial effect that the present invention has is:

Avoided the comment of pending rubbish with all cluster rubbish carry out similarity relatively, reduced the number of comparisons of comment similarity effectively, improve the efficient that rubbish is differentiated and the cluster garbage files is safeguarded, can adapt to the performance requirement that magnanimity rubbish is differentiated on the Internet.

Description of drawings

Fig. 1 is the anti-rubbish method flow diagram based on the rubbish similarity of the present invention.

Fig. 2 is the algorithm flow chart that the rubbish comment is inserted the cluster garbage files of the present invention.

Fig. 3 is the algorithm flow chart that rubbish discriminator of the present invention is differentiated the unknown comment.

Embodiment

The present invention is as follows for the concept definition of comment similarity:

Speech: indivisible semantic primitive;

High frequency words: similar " ", the no semanteme of " ", the word that need be filtered;

Comment: the finite aggregate of speech, participle is carried out in original comment, filter out the result after the high frequency words;

The speech number of comment: the gesture of this comment set of words---the element number that this set comprised;

" friendship " of comment: set of words ship calculation;

" also " of comment: the union of set of words;

The similarity sim of definition comment a and comment b (a, b):

A hands over the speech number/a of b and the speech number of b, promptly

A hands over speech number/(the speech number of speech number+b of a-a hands over the speech number of b) of b

In conjunction with above-mentioned comment similarity notion, commenting on anti-garbage system with blog is example, and concrete implementation step is as follows:

As shown in Figure 1,, produce the cluster garbage files, be entered into rubbish storehouse discriminator in order to differentiate unknown comment by being that cluster is carried out in the comment of rubbish to artificial discriminating; The rubbish discriminator judges by the similarity of calculating typical sample in unknown comment and the cluster garbage files whether this unknown comment is rubbish, and provides judged result; Carry out the similarity scoring by the similarity of calculating typical sample in new rubbish comment and the cluster garbage files, new refuse classification is added the cluster garbage files, and upgrade existing refuse classification, abandon useless rubbish comment.

The process of adding rubbish comment supposes will insert now a rubbish comment x as shown in Figure 2, has had s class rubbish in the cluster garbage files, counts respectively: G ₁, G ₂, G ₃... G _s, in each classification rubbish, get the rubbish category feature of any rubbish comment as such rubbish, be designated as respectively: g ₁, g ₂, g ₃... g _s, calculate rubbish comment x and rubbish category feature g respectively ₁, g ₂, g ₃... g _sThe comment similarity, suppose g _mGet wherein maximum with x:

sim(g _m，x)＝max?sim(g _i，x)

i＝1～s

Suppose that in the cluster garbage files a total t bar belongs to the rubbish comment of m class, is designated as g respectively _M1, g _M2, g _M3G _Mt, calculate rubbish comment x and comment g respectively _M1, g _M2, g _M3... g _MtThe comment similarity, suppose g _MnGet wherein maximum with x:

sim(g _mn，x)＝max?sim(g _mi，x)

i＝1～t

Below comprise 2 kinds of situations:

If a) sim (g _m, x) less than specific threshold T1, then this rubbish comment will independently become a new class rubbish, inserts database;

B) if sim is (g _Mn, x) more than or equal to T1, then this rubbish comment is under the jurisdiction of the m class and has had rubbish, comprises 2 kinds of situations:

B1) if sim is (g _Mn, x) less than specific threshold T2 (T2＞sim (g _Mn, x)＞T1) this rubbish comment adds the ability that the cluster garbage files can significantly improve native system discriminating rubbish, rubbish is commented on x insert in the cluster garbage files as m class rubbish;

B2) if sim is (g _Mn, x) more than or equal to T2 (sim (g _Mn, x)＞and T2＞T1), this expression has had the rubbish comment existence high with the similarity of rubbish comment x, and rubbish comment x adds the cluster garbage files and can not significantly improve the ability that native system is differentiated rubbish, ignores this rubbish comment, is left intact.

The process of the cluster garbage files that the rubbish comment is inserted into also is that they are by the process of cluster simultaneously.

Along with the rubbish number of reviews in the cluster garbage files constantly increases, reach certain scale after, just the cluster rubbish in the storehouse derive can be generated the cluster garbage files, after the rubbish discriminator imports this cluster garbage files, just can comment on and differentiate the unknown:

Fig. 3 has provided the process that the rubbish discriminator is differentiated the unknown comment, establishes unknown comment for x, supposes to have existed in the cluster garbage files s class rubbish, counts respectively: G ₁, G ₂, G ₃... G _s, in each classification rubbish, get the rubbish category feature of any rubbish comment as such rubbish, be designated as respectively: g ₁, g ₂, g ₃... g _s, calculate comment x and comment g respectively ₁, g ₂, g ₃... g _sThe comment similarity, suppose g _mGet wherein maximum with x:

sim(g _m，x)＝max?sim(g _i，x)

i＝1～s

Suppose that in the cluster garbage files a total t bar belongs to the rubbish comment of m class, is designated as g respectively _M1, g _M2, g _M3G _Mt, calculate comment x and comment g respectively _M1, g _M2, g _M3... g _MtThe comment similarity, suppose g _MnGet wherein maximum with x:

sim(g _mn，x)＝max?sim(g _mi，x)

i＝1～t

If sim (g _Mn, x)＞and specific threshold T3, unknown comment x is differentiated to be the rubbish comment;

If sim (g _Mn, x)＜=and specific threshold T3, unknown comment x is differentiated to be non-rubbish comment;

If unknown comment is the rubbish comment, then should comments on and from comment database, delete; If not the rubbish comment, then ignore it.

Claims

1. anti-rubbish method based on the micro-content similarity is characterized in that the step of this method is as follows:

2. a kind of anti-rubbish method based on the micro-content similarity according to claim 1 is characterized in that: described rubbish comment cluster process is:

1) the cluster garbage files is initially sky;

3. a kind of anti-rubbish method according to claim 1 based on the micro-content similarity, it is characterized in that: described rubbish identification algorithm step is as follows: