CN101159704A - Microcontent similarity based antirubbish method - Google Patents

Microcontent similarity based antirubbish method Download PDF

Info

Publication number
CN101159704A
CN101159704A CNA2007101561840A CN200710156184A CN101159704A CN 101159704 A CN101159704 A CN 101159704A CN A2007101561840 A CNA2007101561840 A CN A2007101561840A CN 200710156184 A CN200710156184 A CN 200710156184A CN 101159704 A CN101159704 A CN 101159704A
Authority
CN
China
Prior art keywords
rubbish
comment
similarity
unknown
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2007101561840A
Other languages
Chinese (zh)
Inventor
胡天磊
陈珂
陈刚
寿黎但
汪源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CNA2007101561840A priority Critical patent/CN101159704A/en
Publication of CN101159704A publication Critical patent/CN101159704A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an anti-spam method based on micro-content similarity. The method comprises clustering the comments that are discriminated to be the spam manually to generate a clustered spam file; and discriminating the unknown comments by using a spam discriminator according to the clustered spam file. The method for scoring the similarity of a random sample in all spam comment classes with a comment to be processed and scoring the class where the random sample with the highest similarity can obviate the similarity comparison between the spam comment to be processed and the clustered spam so as to effectively reduce the frequency for comparing the comment similarities, thereby improving the efficiency of spam discrimination and the clustered spam file maintenance to satisfy the performance requirement for massive spam discrimination on the internet.

Description

Anti-rubbish method based on the micro-content similarity
Technical field
The present invention relates to the anti-rubbish method of internet micro-content, particularly a kind of anti-rubbish method based on the micro-content similarity.
Background technology
Blog is the 4th kind of internet exchange mode that occurs after Email, BBS, ICQ, be individual's " Reader's Digest " of cybertimes, being to be the network diary of weapon with the hyperlink, is to represent new life style and new working method, is more representing new mode of learning.But, the technology of anti-rubbish mail day by day ripe now, the Blog comment also more and more is subjected to businessman and common online friend's welcome as the means of interspersing advertisements and propaganda.This causes the rubbish comment on the Blog more and more, has greatly wasted the network bandwidth, Blog owner and reader's time, and system resource, and make the user painstaking, become and hindered a great problem that Blog popularizes.
The technology and the method for anti-rubbish comment commonly used at present have:
1) put phrase and filter, some responsive words are filtered or shield, less better but this filtration is taken precautions against ability for the mutation of responsive word, as divine by means of characters etc., and along with the continuous increase of dictionary, safeguard and operational efficiency all is affected.But this is a kind of method the most efficiently, can play the strick precaution effect of getting instant result.
2) check code is set, prevents the robot submission by check code being carried out the legitimacy verification.But at present robot still can permeate by OCR or exhaustive method, even the version of Xiu Gaiing is specially only needed efforts and just can be found method to crack.Also brought some obstacles for simultaneously normal users.
3) check Refer, fall those not accession page and connections of the comment of directly entering the station by the Refer Field Sanitization in the http protocol, this also is one of method of door chain, and efficient is very high.If submit to but the instrument that uses special modification to cross pretends http protocol, the method will be felt simply helpless.
4) control is submitted at interval continuously, and this measure prevents that the malice robot from carrying out saturation attack to database, reduces the server burden, but can not effect a permanent cure, and belongs to passive passive defense.
5) content-based scoring, the realization threshold value is cut apart, and whether intelligent decision is the rubbish comment.The method is the most scientific and reasonable, but needs server to do a large amount of processing, has increased the server burden, if be connected to remote server, service quality may can't guarantee because of network.
Therefore above method all can not satisfy the demand that online in real time is differentiated the rubbish comment fully.
Summary of the invention
The object of the invention is to provide a kind of anti-rubbish method based on the micro-content similarity.
The technical scheme that the present invention solves its technical problem employing is that the step of this method is as follows:
1) carries out cluster by will artificially differentiating, produce the cluster garbage files, comprise a plurality of refuse classifications to the comment of rubbish;
2) use the rubbish discriminator according to the cluster garbage files, the unknown comment is differentiated.
Described rubbish comment cluster process is:
1) the cluster garbage files is initially sky;
2) when new artificial discriminating was the rubbish comment of rubbish, following condition and step selectively joined in the cluster garbage files:
The first step is chosen the typical sample of a rubbish comment as this refuse classification arbitrarily from all refuse classifications;
The new rubbish comment of second step is carried out the similarity scoring with the typical sample of all refuse classifications;
The 3rd step, new rubbish comment comment was commented on the similarity scoring with all rubbish of this refuse classification again, calculated its highest similarity score to have the typical sample place refuse classification of highest similarity scoring in second step with unknown rubbish;
The 4th step, then commented on new rubbish as a new refuse classification less than certain assign thresholds as if the highest similarity score; Otherwise if the highest similarity score then joins new rubbish comment comment in the existing classification as a new rubbish comment sample less than another assign thresholds; Otherwise ignore this new rubbish comment.
Described rubbish identification algorithm step is as follows:
1) from all refuse classifications, chooses the typical sample of a rubbish comment arbitrarily as this refuse classification;
2) unknown comment is carried out the similarity scoring with the typical sample of all refuse classifications;
3) to last step 2) in have the typical sample place refuse classification of highest similarity scoring with unknown rubbish, unknown comment is commented on the similarity scoring with all rubbish of this refuse classification again;
4) surpass assign thresholds if unknown comment and all rubbish of above-mentioned refuse classification are commented on the maximum of similarity scoring, judge that then unknown the comment is the rubbish comment.
The beneficial effect that the present invention has is:
Avoided the comment of pending rubbish with all cluster rubbish carry out similarity relatively, reduced the number of comparisons of comment similarity effectively, improve the efficient that rubbish is differentiated and the cluster garbage files is safeguarded, can adapt to the performance requirement that magnanimity rubbish is differentiated on the Internet.
Description of drawings
Fig. 1 is the anti-rubbish method flow diagram based on the rubbish similarity of the present invention.
Fig. 2 is the algorithm flow chart that the rubbish comment is inserted the cluster garbage files of the present invention.
Fig. 3 is the algorithm flow chart that rubbish discriminator of the present invention is differentiated the unknown comment.
Embodiment
The present invention is as follows for the concept definition of comment similarity:
Speech: indivisible semantic primitive;
High frequency words: similar " ", the no semanteme of " ", the word that need be filtered;
Comment: the finite aggregate of speech, participle is carried out in original comment, filter out the result after the high frequency words;
The speech number of comment: the gesture of this comment set of words---the element number that this set comprised;
" friendship " of comment: set of words ship calculation;
" also " of comment: the union of set of words;
The similarity sim of definition comment a and comment b (a, b):
A hands over the speech number/a of b and the speech number of b, promptly
A hands over speech number/(the speech number of speech number+b of a-a hands over the speech number of b) of b
In conjunction with above-mentioned comment similarity notion, commenting on anti-garbage system with blog is example, and concrete implementation step is as follows:
As shown in Figure 1,, produce the cluster garbage files, be entered into rubbish storehouse discriminator in order to differentiate unknown comment by being that cluster is carried out in the comment of rubbish to artificial discriminating; The rubbish discriminator judges by the similarity of calculating typical sample in unknown comment and the cluster garbage files whether this unknown comment is rubbish, and provides judged result; Carry out the similarity scoring by the similarity of calculating typical sample in new rubbish comment and the cluster garbage files, new refuse classification is added the cluster garbage files, and upgrade existing refuse classification, abandon useless rubbish comment.
The process of adding rubbish comment supposes will insert now a rubbish comment x as shown in Figure 2, has had s class rubbish in the cluster garbage files, counts respectively: G 1, G 2, G 3... G s, in each classification rubbish, get the rubbish category feature of any rubbish comment as such rubbish, be designated as respectively: g 1, g 2, g 3... g s, calculate rubbish comment x and rubbish category feature g respectively 1, g 2, g 3... g sThe comment similarity, suppose g mGet wherein maximum with x:
sim(g m,x)=max?sim(g i,x)
i=1~s
Suppose that in the cluster garbage files a total t bar belongs to the rubbish comment of m class, is designated as g respectively M1, g M2, g M3G Mt, calculate rubbish comment x and comment g respectively M1, g M2, g M3... g MtThe comment similarity, suppose g MnGet wherein maximum with x:
sim(g mn,x)=max?sim(g mi,x)
i=1~t
Below comprise 2 kinds of situations:
If a) sim (g m, x) less than specific threshold T1, then this rubbish comment will independently become a new class rubbish, inserts database;
B) if sim is (g Mn, x) more than or equal to T1, then this rubbish comment is under the jurisdiction of the m class and has had rubbish, comprises 2 kinds of situations:
B1) if sim is (g Mn, x) less than specific threshold T2 (T2>sim (g Mn, x)>T1) this rubbish comment adds the ability that the cluster garbage files can significantly improve native system discriminating rubbish, rubbish is commented on x insert in the cluster garbage files as m class rubbish;
B2) if sim is (g Mn, x) more than or equal to T2 (sim (g Mn, x)>and T2>T1), this expression has had the rubbish comment existence high with the similarity of rubbish comment x, and rubbish comment x adds the cluster garbage files and can not significantly improve the ability that native system is differentiated rubbish, ignores this rubbish comment, is left intact.
The process of the cluster garbage files that the rubbish comment is inserted into also is that they are by the process of cluster simultaneously.
Along with the rubbish number of reviews in the cluster garbage files constantly increases, reach certain scale after, just the cluster rubbish in the storehouse derive can be generated the cluster garbage files, after the rubbish discriminator imports this cluster garbage files, just can comment on and differentiate the unknown:
Fig. 3 has provided the process that the rubbish discriminator is differentiated the unknown comment, establishes unknown comment for x, supposes to have existed in the cluster garbage files s class rubbish, counts respectively: G 1, G 2, G 3... G s, in each classification rubbish, get the rubbish category feature of any rubbish comment as such rubbish, be designated as respectively: g 1, g 2, g 3... g s, calculate comment x and comment g respectively 1, g 2, g 3... g sThe comment similarity, suppose g mGet wherein maximum with x:
sim(g m,x)=max?sim(g i,x)
i=1~s
Suppose that in the cluster garbage files a total t bar belongs to the rubbish comment of m class, is designated as g respectively M1, g M2, g M3G Mt, calculate comment x and comment g respectively M1, g M2, g M3... g MtThe comment similarity, suppose g MnGet wherein maximum with x:
sim(g mn,x)=max?sim(g mi,x)
i=1~t
If sim (g Mn, x)>and specific threshold T3, unknown comment x is differentiated to be the rubbish comment;
If sim (g Mn, x)<=and specific threshold T3, unknown comment x is differentiated to be non-rubbish comment;
If unknown comment is the rubbish comment, then should comments on and from comment database, delete; If not the rubbish comment, then ignore it.

Claims (3)

1. anti-rubbish method based on the micro-content similarity is characterized in that the step of this method is as follows:
1) carries out cluster by will artificially differentiating, produce the cluster garbage files, comprise a plurality of refuse classifications to the comment of rubbish;
2) use the rubbish discriminator according to the cluster garbage files, the unknown comment is differentiated.
2. a kind of anti-rubbish method based on the micro-content similarity according to claim 1 is characterized in that: described rubbish comment cluster process is:
1) the cluster garbage files is initially sky;
2) when new artificial discriminating was the rubbish comment of rubbish, following condition and step selectively joined in the cluster garbage files:
The first step is chosen the typical sample of a rubbish comment as this refuse classification arbitrarily from all refuse classifications;
The new rubbish comment of second step is carried out the similarity scoring with the typical sample of all refuse classifications;
The 3rd step, new rubbish comment comment was commented on the similarity scoring with all rubbish of this refuse classification again, calculated its highest similarity score to have the typical sample place refuse classification of highest similarity scoring in second step with unknown rubbish;
The 4th step, then commented on new rubbish as a new refuse classification less than certain assign thresholds as if the highest similarity score; Otherwise if the highest similarity score then joins new rubbish comment comment in the existing classification as a new rubbish comment sample less than another assign thresholds; Otherwise ignore this new rubbish comment.
3. a kind of anti-rubbish method according to claim 1 based on the micro-content similarity, it is characterized in that: described rubbish identification algorithm step is as follows:
1) from all refuse classifications, chooses the typical sample of a rubbish comment arbitrarily as this refuse classification;
2) unknown comment is carried out the similarity scoring with the typical sample of all refuse classifications;
3) to last step 2) in have the typical sample place refuse classification of highest similarity scoring with unknown rubbish, unknown comment is commented on the similarity scoring with all rubbish of this refuse classification again;
4) surpass assign thresholds if unknown comment and all rubbish of above-mentioned refuse classification are commented on the maximum of similarity scoring, judge that then unknown the comment is the rubbish comment.
CNA2007101561840A 2007-10-23 2007-10-23 Microcontent similarity based antirubbish method Pending CN101159704A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2007101561840A CN101159704A (en) 2007-10-23 2007-10-23 Microcontent similarity based antirubbish method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2007101561840A CN101159704A (en) 2007-10-23 2007-10-23 Microcontent similarity based antirubbish method

Publications (1)

Publication Number Publication Date
CN101159704A true CN101159704A (en) 2008-04-09

Family

ID=39307629

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2007101561840A Pending CN101159704A (en) 2007-10-23 2007-10-23 Microcontent similarity based antirubbish method

Country Status (1)

Country Link
CN (1) CN101159704A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102053992A (en) * 2009-11-10 2011-05-11 阿里巴巴集团控股有限公司 Clustering method and system
CN101446970B (en) * 2008-12-15 2012-07-04 腾讯科技(深圳)有限公司 Method for censoring and process text contents issued by user and device thereof
CN102655480A (en) * 2011-03-03 2012-09-05 腾讯科技(深圳)有限公司 Similar mail handling system and method
CN103064971A (en) * 2013-01-05 2013-04-24 南京邮电大学 Scoring and Chinese sentiment analysis based review spam detection method
CN103065027A (en) * 2011-10-19 2013-04-24 腾讯科技(深圳)有限公司 Message leaving method and device provided for third-party social network site (SNS) web game
CN103714049A (en) * 2012-09-29 2014-04-09 百度在线网络技术(北京)有限公司 Method and device for determining similarity of samples dynamically
CN103745001A (en) * 2014-01-24 2014-04-23 福州大学 System for detecting reviewers of negative comments on products
CN103970801A (en) * 2013-02-05 2014-08-06 腾讯科技(深圳)有限公司 Method and device for recognizing microblog advertisement blog articles
CN103996130A (en) * 2014-04-29 2014-08-20 北京京东尚科信息技术有限公司 Goods evaluation information filtering method and goods evaluation information filtering system
CN104050195A (en) * 2013-03-15 2014-09-17 北京暴风科技股份有限公司 Advertisement sticker processing method and system
CN104252465A (en) * 2013-06-26 2014-12-31 南宁明江智能科技有限公司 Method and device utilizing representative vectors to filter information
CN104933191A (en) * 2015-07-09 2015-09-23 广东欧珀移动通信有限公司 Spam comment recognition method and system based on Bayesian algorithm and terminal
CN105323153A (en) * 2015-11-18 2016-02-10 Tcl集团股份有限公司 Spam mail filtering method and device
CN106372236A (en) * 2016-09-13 2017-02-01 东软集团股份有限公司 Comment data processing method and device
WO2018129978A1 (en) * 2017-01-13 2018-07-19 广东欧珀移动通信有限公司 Information processing method, device, storage medium and computer device
CN112860643A (en) * 2021-03-05 2021-05-28 中富通集团股份有限公司 Method for improving cache cleaning speed of 5G mobile terminal and storage device

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446970B (en) * 2008-12-15 2012-07-04 腾讯科技(深圳)有限公司 Method for censoring and process text contents issued by user and device thereof
CN102053992A (en) * 2009-11-10 2011-05-11 阿里巴巴集团控股有限公司 Clustering method and system
CN102053992B (en) * 2009-11-10 2014-12-10 阿里巴巴集团控股有限公司 Clustering method and system
WO2012116587A1 (en) * 2011-03-03 2012-09-07 腾讯科技(深圳)有限公司 Similar email processing system and method
CN102655480B (en) * 2011-03-03 2015-12-02 腾讯科技(深圳)有限公司 Similar mail treatment system and method
CN102655480A (en) * 2011-03-03 2012-09-05 腾讯科技(深圳)有限公司 Similar mail handling system and method
CN103065027A (en) * 2011-10-19 2013-04-24 腾讯科技(深圳)有限公司 Message leaving method and device provided for third-party social network site (SNS) web game
CN103065027B (en) * 2011-10-19 2017-02-22 腾讯科技(深圳)有限公司 Message leaving method and device provided for third-party social network site (SNS) web game
CN103714049A (en) * 2012-09-29 2014-04-09 百度在线网络技术(北京)有限公司 Method and device for determining similarity of samples dynamically
CN103714049B (en) * 2012-09-29 2017-10-03 北京音之邦文化科技有限公司 The similar method and device of dynamic validation sample
CN103064971A (en) * 2013-01-05 2013-04-24 南京邮电大学 Scoring and Chinese sentiment analysis based review spam detection method
CN103970801B (en) * 2013-02-05 2019-03-26 腾讯科技(深圳)有限公司 Microblogging advertisement blog article recognition methods and device
CN103970801A (en) * 2013-02-05 2014-08-06 腾讯科技(深圳)有限公司 Method and device for recognizing microblog advertisement blog articles
CN104050195B (en) * 2013-03-15 2017-11-03 暴风集团股份有限公司 A kind of advertisement sticker processing method and system
CN104050195A (en) * 2013-03-15 2014-09-17 北京暴风科技股份有限公司 Advertisement sticker processing method and system
CN104252465B (en) * 2013-06-26 2018-10-12 南宁明江智能科技有限公司 A kind of method and apparatus filtering information using representation vector
CN104252465A (en) * 2013-06-26 2014-12-31 南宁明江智能科技有限公司 Method and device utilizing representative vectors to filter information
CN103745001B (en) * 2014-01-24 2016-10-05 福州大学 A kind of product comment spam person's detecting system
CN103745001A (en) * 2014-01-24 2014-04-23 福州大学 System for detecting reviewers of negative comments on products
CN103996130B (en) * 2014-04-29 2016-04-27 北京京东尚科信息技术有限公司 A kind of information on commodity comment filter method and system
CN103996130A (en) * 2014-04-29 2014-08-20 北京京东尚科信息技术有限公司 Goods evaluation information filtering method and goods evaluation information filtering system
AU2015252513B2 (en) * 2014-04-29 2018-11-29 Beijing Jingdong Century Trading Co., Ltd. Method and system for filtering goods evaluation information
CN104933191A (en) * 2015-07-09 2015-09-23 广东欧珀移动通信有限公司 Spam comment recognition method and system based on Bayesian algorithm and terminal
CN105323153A (en) * 2015-11-18 2016-02-10 Tcl集团股份有限公司 Spam mail filtering method and device
CN106372236A (en) * 2016-09-13 2017-02-01 东软集团股份有限公司 Comment data processing method and device
WO2018129978A1 (en) * 2017-01-13 2018-07-19 广东欧珀移动通信有限公司 Information processing method, device, storage medium and computer device
CN112860643A (en) * 2021-03-05 2021-05-28 中富通集团股份有限公司 Method for improving cache cleaning speed of 5G mobile terminal and storage device
CN112860643B (en) * 2021-03-05 2022-07-08 中富通集团股份有限公司 Method for improving cache cleaning speed of 5G mobile terminal and storage device

Similar Documents

Publication Publication Date Title
CN101159704A (en) Microcontent similarity based antirubbish method
CN101408883B (en) Method for collecting network public feelings viewpoint
CN103064970B (en) Optimize the search method of interpreter
CN101184259B (en) Keyword automatically learning and updating method in rubbish short message
CN107609121A (en) Newsletter archive sorting technique based on LDA and word2vec algorithms
CN109165294A (en) Short text classification method based on Bayesian classification
CN100589453C (en) Processing device and method for anti-junk mails
CN103441924B (en) A kind of rubbish mail filtering method based on short text and device
CN103037339B (en) One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "
CN102158428B (en) Rapid and high-accuracy junk mail filtering method
CN103646080A (en) Microblog duplication-eliminating method and system based on reverse-order index
CN103136266A (en) Method and device for classification of mail
CN102194012B (en) Microblog topic detecting method and system
CN101996241A (en) Bayesian algorithm-based content filtering method
CN103336766A (en) Short text garbage identification and modeling method and device
CN101021838A (en) Text handling method and system
CN111639183A (en) Financial industry consensus public opinion analysis method and system based on deep learning algorithm
CN101645069A (en) Regular expression storage compacting method in multi-mode matching
CN101046858B (en) Electronic information comparing system and method and anti-garbage mail system
CN101339560B (en) Method and device for searching series data, and search engine system
CN102045268A (en) Method and device for recovering email data
CN101441663B (en) Chinese text classification characteristic dictionary generating method based on LZW compression algorithm
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN110674291A (en) Chinese patent text effect category classification method based on multivariate neural network fusion
CN101329668A (en) Method and apparatus for generating information regulation and method and system for judging information types

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20080409