CN101159704A - Microcontent similarity based antirubbish method - Google Patents
Microcontent similarity based antirubbish method Download PDFInfo
- Publication number
- CN101159704A CN101159704A CNA2007101561840A CN200710156184A CN101159704A CN 101159704 A CN101159704 A CN 101159704A CN A2007101561840 A CNA2007101561840 A CN A2007101561840A CN 200710156184 A CN200710156184 A CN 200710156184A CN 101159704 A CN101159704 A CN 101159704A
- Authority
- CN
- China
- Prior art keywords
- rubbish
- comment
- similarity
- unknown
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000012423 maintenance Methods 0.000 abstract 1
- 230000000694 effects Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 239000012466 permeate Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000011012 sanitization Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses an anti-spam method based on micro-content similarity. The method comprises clustering the comments that are discriminated to be the spam manually to generate a clustered spam file; and discriminating the unknown comments by using a spam discriminator according to the clustered spam file. The method for scoring the similarity of a random sample in all spam comment classes with a comment to be processed and scoring the class where the random sample with the highest similarity can obviate the similarity comparison between the spam comment to be processed and the clustered spam so as to effectively reduce the frequency for comparing the comment similarities, thereby improving the efficiency of spam discrimination and the clustered spam file maintenance to satisfy the performance requirement for massive spam discrimination on the internet.
Description
Technical field
The present invention relates to the anti-rubbish method of internet micro-content, particularly a kind of anti-rubbish method based on the micro-content similarity.
Background technology
Blog is the 4th kind of internet exchange mode that occurs after Email, BBS, ICQ, be individual's " Reader's Digest " of cybertimes, being to be the network diary of weapon with the hyperlink, is to represent new life style and new working method, is more representing new mode of learning.But, the technology of anti-rubbish mail day by day ripe now, the Blog comment also more and more is subjected to businessman and common online friend's welcome as the means of interspersing advertisements and propaganda.This causes the rubbish comment on the Blog more and more, has greatly wasted the network bandwidth, Blog owner and reader's time, and system resource, and make the user painstaking, become and hindered a great problem that Blog popularizes.
The technology and the method for anti-rubbish comment commonly used at present have:
1) put phrase and filter, some responsive words are filtered or shield, less better but this filtration is taken precautions against ability for the mutation of responsive word, as divine by means of characters etc., and along with the continuous increase of dictionary, safeguard and operational efficiency all is affected.But this is a kind of method the most efficiently, can play the strick precaution effect of getting instant result.
2) check code is set, prevents the robot submission by check code being carried out the legitimacy verification.But at present robot still can permeate by OCR or exhaustive method, even the version of Xiu Gaiing is specially only needed efforts and just can be found method to crack.Also brought some obstacles for simultaneously normal users.
3) check Refer, fall those not accession page and connections of the comment of directly entering the station by the Refer Field Sanitization in the http protocol, this also is one of method of door chain, and efficient is very high.If submit to but the instrument that uses special modification to cross pretends http protocol, the method will be felt simply helpless.
4) control is submitted at interval continuously, and this measure prevents that the malice robot from carrying out saturation attack to database, reduces the server burden, but can not effect a permanent cure, and belongs to passive passive defense.
5) content-based scoring, the realization threshold value is cut apart, and whether intelligent decision is the rubbish comment.The method is the most scientific and reasonable, but needs server to do a large amount of processing, has increased the server burden, if be connected to remote server, service quality may can't guarantee because of network.
Therefore above method all can not satisfy the demand that online in real time is differentiated the rubbish comment fully.
Summary of the invention
The object of the invention is to provide a kind of anti-rubbish method based on the micro-content similarity.
The technical scheme that the present invention solves its technical problem employing is that the step of this method is as follows:
1) carries out cluster by will artificially differentiating, produce the cluster garbage files, comprise a plurality of refuse classifications to the comment of rubbish;
2) use the rubbish discriminator according to the cluster garbage files, the unknown comment is differentiated.
Described rubbish comment cluster process is:
1) the cluster garbage files is initially sky;
2) when new artificial discriminating was the rubbish comment of rubbish, following condition and step selectively joined in the cluster garbage files:
The first step is chosen the typical sample of a rubbish comment as this refuse classification arbitrarily from all refuse classifications;
The new rubbish comment of second step is carried out the similarity scoring with the typical sample of all refuse classifications;
The 3rd step, new rubbish comment comment was commented on the similarity scoring with all rubbish of this refuse classification again, calculated its highest similarity score to have the typical sample place refuse classification of highest similarity scoring in second step with unknown rubbish;
The 4th step, then commented on new rubbish as a new refuse classification less than certain assign thresholds as if the highest similarity score; Otherwise if the highest similarity score then joins new rubbish comment comment in the existing classification as a new rubbish comment sample less than another assign thresholds; Otherwise ignore this new rubbish comment.
Described rubbish identification algorithm step is as follows:
1) from all refuse classifications, chooses the typical sample of a rubbish comment arbitrarily as this refuse classification;
2) unknown comment is carried out the similarity scoring with the typical sample of all refuse classifications;
3) to last step 2) in have the typical sample place refuse classification of highest similarity scoring with unknown rubbish, unknown comment is commented on the similarity scoring with all rubbish of this refuse classification again;
4) surpass assign thresholds if unknown comment and all rubbish of above-mentioned refuse classification are commented on the maximum of similarity scoring, judge that then unknown the comment is the rubbish comment.
The beneficial effect that the present invention has is:
Avoided the comment of pending rubbish with all cluster rubbish carry out similarity relatively, reduced the number of comparisons of comment similarity effectively, improve the efficient that rubbish is differentiated and the cluster garbage files is safeguarded, can adapt to the performance requirement that magnanimity rubbish is differentiated on the Internet.
Description of drawings
Fig. 1 is the anti-rubbish method flow diagram based on the rubbish similarity of the present invention.
Fig. 2 is the algorithm flow chart that the rubbish comment is inserted the cluster garbage files of the present invention.
Fig. 3 is the algorithm flow chart that rubbish discriminator of the present invention is differentiated the unknown comment.
Embodiment
The present invention is as follows for the concept definition of comment similarity:
Speech: indivisible semantic primitive;
High frequency words: similar " ", the no semanteme of " ", the word that need be filtered;
Comment: the finite aggregate of speech, participle is carried out in original comment, filter out the result after the high frequency words;
The speech number of comment: the gesture of this comment set of words---the element number that this set comprised;
" friendship " of comment: set of words ship calculation;
" also " of comment: the union of set of words;
The similarity sim of definition comment a and comment b (a, b):
A hands over the speech number/a of b and the speech number of b, promptly
A hands over speech number/(the speech number of speech number+b of a-a hands over the speech number of b) of b
In conjunction with above-mentioned comment similarity notion, commenting on anti-garbage system with blog is example, and concrete implementation step is as follows:
As shown in Figure 1,, produce the cluster garbage files, be entered into rubbish storehouse discriminator in order to differentiate unknown comment by being that cluster is carried out in the comment of rubbish to artificial discriminating; The rubbish discriminator judges by the similarity of calculating typical sample in unknown comment and the cluster garbage files whether this unknown comment is rubbish, and provides judged result; Carry out the similarity scoring by the similarity of calculating typical sample in new rubbish comment and the cluster garbage files, new refuse classification is added the cluster garbage files, and upgrade existing refuse classification, abandon useless rubbish comment.
The process of adding rubbish comment supposes will insert now a rubbish comment x as shown in Figure 2, has had s class rubbish in the cluster garbage files, counts respectively: G
1, G
2, G
3... G
s, in each classification rubbish, get the rubbish category feature of any rubbish comment as such rubbish, be designated as respectively: g
1, g
2, g
3... g
s, calculate rubbish comment x and rubbish category feature g respectively
1, g
2, g
3... g
sThe comment similarity, suppose g
mGet wherein maximum with x:
sim(g
m,x)=max?sim(g
i,x)
i=1~s
Suppose that in the cluster garbage files a total t bar belongs to the rubbish comment of m class, is designated as g respectively
M1, g
M2, g
M3G
Mt, calculate rubbish comment x and comment g respectively
M1, g
M2, g
M3... g
MtThe comment similarity, suppose g
MnGet wherein maximum with x:
sim(g
mn,x)=max?sim(g
mi,x)
i=1~t
Below comprise 2 kinds of situations:
If a) sim (g
m, x) less than specific threshold T1, then this rubbish comment will independently become a new class rubbish, inserts database;
B) if sim is (g
Mn, x) more than or equal to T1, then this rubbish comment is under the jurisdiction of the m class and has had rubbish, comprises 2 kinds of situations:
B1) if sim is (g
Mn, x) less than specific threshold T2 (T2>sim (g
Mn, x)>T1) this rubbish comment adds the ability that the cluster garbage files can significantly improve native system discriminating rubbish, rubbish is commented on x insert in the cluster garbage files as m class rubbish;
B2) if sim is (g
Mn, x) more than or equal to T2 (sim (g
Mn, x)>and T2>T1), this expression has had the rubbish comment existence high with the similarity of rubbish comment x, and rubbish comment x adds the cluster garbage files and can not significantly improve the ability that native system is differentiated rubbish, ignores this rubbish comment, is left intact.
The process of the cluster garbage files that the rubbish comment is inserted into also is that they are by the process of cluster simultaneously.
Along with the rubbish number of reviews in the cluster garbage files constantly increases, reach certain scale after, just the cluster rubbish in the storehouse derive can be generated the cluster garbage files, after the rubbish discriminator imports this cluster garbage files, just can comment on and differentiate the unknown:
Fig. 3 has provided the process that the rubbish discriminator is differentiated the unknown comment, establishes unknown comment for x, supposes to have existed in the cluster garbage files s class rubbish, counts respectively: G
1, G
2, G
3... G
s, in each classification rubbish, get the rubbish category feature of any rubbish comment as such rubbish, be designated as respectively: g
1, g
2, g
3... g
s, calculate comment x and comment g respectively
1, g
2, g
3... g
sThe comment similarity, suppose g
mGet wherein maximum with x:
sim(g
m,x)=max?sim(g
i,x)
i=1~s
Suppose that in the cluster garbage files a total t bar belongs to the rubbish comment of m class, is designated as g respectively
M1, g
M2, g
M3G
Mt, calculate comment x and comment g respectively
M1, g
M2, g
M3... g
MtThe comment similarity, suppose g
MnGet wherein maximum with x:
sim(g
mn,x)=max?sim(g
mi,x)
i=1~t
If sim (g
Mn, x)>and specific threshold T3, unknown comment x is differentiated to be the rubbish comment;
If sim (g
Mn, x)<=and specific threshold T3, unknown comment x is differentiated to be non-rubbish comment;
If unknown comment is the rubbish comment, then should comments on and from comment database, delete; If not the rubbish comment, then ignore it.
Claims (3)
1. anti-rubbish method based on the micro-content similarity is characterized in that the step of this method is as follows:
1) carries out cluster by will artificially differentiating, produce the cluster garbage files, comprise a plurality of refuse classifications to the comment of rubbish;
2) use the rubbish discriminator according to the cluster garbage files, the unknown comment is differentiated.
2. a kind of anti-rubbish method based on the micro-content similarity according to claim 1 is characterized in that: described rubbish comment cluster process is:
1) the cluster garbage files is initially sky;
2) when new artificial discriminating was the rubbish comment of rubbish, following condition and step selectively joined in the cluster garbage files:
The first step is chosen the typical sample of a rubbish comment as this refuse classification arbitrarily from all refuse classifications;
The new rubbish comment of second step is carried out the similarity scoring with the typical sample of all refuse classifications;
The 3rd step, new rubbish comment comment was commented on the similarity scoring with all rubbish of this refuse classification again, calculated its highest similarity score to have the typical sample place refuse classification of highest similarity scoring in second step with unknown rubbish;
The 4th step, then commented on new rubbish as a new refuse classification less than certain assign thresholds as if the highest similarity score; Otherwise if the highest similarity score then joins new rubbish comment comment in the existing classification as a new rubbish comment sample less than another assign thresholds; Otherwise ignore this new rubbish comment.
3. a kind of anti-rubbish method according to claim 1 based on the micro-content similarity, it is characterized in that: described rubbish identification algorithm step is as follows:
1) from all refuse classifications, chooses the typical sample of a rubbish comment arbitrarily as this refuse classification;
2) unknown comment is carried out the similarity scoring with the typical sample of all refuse classifications;
3) to last step 2) in have the typical sample place refuse classification of highest similarity scoring with unknown rubbish, unknown comment is commented on the similarity scoring with all rubbish of this refuse classification again;
4) surpass assign thresholds if unknown comment and all rubbish of above-mentioned refuse classification are commented on the maximum of similarity scoring, judge that then unknown the comment is the rubbish comment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2007101561840A CN101159704A (en) | 2007-10-23 | 2007-10-23 | Microcontent similarity based antirubbish method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2007101561840A CN101159704A (en) | 2007-10-23 | 2007-10-23 | Microcontent similarity based antirubbish method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101159704A true CN101159704A (en) | 2008-04-09 |
Family
ID=39307629
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2007101561840A Pending CN101159704A (en) | 2007-10-23 | 2007-10-23 | Microcontent similarity based antirubbish method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101159704A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102053992A (en) * | 2009-11-10 | 2011-05-11 | 阿里巴巴集团控股有限公司 | Clustering method and system |
CN101446970B (en) * | 2008-12-15 | 2012-07-04 | 腾讯科技(深圳)有限公司 | Method for censoring and process text contents issued by user and device thereof |
CN102655480A (en) * | 2011-03-03 | 2012-09-05 | 腾讯科技(深圳)有限公司 | Similar mail handling system and method |
CN103064971A (en) * | 2013-01-05 | 2013-04-24 | 南京邮电大学 | Scoring and Chinese sentiment analysis based review spam detection method |
CN103065027A (en) * | 2011-10-19 | 2013-04-24 | 腾讯科技(深圳)有限公司 | Message leaving method and device provided for third-party social network site (SNS) web game |
CN103714049A (en) * | 2012-09-29 | 2014-04-09 | 百度在线网络技术(北京)有限公司 | Method and device for determining similarity of samples dynamically |
CN103745001A (en) * | 2014-01-24 | 2014-04-23 | 福州大学 | System for detecting reviewers of negative comments on products |
CN103970801A (en) * | 2013-02-05 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Method and device for recognizing microblog advertisement blog articles |
CN103996130A (en) * | 2014-04-29 | 2014-08-20 | 北京京东尚科信息技术有限公司 | Goods evaluation information filtering method and goods evaluation information filtering system |
CN104050195A (en) * | 2013-03-15 | 2014-09-17 | 北京暴风科技股份有限公司 | Advertisement sticker processing method and system |
CN104252465A (en) * | 2013-06-26 | 2014-12-31 | 南宁明江智能科技有限公司 | Method and device utilizing representative vectors to filter information |
CN104933191A (en) * | 2015-07-09 | 2015-09-23 | 广东欧珀移动通信有限公司 | Spam comment recognition method and system based on Bayesian algorithm and terminal |
CN105323153A (en) * | 2015-11-18 | 2016-02-10 | Tcl集团股份有限公司 | Spam mail filtering method and device |
CN106372236A (en) * | 2016-09-13 | 2017-02-01 | 东软集团股份有限公司 | Comment data processing method and device |
WO2018129978A1 (en) * | 2017-01-13 | 2018-07-19 | 广东欧珀移动通信有限公司 | Information processing method, device, storage medium and computer device |
CN112860643A (en) * | 2021-03-05 | 2021-05-28 | 中富通集团股份有限公司 | Method for improving cache cleaning speed of 5G mobile terminal and storage device |
-
2007
- 2007-10-23 CN CNA2007101561840A patent/CN101159704A/en active Pending
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101446970B (en) * | 2008-12-15 | 2012-07-04 | 腾讯科技(深圳)有限公司 | Method for censoring and process text contents issued by user and device thereof |
CN102053992A (en) * | 2009-11-10 | 2011-05-11 | 阿里巴巴集团控股有限公司 | Clustering method and system |
CN102053992B (en) * | 2009-11-10 | 2014-12-10 | 阿里巴巴集团控股有限公司 | Clustering method and system |
WO2012116587A1 (en) * | 2011-03-03 | 2012-09-07 | 腾讯科技(深圳)有限公司 | Similar email processing system and method |
CN102655480B (en) * | 2011-03-03 | 2015-12-02 | 腾讯科技(深圳)有限公司 | Similar mail treatment system and method |
CN102655480A (en) * | 2011-03-03 | 2012-09-05 | 腾讯科技(深圳)有限公司 | Similar mail handling system and method |
CN103065027A (en) * | 2011-10-19 | 2013-04-24 | 腾讯科技(深圳)有限公司 | Message leaving method and device provided for third-party social network site (SNS) web game |
CN103065027B (en) * | 2011-10-19 | 2017-02-22 | 腾讯科技(深圳)有限公司 | Message leaving method and device provided for third-party social network site (SNS) web game |
CN103714049A (en) * | 2012-09-29 | 2014-04-09 | 百度在线网络技术(北京)有限公司 | Method and device for determining similarity of samples dynamically |
CN103714049B (en) * | 2012-09-29 | 2017-10-03 | 北京音之邦文化科技有限公司 | The similar method and device of dynamic validation sample |
CN103064971A (en) * | 2013-01-05 | 2013-04-24 | 南京邮电大学 | Scoring and Chinese sentiment analysis based review spam detection method |
CN103970801B (en) * | 2013-02-05 | 2019-03-26 | 腾讯科技(深圳)有限公司 | Microblogging advertisement blog article recognition methods and device |
CN103970801A (en) * | 2013-02-05 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Method and device for recognizing microblog advertisement blog articles |
CN104050195B (en) * | 2013-03-15 | 2017-11-03 | 暴风集团股份有限公司 | A kind of advertisement sticker processing method and system |
CN104050195A (en) * | 2013-03-15 | 2014-09-17 | 北京暴风科技股份有限公司 | Advertisement sticker processing method and system |
CN104252465B (en) * | 2013-06-26 | 2018-10-12 | 南宁明江智能科技有限公司 | A kind of method and apparatus filtering information using representation vector |
CN104252465A (en) * | 2013-06-26 | 2014-12-31 | 南宁明江智能科技有限公司 | Method and device utilizing representative vectors to filter information |
CN103745001B (en) * | 2014-01-24 | 2016-10-05 | 福州大学 | A kind of product comment spam person's detecting system |
CN103745001A (en) * | 2014-01-24 | 2014-04-23 | 福州大学 | System for detecting reviewers of negative comments on products |
CN103996130B (en) * | 2014-04-29 | 2016-04-27 | 北京京东尚科信息技术有限公司 | A kind of information on commodity comment filter method and system |
CN103996130A (en) * | 2014-04-29 | 2014-08-20 | 北京京东尚科信息技术有限公司 | Goods evaluation information filtering method and goods evaluation information filtering system |
AU2015252513B2 (en) * | 2014-04-29 | 2018-11-29 | Beijing Jingdong Century Trading Co., Ltd. | Method and system for filtering goods evaluation information |
CN104933191A (en) * | 2015-07-09 | 2015-09-23 | 广东欧珀移动通信有限公司 | Spam comment recognition method and system based on Bayesian algorithm and terminal |
CN105323153A (en) * | 2015-11-18 | 2016-02-10 | Tcl集团股份有限公司 | Spam mail filtering method and device |
CN106372236A (en) * | 2016-09-13 | 2017-02-01 | 东软集团股份有限公司 | Comment data processing method and device |
WO2018129978A1 (en) * | 2017-01-13 | 2018-07-19 | 广东欧珀移动通信有限公司 | Information processing method, device, storage medium and computer device |
CN112860643A (en) * | 2021-03-05 | 2021-05-28 | 中富通集团股份有限公司 | Method for improving cache cleaning speed of 5G mobile terminal and storage device |
CN112860643B (en) * | 2021-03-05 | 2022-07-08 | 中富通集团股份有限公司 | Method for improving cache cleaning speed of 5G mobile terminal and storage device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101159704A (en) | Microcontent similarity based antirubbish method | |
CN101408883B (en) | Method for collecting network public feelings viewpoint | |
CN103064970B (en) | Optimize the search method of interpreter | |
CN101184259B (en) | Keyword automatically learning and updating method in rubbish short message | |
CN107609121A (en) | Newsletter archive sorting technique based on LDA and word2vec algorithms | |
CN109165294A (en) | Short text classification method based on Bayesian classification | |
CN100589453C (en) | Processing device and method for anti-junk mails | |
CN103441924B (en) | A kind of rubbish mail filtering method based on short text and device | |
CN103037339B (en) | One kind is based on the short message filter method of " user's credit worthiness and short message spam degree " | |
CN102158428B (en) | Rapid and high-accuracy junk mail filtering method | |
CN103646080A (en) | Microblog duplication-eliminating method and system based on reverse-order index | |
CN103136266A (en) | Method and device for classification of mail | |
CN102194012B (en) | Microblog topic detecting method and system | |
CN101996241A (en) | Bayesian algorithm-based content filtering method | |
CN103336766A (en) | Short text garbage identification and modeling method and device | |
CN101021838A (en) | Text handling method and system | |
CN111639183A (en) | Financial industry consensus public opinion analysis method and system based on deep learning algorithm | |
CN101645069A (en) | Regular expression storage compacting method in multi-mode matching | |
CN101046858B (en) | Electronic information comparing system and method and anti-garbage mail system | |
CN101339560B (en) | Method and device for searching series data, and search engine system | |
CN102045268A (en) | Method and device for recovering email data | |
CN101441663B (en) | Chinese text classification characteristic dictionary generating method based on LZW compression algorithm | |
CN102722526B (en) | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method | |
CN110674291A (en) | Chinese patent text effect category classification method based on multivariate neural network fusion | |
CN101329668A (en) | Method and apparatus for generating information regulation and method and system for judging information types |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20080409 |