CN102208992A - Internet-facing filtration system of unhealthy information and method thereof - Google Patents

Internet-facing filtration system of unhealthy information and method thereof Download PDF

Info

Publication number
CN102208992A
CN102208992A CN2010102005887A CN201010200588A CN102208992A CN 102208992 A CN102208992 A CN 102208992A CN 2010102005887 A CN2010102005887 A CN 2010102005887A CN 201010200588 A CN201010200588 A CN 201010200588A CN 102208992 A CN102208992 A CN 102208992A
Authority
CN
China
Prior art keywords
flame
information
filtration system
keyword
index engine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010102005887A
Other languages
Chinese (zh)
Other versions
CN102208992B (en
Inventor
陶鹏
宋传宝
罗侃
曹浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Haina media big data technology development Co. Ltd.
Original Assignee
TIANJIN HYLANDA INFORMATION TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN HYLANDA INFORMATION TECHNOLOGY CO LTD filed Critical TIANJIN HYLANDA INFORMATION TECHNOLOGY CO LTD
Priority to CN201010200588.7A priority Critical patent/CN102208992B/en
Publication of CN102208992A publication Critical patent/CN102208992A/en
Application granted granted Critical
Publication of CN102208992B publication Critical patent/CN102208992B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an internet-facing filtration system of unhealthy information and a method thereof. The unhealthy information filtration system comprises a user data submission module, a user service management system, an auditing platform for user mutual information, an operation platform for purifying service, a knowledge base, and at least an indexing engine. The user data submission module is connected with the user service management system that is connected with the operation platform for purifying service. The operation platform for purifying service is connected respectively with the auditing platform for user mutual information and each indexing engine. The indexing engine is connected with the knowledge base. A plurality of intelligent technologies including word segmentation, keywords matching, and vector model as well as a plurality of processing algorithms with high performances are applied in the invention to process data. According to the invention, indexing services of information including sensitive information, erotic information, vulgar information, junk information, and commercial advertisements and the like are provided for the internet community, so that a high-efficient information management mean is provided for the client.

Description

Flame filtration system and method thereof towards the Internet
Technical field
The present invention relates to a kind of flame filtration system and filter method thereof, relate in particular to a kind of characteristics at the Internet community, can to pornographic, vulgar, pour water and filtration system and filter method thereof that flame such as commercial advertisement carries out accurate index and filtration, belong to the network information security technology field.
Background technology
Along with the Internet development growth, increasing community channel has been released in each website (comprising portal website, thematic website etc.), as all kinds of thematic forums, blog, comment etc., the interactive online friend that these community's channels attract is increasing, for website and netizen have brought interests.But also there are some personnel to borrow these community's channels to issue the various commercial advertising message without restraint simultaneously, even the model of issuing a large amount of pornographics, vulgar, thick mouth and competing with other malice of going together, these flames have disturbed the normal operation of website, damage the brand and the public praise of website, also greatly influenced other netizen's normal use simultaneously.
At present, each website generally all is to adopt following technical measures at this situation:
Figure BSA00000147932400011
The keyword restriction is set: by safeguarding a huge crucial dictionary, contain keyword in model, system is reminded or directly deletion.
The frequency limitation of posting: limit the maximum model quantity that same IP or same ID send out in the unit interval.
These two kinds of methods can filter out the part bad data, but also exist great deficiency:
For the keyword restriction, a lot of bad cards are not only just can judge by one or two keyword, but need be by the front and back semanteme of whole section words, whole word, keyword is judged.For the frequency limitation of posting, defective has been to limit posting of some normal netizens, is easy to simultaneously be cracked by the machine of posting, and makes the method be difficult to actual proving effective.
In the patent No. was 200510048576.6 Chinese invention patent, a kind of system of intercepting pornographic image and flame was on the internet disclosed.This system contains IP address filtering, keyword filtration and pornographic image detects, and sets up the pornographic image Mathematical Modeling by decision-feedback repeatedly; Whether set up pornographic standard picture feature database, be the foundation of pornographic image as the decision networks image; Set up similitude coupling judgement model; To carry out content-based image judgement by the network information of keyword contrast.Both carrying out the information content in application layer filters, adopt network address to filter at the IP layer again, can directly tackle pornographic image information, the real-time update url database, network address by past passive is filtered the information filtering that jumps to active, the multifunctional management platform that system is unique, integrated the complex relationship between operating system, browser, the internet protocol negotiation visual detector, the process interaction and the pornographic image that have solved between the client-server detect the division of labor and the data recombination problem of task, and have realized and the irrelevant characteristics of browser.
In addition, in number of patent application is 200410053683.3 Chinese invention patent application, a kind of internet content filtration system and filter method are disclosed.This content filtering system comprises: information filtering agency (CFA), querying server (QS), content analysis and management server (CAMS) three parts.The filtering process of Web content filtration system is: when the user sent the request that certain URL is conducted interviews, CFA was according to the black and white lists that the user is provided with, and allowed or forbade this access request.If in the black and white lists of CFA, CFA does not then send query requests to QS to this URL.QS will inquire about the rating information of this URL and the result is returned to CFA in the URL storehouse of oneself.CFA makes a response in view of the above.QS meeting simultaneously is the URL rating information of down loading updating from CAMS regularly.The flame that this technical scheme exists in can recognition network, and stop these objectionable websites of internet user access on one's own initiative.
Summary of the invention
Technical problem to be solved by this invention is to provide a kind of flame filtration system and method thereof towards the Internet, can to pornographic, vulgar, pour water and flame such as commercial advertisement carries out accurate index and filtration.
For realizing above-mentioned goal of the invention, the present invention adopts following technical scheme:
A kind of flame filtration system towards the Internet is characterized in that:
Described flame filtration system comprises user data submission module, subscriber service management system, customer interaction information audit platform, purifies Service Operation platform, knowledge base and at least one index engine; Wherein,
Described user data submits to module to be connected with described subscriber service management system, and described subscriber service management system connects described purification Service Operation platform;
Described purification Service Operation platform is connected with each index engine with described customer interaction information audit platform respectively;
Described index engine is connected with described knowledge base.
Wherein, described index engine comprises one or more in advertisement index engine group, water decals engine group, individualized feature index engine group, behavioural characteristic index engine group, pornographic index engine group, vulgar index engine group, the sensitive information index engine group.
Described knowledge base comprises one or more in keyword dictionary, behavior pattern storehouse, rule base, case library and the training characteristics storehouse.
Also comprise impurity characteristics storehouse, non-impurity characteristics storehouse and individual character impurity characteristics storehouse in the described flame filtration system; Described impurity characteristics storehouse, non-impurity characteristics storehouse and individual character impurity characteristics storehouse are connected described knowledge base on the one hand, are connected with described purification Service Operation platform on the other hand.
Described customer interaction information audit platform comprises distributing data module, feedback data module and system effect statistical module; Wherein, described distributing data module receives the data from described purification Service Operation platform, if normal labeled, then as externally issue of normal subsides; If error flag then sends into described feedback data module and use as corpus, and feed back to described purification Service Operation platform.
A kind of flame filter method towards the Internet is realized based on above-mentioned flame filtration system, it is characterized in that comprising following step:
(1) receives the various message that Web Community issues;
(2) case library that calls in the knowledge base filters, and judges whether to be flame;
(3) if not, further call customer personalized " black and white lists " that comprise keyword, keyword combination, URL, IP address, user ID and filter, judge whether to be flame;
(4) if not, further carry out common behavior pattern recognition;
(5) if not, further carry out the characteristic behavior pattern recognition;
(6) if not, further call the miscellaneous service rule and filter;
(7) the comprehensive filter result that step (2)~(6) are obtained obtains final flame filter result, and preserves warehouse-in;
(8) final index result is returned to client.
Wherein, in the flame filter process, at first Message-text is carried out the keyword coupling; Were it not for and hit keyword, then will predict the outcome is made as " not needing deletion ", if hit keyword, then text-converted is become vector space model, this vector is predicted prediction result is a certainty factor; For different message, it is predicted as " needing deletion ", " the doubtful needs deletes ", " not needing deletion " this three class according to certainty factor and preset threshold value, wherein, introduce and manually further examine for the message of " the doubtful needs deletes " this classification.
In the described step (5), described characteristic behavior pattern recognition is meant carries out global analysis to Web Community's information releasing content, go out wherein all feature contact details by semantic identification extraction, the frequency of occurrences of described feature contact details in the certain hour section calculated, and compare with default threshold values, when exceeding described threshold values, think flame.
Flame filtration system provided by the present invention and method thereof are used multinomial intellectual technology: participle, keyword coupling, vector model, and a plurality of high performance Processing Algorithm are handled data, can provide responsive, pornographic, vulgar for the Internet community, pour water and the index service of information such as commercial advertisement, thereby provide high efficiency information management means to the client.
Description of drawings
The present invention is described in further detail below in conjunction with the drawings and specific embodiments.
Fig. 1 is the overall structure schematic diagram of flame filtration system provided by the present invention;
Fig. 2 carries out the operating process schematic diagram that flame filters for this flame filtration system;
Fig. 3 is the exemplary plot based on the statistical model of supervise learning;
Fig. 4 is for using the schematic flow sheet based on the supervise learning statistical model;
Fig. 5 is for solving the schematic flow sheet that rubbish pastes with keyword in conjunction with the statistical model system framework among the present invention.
Embodiment
In order to improve the filter effect of the present invention to flame, the inventor has done classification from operational angle, technical standpoint to flame respectively by the analysis to a large amount of the Internet community data.
The classification of operational angle: flame can be divided into commercial advertisement class, pornographic, vulgar, pour water, client's individual character class.And each classification is segmented.Can be divided into as the commercial advertisement class: numeric class (QQ, phone, cell-phone number, invoice, quotation etc.), domain name kind (MSN, network address, E-Mail etc.).
The classification of technical standpoint: rigidity identification, behavior pattern recognition, flexible identification, the identification of keyword black and white lists, all kinds of business rule (the different business rule has adopted different algorithms again, sees concrete algorithm introduction for details) are provided.Wherein:
Figure BSA00000147932400041
Rigidity identification: among the present invention system's mistake of client's feedback is deleted, leaked and delete data as the rigidity data, can carry out index to the identical data of the content of subsequent issued.The system's mistake that receives client's feedback is deleted, is leaked and delete data, and whole piece information is calculated, and generates a unique value, preserves warehouse-in and (is called: the rigidity storehouse).When follow-up when receiving the various information data of need filtering again, adopt identical algorithm to calculate to every information, and the end value that generates compared with the value in the rigidity storehouse, on mating, can judge directly that then this information is normal information (or flame).
Behavior pattern recognition: add up by the number of times that the same IP in a period of time, same ID, identical key content are occurred, analyze the behavior pattern of data.
Flexible identification: claim approximate text detection (seeing following algorithm introduction for details) again.By the data in a period of time are carried out training study, can the data identification frequency of occurrences is higher and that content is similar come out.
The keyword black and white lists: black and white lists (providing: keyword, keyword combination, netizen IP, netizen ID, content URL etc.) can be provided according to demand in each website, and the content of issue is mated identification.
Figure BSA00000147932400054
All kinds of business rules: because the characteristic formp of all kinds of business datums is different, so, can take different intelligent identification Methods at all kinds of business datums.Comprise: automanual digital flexible identification, based on the domain name of pattern and email identification, the identification of vertical setting of types literal, keyword in conjunction with statistical model framework or the like.
As shown in Figure 1, flame filtration system provided by the present invention comprise user data submit module, subscriber service management system, customer interaction information audit platform to, purify the Service Operation platform, at the index engine of various situations and corresponding knowledge base etc.Wherein user data submits to module that interactive text message and subscriber identity information are submitted to the subscriber service management system, and the subscriber service management system sends relevant data to purification Service Operation platform in the mode of UID-xml.Purifying the Service Operation platform is the core of this flame filtration system.It connects each index engine, therefrom obtains the information of reflection knowledge/rule, simultaneously also to impurity characteristics storehouse, non-impurity characteristics storehouse and the individual character impurity characteristics storehouse feedback information about knowledge/rule.Customer interaction information audit platform comprises List (issue) data module, feedback data module and system effect statistical module.Wherein, List (issue) data module receives the data of auto purification Service Operation platform, if normal labeled is then issued as normal the subsides externally; If error flag then sends into the feedback data module and use as corpus, and feed back to and purify the Service Operation platform.Purify the Service Operation platform and send the statistical effect analysis result to the system effect statistical module simultaneously.
The index engine that uses in this flame filtration system comprises advertisement index engine group, water decals engine group, individualized feature index engine group, behavioural characteristic index engine group, pornographic index engine group, vulgar index engine group, sensitive information index engine group etc., respectively at commercial advertisement class, pornographic, vulgar, pour water, multiple situation such as client's individual character class.According to the actual needs of Web Community, above-mentioned engine group can also constantly be expanded.Each above-mentioned index engine connects knowledge base, therefrom obtains the knowledge/rule that is used to filter flame.These databases comprise keyword dictionary (logical implication storehouse), behavior pattern storehouse, rule base, case library (rigidity storehouse) and training characteristics storehouse etc.Based on above-mentioned index engine and knowledge base, this flame filtration system is unified the rule of a plurality of dimensions, provides different composite services according to client's demand for the client.Like this,, increased recognition effect, solved the lower defective of single regular effect data by the identification of a plurality of rules.
Comprise four following class functional interfaces in this flame filtration system:
One. the index interface
After reception and the parsing client requests data, purify the Service Operation platform and read filtering rule and the customer personalized content that is provided with that the client is provided with, and the filter algorithm of calling correspondence (is docked with the core algorithm service, support every rule, strobe utility), whether draw the judged result of rubbish card, and the result is returned to the client.
Two. feedback interface
The client edits native system leaked and deletes data and carry out " deletion " operation, or the native system mistake is deleted after data carry out " recoverys " operation, and client is transferred to server end with these data by this interface, and preservation is put in storage.These data will become the rigidity database data, and follow-up data are directly come into force.
Three. interface is set
Receive every configuration data (client can be provided with personalized black and white lists, comprising: keyword, IP, ID, picture chained address) that the client is provided with, the preservation warehouse-in also comes into force in real time.
Four. notification interface
This flame filtration system is at newly-increased up-to-date filtration speech, when adding up-to-date rule, to carry out primary indexing once more to the historical normal data (acquiescence keeps the data of this month and last month) that is retained in the system, and this index result preserved for the data of " rubbish ", client can regularly be obtained this type of data by " notification interface " visit, and these data are deleted.
The process that this flame filtration system is handled various flames at first receives the various message of Web Community's issue as shown in Figure 2, and the case library (rigidity storehouse) that calls then in the knowledge base filters, and judges whether to be flame.Then, use customer personalized " black and white lists " to filter, promptly filter by keyword, keyword combination, URL, IP address, user ID etc.If not among the filter area of customer personalized " black and white lists ", then further carry out common behavior pattern recognition and characteristic behavior pattern recognition.After above-mentioned judging means is finished using, further call miscellaneous service rule (as advertisement, vulgar etc.) and filter, thereby obtain final filter result, preserve warehouse-in, return the index result then and give client.
In this flame filtration system, used a class new technology: semanteme identification is combined with behavioural analysis, be called characteristic behavior analysis (also claiming the characteristic behavior pattern recognition).The characteristic behavior analysis is meant carries out global analysis to Web Community's information releasing content, go out wherein all feature contact details by semantic identification extraction, the frequency of occurrences of these information in the certain hour section calculated, and compared with the threshold values of presetting.When exceeding threshold values, think flame.
In the present invention, the commercial advertisement card is mainly discerned in the effect of characteristic behavior analysis.Concrete technical descriptioon is as follows:
Owing to allow the contact method (as: QQ number, telephone number etc.) of some individual or entities of issue in a lot of the Internet community, to increase the interactive of user, but do not allow those model that has advertisement character issues, therefore whether this model is that rubbish pastes, its criterion can not be fixed, and has stronger subjectivity.Judge iff the business rule that adopts semantic identification (as: all deletions of band contact method), certainly will delete a lot of models by mistake.
By information releasing in a large amount of communities (mainly refer to the information of band contact method, comprise normal information, flame) is analyzed, find to exist certain rules.That is: normal information is general only can issue minority several times in one or several column.Flame then can be constantly in the ground issue of a plurality of column high-frequencies, and the contact method that is comprised is normally the same.In the case, can set the plate amount threshold and the unit interval frequency of occurrences threshold value of same information issue, when surpassing preset threshold value, assert that this information is the flame of commercial advertisement character.
Therefore, by semanteme identification is combined with behavioural analysis, can solve the problem that the commercial advertisement criterion can not be fixed.
In addition, this flame filtration system has adopted the technical scheme of keyword in conjunction with the statistical model framework at the rubbish card.Specify as follows:
At present, the statistical model based on supervise learning has been widely used in every field such as text classification, image classification.Statistical model based on supervise learning refers to a kind of framework shown in Figure 3: collect or mark out the data of some classifications by artificial mode, by the learning algorithm of statistical model, finally obtain the model that can discern these some classifications.
Statistical model commonly used comprises SVM (SVMs), maximum entropy model, Logistic regression model, model-naive Bayesian etc.The more information of these models can be with reference to Mitchell, and (China Machine Press's in March, 2008 version, an ISBN:9787111109938) book have not just been given unnecessary details in detail at this in " machine learning " that T.M. showed.
Automatically identification rubbish pastes a special case can thinking the autotext classification.And the framework based on statistical model that provides is above used in the autotext classification usually.This is that Many researchers publishes thesis and declares because of the development through four more than ten years: use the autotext classification based on the supervise learning statistical model can obtain best predicting the outcome.Pasting with automatic identification rubbish is example, use can be referring to Fig. 4 based on the flow process of supervise learning statistical model, promptly at first collecting is the comment that rubbish pastes and non-rubbish pastes in a large number, then comment text is converted to vector space model, learning algorithm by statistical model is finally predicted the outcome accordingly.
So-called vector space model (Vector Space Model) is a kind of text modeling pattern very commonly used.Its main thought is to regard different speech as different dimension.For one piece of specific document, the weight of each dimension adopts the mode of TF * IDF to calculate usually.Wherein TF refers to the occurrence number of this speech in the document, and IDF refers to the contrary document rate of this speech, uses following formula to calculate usually: N in the formula refers to the quantity of all documents, DF WordRefer to this speech and appear at the quantity of different document.These computational methods can (China Machine Press's in April, 2005 version finds more explanation in ISBN:7-111-15878-4) at the textbook " modern information retrieval " of work such as Ricardo Baeza-Yates.
In the present invention, the inventor has further proposed a kind of keyword and has solved the technical scheme that rubbish pastes automatic classification in conjunction with the statistical model system framework.
So-called keyword refer to artificial summary, rubbish pastes and normal set of words of pasting in order to filter and to distinguish.For example in the rubbish of political situation of the time class pasted, " Falun Gong " was exactly a keyword.For a model, if hit this keyword, this model can directly be classified as the rubbish subsides so, or via being grouped into respective classes after the manual examination and verification.
At present, a lot of websites, forum use based on the mode of keyword screens comment or blog text, uses the mode of manual examination and verification to determine whether this comment or blog text belong to the text of this deletion then.Yet, use this mode not needing in a large number can to obtain the text of deleting.For example, the name of using the state leader not needing in a large number will to obtain the text deleted as keyword.Therefore, use keyword still to need to expend a large amount of artificial merely.
Then there is following several problem in the simple statistical model that uses:
1) can not the instant demand that changes of real-time response.Therefore because statistical model needs the label data that has of finishing collecting some, need a period of time to collect data for a new erasure request and train new model.For example, occurred in the forum pasting, supposed that existing system can not identify these advertisements and paste for the advertisement of illegal article such as invoice, gun.If the mode by statistics is handled, then need to collect these advertisements and label and stamp corresponding.Make up model then and release.Therefore, use mode based on the supervise learning statistical model can not satisfy the demand that (for example within 1 minute) in the short period will control the content of posting merely.
2) speed is slower.Because the needs of algorithm, based on the mode of statistics than differing tens times to hundreds of times on the mode speed based on keyword.Therefore, under the prerequisite of very big data throughout (tens MBPSs), be difficult to deal with actual demand, perhaps cost very big (needing distributed computing platform or other solution) based on the mode of adding up.
As shown in Figure 5, there be the keyword and the statistical model of manual sorting collection in supposition in the present invention.For a text, the flow process of operation is carried out as follows:
1. text is carried out the keyword coupling.Were it not for and hit keyword, then will predict the outcome is made as " not needing deletion ".And skip over 2~5 the step.If hit keyword, then changed for the 2nd step over to.
2. convert text to vector space model according to the mode of introducing previously.
3. this vector is predicted that prediction result is a certainty factor (different statistical models can obtain different certainty factor interval, but any one model can both obtain the value of a certainty factor).
4. for different message, it is predicted as " needing deletion ", " the doubtful needs deletes ", " not needing deletion " this three class according to certainty factor and preset threshold value.For example, the yardstick that political situation of the time class need be deleted is often more wide in range, and the threshold value that therefore, is judged to " needing deletion " is lower with regard to what set.And the yardstick of " thick mouthful " classification deletion is narrow, so it is judged to higher that the threshold value of " needing deletion " just can set.
5. " the doubtful needs deletes " this classification that manual examination and verification previous step obtains.
Object lesson can see the following form:
Figure BSA00000147932400101
In the present invention, use the technological means of SVM (SVMs) algorithm as the statistical model part.This part operation can use other statistical model to substitute, maximum entropy model for example mentioned above, model-naive Bayesian etc.Each statistical model is had any different on modeling and prediction algorithm, and under the different applied environments, each algorithm has different performances.
In addition, the computing formula that preamble is mentioned has been adopted in the expression of vector space model.Different computing formula also has different results, need weigh assessment and be used afterwards again in reality.
Utilize above-mentioned filter algorithm, flame filtration system provided by the present invention can effectively solve above mentioned three problems, and is promptly relatively poor based on the system accuracy of keyword, do not possess real-time customization and slower speed etc. based on the method for statistics.
In addition, this flame filtration system also possesses following several characteristics:
1) for the extremely sensitive content (for example comprising the comment of state leader's name) that needs deletion, this system can mix into the mode of manual examination and verification to guarantee the safety of website.Compare with the method for simple use keyword, native system can effectively reduce the quantity (quantity of saving is greater than 70%) of manual examination and verification.
2) along with the increase that label data is arranged, the accuracy of system can be more and more higher, and progressively converge to a comparatively stable value.
In general, the embedded multinomial intellectual technology of the underlying platform of this flame filtration system: participle, keyword coupling, vector model, and a plurality of high performance Processing Algorithm handle data, by form feature (type-setting mode, symbol make usage), content characteristic (style of writing mode, the speech to data; Named entity, sentence, chapter), behavior pattern etc. analyzes, and reached world first-class recognition effect in open evaluation and test.
More than flame filtration system and the method thereof towards the Internet provided by the present invention had been described in detail.To those skilled in the art, any conspicuous change of under the prerequisite that does not deviate from connotation of the present invention it being done all will constitute to infringement of patent right of the present invention, with corresponding legal responsibilities.

Claims (8)

1. flame filtration system towards the Internet is characterized in that:
Described flame filtration system comprises user data submission module, subscriber service management system, customer interaction information audit platform, purifies Service Operation platform, knowledge base and at least one index engine; Wherein,
Described user data submits to module to be connected with described subscriber service management system, and described subscriber service management system connects described purification Service Operation platform;
Described purification Service Operation platform is connected with each index engine with described customer interaction information audit platform respectively;
Described index engine is connected with described knowledge base.
2. flame filtration system as claimed in claim 1 is characterized in that:
Described index engine comprises one or more in advertisement index engine group, water decals engine group, individualized feature index engine group, behavioural characteristic index engine group, pornographic index engine group, vulgar index engine group, the sensitive information index engine group.
3. flame filtration system as claimed in claim 1 is characterized in that:
Described knowledge base comprises one or more in keyword dictionary, behavior pattern storehouse, rule base, case library and the training characteristics storehouse.
4. flame filtration system as claimed in claim 1 is characterized in that:
Also comprise impurity characteristics storehouse, non-impurity characteristics storehouse and individual character impurity characteristics storehouse in the described flame filtration system; Described impurity characteristics storehouse, non-impurity characteristics storehouse and individual character impurity characteristics storehouse are connected described knowledge base on the one hand, are connected with described purification Service Operation platform on the other hand.
5. flame filtration system as claimed in claim 1 is characterized in that:
Described customer interaction information audit platform comprises distributing data module, feedback data module and system effect statistical module; Wherein, described distributing data module receives the data from described purification Service Operation platform, if normal labeled, then as externally issue of normal subsides; If error flag then sends into described feedback data module and use as corpus, and feed back to described purification Service Operation platform.
6. the flame filter method towards the Internet is realized based on flame filtration system as claimed in claim 1, it is characterized in that comprising following step:
(1) receives the various message that Web Community issues;
(2) case library that calls in the knowledge base filters, and judges whether to be flame;
(3) if not, further call customer personalized " black and white lists " that comprise keyword, keyword combination, URL, IP address, user ID and filter, judge whether to be flame;
(4) if not, further carry out common behavior pattern recognition;
(5) if not, further carry out the characteristic behavior pattern recognition;
(6) if not, further call the miscellaneous service rule and filter;
(7) the comprehensive filter result that step (2)~(6) are obtained obtains final flame filter result, and preserves warehouse-in;
(8) final index result is returned to client.
7. flame filter method as claimed in claim 6 is characterized in that:
In the flame filter process, at first Message-text is carried out the keyword coupling; Were it not for and hit keyword, then will predict the outcome is made as " not needing deletion ", if hit keyword, then text-converted is become vector space model, this vector is predicted prediction result is a certainty factor; For different message, it is predicted as " needing deletion ", " the doubtful needs deletes ", " not needing deletion " this three class according to certainty factor and preset threshold value, wherein, introduce and manually further examine for the message of " the doubtful needs deletes " this classification.
8. flame filter method as claimed in claim 6 is characterized in that:
In the described step (5), described characteristic behavior pattern recognition is meant carries out global analysis to Web Community's information releasing content, go out wherein all feature contact details by semantic identification extraction, the frequency of occurrences of described feature contact details in the certain hour section calculated, and compare with default threshold values, when exceeding described threshold values, think flame.
CN201010200588.7A 2010-06-13 2010-06-13 The malicious information filtering system of Internet and method thereof Expired - Fee Related CN102208992B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010200588.7A CN102208992B (en) 2010-06-13 2010-06-13 The malicious information filtering system of Internet and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010200588.7A CN102208992B (en) 2010-06-13 2010-06-13 The malicious information filtering system of Internet and method thereof

Publications (2)

Publication Number Publication Date
CN102208992A true CN102208992A (en) 2011-10-05
CN102208992B CN102208992B (en) 2015-09-02

Family

ID=44697664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010200588.7A Expired - Fee Related CN102208992B (en) 2010-06-13 2010-06-13 The malicious information filtering system of Internet and method thereof

Country Status (1)

Country Link
CN (1) CN102208992B (en)

Cited By (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567534A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Interactive product user generated content intercepting system and intercepting method for the same
CN102647416A (en) * 2012-03-30 2012-08-22 上海明复信息技术有限公司 System and method for filtering harmful information based on internet data source control
CN102682090A (en) * 2012-04-26 2012-09-19 焦点科技股份有限公司 System and method for matching and processing sensitive words on basis of polymerized word tree
CN102752262A (en) * 2011-04-18 2012-10-24 腾讯科技(深圳)有限公司 Method and server for restricting malicious information propagation
CN102932354A (en) * 2012-11-02 2013-02-13 杭州迪普科技有限公司 Verification method and device for internet protocol (IP) address
CN103064858A (en) * 2011-10-19 2013-04-24 北京千橡网景科技发展有限公司 Method and apparatus for objectionable image detection in social networking websites
CN103166920A (en) * 2011-12-13 2013-06-19 腾讯科技(深圳)有限公司 Method and system for limiting transmission of malicious information
CN103246705A (en) * 2013-04-09 2013-08-14 无锡安康讯信息科技有限公司 Network text data content detecting and high-speed processing method
CN103347009A (en) * 2013-06-20 2013-10-09 新浪网技术(中国)有限公司 Method and device filtering information
CN103345530A (en) * 2013-07-25 2013-10-09 南京邮电大学 Social networking service blacklist automatic filtration model based on semantic net
CN103532918A (en) * 2012-07-06 2014-01-22 天讯天网(福建)网络科技有限公司 Mobile-terminal monitoring method and system based on mobile Internet and cloud computing
CN103970801A (en) * 2013-02-05 2014-08-06 腾讯科技(深圳)有限公司 Method and device for recognizing microblog advertisement blog articles
CN104050195A (en) * 2013-03-15 2014-09-17 北京暴风科技股份有限公司 Advertisement sticker processing method and system
CN104050191A (en) * 2013-03-14 2014-09-17 北京百度网讯科技有限公司 Method and equipment for monitoring promotional information
CN104424252A (en) * 2013-08-28 2015-03-18 北大方正集团有限公司 Verbal information processing method based on extensive markup language and verbal content server
CN104702671A (en) * 2015-02-06 2015-06-10 贵阳朗玛信息技术股份有限公司 Method and server for reporting information
CN104834685A (en) * 2015-04-17 2015-08-12 百度国际科技(深圳)有限公司 Method and device for processing comment message block in comment-like webpage
CN104834703A (en) * 2015-04-29 2015-08-12 深圳市梦网科技股份有限公司 Retrieval method and system
CN105573968A (en) * 2015-12-10 2016-05-11 天津海量信息技术有限公司 Text indexing method based on rules
CN105760445A (en) * 2016-02-03 2016-07-13 北京光年无限科技有限公司 Junk word filtering method and system
CN105938483A (en) * 2016-04-14 2016-09-14 江苏马上游科技股份有限公司 Network junk information filtering and optimizing method
CN106131595A (en) * 2016-05-26 2016-11-16 武汉斗鱼网络科技有限公司 A kind of title sensitive word control method for net cast and device
CN106447239A (en) * 2016-11-21 2017-02-22 北京字节跳动科技有限公司 Auditing method and device for data release
CN103780409B (en) * 2012-10-19 2017-04-05 任子行网络技术股份有限公司 A kind of network log-in management method and apparatus
CN106649831A (en) * 2016-12-29 2017-05-10 北京奇艺世纪科技有限公司 Data filtering method and device
CN106815242A (en) * 2015-11-30 2017-06-09 腾讯科技(深圳)有限公司 Textual resources data detection method and device
CN107103012A (en) * 2016-01-28 2017-08-29 阿里巴巴集团控股有限公司 Recognize method, device and the server of violated webpage
CN107332758A (en) * 2017-06-30 2017-11-07 罗颖莉 A kind of the Internet chat processing system with malicious information filtering component
CN107395619A (en) * 2017-08-17 2017-11-24 深圳市盛路物联通讯技术有限公司 A kind of safety communicating method and system
CN107579960A (en) * 2017-08-22 2018-01-12 深圳市盛路物联通讯技术有限公司 A kind of data filtering method and device
CN108055289A (en) * 2018-01-30 2018-05-18 深圳市富途网络科技有限公司 A kind of method and system audited to user-generated content based on internet
CN108170813A (en) * 2017-12-29 2018-06-15 智搜天机(北京)信息技术有限公司 A kind of method and its system of full media content intelligent checks
CN109656141A (en) * 2019-01-11 2019-04-19 武汉天喻聚联网络有限公司 Violation identification and machine behaviour control method, equipment, storage medium based on artificial intelligence technology
CN109947943A (en) * 2019-03-15 2019-06-28 四川长虹电器股份有限公司 A method of detection network public information health authenticity
WO2019127651A1 (en) * 2017-12-30 2019-07-04 惠州学院 Method and system thereof for identifying malicious video
CN109977403A (en) * 2019-03-18 2019-07-05 北京金堤科技有限公司 The recognition methods of malice comment information and device
CN110020257A (en) * 2017-12-30 2019-07-16 惠州学院 The method and system of the harmful video of identification based on User ID and video copy
CN110047567A (en) * 2019-04-18 2019-07-23 中国石油大学(华东) A kind of gall stone diagnostic model based on case history key message extractive technique
CN110489657A (en) * 2019-07-05 2019-11-22 五八有限公司 A kind of information filtering method, device, terminal device and storage medium
CN110837615A (en) * 2019-11-05 2020-02-25 福建省趋普物联科技有限公司 Artificial intelligent checking system for advertisement content information filtering
CN111046174A (en) * 2019-11-08 2020-04-21 广州坚和网络科技有限公司 Method for commenting water prevention and irrigation suitable for news information software
CN111460253A (en) * 2020-03-24 2020-07-28 国家电网有限公司 Internet data capture method suitable for big data analysis
CN111797200A (en) * 2020-06-18 2020-10-20 北京亿宇嘉隆科技有限公司 IT operation and maintenance method
CN112686036A (en) * 2020-08-18 2021-04-20 平安国际智慧城市科技股份有限公司 Risk text recognition method and device, computer equipment and storage medium
CN112686055A (en) * 2021-03-16 2021-04-20 北京轻松筹信息技术有限公司 Semantic recognition method and device, electronic equipment and storage medium
CN112988811A (en) * 2021-03-09 2021-06-18 重庆可兰达科技有限公司 Method, system, terminal and medium for detecting APP advertisement content compliance
CN113704638A (en) * 2021-08-31 2021-11-26 连尚(北京)网络科技有限公司 Method and equipment for identifying presentation information in social group chat
CN114238962A (en) * 2021-09-29 2022-03-25 睿贸恒诚(山东)科技发展有限责任公司 Harmful information filtering system and method based on mobile internet
CN115410207A (en) * 2021-05-28 2022-11-29 国家计算机网络与信息安全管理中心天津分中心 Detection method and device for vertical texts
CN116089669A (en) * 2023-03-09 2023-05-09 数影星球(杭州)科技有限公司 Browser-based website uploading interception mode and system
CN116910231A (en) * 2023-09-11 2023-10-20 社治无忧(成都)智慧科技有限公司 WeChat public opinion early warning method and system based on natural language processing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158963A (en) * 2007-10-31 2008-04-09 中兴通讯股份有限公司 Information acquisition processing and retrieval system
CN101340308A (en) * 2008-08-19 2009-01-07 翁时锋 Network rubbish information filtering architecture, Network rubbish information cleaning system and method thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158963A (en) * 2007-10-31 2008-04-09 中兴通讯股份有限公司 Information acquisition processing and retrieval system
CN101340308A (en) * 2008-08-19 2009-01-07 翁时锋 Network rubbish information filtering architecture, Network rubbish information cleaning system and method thereof

Cited By (69)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102752262A (en) * 2011-04-18 2012-10-24 腾讯科技(深圳)有限公司 Method and server for restricting malicious information propagation
CN102752262B (en) * 2011-04-18 2016-08-10 腾讯科技(深圳)有限公司 A kind of method limiting fallacious message propagation and server
CN103064858B (en) * 2011-10-19 2016-03-30 北京千橡网景科技发展有限公司 Method and apparatus for objectionable image detection in social networking websites
CN103064858A (en) * 2011-10-19 2013-04-24 北京千橡网景科技发展有限公司 Method and apparatus for objectionable image detection in social networking websites
CN103166920A (en) * 2011-12-13 2013-06-19 腾讯科技(深圳)有限公司 Method and system for limiting transmission of malicious information
CN103166920B (en) * 2011-12-13 2016-01-20 腾讯科技(深圳)有限公司 A kind of method and system limiting fallacious message and propagate
CN102567534B (en) * 2011-12-31 2014-02-19 凤凰在线(北京)信息技术有限公司 Interactive product user generated content intercepting system and intercepting method for the same
CN102567534A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Interactive product user generated content intercepting system and intercepting method for the same
CN102647416A (en) * 2012-03-30 2012-08-22 上海明复信息技术有限公司 System and method for filtering harmful information based on internet data source control
CN102682090A (en) * 2012-04-26 2012-09-19 焦点科技股份有限公司 System and method for matching and processing sensitive words on basis of polymerized word tree
CN102682090B (en) * 2012-04-26 2015-09-02 焦点科技股份有限公司 A kind of sensitive word matching treatment system and method based on polymerization word tree
CN103532918A (en) * 2012-07-06 2014-01-22 天讯天网(福建)网络科技有限公司 Mobile-terminal monitoring method and system based on mobile Internet and cloud computing
CN103532918B (en) * 2012-07-06 2017-05-17 绿网天下(福建)网络科技股份有限公司 Mobile-terminal monitoring method and system based on mobile Internet and cloud computing
CN103780409B (en) * 2012-10-19 2017-04-05 任子行网络技术股份有限公司 A kind of network log-in management method and apparatus
CN102932354A (en) * 2012-11-02 2013-02-13 杭州迪普科技有限公司 Verification method and device for internet protocol (IP) address
CN103970801A (en) * 2013-02-05 2014-08-06 腾讯科技(深圳)有限公司 Method and device for recognizing microblog advertisement blog articles
CN103970801B (en) * 2013-02-05 2019-03-26 腾讯科技(深圳)有限公司 Microblogging advertisement blog article recognition methods and device
CN104050191A (en) * 2013-03-14 2014-09-17 北京百度网讯科技有限公司 Method and equipment for monitoring promotional information
CN104050191B (en) * 2013-03-14 2019-04-12 北京百度网讯科技有限公司 The method and apparatus that promotion message is monitored
CN104050195A (en) * 2013-03-15 2014-09-17 北京暴风科技股份有限公司 Advertisement sticker processing method and system
CN104050195B (en) * 2013-03-15 2017-11-03 暴风集团股份有限公司 A kind of advertisement sticker processing method and system
CN103246705A (en) * 2013-04-09 2013-08-14 无锡安康讯信息科技有限公司 Network text data content detecting and high-speed processing method
CN103347009A (en) * 2013-06-20 2013-10-09 新浪网技术(中国)有限公司 Method and device filtering information
CN103347009B (en) * 2013-06-20 2016-09-28 新浪网技术(中国)有限公司 A kind of information filtering method and device
CN103345530A (en) * 2013-07-25 2013-10-09 南京邮电大学 Social networking service blacklist automatic filtration model based on semantic net
CN104424252B (en) * 2013-08-28 2017-12-15 北大方正集团有限公司 Literal information processing method and word content server based on XML
CN104424252A (en) * 2013-08-28 2015-03-18 北大方正集团有限公司 Verbal information processing method based on extensive markup language and verbal content server
CN104702671A (en) * 2015-02-06 2015-06-10 贵阳朗玛信息技术股份有限公司 Method and server for reporting information
CN104834685A (en) * 2015-04-17 2015-08-12 百度国际科技(深圳)有限公司 Method and device for processing comment message block in comment-like webpage
CN104834703A (en) * 2015-04-29 2015-08-12 深圳市梦网科技股份有限公司 Retrieval method and system
CN106815242A (en) * 2015-11-30 2017-06-09 腾讯科技(深圳)有限公司 Textual resources data detection method and device
CN105573968A (en) * 2015-12-10 2016-05-11 天津海量信息技术有限公司 Text indexing method based on rules
CN107103012A (en) * 2016-01-28 2017-08-29 阿里巴巴集团控股有限公司 Recognize method, device and the server of violated webpage
CN105760445A (en) * 2016-02-03 2016-07-13 北京光年无限科技有限公司 Junk word filtering method and system
CN105938483A (en) * 2016-04-14 2016-09-14 江苏马上游科技股份有限公司 Network junk information filtering and optimizing method
CN106131595A (en) * 2016-05-26 2016-11-16 武汉斗鱼网络科技有限公司 A kind of title sensitive word control method for net cast and device
CN106447239A (en) * 2016-11-21 2017-02-22 北京字节跳动科技有限公司 Auditing method and device for data release
CN106649831B (en) * 2016-12-29 2020-09-04 北京奇艺世纪科技有限公司 Data filtering method and device
CN106649831A (en) * 2016-12-29 2017-05-10 北京奇艺世纪科技有限公司 Data filtering method and device
CN107332758A (en) * 2017-06-30 2017-11-07 罗颖莉 A kind of the Internet chat processing system with malicious information filtering component
CN107395619B (en) * 2017-08-17 2020-03-17 深圳市盛路物联通讯技术有限公司 Secure communication method and system
CN107395619A (en) * 2017-08-17 2017-11-24 深圳市盛路物联通讯技术有限公司 A kind of safety communicating method and system
CN107579960A (en) * 2017-08-22 2018-01-12 深圳市盛路物联通讯技术有限公司 A kind of data filtering method and device
CN108170813A (en) * 2017-12-29 2018-06-15 智搜天机(北京)信息技术有限公司 A kind of method and its system of full media content intelligent checks
WO2019127651A1 (en) * 2017-12-30 2019-07-04 惠州学院 Method and system thereof for identifying malicious video
CN110020257A (en) * 2017-12-30 2019-07-16 惠州学院 The method and system of the harmful video of identification based on User ID and video copy
CN108055289A (en) * 2018-01-30 2018-05-18 深圳市富途网络科技有限公司 A kind of method and system audited to user-generated content based on internet
CN109656141A (en) * 2019-01-11 2019-04-19 武汉天喻聚联网络有限公司 Violation identification and machine behaviour control method, equipment, storage medium based on artificial intelligence technology
CN109947943A (en) * 2019-03-15 2019-06-28 四川长虹电器股份有限公司 A method of detection network public information health authenticity
CN109977403A (en) * 2019-03-18 2019-07-05 北京金堤科技有限公司 The recognition methods of malice comment information and device
CN110047567A (en) * 2019-04-18 2019-07-23 中国石油大学(华东) A kind of gall stone diagnostic model based on case history key message extractive technique
CN110489657A (en) * 2019-07-05 2019-11-22 五八有限公司 A kind of information filtering method, device, terminal device and storage medium
CN110837615A (en) * 2019-11-05 2020-02-25 福建省趋普物联科技有限公司 Artificial intelligent checking system for advertisement content information filtering
CN111046174A (en) * 2019-11-08 2020-04-21 广州坚和网络科技有限公司 Method for commenting water prevention and irrigation suitable for news information software
CN111460253A (en) * 2020-03-24 2020-07-28 国家电网有限公司 Internet data capture method suitable for big data analysis
CN111797200A (en) * 2020-06-18 2020-10-20 北京亿宇嘉隆科技有限公司 IT operation and maintenance method
CN112686036A (en) * 2020-08-18 2021-04-20 平安国际智慧城市科技股份有限公司 Risk text recognition method and device, computer equipment and storage medium
CN112686036B (en) * 2020-08-18 2022-04-01 平安国际智慧城市科技股份有限公司 Risk text recognition method and device, computer equipment and storage medium
CN112988811A (en) * 2021-03-09 2021-06-18 重庆可兰达科技有限公司 Method, system, terminal and medium for detecting APP advertisement content compliance
CN112686055B (en) * 2021-03-16 2021-06-04 北京轻松筹信息技术有限公司 Semantic recognition method and device, electronic equipment and storage medium
CN112686055A (en) * 2021-03-16 2021-04-20 北京轻松筹信息技术有限公司 Semantic recognition method and device, electronic equipment and storage medium
CN115410207A (en) * 2021-05-28 2022-11-29 国家计算机网络与信息安全管理中心天津分中心 Detection method and device for vertical texts
CN115410207B (en) * 2021-05-28 2023-08-29 国家计算机网络与信息安全管理中心天津分中心 Detection method and device for vertical text
CN113704638A (en) * 2021-08-31 2021-11-26 连尚(北京)网络科技有限公司 Method and equipment for identifying presentation information in social group chat
CN114238962A (en) * 2021-09-29 2022-03-25 睿贸恒诚(山东)科技发展有限责任公司 Harmful information filtering system and method based on mobile internet
CN116089669A (en) * 2023-03-09 2023-05-09 数影星球(杭州)科技有限公司 Browser-based website uploading interception mode and system
CN116089669B (en) * 2023-03-09 2023-10-03 数影星球(杭州)科技有限公司 Browser-based website uploading interception mode and system
CN116910231A (en) * 2023-09-11 2023-10-20 社治无忧(成都)智慧科技有限公司 WeChat public opinion early warning method and system based on natural language processing
CN116910231B (en) * 2023-09-11 2023-11-17 社治无忧(成都)智慧科技有限公司 WeChat public opinion early warning method and system based on natural language processing

Also Published As

Publication number Publication date
CN102208992B (en) 2015-09-02

Similar Documents

Publication Publication Date Title
CN102208992A (en) Internet-facing filtration system of unhealthy information and method thereof
CN102098332B (en) Method and device for examining and verifying contents
Sankaranarayanan et al. Twitterstand: news in tweets
US9965462B2 (en) Systems and methods for identifying and recording the sentiment of a message, posting, or other online communication using an explicit sentiment identifier
US7636719B2 (en) Contact schema
CN108776671A (en) A kind of network public sentiment monitoring system and method
CN101751458A (en) Network public sentiment monitoring system and method
CN102833111B (en) A kind of visual HTTP data monitoring and managing method and device
CN101866347A (en) Method, system that structural data is searched for and method, the system that makes data item structured and can search for
CN104915447A (en) Method and device for tracing hot topics and confirming keywords
WO2008113290A1 (en) Method and device for pushing information
US20120072466A1 (en) Contents creating device and contents creating method
CN111447575B (en) Short message pushing method, device, equipment and storage medium
CN108199951A (en) A kind of rubbish mail filtering method based on more algorithm fusion models
CN112632405A (en) Recommendation method, device, equipment and storage medium
CN103412940B (en) The method of detection swindle phone
CN102750346A (en) Method, system and terminal device for recommending software
JP3497712B2 (en) Information filtering method, apparatus and system
CN113239111A (en) Network public opinion visual analysis method and system based on knowledge graph
CN104992318A (en) Method for actively recommending events by calendar
CN100555283C (en) A kind of directly at the dissemination method and the system of user's relevant information
CN100419762C (en) Freely-inputted wireless short message matching and search engine information processing method, and apparatus therefor
JP4052883B2 (en) Information processing system and method
Sadath Data mining in E-commerce: a CRM platform
CN101414372A (en) Device and method for paying for internet advertisement information based on preservation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C56 Change in the name or address of the patentee
CP03 Change of name, title or address

Address after: 300020 Tianjin Heping District, South Road, No. 11 International Building 23 purchase of Wheat

Patentee after: Tianjin mass information technology Limited by Share Ltd

Address before: 300384 Tianjin City Huayuan Industrial Zone Rong Yuan Road No. 1 North B room 332-323

Patentee before: Tianjin Hylanda Information Technology Co.,Ltd.

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20170215

Address after: 300000 Tianjin Binhai New Area in the new eco city anime Middle Road, building C1, No. 126, 101-134

Patentee after: Tianjin Haina media big data technology development Co. Ltd.

Address before: 300020 Tianjin Heping District, South Road, No. 11 International Building 23 purchase of Wheat

Patentee before: Tianjin mass information technology Limited by Share Ltd

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150902

Termination date: 20200613