CN102208992A

CN102208992A - Internet-facing filtration system of unhealthy information and method thereof

Info

Publication number: CN102208992A
Application number: CN2010102005887A
Authority: CN
Inventors: 陶鹏; 宋传宝; 罗侃; 曹浩
Original assignee: TIANJIN HYLANDA INFORMATION TECHNOLOGY CO LTD
Current assignee: Tianjin Haina media big data technology development Co. Ltd.
Priority date: 2010-06-13
Filing date: 2010-06-13
Publication date: 2011-10-05
Anticipated expiration: 2030-06-13
Also published as: CN102208992B

Abstract

The invention discloses an internet-facing filtration system of unhealthy information and a method thereof. The unhealthy information filtration system comprises a user data submission module, a user service management system, an auditing platform for user mutual information, an operation platform for purifying service, a knowledge base, and at least an indexing engine. The user data submission module is connected with the user service management system that is connected with the operation platform for purifying service. The operation platform for purifying service is connected respectively with the auditing platform for user mutual information and each indexing engine. The indexing engine is connected with the knowledge base. A plurality of intelligent technologies including word segmentation, keywords matching, and vector model as well as a plurality of processing algorithms with high performances are applied in the invention to process data. According to the invention, indexing services of information including sensitive information, erotic information, vulgar information, junk information, and commercial advertisements and the like are provided for the internet community, so that a high-efficient information management mean is provided for the client.

Description

Flame filtration system and method thereof towards the Internet

Technical field

The present invention relates to a kind of flame filtration system and filter method thereof, relate in particular to a kind of characteristics at the Internet community, can to pornographic, vulgar, pour water and filtration system and filter method thereof that flame such as commercial advertisement carries out accurate index and filtration, belong to the network information security technology field.

Background technology

Along with the Internet development growth, increasing community channel has been released in each website (comprising portal website, thematic website etc.), as all kinds of thematic forums, blog, comment etc., the interactive online friend that these community's channels attract is increasing, for website and netizen have brought interests.But also there are some personnel to borrow these community's channels to issue the various commercial advertising message without restraint simultaneously, even the model of issuing a large amount of pornographics, vulgar, thick mouth and competing with other malice of going together, these flames have disturbed the normal operation of website, damage the brand and the public praise of website, also greatly influenced other netizen's normal use simultaneously.

At present, each website generally all is to adopt following technical measures at this situation:

The keyword restriction is set: by safeguarding a huge crucial dictionary, contain keyword in model, system is reminded or directly deletion.

The frequency limitation of posting: limit the maximum model quantity that same IP or same ID send out in the unit interval.

These two kinds of methods can filter out the part bad data, but also exist great deficiency:

For the keyword restriction, a lot of bad cards are not only just can judge by one or two keyword, but need be by the front and back semanteme of whole section words, whole word, keyword is judged.For the frequency limitation of posting, defective has been to limit posting of some normal netizens, is easy to simultaneously be cracked by the machine of posting, and makes the method be difficult to actual proving effective.

In the patent No. was 200510048576.6 Chinese invention patent, a kind of system of intercepting pornographic image and flame was on the internet disclosed.This system contains IP address filtering, keyword filtration and pornographic image detects, and sets up the pornographic image Mathematical Modeling by decision-feedback repeatedly; Whether set up pornographic standard picture feature database, be the foundation of pornographic image as the decision networks image; Set up similitude coupling judgement model; To carry out content-based image judgement by the network information of keyword contrast.Both carrying out the information content in application layer filters, adopt network address to filter at the IP layer again, can directly tackle pornographic image information, the real-time update url database, network address by past passive is filtered the information filtering that jumps to active, the multifunctional management platform that system is unique, integrated the complex relationship between operating system, browser, the internet protocol negotiation visual detector, the process interaction and the pornographic image that have solved between the client-server detect the division of labor and the data recombination problem of task, and have realized and the irrelevant characteristics of browser.

In addition, in number of patent application is 200410053683.3 Chinese invention patent application, a kind of internet content filtration system and filter method are disclosed.This content filtering system comprises: information filtering agency (CFA), querying server (QS), content analysis and management server (CAMS) three parts.The filtering process of Web content filtration system is: when the user sent the request that certain URL is conducted interviews, CFA was according to the black and white lists that the user is provided with, and allowed or forbade this access request.If in the black and white lists of CFA, CFA does not then send query requests to QS to this URL.QS will inquire about the rating information of this URL and the result is returned to CFA in the URL storehouse of oneself.CFA makes a response in view of the above.QS meeting simultaneously is the URL rating information of down loading updating from CAMS regularly.The flame that this technical scheme exists in can recognition network, and stop these objectionable websites of internet user access on one's own initiative.

Summary of the invention

Technical problem to be solved by this invention is to provide a kind of flame filtration system and method thereof towards the Internet, can to pornographic, vulgar, pour water and flame such as commercial advertisement carries out accurate index and filtration.

For realizing above-mentioned goal of the invention, the present invention adopts following technical scheme:

A kind of flame filtration system towards the Internet is characterized in that:

Described flame filtration system comprises user data submission module, subscriber service management system, customer interaction information audit platform, purifies Service Operation platform, knowledge base and at least one index engine; Wherein,

Described user data submits to module to be connected with described subscriber service management system, and described subscriber service management system connects described purification Service Operation platform;

Described purification Service Operation platform is connected with each index engine with described customer interaction information audit platform respectively;

Described index engine is connected with described knowledge base.

Wherein, described index engine comprises one or more in advertisement index engine group, water decals engine group, individualized feature index engine group, behavioural characteristic index engine group, pornographic index engine group, vulgar index engine group, the sensitive information index engine group.

Described knowledge base comprises one or more in keyword dictionary, behavior pattern storehouse, rule base, case library and the training characteristics storehouse.

Also comprise impurity characteristics storehouse, non-impurity characteristics storehouse and individual character impurity characteristics storehouse in the described flame filtration system; Described impurity characteristics storehouse, non-impurity characteristics storehouse and individual character impurity characteristics storehouse are connected described knowledge base on the one hand, are connected with described purification Service Operation platform on the other hand.

Described customer interaction information audit platform comprises distributing data module, feedback data module and system effect statistical module; Wherein, described distributing data module receives the data from described purification Service Operation platform, if normal labeled, then as externally issue of normal subsides; If error flag then sends into described feedback data module and use as corpus, and feed back to described purification Service Operation platform.

A kind of flame filter method towards the Internet is realized based on above-mentioned flame filtration system, it is characterized in that comprising following step:

(1) receives the various message that Web Community issues;

(2) case library that calls in the knowledge base filters, and judges whether to be flame;

(3) if not, further call customer personalized " black and white lists " that comprise keyword, keyword combination, URL, IP address, user ID and filter, judge whether to be flame;

(4) if not, further carry out common behavior pattern recognition;

(5) if not, further carry out the characteristic behavior pattern recognition;

(6) if not, further call the miscellaneous service rule and filter;

(7) the comprehensive filter result that step (2)～(6) are obtained obtains final flame filter result, and preserves warehouse-in;

(8) final index result is returned to client.

Wherein, in the flame filter process, at first Message-text is carried out the keyword coupling; Were it not for and hit keyword, then will predict the outcome is made as " not needing deletion ", if hit keyword, then text-converted is become vector space model, this vector is predicted prediction result is a certainty factor; For different message, it is predicted as " needing deletion ", " the doubtful needs deletes ", " not needing deletion " this three class according to certainty factor and preset threshold value, wherein, introduce and manually further examine for the message of " the doubtful needs deletes " this classification.

In the described step (5), described characteristic behavior pattern recognition is meant carries out global analysis to Web Community's information releasing content, go out wherein all feature contact details by semantic identification extraction, the frequency of occurrences of described feature contact details in the certain hour section calculated, and compare with default threshold values, when exceeding described threshold values, think flame.

Flame filtration system provided by the present invention and method thereof are used multinomial intellectual technology: participle, keyword coupling, vector model, and a plurality of high performance Processing Algorithm are handled data, can provide responsive, pornographic, vulgar for the Internet community, pour water and the index service of information such as commercial advertisement, thereby provide high efficiency information management means to the client.

Description of drawings

The present invention is described in further detail below in conjunction with the drawings and specific embodiments.

Fig. 1 is the overall structure schematic diagram of flame filtration system provided by the present invention;

Fig. 2 carries out the operating process schematic diagram that flame filters for this flame filtration system;

Fig. 3 is the exemplary plot based on the statistical model of supervise learning;

Fig. 4 is for using the schematic flow sheet based on the supervise learning statistical model;

Fig. 5 is for solving the schematic flow sheet that rubbish pastes with keyword in conjunction with the statistical model system framework among the present invention.

Embodiment

In order to improve the filter effect of the present invention to flame, the inventor has done classification from operational angle, technical standpoint to flame respectively by the analysis to a large amount of the Internet community data.

The classification of operational angle: flame can be divided into commercial advertisement class, pornographic, vulgar, pour water, client's individual character class.And each classification is segmented.Can be divided into as the commercial advertisement class: numeric class (QQ, phone, cell-phone number, invoice, quotation etc.), domain name kind (MSN, network address, E-Mail etc.).

The classification of technical standpoint: rigidity identification, behavior pattern recognition, flexible identification, the identification of keyword black and white lists, all kinds of business rule (the different business rule has adopted different algorithms again, sees concrete algorithm introduction for details) are provided.Wherein:

Rigidity identification: among the present invention system's mistake of client's feedback is deleted, leaked and delete data as the rigidity data, can carry out index to the identical data of the content of subsequent issued.The system's mistake that receives client's feedback is deleted, is leaked and delete data, and whole piece information is calculated, and generates a unique value, preserves warehouse-in and (is called: the rigidity storehouse).When follow-up when receiving the various information data of need filtering again, adopt identical algorithm to calculate to every information, and the end value that generates compared with the value in the rigidity storehouse, on mating, can judge directly that then this information is normal information (or flame).

Behavior pattern recognition: add up by the number of times that the same IP in a period of time, same ID, identical key content are occurred, analyze the behavior pattern of data.

Flexible identification: claim approximate text detection (seeing following algorithm introduction for details) again.By the data in a period of time are carried out training study, can the data identification frequency of occurrences is higher and that content is similar come out.

The keyword black and white lists: black and white lists (providing: keyword, keyword combination, netizen IP, netizen ID, content URL etc.) can be provided according to demand in each website, and the content of issue is mated identification.

All kinds of business rules: because the characteristic formp of all kinds of business datums is different, so, can take different intelligent identification Methods at all kinds of business datums.Comprise: automanual digital flexible identification, based on the domain name of pattern and email identification, the identification of vertical setting of types literal, keyword in conjunction with statistical model framework or the like.

As shown in Figure 1, flame filtration system provided by the present invention comprise user data submit module, subscriber service management system, customer interaction information audit platform to, purify the Service Operation platform, at the index engine of various situations and corresponding knowledge base etc.Wherein user data submits to module that interactive text message and subscriber identity information are submitted to the subscriber service management system, and the subscriber service management system sends relevant data to purification Service Operation platform in the mode of UID-xml.Purifying the Service Operation platform is the core of this flame filtration system.It connects each index engine, therefrom obtains the information of reflection knowledge/rule, simultaneously also to impurity characteristics storehouse, non-impurity characteristics storehouse and the individual character impurity characteristics storehouse feedback information about knowledge/rule.Customer interaction information audit platform comprises List (issue) data module, feedback data module and system effect statistical module.Wherein, List (issue) data module receives the data of auto purification Service Operation platform, if normal labeled is then issued as normal the subsides externally; If error flag then sends into the feedback data module and use as corpus, and feed back to and purify the Service Operation platform.Purify the Service Operation platform and send the statistical effect analysis result to the system effect statistical module simultaneously.

The index engine that uses in this flame filtration system comprises advertisement index engine group, water decals engine group, individualized feature index engine group, behavioural characteristic index engine group, pornographic index engine group, vulgar index engine group, sensitive information index engine group etc., respectively at commercial advertisement class, pornographic, vulgar, pour water, multiple situation such as client's individual character class.According to the actual needs of Web Community, above-mentioned engine group can also constantly be expanded.Each above-mentioned index engine connects knowledge base, therefrom obtains the knowledge/rule that is used to filter flame.These databases comprise keyword dictionary (logical implication storehouse), behavior pattern storehouse, rule base, case library (rigidity storehouse) and training characteristics storehouse etc.Based on above-mentioned index engine and knowledge base, this flame filtration system is unified the rule of a plurality of dimensions, provides different composite services according to client's demand for the client.Like this,, increased recognition effect, solved the lower defective of single regular effect data by the identification of a plurality of rules.

Comprise four following class functional interfaces in this flame filtration system:

One. the index interface

After reception and the parsing client requests data, purify the Service Operation platform and read filtering rule and the customer personalized content that is provided with that the client is provided with, and the filter algorithm of calling correspondence (is docked with the core algorithm service, support every rule, strobe utility), whether draw the judged result of rubbish card, and the result is returned to the client.

Two. feedback interface

The client edits native system leaked and deletes data and carry out " deletion " operation, or the native system mistake is deleted after data carry out " recoverys " operation, and client is transferred to server end with these data by this interface, and preservation is put in storage.These data will become the rigidity database data, and follow-up data are directly come into force.

Three. interface is set

Receive every configuration data (client can be provided with personalized black and white lists, comprising: keyword, IP, ID, picture chained address) that the client is provided with, the preservation warehouse-in also comes into force in real time.

Four. notification interface

This flame filtration system is at newly-increased up-to-date filtration speech, when adding up-to-date rule, to carry out primary indexing once more to the historical normal data (acquiescence keeps the data of this month and last month) that is retained in the system, and this index result preserved for the data of " rubbish ", client can regularly be obtained this type of data by " notification interface " visit, and these data are deleted.

The process that this flame filtration system is handled various flames at first receives the various message of Web Community's issue as shown in Figure 2, and the case library (rigidity storehouse) that calls then in the knowledge base filters, and judges whether to be flame.Then, use customer personalized " black and white lists " to filter, promptly filter by keyword, keyword combination, URL, IP address, user ID etc.If not among the filter area of customer personalized " black and white lists ", then further carry out common behavior pattern recognition and characteristic behavior pattern recognition.After above-mentioned judging means is finished using, further call miscellaneous service rule (as advertisement, vulgar etc.) and filter, thereby obtain final filter result, preserve warehouse-in, return the index result then and give client.

In this flame filtration system, used a class new technology: semanteme identification is combined with behavioural analysis, be called characteristic behavior analysis (also claiming the characteristic behavior pattern recognition).The characteristic behavior analysis is meant carries out global analysis to Web Community's information releasing content, go out wherein all feature contact details by semantic identification extraction, the frequency of occurrences of these information in the certain hour section calculated, and compared with the threshold values of presetting.When exceeding threshold values, think flame.

In the present invention, the commercial advertisement card is mainly discerned in the effect of characteristic behavior analysis.Concrete technical descriptioon is as follows:

Owing to allow the contact method (as: QQ number, telephone number etc.) of some individual or entities of issue in a lot of the Internet community, to increase the interactive of user, but do not allow those model that has advertisement character issues, therefore whether this model is that rubbish pastes, its criterion can not be fixed, and has stronger subjectivity.Judge iff the business rule that adopts semantic identification (as: all deletions of band contact method), certainly will delete a lot of models by mistake.

By information releasing in a large amount of communities (mainly refer to the information of band contact method, comprise normal information, flame) is analyzed, find to exist certain rules.That is: normal information is general only can issue minority several times in one or several column.Flame then can be constantly in the ground issue of a plurality of column high-frequencies, and the contact method that is comprised is normally the same.In the case, can set the plate amount threshold and the unit interval frequency of occurrences threshold value of same information issue, when surpassing preset threshold value, assert that this information is the flame of commercial advertisement character.

Therefore, by semanteme identification is combined with behavioural analysis, can solve the problem that the commercial advertisement criterion can not be fixed.

In addition, this flame filtration system has adopted the technical scheme of keyword in conjunction with the statistical model framework at the rubbish card.Specify as follows:

At present, the statistical model based on supervise learning has been widely used in every field such as text classification, image classification.Statistical model based on supervise learning refers to a kind of framework shown in Figure 3: collect or mark out the data of some classifications by artificial mode, by the learning algorithm of statistical model, finally obtain the model that can discern these some classifications.

Statistical model commonly used comprises SVM (SVMs), maximum entropy model, Logistic regression model, model-naive Bayesian etc.The more information of these models can be with reference to Mitchell, and (China Machine Press's in March, 2008 version, an ISBN:9787111109938) book have not just been given unnecessary details in detail at this in " machine learning " that T.M. showed.

Automatically identification rubbish pastes a special case can thinking the autotext classification.And the framework based on statistical model that provides is above used in the autotext classification usually.This is that Many researchers publishes thesis and declares because of the development through four more than ten years: use the autotext classification based on the supervise learning statistical model can obtain best predicting the outcome.Pasting with automatic identification rubbish is example, use can be referring to Fig. 4 based on the flow process of supervise learning statistical model, promptly at first collecting is the comment that rubbish pastes and non-rubbish pastes in a large number, then comment text is converted to vector space model, learning algorithm by statistical model is finally predicted the outcome accordingly.

So-called vector space model (Vector Space Model) is a kind of text modeling pattern very commonly used.Its main thought is to regard different speech as different dimension.For one piece of specific document, the weight of each dimension adopts the mode of TF * IDF to calculate usually.Wherein TF refers to the occurrence number of this speech in the document, and IDF refers to the contrary document rate of this speech, uses following formula to calculate usually: N in the formula refers to the quantity of all documents, DF _WordRefer to this speech and appear at the quantity of different document.These computational methods can (China Machine Press's in April, 2005 version finds more explanation in ISBN:7-111-15878-4) at the textbook " modern information retrieval " of work such as Ricardo Baeza-Yates.

In the present invention, the inventor has further proposed a kind of keyword and has solved the technical scheme that rubbish pastes automatic classification in conjunction with the statistical model system framework.

So-called keyword refer to artificial summary, rubbish pastes and normal set of words of pasting in order to filter and to distinguish.For example in the rubbish of political situation of the time class pasted, " Falun Gong " was exactly a keyword.For a model, if hit this keyword, this model can directly be classified as the rubbish subsides so, or via being grouped into respective classes after the manual examination and verification.

At present, a lot of websites, forum use based on the mode of keyword screens comment or blog text, uses the mode of manual examination and verification to determine whether this comment or blog text belong to the text of this deletion then.Yet, use this mode not needing in a large number can to obtain the text of deleting.For example, the name of using the state leader not needing in a large number will to obtain the text deleted as keyword.Therefore, use keyword still to need to expend a large amount of artificial merely.

Then there is following several problem in the simple statistical model that uses:

1) can not the instant demand that changes of real-time response.Therefore because statistical model needs the label data that has of finishing collecting some, need a period of time to collect data for a new erasure request and train new model.For example, occurred in the forum pasting, supposed that existing system can not identify these advertisements and paste for the advertisement of illegal article such as invoice, gun.If the mode by statistics is handled, then need to collect these advertisements and label and stamp corresponding.Make up model then and release.Therefore, use mode based on the supervise learning statistical model can not satisfy the demand that (for example within 1 minute) in the short period will control the content of posting merely.

2) speed is slower.Because the needs of algorithm, based on the mode of statistics than differing tens times to hundreds of times on the mode speed based on keyword.Therefore, under the prerequisite of very big data throughout (tens MBPSs), be difficult to deal with actual demand, perhaps cost very big (needing distributed computing platform or other solution) based on the mode of adding up.

As shown in Figure 5, there be the keyword and the statistical model of manual sorting collection in supposition in the present invention.For a text, the flow process of operation is carried out as follows:

1. text is carried out the keyword coupling.Were it not for and hit keyword, then will predict the outcome is made as " not needing deletion ".And skip over 2～5 the step.If hit keyword, then changed for the 2nd step over to.

2. convert text to vector space model according to the mode of introducing previously.

3. this vector is predicted that prediction result is a certainty factor (different statistical models can obtain different certainty factor interval, but any one model can both obtain the value of a certainty factor).

4. for different message, it is predicted as " needing deletion ", " the doubtful needs deletes ", " not needing deletion " this three class according to certainty factor and preset threshold value.For example, the yardstick that political situation of the time class need be deleted is often more wide in range, and the threshold value that therefore, is judged to " needing deletion " is lower with regard to what set.And the yardstick of " thick mouthful " classification deletion is narrow, so it is judged to higher that the threshold value of " needing deletion " just can set.

5. " the doubtful needs deletes " this classification that manual examination and verification previous step obtains.

Object lesson can see the following form:

In the present invention, use the technological means of SVM (SVMs) algorithm as the statistical model part.This part operation can use other statistical model to substitute, maximum entropy model for example mentioned above, model-naive Bayesian etc.Each statistical model is had any different on modeling and prediction algorithm, and under the different applied environments, each algorithm has different performances.

In addition, the computing formula that preamble is mentioned has been adopted in the expression of vector space model.Different computing formula also has different results, need weigh assessment and be used afterwards again in reality.

Utilize above-mentioned filter algorithm, flame filtration system provided by the present invention can effectively solve above mentioned three problems, and is promptly relatively poor based on the system accuracy of keyword, do not possess real-time customization and slower speed etc. based on the method for statistics.

In addition, this flame filtration system also possesses following several characteristics:

1) for the extremely sensitive content (for example comprising the comment of state leader's name) that needs deletion, this system can mix into the mode of manual examination and verification to guarantee the safety of website.Compare with the method for simple use keyword, native system can effectively reduce the quantity (quantity of saving is greater than 70%) of manual examination and verification.

2) along with the increase that label data is arranged, the accuracy of system can be more and more higher, and progressively converge to a comparatively stable value.

In general, the embedded multinomial intellectual technology of the underlying platform of this flame filtration system: participle, keyword coupling, vector model, and a plurality of high performance Processing Algorithm handle data, by form feature (type-setting mode, symbol make usage), content characteristic (style of writing mode, the speech to data; Named entity, sentence, chapter), behavior pattern etc. analyzes, and reached world first-class recognition effect in open evaluation and test.

More than flame filtration system and the method thereof towards the Internet provided by the present invention had been described in detail.To those skilled in the art, any conspicuous change of under the prerequisite that does not deviate from connotation of the present invention it being done all will constitute to infringement of patent right of the present invention, with corresponding legal responsibilities.

Claims

1. flame filtration system towards the Internet is characterized in that:

Described index engine is connected with described knowledge base.

2. flame filtration system as claimed in claim 1 is characterized in that:

Described index engine comprises one or more in advertisement index engine group, water decals engine group, individualized feature index engine group, behavioural characteristic index engine group, pornographic index engine group, vulgar index engine group, the sensitive information index engine group.

3. flame filtration system as claimed in claim 1 is characterized in that:

4. flame filtration system as claimed in claim 1 is characterized in that:

5. flame filtration system as claimed in claim 1 is characterized in that:

6. the flame filter method towards the Internet is realized based on flame filtration system as claimed in claim 1, it is characterized in that comprising following step:

(1) receives the various message that Web Community issues;

(4) if not, further carry out common behavior pattern recognition;

(5) if not, further carry out the characteristic behavior pattern recognition;

(6) if not, further call the miscellaneous service rule and filter;

(8) final index result is returned to client.

7. flame filter method as claimed in claim 6 is characterized in that:

In the flame filter process, at first Message-text is carried out the keyword coupling; Were it not for and hit keyword, then will predict the outcome is made as " not needing deletion ", if hit keyword, then text-converted is become vector space model, this vector is predicted prediction result is a certainty factor; For different message, it is predicted as " needing deletion ", " the doubtful needs deletes ", " not needing deletion " this three class according to certainty factor and preset threshold value, wherein, introduce and manually further examine for the message of " the doubtful needs deletes " this classification.

8. flame filter method as claimed in claim 6 is characterized in that: