A kind of short message filter method based on " user's credit worthiness and note rubbish degree "
Technical field
The invention belongs to the short message treatment technology in Internet communication technology field, specifically, the method that the disclosed propagating contents that the sms platform that relates to a kind of Internet communication system is submitted to the user based on the short message filter method of " user's credit worthiness and note rubbish degree " is supervised and filtered.
Background technology
In recent years, along with the high speed development of mailbox service, some lawless persons that send refuse messages utilized specially the proprietary free text message passage of some mailboxes (such as 139 mailboxes etc.) as the instrument of accumulating wealth by unfair means or reached hidden purpose.Note, has also been grown in a large number to propagate the flames such as obscene pornographic, commercial swindle and commercial advertisement as the refuse messages of purpose for people provide cheap and easily communication service as one of value-added service of mobile communication simultaneously.These refuse messages severe jammings people's lives, harmed social safety, the supervision problem of refuse messages has been subject to the extensive attention of various circles of society.Except from the reinforcement of legislation aspect is supervised releasing news, the more important thing is at technological layer and explore the effective precautionary technology of note rubbish filtering.
In the prior art, the filter method of refuse messages mainly contains two kinds: based on keyword or content-based note rubbish filtering.
Filtering junk short messages based on keyword is that system arranges some keywords in advance, as long as occur these keywords in the short message content, then regards as refuse messages and is tackled, and this method basis for estimation is single, can have the defective of a large amount of erroneous judgements.
Content-based filtering junk short messages is to adopt machine learning that note is divided into normal note and refuse messages.Be used at present SMS classified machine learning method and mainly contain Bayes, SVM, KNN and artificial neural net etc.Also there is the defective of erroneous judgement in this filter method.
Summary of the invention
The method that the object of the present invention is to provide a kind of disclosed propagating contents of the user being submitted to based on the short message filter method of " user's credit worthiness and note rubbish degree " to supervise and filter.
For achieving the above object, the short message filter method based on " user's credit worthiness and note rubbish degree " of the present invention comprises step:
A) according to the situation of enlivening of short-message users, give initial credit worthiness of each user;
B) text preliminary treatment: reject first normal punctuation mark in the text, the interference character record number that the system that identifies arranges is also rejected, and replaces numeral and the pictographic code of specific coding;
C) extract phone number and URL address, carry out the feature extraction of note corelation behaviour;
D) the newly-increased rubbish degree base attribute of keyword is based on B) the pretreated content of step text does the keyword coupling, and each keyword of arriving of record matching;
E) similar content defines, and calculates short breath rubbish degree based on similarity;
F) in conjunction with user's credit worthiness and short breath rubbish degree, judge whether interception.
The object of the invention is to short message content and user behavior are carried out comprehensive marking, form and make a concerted effort, determine whether refuse messages in conjunction with user's credit worthiness again, the interception of as much as possible catching rubbish note, and reduction mistake is on high prestige user's impact.
The present invention gives initial credit worthiness of each user according to user's the situation of enlivening, and adopts hadoop to press a day extraction user again and uses each professional behavior counting, real-time servicing user credit worthiness.
Then carry out the text preliminary treatment.Reject first in the text normal punctuation mark, the interference character that the system that identifies arranges (such as ぁ etc.) record number is also rejected, replace the numeral of specific coding and pictographic code (as 4., 〇).
Based on the content after the second step processing, extract phone number and URL address, and judge whether phone number is original string content.Send user self behavioural characteristic and extract, as: different-place login, new registration user, note issue mortality high (extendible).Similar Content Feature Extraction, as: the distribution of sender area, sender login (extendible) such as IP distribution, recipient's area distribution, transmission frequency.Based on the feature calculation rubbish degree that extracts, carry out refuse messages identification.
Keyword increases rubbish degree base attribute newly, does keyword coupling based on the pretreated content of text, and each keyword of arriving of record matching.Keyword based on coupling calculates the rubbish degree, gathers simultaneously the result of the 3rd step clearing, carries out the refuse messages identifying processing.
Similar content defines.Calculate the rubbish degree based on similarity, and gather the result in the 4th step, carry out the refuse messages identifying processing.
In conjunction with user's credit worthiness and note rubbish degree, judge whether interception.The rubbish degree is moderate, and the note that allows the user to issue, and carries out simultaneously user's credit worthiness deduction.
The present invention is based on user's credit worthiness and note rubbish degree and can realize more accurately filtration to the short breath of rubbish, reduce the erroneous judgement of the short breath of rubbish.
Description of drawings
Fig. 1 is that a kind of embodiment of the present invention is to the flow chart of filtering junk short messages;
Fig. 2 is the flow chart that a kind of embodiment of the present invention is safeguarded user's credit worthiness;
Fig. 3 is the flow chart of the embodiment of text pre-treatment step shown in Figure 1;
Fig. 4 is the flow chart of the embodiment of behavioural characteristic treatment step shown in Figure 1;
Fig. 5 is the flow chart of the embodiment of keyword coupling step shown in Figure 1;
Fig. 6 is the flow chart that similarity shown in Figure 1 defines the embodiment of step;
Fig. 7 is the flow chart of the embodiment of doubtful refuse messages treatment step shown in Figure 1.
Embodiment
Below in conjunction with the drawings and specific embodiments the present invention is described in further details.
Fig. 1-Fig. 7 is that a kind of embodiment of the present invention is to the flow chart of filtering junk short messages.In this example, rubbish filtering method of the present invention incorporated and be embodied in characteristic processing step, keyword treatment step and similarity define, and in normal note handling process, doubtful garbage disposal flow process and the refuse messages handling process.Normal note handling process, doubtful refuse messages handling process and refuse messages handling process mainly are to safeguard for user's credit worthiness to provide main data supporting.
In this example, rubbish filtering method of the present invention will be given a mark according to the Word message of note and feature and be determined whether filter method into refuse messages, adopt successively behavioural characteristic processing, keyword coupling and similarity to define the combination of three kinds of methods, improve the accuracy that refuse messages is judged.
Simultaneously, in this example, rubbish filtering method of the present invention also combines black/white list filter method, and namely black list user's credit worthiness is 0 to forbid sending any note, and white list user credit worthiness is that the note that 1 acquiescence sends is normally.
The below is described in detail five handling processes.
User's credit worthiness maintenance process-" this flow process comprises credit worthiness initialization, unlawful practice deduction credit worthiness and cumulative credit worthiness three parts of the behavior that enlivens.Wherein deduct the credit worthiness unlawful practice and comprise the submission refuse messages and issue doubtful refuse messages, adopt the in real time mode of deduction; The cumulative credit worthiness of the behavior that enlivens adopts the mode of hadoop timing analysis to carry out; Credit worthiness initialization rule:
The behavioural characteristic handling process-" this flow process mainly is to extract the relevant behavioural characteristic of note, generally comprise such as the commercial paper note and to be mingled with the character that disturbs character or adopt specific coding in phone number or URL address, the key content (as 4./⒀), refuse messages also possesses the characteristic of mass-sending simultaneously, therefore be necessary that also similar content is carried out IP distribution, recipient's Regional Distribution, sender's Regional Distribution etc. to be analyzed, gather above-mentioned information note is carried out the calculating of rubbish degree, then take a decision as to whether refuse messages.In the intermediate treatment process, only whether identification is refuse messages, only judges namely whether the rubbish degree surpasses the predetermined threshold values of refuse messages.If be judged as refuse messages, then carry out the credit worthiness deduction of points.
Keyword matching treatment flow process-" at first with common keyword, combination keyword and responsive key definition rubbish value attribute, then this flow process is done the keyword coupling with regard to pretreated text, keyword on the coupling is carried out the rubbish degree calculate, simultaneously cumulative total rubbish degree before.Determine whether refuse messages based on the rubbish degree at last.In the intermediate treatment process, only whether identification is refuse messages, only judges namely whether the rubbish degree surpasses the predetermined threshold values of refuse messages.Certainly in the keyword matching process, also can adopt the original contents string to do the canonical coupling.
Similarity define handling process-" this flow process is for for the historical refuse messages of having tackled, do fingerprint similarity coupling, calculate maximum similarity, simultaneously cumulative greater than doing of certain value similarity, and the association attributes conversion of extracting is rubbish degree (also can adopt bayesian algorithm to come text is classified), the total rubbish degree before cumulative simultaneously.Total rubbish degree is lower than doubtful rubbish threshold values, then directly processes as normal note, is higher than the refuse messages threshold values, then is judged to be refuse messages, otherwise, carry out doubtful garbage disposal flow process.
Doubtful garbage disposal flow process-" do judgement based on user's credit worthiness, processing mode is as follows:
Mandate for doubtful refuse messages issues, and does the credit worthiness deduction according to user's credit worthiness and doubtful rubbish degree, and computing formula is following, and (the supposition credit worthiness is divided the n shelves, adopts C1, C2 ... Cn represents, the C1 maximum; T1, T2 ... Tn represents to allow between the credit worthiness stepping number of transmission; B1, B2 ... each class rubbish contribution fiducial value of Bn; G is the rubbish degree):
Credit worthiness deduction value=(C1-C2)/T1* (G/B1).
The present invention is based on user's credit worthiness and note rubbish degree and can realize more accurately filtration to the short breath of rubbish, reduce the erroneous judgement of the short breath of rubbish.
Above content is in conjunction with concrete preferred implementation further description made for the present invention, can not assert that implementation of the present invention is confined to these explanations.For the general technical staff of the technical field of the invention, without departing from the inventive concept of the premise, can also make some simple deduction or replace, all should be considered as belonging to protection scope of the present invention.