CN104050556B - The feature selection approach and its detection method of a kind of spam - Google Patents

The feature selection approach and its detection method of a kind of spam Download PDF

Info

Publication number
CN104050556B
CN104050556B CN201410228073.6A CN201410228073A CN104050556B CN 104050556 B CN104050556 B CN 104050556B CN 201410228073 A CN201410228073 A CN 201410228073A CN 104050556 B CN104050556 B CN 104050556B
Authority
CN
China
Prior art keywords
feature
subset
classification
spam
mail
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410228073.6A
Other languages
Chinese (zh)
Other versions
CN104050556A (en
Inventor
孙广路
何勇军
刘广明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daqing Lehen Information Technology Co.,Ltd.
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN201410228073.6A priority Critical patent/CN104050556B/en
Publication of CN104050556A publication Critical patent/CN104050556A/en
Application granted granted Critical
Publication of CN104050556B publication Critical patent/CN104050556B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention relates to the feature selection approach and its detection method of a kind of spam, including:N grams methods based on byte carry out the feature extraction of mail;Feature ordering generation initial characteristicses subset is carried out according to the feature extracted and the degree of correlation of default mail classes;Redundancy feature in the initial characteristicses subset is deleted according to approximate Markov blanket algorithm and obtains candidate feature subset;The candidate feature subset is predicted by online logistic regression grader and the candidate feature subset is carried out according to predicting the outcome evaluate selection optimal feature subset;Spam is detected using online logistic regression grader according to the optimal feature subset.Using the feature selection approach and its detection method of spam proposed by the present invention so that the feature selecting of spam and the calculating process of spam detection are simple, and time complexity is low, and cause that the accuracy rate of spam detection is greatly improved.

Description

The feature selection approach and its detection method of a kind of spam
Technical field
The present invention relates to computer network security technology field, more particularly to a kind of spam feature selection approach and Its detection method.
Background technology
With the fast development of internet, Email becomes new media of information, by its is cheap, Convenient and swift the advantages of, it is widely used in every field.Then it is extensive using some negative impacts are also brought, largely Spam be full of in the mailbox of people, not only have impact on the normal of user and use, and image to operator is produced Infringement.Many garbage mail systems arise at the historic moment, but are faced with that data volume is big, the low problem of operational efficiency.
Traditional rubbish mail filtering method, including including Flexible Bayes, decision tree, SVM, Boosting very Many machine learning methods are all applied in Spam filtering.From in terms of current result of study, Flexible Bayes, SVM, These machine learning methods such as Boosting, Winnow seem that practical degree can be reached in some small-scale data. But for large-scale data, training grader can take a significant amount of time, and because data are numerous and diverse, it is difficult to obtain optimal Training pattern.
In the middle of current characterization method, the feature selection approach research for higher-dimension two-value data is very few, does not have also at present There is effective solution.Traditional method can process the feature selecting of two-value data, but for the data of higher-dimension, Often complexity is very high for those methods, it is difficult to obtain good effect in actual applications.
The content of the invention
(1) technical problem to be solved
It is an object of the invention to provide the feature selection approach and its detection method of a kind of spam, to solve existing spy Levy system of selection and pass junk mail detection method present in computation complexity it is high, the cost time is more, and is difficult in reality The problem of good effect is obtained in.
(2) technical scheme
In order to achieve the above object, the present invention proposes a kind of feature selection approach of spam, including:
N-grams methods based on byte carry out the feature extraction of mail;
Feature ordering generation initial characteristicses subset is carried out according to the feature extracted and the degree of correlation of default mail classes;
Redundancy feature in the initial characteristicses subset is deleted according to approximate Markov blanket algorithm and obtains candidate feature Collection;
The candidate feature subset is predicted by online logistic regression grader and according to predicting the outcome to described Candidate feature subset carries out evaluating selection optimal feature subset.
The invention allows for a kind of junk mail detection method of the feature selection approach based on above-mentioned spam, bag Include:
Spam is detected using online logistic regression grader according to the optimal feature subset.
Preferably, the feature extraction that the N-grams methods based on byte carry out mail is specifically included:
Mail is obtained the hash dictionaries of the mail according to the byte cutting that byte stream carries out preset length;
Default sample and the hash dictionaries are carried out into Characteristic Contrast and obtains feature set corresponding with the hash dictionaries.
Preferably, it is described default sample and the hash dictionaries are carried out into Characteristic Contrast to obtain corresponding with the hash dictionaries Feature set be specially:
When the spy of the then hash dictionaries correspondence position occurs in the default sample in the feature in the hash dictionaries Value indicative is set to 1, if not occurring, the characteristic value of the hash dictionaries correspondence position is set to 0, obtains a sparse binary feature Data set.
Preferably, it is described that the initial spy of feature ordering generation is carried out according to the feature extracted and the degree of correlation of default mail classes Subset is levied to specifically include:
The relative density of the feature and default mail classes extracted is calculated, it is specific as follows:
Wherein, F is characterized collection, FiThe ith feature of concentration is characterized, C is default mail classes collection, ClFor classification is concentrated L-th classification,It is classification ClContained sample number, L is classification sum, and M is characterized sum,Represent ith feature number Be worth for 1 when relative to classification ClRelative density, and
The feature of the extraction and the degree of correlation of default mail classes are judged according to the relative density;
Feature ordering generation initial characteristicses subset is carried out according to the degree of correlation.
Preferably, it is described to judge that the feature of the extraction and the degree of correlation of default mail classes are specifically wrapped according to relative density Include:
Relatedness computation is carried out according to the relative density, formula is as follows:
Wherein, W (Fi)diffScope be [0,1],Represent ith feature numerical value when being 1 relative to classification C1It is relative Density,Represent ith feature numerical value when being 1 relative to classification C0Relative density, and as W (Fi)diffWhen=0, feature is represented FiIt is least related to classification, as W (Fi)diff=1, represent feature FiIt is most related to classification;
With W (Fi)diffAs interpretational criteria, by the W (Fi)diffIt is compared with predetermined threshold value ω, judges the feature FiWith the degree of correlation of default mail classes.
Preferably, the redundancy feature that the approximate Markov blanket algorithm of the basis is deleted in the initial characteristicses subset is obtained Candidate feature subset is specifically included:
Initialization feature subset, for the feature F in the initial characteristicses subsetiAccording to coefficient correlation from the initial spy Levy selection and the F in subsetiMaximally related K feature, the computing formula of the coefficient correlation is as follows:
Wherein, fiAnd ciIt is respectively feature FiWith classification CiComponent, n for sample number,WithIt is feature FiAnd class Other CiAverage, computing formula is as follows:
By this K feature composition set Mi, and by the MiAs feature FiApproximate Markov blanket calculate the feature FiScore value δG(Fi|Mi), computing formula is as follows:
Wherein, DKLRelative entropy is represented, is the index of similarity between measurement variable, computing formula is as follows:
According to the score value δG(Fi|Mi) redundancy feature deleted in the initial characteristicses subset obtains candidate feature subset.
Preferably, it is described according to score value δG(Fi|Mi) to obtain candidate special for the redundancy feature deleted in the initial characteristicses subset Subset is levied to specifically include:
According to the score value δG(Fi|Mi) feature in the initial characteristicses subset is ranked up, the minimum δ of deletion valueG (Fi|Mi) corresponding to feature;
Circulation above-mentioned steps, candidate feature subset is obtained according to default Characteristic Number.
Preferably, it is described the candidate feature subset to be predicted and according to prediction by online logistic regression grader Result carries out grading selection optimal feature subset to the candidate feature subset and specifically includes:
The online logistic regression grader is predicted using anticipation function to the candidate feature subset, the prediction Function is:
Wherein, w is weight, and b is biasing, and x is input, and P (Y | x) is to predict the outcome and scope is [0,1];
For the feature input that the candidate feature is concentrated, predicting the outcome for the anticipation function is obtained, when prediction is tied Fruit P>Then be spam when 0.5, when predict the outcome P≤0.5 when, then be normal email;
According to it is described predict the outcome the feature that the candidate feature is concentrated evaluate extract predetermined quantity The optimal optimal feature subset of prediction effect on the online logistic regression grader.
Preferably, it is described that detection tool is carried out to spam using online logistic regression grader according to optimal feature subset Body is:
When there is mail to need detection, the online logistic regression grader is predicted to the optimal feature subset;
Spam is detected according to predicting the outcome.
(3) beneficial effect
The feature selection approach and its detection method of a kind of spam proposed by the present invention, based on packaged type feature selecting Algorithm carries out the Feature Selection of spam, intrinsic dimensionality is greatly lowered, and remove mail using online Logic Regression Models Substantial amounts of uncorrelated and redundancy feature in data, generates optimal character subset, and carry out rubbish using the optimal feature subset Rubbish mail-detection, fundamentally improves Detection accuracy and reduces the time of sorting algorithm consumption, can be widely used in rubbish In mail-detection.
Brief description of the drawings
Fig. 1 is a kind of feature selection approach flow chart of spam of the invention;
Fig. 2 is a kind of junk mail detection method flow chart of the feature selection approach based on spam of the present invention.
Specific embodiment
With reference to the accompanying drawings and examples, specific embodiment of the invention is described in further detail.Hereinafter implement Example is not limited to the scope of the present invention for illustrating the present invention.
The present invention proposes a kind of feature selection approach of spam, as shown in figure 1, comprising the following steps:
N-grams methods of the S101 based on byte carries out the feature extraction of mail, specifically includes:By mail according to byte stream The byte cutting for carrying out preset length obtains the hash dictionaries of the mail;Default sample is carried out into feature with the hash dictionaries Contrast obtains feature set corresponding with the hash dictionaries;
Wherein, default sample and the hash dictionaries are carried out into Characteristic Contrast and obtains feature corresponding with the hash dictionaries Collection is specially:When the spy of the then hash dictionaries correspondence position occurs in the default sample in the feature in the hash dictionaries Value indicative is set to 1, if not occurring, the characteristic value of the hash dictionaries correspondence position is set to 0, obtains a sparse binary feature Data set.
S102 carries out feature ordering generation initial characteristicses subset according to the feature extracted and the degree of correlation of default mail classes, Specifically include:The relative density of the feature and default mail classes extracted is calculated, it is specific as follows:
Wherein, F is characterized collection, FiThe ith feature of concentration is characterized, C is default mail classes collection, ClFor classification is concentrated L-th classification,It is classification ClContained sample number, L is classification sum, and M is characterized sum,Represent ith feature number Be worth for 1 when relative to classification ClRelative density, andAccording to the relative density judge the feature of the extraction with The degree of correlation of default mail classes;Feature ordering generation initial characteristicses subset is carried out according to the degree of correlation.
Wherein, the feature and the degree of correlation of default mail classes that the extraction is judged according to relative density are specifically included:Root Relatedness computation is carried out according to the relative density, formula is as follows:
Wherein, W (Fi)diffScope be [0,1],Represent ith feature numerical value when being 1 relative to classification C1It is relative Density,Represent ith feature numerical value when being 1 relative to classification C0Relative density, and as W (Fi)diffWhen=0, feature is represented FiIt is least related to classification, as W (Fi)diff=1, represent feature FiIt is most related to classification;With W (Fi)diffAs interpretational criteria, will W (the Fi)diffIt is compared with predetermined threshold value ω, judges the feature FiWith the degree of correlation of default mail classes.Root of the present invention The characteristics of according to binary feature, using the method based on density, particularly with the sparse data of two-value, computational methods are simple, the time Complexity is low, and accuracy rate is greatly improved.
It is special that S103 obtains candidate according to the redundancy feature that approximate Markov blanket algorithm is deleted in the initial characteristicses subset Subset is levied, is specifically included:
Initialization feature subset, for the feature F in the initial characteristicses subsetiAccording to coefficient correlation from the initial spy Levy selection and the F in subsetiMaximally related K feature, the computing formula of the coefficient correlation is as follows:
Wherein, fiAnd ciIt is respectively feature FiWith classification CiComponent, n for sample number,WithIt is feature FiAnd class Other CiAverage, computing formula is as follows:
By this K feature composition set Mi, and by the MiAs feature FiApproximate Markov blanket calculate the feature FiScore value δG(Fi|Mi), computing formula is as follows:
Wherein, DKLRelative entropy is represented, is the index of similarity between measurement variable, computing formula is as follows:
According to the score value δG(Fi|Mi) redundancy feature deleted in the initial characteristicses subset obtains candidate feature subset, Specifically include following steps:According to the score value δG(Fi|Mi) feature in the initial characteristicses subset is ranked up, delete It is worth minimum δG(Fi|Mi) corresponding to feature;Circulation above-mentioned steps, candidate feature subset is obtained according to default Characteristic Number.
S104 is predicted and right according to predicting the outcome by online logistic regression grader to the candidate feature subset The candidate feature subset carries out evaluating selection optimal feature subset, specifically includes:The online logistic regression grader is utilized Anticipation function is predicted to the candidate feature subset, and the anticipation function is:
Wherein, w is weight, and b is biasing, and x is input, and P (Y | x) is to predict the outcome and scope is [0,1];
For the feature input that the candidate feature is concentrated, predicting the outcome for the anticipation function is obtained, when prediction is tied Fruit P>Then be spam when 0.5, when predict the outcome P≤0.5 when, then be normal email;
According to it is described predict the outcome the feature that the candidate feature is concentrated evaluate extract predetermined quantity The optimal optimal feature subset of prediction effect on the online logistic regression grader.
With specific embodiment, the present invention is described in detail below.
With the development of anti-spam technologies, send anti-spam technology and also improving, spammer passes through Intentional misspelling, character are replaced and are inserted the forms such as blank and carry out variant to the word of spam feature, so as to escape inspection The detection of examining system.In order to overcome these problems, the present invention to be carried using the feature that the N-grams methods based on byte carry out mail Take.Very easy to use based on byte level n-grams feature extracting methods, it is not necessary to the support of any dictionary, it is right to need Sentence carries out participle;It is trained also without to corpus before the use.When feature is extracted to mail, without to mail Pre-processed, without consideration mail encoded question, but mail is directly converted into indiscriminate byte stream.
Feature extracting method based on n-grams is that mail is carried out into size for n bytes carry out cutting (its according to byte stream In, n values are 1,2,3,4 ...), it is n several strings of byte to obtain length, and each string is referred to as 1 gram.Such as: Information, sliding window cutting is carried out during according to n=4 is:Info, nfor, form, orma, rmat, mati, atio and Tion this 8 features of 4-grams.
After carrying out feature extraction to all of training data, a hash dictionary for higher-dimension will be obtained, it is each in dictionary Individual position, is all a feature.Default sample is contrasted with hash dictionaries, then phase occurs in the sample in the feature in dictionary It is 1 to answer the characteristic value of position, if not occurring, the characteristic value of correspondence position is 0.Finally obtain a sparse two-value for higher-dimension Characteristic data set.
The feature extracted by n-grams methods only has 0 and 1 two kind of numerical value, and data are quite sparse, using classics Method can process such data, but at a relatively high time loss can be brought.The present invention is using between feature and classification Relative density weighs the degree of correlation between feature and classification, it is not necessary to complicated computing and iteration.The formula of relative density, tool Body is as follows:
Wherein, F is characterized collection, FiThe ith feature of concentration is characterized, C is default mail classes collection, ClFor classification is concentrated L-th classification,It is classification ClContained sample number, L is classification sum, and M is characterized sum,Represent ith feature numerical value For 1 when relative to classification ClRelative density, and
The present invention uses the method for feature ordering as the first stage of feature selecting.Firstly the need of by interpretational criteria pair Given a mark per one-dimensional characteristic, be ranked up by fraction.The characteristics of present invention is for binary feature, using measurement feature and class The formula of other degree of correlation is specific as follows as interpretational criteria:
BecauseSo W (Fi)diffScope be [0,1].And as W (Fi)diffWhen=0, feature F is representediWith class It is least not related, as W (Fi)diff=1, represent feature FiIt is most related to classification.W(Fi)diffFraction is higher, represents feature and classification Degree of correlation is higher.Therefore can be by W (Fi)diffAs interpretational criteria.
One threshold value ω is preset according to actual demand, for W (Fi)diffThe feature of >=ω, it is believed that with classification degree of correlation It is higher, initial character subset F is remained and generates, other incoherent features are to be deleted.
By the detection of correlation, also there is substantial amounts of redundancy feature in data, when this feature can bring unnecessary Between consume, or even influence grader the degree of accuracy, therefore delete redundancy feature be necessary.
It is theoretical according to Markov blanket on the basis of initial character subset, the redundancy feature in subset is deleted, select most Excellent character subset.Markov blanket theory thinks:Assuming that characteristic set is F, there is a subset M and not comprising there is feature FiIf, under conditions of subset M, feature FiWith set F-M- { FiSeparate, then say that M is FiMarkov blanket.Tool Body formula can be expressed as follows:
P(F-Mi-{Fi},C|Fi,Mi)=P (F-Mi-{Fi},C|Mi)
If meeting above formula, it is believed that subset M includes feature FiAll information, therefore FiIt is the feature of redundancy, can To be deleted.But in practical application, it is np hard problem to search for optimal Markov blanket, therefore the present invention is using heuristic Algorithm, propose a kind of approximate Markov blanket model.
The embodiment of the present invention deletes the feature of redundancy one by one using the strategy deleted backward.First, initialization feature subset G=F, for every one-dimensional characteristic Fi, according to coefficient correlation from subset G- { FiIn choose and FiMaximally related K feature, the phase The computing formula of relation number is as follows:
Wherein, fiAnd ciIt is respectively feature FiWith classification CiComponent, n for sample number,WithIt is feature FiAnd class Other CiAverage, computing formula is as follows:
By this K feature composition set Mi, and by the MiAs feature FiApproximate Markov blanket calculate the feature FiScore value δG(Fi|Mi), computing formula is as follows:
Wherein DKLExpression KL distances, also referred to as relative entropy, are the index of similarity between measurement variable, DKLIt is worth smaller, table Show that similarity is higher.Due to needing to calculate probable value, but the real distribution of feature is hardly resulted in, therefore used here original Definition of probability calculates the value of joint probability distribution.KL range formulas are as follows:
By δG(Fi|Mi) calculating, be readily apparent that δG(Fi|Mi) fraction is smaller, represents MiWith FiSimilarity it is higher, Mi Include FiInformation it is more, can be by MiApproximately regard F asiMarkov blanket.Therefore according to δG(Fi|Mi) fraction size enter Row sequence, deletes the minimum δ of fractionG(Fi|Mi) corresponding to feature Fi.So circulation is gone down, and can voluntarily be set as needed Residue character number, but it is able to obtain preferably feature oneself, we can obtain the n character subset G of candidate1, G2,...,Gn, therefrom select optimal subset.
The evaluation method of the character subset proposed in the embodiment of the present invention.It is more targeted compared to other methods, place Managing a certain specific data can have good effect.Feature selection approach is packaged together with grader, by grader Character subset is evaluated.By after two stage feature selecting, the feature selected all is significant feature, comprising corresponding The main information of classification, therefore ability can be divided stronger.The filter that the present invention is used is online logistic regression (LR) grader, its Time complexity is low, and classification effectiveness is high, and treatment high dimensional data has great advantage.
The thought of Logic Regression Models is the presence of hyperplane f (x)=wx+b=0, and anticipation function is:
Wherein w is weight, and b is biasing, and x is input, and P (Y | x) is a successive value that output and scope are [0,1].It is right In given input example x, [0,1] numerical value P can be obtained down by the calculating of formula, work as P>When 0.5, Y=1 is taken, i.e., It is spam to predict the outcome, otherwise when P≤0.5, takes Y=0, and it is normal email to predict the outcome.
The embodiment of the present invention uses the update mode of stochastic gradient descent, although traditional gradient declines can obtain Globally optimal solution, but iteration is required for traveling through all of data each time, it is extremely inefficient during treatment mass data.Boarding steps It is that only this example is trained to spend the thought for declining, it is not necessary to travel through all of sample, efficiency high can obtain suboptimum Solution.The update mode of stochastic gradient descent is as follows:
wi←wi-α(f(xi)-Yi)xi
By the training and classification of online Logic Regression Models, one point of each sample will be given by the calculating of formula Number, when fraction is more than 0.5, fallout predictor classification is spam, otherwise is then predicted as normal email, below will be by predicting classification Evaluated with actual class comparer collection.
For n candidate subset G1,G2,...,Gn, we want to obtain best that character subset work of classifying quality For according to predicting the outcome that previous step is obtained, we will be evaluated each subset.By the classification of previous step, can To obtain some related datas, table 1 lists the statistic of Calculation Estimation function needs:
The evaluation function statistical form of table 1
Can be obtained calculating following statistic according to these data:
Wherein BER is referred to as balanced error rate, when normal email and spam quantity variance are larger, can by BER With effect of the more preferable evaluating characteristic collection on grader.Specifically, carried out by online logistic regression grader characteristic set Classification, wherein normal email number be P, spam number be N, and statistical sorter correctly classification normal email number TP and rubbish Mail number TN, by formula TP=P-FN, TN=N-FP can obtain TP and TN.
A series of BER values BER will finally be obtained1,BER2......BERn, corresponding feature of selection minimum BER values Collection GoptIt is optimal feature subset as final character subset, illustrates in online Logic Regression Models, optimal feature subset GoptThere is best classifying quality.
Additionally, the invention allows for a kind of junk mail detection method of the feature selection approach based on spam, As shown in Fig. 2 comprising the following steps:
N-grams methods of the S201 based on byte carries out the feature extraction of mail, specifically includes:By mail according to byte stream The byte cutting for carrying out preset length obtains the hash dictionaries of the mail;Default sample is carried out into feature with the hash dictionaries Contrast obtains feature set corresponding with the hash dictionaries.
Wherein, default sample and the hash dictionaries are carried out into Characteristic Contrast and obtains feature corresponding with the hash dictionaries Collection is specially:When the spy of the then hash dictionaries correspondence position occurs in the default sample in the feature in the hash dictionaries Value indicative is set to 1, if not occurring, the characteristic value of the hash dictionaries correspondence position is set to 0, obtains a sparse binary feature Data set.
S202 carries out feature ordering generation initial characteristicses subset according to the feature extracted and the degree of correlation of default mail classes, Specifically include:The relative density of the feature and default mail classes extracted is calculated, it is specific as follows:
Wherein, F is characterized collection, FiThe ith feature of concentration is characterized, C is default mail classes collection, ClFor classification is concentrated L-th classification,It is classification ClContained sample number, L is classification sum, and M is characterized sum,Represent ith feature number Be worth for 1 when relative to classification ClRelative density, andAccording to the relative density judge the feature of the extraction with The degree of correlation of default mail classes;Feature ordering generation initial characteristicses subset is carried out according to the degree of correlation.
Wherein, the feature and the degree of correlation of default mail classes that the extraction is judged according to relative density are specifically included:Root Relatedness computation is carried out according to the relative density, formula is as follows:
Wherein, W (Fi)diffScope be [0,1],Represent ith feature numerical value when being 1 relative to classification C1It is relative Density,Represent ith feature numerical value when being 1 relative to classification C0Relative density, and as W (Fi)diffWhen=0, represent special Levy FiIt is least related to classification, as W (Fi)diff=1, represent feature FiIt is most related to classification;With W (Fi)diffAs interpretational criteria, By the W (Fi)diffIt is compared with predetermined threshold value ω, judges the feature FiWith the degree of correlation of default mail classes.
It is special that S203 obtains candidate according to the redundancy feature that approximate Markov blanket algorithm is deleted in the initial characteristicses subset Subset is levied, is specifically included:Initialization feature subset, for the feature F in the initial characteristicses subsetiAccording to coefficient correlation from institute State selection and the F in initial characteristicses subsetiMaximally related K feature, the computing formula of the coefficient correlation is as follows:
Wherein, fiAnd ciIt is respectively feature FiWith classification CiComponent, n for sample number,WithIt is feature FiAnd class Other CiAverage, computing formula is as follows:
By this K feature composition set Mi, and by the MiAs feature FiApproximate Markov blanket calculate the feature FiScore value δG(Fi|Mi), computing formula is as follows:
Wherein, DKLRelative entropy is represented, is the index of similarity between measurement variable, computing formula is as follows:
According to the score value δG(Fi|Mi) redundancy feature deleted in the initial characteristicses subset obtains candidate feature subset, Specifically include following steps:According to the score value δG(Fi|Mi) feature in the initial characteristicses subset is ranked up, delete It is worth minimum δG(Fi|Mi) corresponding to feature;Circulation above-mentioned steps, candidate feature subset is obtained according to default Characteristic Number.
S204 is predicted and right according to predicting the outcome by online logistic regression grader to the candidate feature subset The candidate feature subset carries out evaluating selection optimal feature subset, specifically includes:The online logistic regression grader is utilized Anticipation function is predicted to the candidate feature subset, and the anticipation function is:
Wherein, w is weight, and b is biasing, and x is input, and P (Y | x) is to predict the outcome and scope is [0,1];For the time Select the feature in character subset to be input into, predicting the outcome for the anticipation function is obtained, as the P that predicts the outcome>Then it is rubbish when 0.5 Mail, when predict the outcome P≤0.5 when, then be normal email;According to the spy for predicting the outcome and being concentrated to the candidate feature Levying carries out evaluating the optimal characteristics that prediction effect is optimal on the online logistic regression grader for extracting predetermined quantity Collection.
S205 detected using online logistic regression grader according to the optimal feature subset to spam, specifically For:When there is mail to need detection, the online logistic regression grader is predicted to the optimal feature subset;According to pre- Survey result and detect spam.
Because spam detection systems need real-time update and detection, thus select online Logic Regression Models as point Class device, can not only improve recognition accuracy, and can reduce the time complexity of training and identification.Examined with traditional spam Survey method is compared, it is contemplated that by the feature selection approach based on spam, intrinsic dimensionality is greatly lowered, by patrolling Collect regression model and obtain optimal character subset, in sorting phase, spam is examined using online Logic Regression Models Survey.
By the feature selection approach of packaged type, final optimal feature subset G has been obtainedopt, wherein include be all with Category Relevance is high, and the small feature of redundancy, there is stronger classification performance.By the evaluation and test of online Logic Regression Models, So that optimal subset GoptThere is best performance in online Logic Regression Models, therefore detection-phase uses logistic regression mould Type, can obtain optimal prediction effect.
Whenever having mail to need detection, logistic regression grader will be by calculating Value give one score value of each envelope mail, when this score value is for 0.5, is given and is judged as spam, on the contrary this score value During less than or equal to 0.5, be given and be judged as normal email.
The feature selection approach and its detection method of a kind of spam proposed by the present invention, based on packaged type feature selecting Algorithm carries out the Feature Selection of spam, intrinsic dimensionality is greatly lowered, and remove mail using online Logic Regression Models Substantial amounts of uncorrelated and redundancy feature in data, generates optimal character subset, and carry out rubbish using the optimal feature subset Rubbish mail-detection, fundamentally improves Detection accuracy and reduces the time of sorting algorithm consumption, can be widely used in rubbish In mail-detection.
Embodiment of above is merely to illustrate the present invention, and not limitation of the present invention, about the common of technical field Technical staff, without departing from the spirit and scope of the present invention, can also make a variety of changes and modification, therefore all Equivalent technical scheme falls within scope of the invention, and scope of patent protection of the invention should be defined by the claims.

Claims (7)

1. a kind of feature selection approach of spam, it is characterised in that including:
N-grams methods based on byte carry out the feature extraction of mail;
Feature ordering generation initial characteristicses subset is carried out according to the feature extracted and the degree of correlation of default mail classes;
Redundancy feature in the initial characteristicses subset is deleted according to approximate Markov blanket algorithm and obtains candidate feature subset;
The candidate feature subset is predicted by online logistic regression grader and according to predicting the outcome to the candidate Character subset carries out evaluating selection optimal feature subset;
The feature extraction that the N-grams methods based on byte carry out mail is specifically included:
Mail is obtained the hash dictionaries of the mail according to the byte cutting that byte stream carries out preset length;
Default sample and the hash dictionaries are carried out into Characteristic Contrast and obtains feature set corresponding with the hash dictionaries;
It is described default sample and the hash dictionaries are carried out into Characteristic Contrast to obtain corresponding with hash dictionaries feature set tool Body is:
Feature in the hash dictionaries occurs in the default sample, and the characteristic value of the hash dictionaries correspondence position sets It is 1, if not occurring, the characteristic value of the hash dictionaries correspondence position is set to 0, obtains a sparse binary feature data Collection;
It is specific that the feature and the degree of correlation of default mail classes according to extraction carries out feature ordering generation initial characteristicses subset Including:
The relative density of the feature and default mail classes extracted is calculated, it is specific as follows:
Wherein, F is characterized collection, FiThe ith feature of concentration is characterized, C is default mail classes collection, ClFor the l that classification is concentrated Individual classification,It is classification ClContained sample number, L is classification sum, and M is characterized sum,Represent that ith feature numerical value is 1 When relative to classification ClRelative density, and
The feature of the extraction and the degree of correlation of default mail classes are judged according to the relative density;
Feature ordering generation initial characteristicses subset is carried out according to the degree of correlation.
2. the method for claim 1, it is characterised in that it is described according to relative density judge the feature of the extraction with it is pre- If the degree of correlation of mail classes is specifically included:
Relatedness computation is carried out according to the relative density, formula is as follows:
Wherein, W (Fi)diffScope be [0,1],Represent ith feature numerical value when being 1 relative to classification C1Relative density,Represent ith feature numerical value when being 1 relative to classification C0Relative density, and as W (Fi)diffWhen=0, feature F is representediWith Classification is least related, as W (Fi)diff=1, represent feature FiIt is most related to classification;
By W (Fi)diffAs interpretational criteria, to the W (Fi)diffIt is compared with predetermined threshold value ω, judges the feature FiWith The degree of correlation of default mail classes.
3. the method for claim 1, it is characterised in that the approximate Markov blanket algorithm of basis deletes described initial Redundancy feature in character subset obtains candidate feature subset and specifically includes:
Initialization feature subset, for the feature F in the initial characteristicses subsetiAccording to coefficient correlation from initial characteristicses Concentrate and choose and the FiMaximally related K feature, the computing formula of the coefficient correlation is as follows:
Wherein, fiAnd ciIt is respectively feature FiWith classification CiComponent, n for sample number,WithIt is feature FiWith classification Ci Value, computing formula is as follows:
By this K feature composition set Mi, and by the MiAs feature FiApproximate Markov blanket calculate the feature Fi's Score value δG(Fi|Mi), computing formula is as follows:
Wherein, DKLRelative entropy is represented, is the index of similarity between measurement variable, computing formula is as follows:
According to the score value δG(Fi|Mi) redundancy feature deleted in the initial characteristicses subset obtains candidate feature subset.
4. method as claimed in claim 3, it is characterised in that described according to score value δG(Fi|Mi) delete initial characteristicses The redundancy feature of concentration obtains candidate feature subset and specifically includes:
According to the score value δG(Fi|Mi) feature in the initial characteristicses subset is ranked up, the minimum δ of deletion valueG(Fi| Mi) corresponding to feature;
Each step of the claims 3 is circulated, candidate feature subset is obtained according to default Characteristic Number.
5. the method for claim 1, it is characterised in that described special to the candidate by online logistic regression grader Levy subset be predicted and according to predict the outcome to the candidate feature subset carry out grading selection optimal feature subset specifically wrap Include:
The online logistic regression grader is predicted using anticipation function to the candidate feature subset, the prediction letter Number is:
Wherein, w is weight, and b is biasing, and x is input, and P (Y | x) is to predict the outcome and scope is [0,1];
For the feature input that the candidate feature is concentrated, predicting the outcome for the anticipation function is obtained, as the P > that predict the outcome Then be spam when 0.5, when predict the outcome P≤0.5 when, then be normal email;
According to it is described predict the outcome the feature that the candidate feature is concentrated evaluate extract predetermined quantity described The optimal optimal feature subset of prediction effect on online logistic regression grader.
6. a kind of junk mail detection method of the feature selection approach based on described in claim 1, it is characterised in that including:
Spam is detected using online logistic regression grader according to the optimal feature subset.
7. method as claimed in claim 6, it is characterised in that it is described according to optimal feature subset using online logistic regression point Class device is detected specially to spam:
When there is mail to need detection, the online logistic regression grader is predicted to the optimal feature subset;
Spam is detected according to predicting the outcome.
CN201410228073.6A 2014-05-27 2014-05-27 The feature selection approach and its detection method of a kind of spam Active CN104050556B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410228073.6A CN104050556B (en) 2014-05-27 2014-05-27 The feature selection approach and its detection method of a kind of spam

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410228073.6A CN104050556B (en) 2014-05-27 2014-05-27 The feature selection approach and its detection method of a kind of spam

Publications (2)

Publication Number Publication Date
CN104050556A CN104050556A (en) 2014-09-17
CN104050556B true CN104050556B (en) 2017-06-16

Family

ID=51503365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410228073.6A Active CN104050556B (en) 2014-05-27 2014-05-27 The feature selection approach and its detection method of a kind of spam

Country Status (1)

Country Link
CN (1) CN104050556B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205349B (en) * 2015-08-25 2018-08-03 合肥工业大学 The Embedded Gene Selection Method based on encapsulation of Markov blanket
CN105306296B (en) * 2015-10-21 2018-10-12 北京工业大学 A kind of data filtering processing method based on LTE signalings
CN106570178B (en) * 2016-11-10 2020-09-29 重庆邮电大学 High-dimensional text data feature selection method based on graph clustering
CN107193804B (en) * 2017-06-02 2019-03-29 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN107239447B (en) * 2017-06-05 2020-12-18 厦门美柚股份有限公司 Junk information identification method, device and system
CN109241523B (en) * 2018-08-10 2020-12-11 北京百度网讯科技有限公司 Method, device and equipment for identifying variant cheating fields
CN110119756B (en) * 2019-03-25 2021-08-10 北京天泽智云科技有限公司 Automatic trend data feature selection method based on voting method
CN110174106A (en) * 2019-04-01 2019-08-27 香港理工大学深圳研究院 A kind of healthy walking paths planning method and terminal device based on PM2.5
CN111312403A (en) * 2020-01-21 2020-06-19 山东师范大学 Disease prediction system, device and medium based on instance and feature sharing cascade
CN112561082A (en) * 2020-12-22 2021-03-26 北京百度网讯科技有限公司 Method, device, equipment and storage medium for generating model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930561A (en) * 2010-05-21 2010-12-29 电子科技大学 N-Gram participle model-based reverse neural network junk mail filter device
US8417783B1 (en) * 2006-05-31 2013-04-09 Proofpoint, Inc. System and method for improving feature selection for a spam filtering model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8417783B1 (en) * 2006-05-31 2013-04-09 Proofpoint, Inc. System and method for improving feature selection for a spam filtering model
CN101930561A (en) * 2010-05-21 2010-12-29 电子科技大学 N-Gram participle model-based reverse neural network junk mail filter device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于贝叶斯分类的垃圾邮件过滤***研究与实现";林伟;《中国优秀硕士学位论文全文数据库 信息科技辑》;20100215;第I139-96页 *

Also Published As

Publication number Publication date
CN104050556A (en) 2014-09-17

Similar Documents

Publication Publication Date Title
CN104050556B (en) The feature selection approach and its detection method of a kind of spam
Barrón-Cedeno et al. Proppy: A system to unmask propaganda in online news
CN106599054B (en) Method and system for classifying and pushing questions
CN102411563B (en) Method, device and system for identifying target words
CN102929937B (en) Based on the data processing method of the commodity classification of text subject model
TWI438637B (en) Systems and methods for capturing and managing collective social intelligence information
CN104199965B (en) Semantic information retrieval method
CN109471942B (en) Chinese comment emotion classification method and device based on evidence reasoning rule
WO2021051518A1 (en) Text data classification method and apparatus based on neural network model, and storage medium
CN107301171A (en) A kind of text emotion analysis method and system learnt based on sentiment dictionary
CN105335352A (en) Entity identification method based on Weibo emotion
CN105824922A (en) Emotion classifying method fusing intrinsic feature and shallow feature
CN106156372B (en) A kind of classification method and device of internet site
CN105279252A (en) Related word mining method, search method and search system
CN105389379A (en) Rubbish article classification method based on distributed feature representation of text
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN105574544A (en) Data processing method and device
BaygIn Classification of text documents based on Naive Bayes using N-Gram features
CN103955547A (en) Method and system for searching forum hot-posts
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
Singh et al. Sentiment analysis of Twitter data using TF-IDF and machine learning techniques
CN114139634A (en) Multi-label feature selection method based on paired label weights
CN108596205B (en) Microblog forwarding behavior prediction method based on region correlation factor and sparse representation
CN104794209A (en) Chinese microblog sentiment classification method and system based on Markov logic network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20221214

Address after: Room 1035 and Room 1036, Block C1, C2 and C3, Daqing Service Outsourcing Industrial Park, No. 6-1, Xinfeng Road, High tech Zone, Daqing City, Heilongjiang Province, 163711

Patentee after: Daqing Lehen Information Technology Co.,Ltd.

Address before: 150080 No. 52, Xuefu Road, Nangang District, Heilongjiang, Harbin

Patentee before: HARBIN University OF SCIENCE AND TECHNOLOGY

TR01 Transfer of patent right