CN104050556B - The feature selection approach and its detection method of a kind of spam - Google Patents
The feature selection approach and its detection method of a kind of spam Download PDFInfo
- Publication number
- CN104050556B CN104050556B CN201410228073.6A CN201410228073A CN104050556B CN 104050556 B CN104050556 B CN 104050556B CN 201410228073 A CN201410228073 A CN 201410228073A CN 104050556 B CN104050556 B CN 104050556B
- Authority
- CN
- China
- Prior art keywords
- feature
- subset
- classification
- spam
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention relates to the feature selection approach and its detection method of a kind of spam, including:N grams methods based on byte carry out the feature extraction of mail;Feature ordering generation initial characteristicses subset is carried out according to the feature extracted and the degree of correlation of default mail classes;Redundancy feature in the initial characteristicses subset is deleted according to approximate Markov blanket algorithm and obtains candidate feature subset;The candidate feature subset is predicted by online logistic regression grader and the candidate feature subset is carried out according to predicting the outcome evaluate selection optimal feature subset;Spam is detected using online logistic regression grader according to the optimal feature subset.Using the feature selection approach and its detection method of spam proposed by the present invention so that the feature selecting of spam and the calculating process of spam detection are simple, and time complexity is low, and cause that the accuracy rate of spam detection is greatly improved.
Description
Technical field
The present invention relates to computer network security technology field, more particularly to a kind of spam feature selection approach and
Its detection method.
Background technology
With the fast development of internet, Email becomes new media of information, by its is cheap,
Convenient and swift the advantages of, it is widely used in every field.Then it is extensive using some negative impacts are also brought, largely
Spam be full of in the mailbox of people, not only have impact on the normal of user and use, and image to operator is produced
Infringement.Many garbage mail systems arise at the historic moment, but are faced with that data volume is big, the low problem of operational efficiency.
Traditional rubbish mail filtering method, including including Flexible Bayes, decision tree, SVM, Boosting very
Many machine learning methods are all applied in Spam filtering.From in terms of current result of study, Flexible Bayes, SVM,
These machine learning methods such as Boosting, Winnow seem that practical degree can be reached in some small-scale data.
But for large-scale data, training grader can take a significant amount of time, and because data are numerous and diverse, it is difficult to obtain optimal
Training pattern.
In the middle of current characterization method, the feature selection approach research for higher-dimension two-value data is very few, does not have also at present
There is effective solution.Traditional method can process the feature selecting of two-value data, but for the data of higher-dimension,
Often complexity is very high for those methods, it is difficult to obtain good effect in actual applications.
The content of the invention
(1) technical problem to be solved
It is an object of the invention to provide the feature selection approach and its detection method of a kind of spam, to solve existing spy
Levy system of selection and pass junk mail detection method present in computation complexity it is high, the cost time is more, and is difficult in reality
The problem of good effect is obtained in.
(2) technical scheme
In order to achieve the above object, the present invention proposes a kind of feature selection approach of spam, including:
N-grams methods based on byte carry out the feature extraction of mail;
Feature ordering generation initial characteristicses subset is carried out according to the feature extracted and the degree of correlation of default mail classes;
Redundancy feature in the initial characteristicses subset is deleted according to approximate Markov blanket algorithm and obtains candidate feature
Collection;
The candidate feature subset is predicted by online logistic regression grader and according to predicting the outcome to described
Candidate feature subset carries out evaluating selection optimal feature subset.
The invention allows for a kind of junk mail detection method of the feature selection approach based on above-mentioned spam, bag
Include:
Spam is detected using online logistic regression grader according to the optimal feature subset.
Preferably, the feature extraction that the N-grams methods based on byte carry out mail is specifically included:
Mail is obtained the hash dictionaries of the mail according to the byte cutting that byte stream carries out preset length;
Default sample and the hash dictionaries are carried out into Characteristic Contrast and obtains feature set corresponding with the hash dictionaries.
Preferably, it is described default sample and the hash dictionaries are carried out into Characteristic Contrast to obtain corresponding with the hash dictionaries
Feature set be specially:
When the spy of the then hash dictionaries correspondence position occurs in the default sample in the feature in the hash dictionaries
Value indicative is set to 1, if not occurring, the characteristic value of the hash dictionaries correspondence position is set to 0, obtains a sparse binary feature
Data set.
Preferably, it is described that the initial spy of feature ordering generation is carried out according to the feature extracted and the degree of correlation of default mail classes
Subset is levied to specifically include:
The relative density of the feature and default mail classes extracted is calculated, it is specific as follows:
Wherein, F is characterized collection, FiThe ith feature of concentration is characterized, C is default mail classes collection, ClFor classification is concentrated
L-th classification,It is classification ClContained sample number, L is classification sum, and M is characterized sum,Represent ith feature number
Be worth for 1 when relative to classification ClRelative density, and
The feature of the extraction and the degree of correlation of default mail classes are judged according to the relative density;
Feature ordering generation initial characteristicses subset is carried out according to the degree of correlation.
Preferably, it is described to judge that the feature of the extraction and the degree of correlation of default mail classes are specifically wrapped according to relative density
Include:
Relatedness computation is carried out according to the relative density, formula is as follows:
Wherein, W (Fi)diffScope be [0,1],Represent ith feature numerical value when being 1 relative to classification C1It is relative
Density,Represent ith feature numerical value when being 1 relative to classification C0Relative density, and as W (Fi)diffWhen=0, feature is represented
FiIt is least related to classification, as W (Fi)diff=1, represent feature FiIt is most related to classification;
With W (Fi)diffAs interpretational criteria, by the W (Fi)diffIt is compared with predetermined threshold value ω, judges the feature
FiWith the degree of correlation of default mail classes.
Preferably, the redundancy feature that the approximate Markov blanket algorithm of the basis is deleted in the initial characteristicses subset is obtained
Candidate feature subset is specifically included:
Initialization feature subset, for the feature F in the initial characteristicses subsetiAccording to coefficient correlation from the initial spy
Levy selection and the F in subsetiMaximally related K feature, the computing formula of the coefficient correlation is as follows:
Wherein, fiAnd ciIt is respectively feature FiWith classification CiComponent, n for sample number,WithIt is feature FiAnd class
Other CiAverage, computing formula is as follows:
By this K feature composition set Mi, and by the MiAs feature FiApproximate Markov blanket calculate the feature
FiScore value δG(Fi|Mi), computing formula is as follows:
Wherein, DKLRelative entropy is represented, is the index of similarity between measurement variable, computing formula is as follows:
According to the score value δG(Fi|Mi) redundancy feature deleted in the initial characteristicses subset obtains candidate feature subset.
Preferably, it is described according to score value δG(Fi|Mi) to obtain candidate special for the redundancy feature deleted in the initial characteristicses subset
Subset is levied to specifically include:
According to the score value δG(Fi|Mi) feature in the initial characteristicses subset is ranked up, the minimum δ of deletion valueG
(Fi|Mi) corresponding to feature;
Circulation above-mentioned steps, candidate feature subset is obtained according to default Characteristic Number.
Preferably, it is described the candidate feature subset to be predicted and according to prediction by online logistic regression grader
Result carries out grading selection optimal feature subset to the candidate feature subset and specifically includes:
The online logistic regression grader is predicted using anticipation function to the candidate feature subset, the prediction
Function is:
Wherein, w is weight, and b is biasing, and x is input, and P (Y | x) is to predict the outcome and scope is [0,1];
For the feature input that the candidate feature is concentrated, predicting the outcome for the anticipation function is obtained, when prediction is tied
Fruit P>Then be spam when 0.5, when predict the outcome P≤0.5 when, then be normal email;
According to it is described predict the outcome the feature that the candidate feature is concentrated evaluate extract predetermined quantity
The optimal optimal feature subset of prediction effect on the online logistic regression grader.
Preferably, it is described that detection tool is carried out to spam using online logistic regression grader according to optimal feature subset
Body is:
When there is mail to need detection, the online logistic regression grader is predicted to the optimal feature subset;
Spam is detected according to predicting the outcome.
(3) beneficial effect
The feature selection approach and its detection method of a kind of spam proposed by the present invention, based on packaged type feature selecting
Algorithm carries out the Feature Selection of spam, intrinsic dimensionality is greatly lowered, and remove mail using online Logic Regression Models
Substantial amounts of uncorrelated and redundancy feature in data, generates optimal character subset, and carry out rubbish using the optimal feature subset
Rubbish mail-detection, fundamentally improves Detection accuracy and reduces the time of sorting algorithm consumption, can be widely used in rubbish
In mail-detection.
Brief description of the drawings
Fig. 1 is a kind of feature selection approach flow chart of spam of the invention;
Fig. 2 is a kind of junk mail detection method flow chart of the feature selection approach based on spam of the present invention.
Specific embodiment
With reference to the accompanying drawings and examples, specific embodiment of the invention is described in further detail.Hereinafter implement
Example is not limited to the scope of the present invention for illustrating the present invention.
The present invention proposes a kind of feature selection approach of spam, as shown in figure 1, comprising the following steps:
N-grams methods of the S101 based on byte carries out the feature extraction of mail, specifically includes:By mail according to byte stream
The byte cutting for carrying out preset length obtains the hash dictionaries of the mail;Default sample is carried out into feature with the hash dictionaries
Contrast obtains feature set corresponding with the hash dictionaries;
Wherein, default sample and the hash dictionaries are carried out into Characteristic Contrast and obtains feature corresponding with the hash dictionaries
Collection is specially:When the spy of the then hash dictionaries correspondence position occurs in the default sample in the feature in the hash dictionaries
Value indicative is set to 1, if not occurring, the characteristic value of the hash dictionaries correspondence position is set to 0, obtains a sparse binary feature
Data set.
S102 carries out feature ordering generation initial characteristicses subset according to the feature extracted and the degree of correlation of default mail classes,
Specifically include:The relative density of the feature and default mail classes extracted is calculated, it is specific as follows:
Wherein, F is characterized collection, FiThe ith feature of concentration is characterized, C is default mail classes collection, ClFor classification is concentrated
L-th classification,It is classification ClContained sample number, L is classification sum, and M is characterized sum,Represent ith feature number
Be worth for 1 when relative to classification ClRelative density, andAccording to the relative density judge the feature of the extraction with
The degree of correlation of default mail classes;Feature ordering generation initial characteristicses subset is carried out according to the degree of correlation.
Wherein, the feature and the degree of correlation of default mail classes that the extraction is judged according to relative density are specifically included:Root
Relatedness computation is carried out according to the relative density, formula is as follows:
Wherein, W (Fi)diffScope be [0,1],Represent ith feature numerical value when being 1 relative to classification C1It is relative
Density,Represent ith feature numerical value when being 1 relative to classification C0Relative density, and as W (Fi)diffWhen=0, feature is represented
FiIt is least related to classification, as W (Fi)diff=1, represent feature FiIt is most related to classification;With W (Fi)diffAs interpretational criteria, will
W (the Fi)diffIt is compared with predetermined threshold value ω, judges the feature FiWith the degree of correlation of default mail classes.Root of the present invention
The characteristics of according to binary feature, using the method based on density, particularly with the sparse data of two-value, computational methods are simple, the time
Complexity is low, and accuracy rate is greatly improved.
It is special that S103 obtains candidate according to the redundancy feature that approximate Markov blanket algorithm is deleted in the initial characteristicses subset
Subset is levied, is specifically included:
Initialization feature subset, for the feature F in the initial characteristicses subsetiAccording to coefficient correlation from the initial spy
Levy selection and the F in subsetiMaximally related K feature, the computing formula of the coefficient correlation is as follows:
Wherein, fiAnd ciIt is respectively feature FiWith classification CiComponent, n for sample number,WithIt is feature FiAnd class
Other CiAverage, computing formula is as follows:
By this K feature composition set Mi, and by the MiAs feature FiApproximate Markov blanket calculate the feature
FiScore value δG(Fi|Mi), computing formula is as follows:
Wherein, DKLRelative entropy is represented, is the index of similarity between measurement variable, computing formula is as follows:
According to the score value δG(Fi|Mi) redundancy feature deleted in the initial characteristicses subset obtains candidate feature subset,
Specifically include following steps:According to the score value δG(Fi|Mi) feature in the initial characteristicses subset is ranked up, delete
It is worth minimum δG(Fi|Mi) corresponding to feature;Circulation above-mentioned steps, candidate feature subset is obtained according to default Characteristic Number.
S104 is predicted and right according to predicting the outcome by online logistic regression grader to the candidate feature subset
The candidate feature subset carries out evaluating selection optimal feature subset, specifically includes:The online logistic regression grader is utilized
Anticipation function is predicted to the candidate feature subset, and the anticipation function is:
Wherein, w is weight, and b is biasing, and x is input, and P (Y | x) is to predict the outcome and scope is [0,1];
For the feature input that the candidate feature is concentrated, predicting the outcome for the anticipation function is obtained, when prediction is tied
Fruit P>Then be spam when 0.5, when predict the outcome P≤0.5 when, then be normal email;
According to it is described predict the outcome the feature that the candidate feature is concentrated evaluate extract predetermined quantity
The optimal optimal feature subset of prediction effect on the online logistic regression grader.
With specific embodiment, the present invention is described in detail below.
With the development of anti-spam technologies, send anti-spam technology and also improving, spammer passes through
Intentional misspelling, character are replaced and are inserted the forms such as blank and carry out variant to the word of spam feature, so as to escape inspection
The detection of examining system.In order to overcome these problems, the present invention to be carried using the feature that the N-grams methods based on byte carry out mail
Take.Very easy to use based on byte level n-grams feature extracting methods, it is not necessary to the support of any dictionary, it is right to need
Sentence carries out participle;It is trained also without to corpus before the use.When feature is extracted to mail, without to mail
Pre-processed, without consideration mail encoded question, but mail is directly converted into indiscriminate byte stream.
Feature extracting method based on n-grams is that mail is carried out into size for n bytes carry out cutting (its according to byte stream
In, n values are 1,2,3,4 ...), it is n several strings of byte to obtain length, and each string is referred to as 1 gram.Such as:
Information, sliding window cutting is carried out during according to n=4 is:Info, nfor, form, orma, rmat, mati, atio and
Tion this 8 features of 4-grams.
After carrying out feature extraction to all of training data, a hash dictionary for higher-dimension will be obtained, it is each in dictionary
Individual position, is all a feature.Default sample is contrasted with hash dictionaries, then phase occurs in the sample in the feature in dictionary
It is 1 to answer the characteristic value of position, if not occurring, the characteristic value of correspondence position is 0.Finally obtain a sparse two-value for higher-dimension
Characteristic data set.
The feature extracted by n-grams methods only has 0 and 1 two kind of numerical value, and data are quite sparse, using classics
Method can process such data, but at a relatively high time loss can be brought.The present invention is using between feature and classification
Relative density weighs the degree of correlation between feature and classification, it is not necessary to complicated computing and iteration.The formula of relative density, tool
Body is as follows:
Wherein, F is characterized collection, FiThe ith feature of concentration is characterized, C is default mail classes collection, ClFor classification is concentrated
L-th classification,It is classification ClContained sample number, L is classification sum, and M is characterized sum,Represent ith feature numerical value
For 1 when relative to classification ClRelative density, and
The present invention uses the method for feature ordering as the first stage of feature selecting.Firstly the need of by interpretational criteria pair
Given a mark per one-dimensional characteristic, be ranked up by fraction.The characteristics of present invention is for binary feature, using measurement feature and class
The formula of other degree of correlation is specific as follows as interpretational criteria:
BecauseSo W (Fi)diffScope be [0,1].And as W (Fi)diffWhen=0, feature F is representediWith class
It is least not related, as W (Fi)diff=1, represent feature FiIt is most related to classification.W(Fi)diffFraction is higher, represents feature and classification
Degree of correlation is higher.Therefore can be by W (Fi)diffAs interpretational criteria.
One threshold value ω is preset according to actual demand, for W (Fi)diffThe feature of >=ω, it is believed that with classification degree of correlation
It is higher, initial character subset F is remained and generates, other incoherent features are to be deleted.
By the detection of correlation, also there is substantial amounts of redundancy feature in data, when this feature can bring unnecessary
Between consume, or even influence grader the degree of accuracy, therefore delete redundancy feature be necessary.
It is theoretical according to Markov blanket on the basis of initial character subset, the redundancy feature in subset is deleted, select most
Excellent character subset.Markov blanket theory thinks:Assuming that characteristic set is F, there is a subset M and not comprising there is feature
FiIf, under conditions of subset M, feature FiWith set F-M- { FiSeparate, then say that M is FiMarkov blanket.Tool
Body formula can be expressed as follows:
P(F-Mi-{Fi},C|Fi,Mi)=P (F-Mi-{Fi},C|Mi)
If meeting above formula, it is believed that subset M includes feature FiAll information, therefore FiIt is the feature of redundancy, can
To be deleted.But in practical application, it is np hard problem to search for optimal Markov blanket, therefore the present invention is using heuristic
Algorithm, propose a kind of approximate Markov blanket model.
The embodiment of the present invention deletes the feature of redundancy one by one using the strategy deleted backward.First, initialization feature subset
G=F, for every one-dimensional characteristic Fi, according to coefficient correlation from subset G- { FiIn choose and FiMaximally related K feature, the phase
The computing formula of relation number is as follows:
Wherein, fiAnd ciIt is respectively feature FiWith classification CiComponent, n for sample number,WithIt is feature FiAnd class
Other CiAverage, computing formula is as follows:
By this K feature composition set Mi, and by the MiAs feature FiApproximate Markov blanket calculate the feature
FiScore value δG(Fi|Mi), computing formula is as follows:
Wherein DKLExpression KL distances, also referred to as relative entropy, are the index of similarity between measurement variable, DKLIt is worth smaller, table
Show that similarity is higher.Due to needing to calculate probable value, but the real distribution of feature is hardly resulted in, therefore used here original
Definition of probability calculates the value of joint probability distribution.KL range formulas are as follows:
By δG(Fi|Mi) calculating, be readily apparent that δG(Fi|Mi) fraction is smaller, represents MiWith FiSimilarity it is higher, Mi
Include FiInformation it is more, can be by MiApproximately regard F asiMarkov blanket.Therefore according to δG(Fi|Mi) fraction size enter
Row sequence, deletes the minimum δ of fractionG(Fi|Mi) corresponding to feature Fi.So circulation is gone down, and can voluntarily be set as needed
Residue character number, but it is able to obtain preferably feature oneself, we can obtain the n character subset G of candidate1,
G2,...,Gn, therefrom select optimal subset.
The evaluation method of the character subset proposed in the embodiment of the present invention.It is more targeted compared to other methods, place
Managing a certain specific data can have good effect.Feature selection approach is packaged together with grader, by grader
Character subset is evaluated.By after two stage feature selecting, the feature selected all is significant feature, comprising corresponding
The main information of classification, therefore ability can be divided stronger.The filter that the present invention is used is online logistic regression (LR) grader, its
Time complexity is low, and classification effectiveness is high, and treatment high dimensional data has great advantage.
The thought of Logic Regression Models is the presence of hyperplane f (x)=wx+b=0, and anticipation function is:
Wherein w is weight, and b is biasing, and x is input, and P (Y | x) is a successive value that output and scope are [0,1].It is right
In given input example x, [0,1] numerical value P can be obtained down by the calculating of formula, work as P>When 0.5, Y=1 is taken, i.e.,
It is spam to predict the outcome, otherwise when P≤0.5, takes Y=0, and it is normal email to predict the outcome.
The embodiment of the present invention uses the update mode of stochastic gradient descent, although traditional gradient declines can obtain
Globally optimal solution, but iteration is required for traveling through all of data each time, it is extremely inefficient during treatment mass data.Boarding steps
It is that only this example is trained to spend the thought for declining, it is not necessary to travel through all of sample, efficiency high can obtain suboptimum
Solution.The update mode of stochastic gradient descent is as follows:
wi←wi-α(f(xi)-Yi)xi
By the training and classification of online Logic Regression Models, one point of each sample will be given by the calculating of formula
Number, when fraction is more than 0.5, fallout predictor classification is spam, otherwise is then predicted as normal email, below will be by predicting classification
Evaluated with actual class comparer collection.
For n candidate subset G1,G2,...,Gn, we want to obtain best that character subset work of classifying quality
For according to predicting the outcome that previous step is obtained, we will be evaluated each subset.By the classification of previous step, can
To obtain some related datas, table 1 lists the statistic of Calculation Estimation function needs:
The evaluation function statistical form of table 1
Can be obtained calculating following statistic according to these data:
Wherein BER is referred to as balanced error rate, when normal email and spam quantity variance are larger, can by BER
With effect of the more preferable evaluating characteristic collection on grader.Specifically, carried out by online logistic regression grader characteristic set
Classification, wherein normal email number be P, spam number be N, and statistical sorter correctly classification normal email number TP and rubbish
Mail number TN, by formula TP=P-FN, TN=N-FP can obtain TP and TN.
A series of BER values BER will finally be obtained1,BER2......BERn, corresponding feature of selection minimum BER values
Collection GoptIt is optimal feature subset as final character subset, illustrates in online Logic Regression Models, optimal feature subset
GoptThere is best classifying quality.
Additionally, the invention allows for a kind of junk mail detection method of the feature selection approach based on spam,
As shown in Fig. 2 comprising the following steps:
N-grams methods of the S201 based on byte carries out the feature extraction of mail, specifically includes:By mail according to byte stream
The byte cutting for carrying out preset length obtains the hash dictionaries of the mail;Default sample is carried out into feature with the hash dictionaries
Contrast obtains feature set corresponding with the hash dictionaries.
Wherein, default sample and the hash dictionaries are carried out into Characteristic Contrast and obtains feature corresponding with the hash dictionaries
Collection is specially:When the spy of the then hash dictionaries correspondence position occurs in the default sample in the feature in the hash dictionaries
Value indicative is set to 1, if not occurring, the characteristic value of the hash dictionaries correspondence position is set to 0, obtains a sparse binary feature
Data set.
S202 carries out feature ordering generation initial characteristicses subset according to the feature extracted and the degree of correlation of default mail classes,
Specifically include:The relative density of the feature and default mail classes extracted is calculated, it is specific as follows:
Wherein, F is characterized collection, FiThe ith feature of concentration is characterized, C is default mail classes collection, ClFor classification is concentrated
L-th classification,It is classification ClContained sample number, L is classification sum, and M is characterized sum,Represent ith feature number
Be worth for 1 when relative to classification ClRelative density, andAccording to the relative density judge the feature of the extraction with
The degree of correlation of default mail classes;Feature ordering generation initial characteristicses subset is carried out according to the degree of correlation.
Wherein, the feature and the degree of correlation of default mail classes that the extraction is judged according to relative density are specifically included:Root
Relatedness computation is carried out according to the relative density, formula is as follows:
Wherein, W (Fi)diffScope be [0,1],Represent ith feature numerical value when being 1 relative to classification C1It is relative
Density,Represent ith feature numerical value when being 1 relative to classification C0Relative density, and as W (Fi)diffWhen=0, represent special
Levy FiIt is least related to classification, as W (Fi)diff=1, represent feature FiIt is most related to classification;With W (Fi)diffAs interpretational criteria,
By the W (Fi)diffIt is compared with predetermined threshold value ω, judges the feature FiWith the degree of correlation of default mail classes.
It is special that S203 obtains candidate according to the redundancy feature that approximate Markov blanket algorithm is deleted in the initial characteristicses subset
Subset is levied, is specifically included:Initialization feature subset, for the feature F in the initial characteristicses subsetiAccording to coefficient correlation from institute
State selection and the F in initial characteristicses subsetiMaximally related K feature, the computing formula of the coefficient correlation is as follows:
Wherein, fiAnd ciIt is respectively feature FiWith classification CiComponent, n for sample number,WithIt is feature FiAnd class
Other CiAverage, computing formula is as follows:
By this K feature composition set Mi, and by the MiAs feature FiApproximate Markov blanket calculate the feature
FiScore value δG(Fi|Mi), computing formula is as follows:
Wherein, DKLRelative entropy is represented, is the index of similarity between measurement variable, computing formula is as follows:
According to the score value δG(Fi|Mi) redundancy feature deleted in the initial characteristicses subset obtains candidate feature subset,
Specifically include following steps:According to the score value δG(Fi|Mi) feature in the initial characteristicses subset is ranked up, delete
It is worth minimum δG(Fi|Mi) corresponding to feature;Circulation above-mentioned steps, candidate feature subset is obtained according to default Characteristic Number.
S204 is predicted and right according to predicting the outcome by online logistic regression grader to the candidate feature subset
The candidate feature subset carries out evaluating selection optimal feature subset, specifically includes:The online logistic regression grader is utilized
Anticipation function is predicted to the candidate feature subset, and the anticipation function is:
Wherein, w is weight, and b is biasing, and x is input, and P (Y | x) is to predict the outcome and scope is [0,1];For the time
Select the feature in character subset to be input into, predicting the outcome for the anticipation function is obtained, as the P that predicts the outcome>Then it is rubbish when 0.5
Mail, when predict the outcome P≤0.5 when, then be normal email;According to the spy for predicting the outcome and being concentrated to the candidate feature
Levying carries out evaluating the optimal characteristics that prediction effect is optimal on the online logistic regression grader for extracting predetermined quantity
Collection.
S205 detected using online logistic regression grader according to the optimal feature subset to spam, specifically
For:When there is mail to need detection, the online logistic regression grader is predicted to the optimal feature subset;According to pre-
Survey result and detect spam.
Because spam detection systems need real-time update and detection, thus select online Logic Regression Models as point
Class device, can not only improve recognition accuracy, and can reduce the time complexity of training and identification.Examined with traditional spam
Survey method is compared, it is contemplated that by the feature selection approach based on spam, intrinsic dimensionality is greatly lowered, by patrolling
Collect regression model and obtain optimal character subset, in sorting phase, spam is examined using online Logic Regression Models
Survey.
By the feature selection approach of packaged type, final optimal feature subset G has been obtainedopt, wherein include be all with
Category Relevance is high, and the small feature of redundancy, there is stronger classification performance.By the evaluation and test of online Logic Regression Models,
So that optimal subset GoptThere is best performance in online Logic Regression Models, therefore detection-phase uses logistic regression mould
Type, can obtain optimal prediction effect.
Whenever having mail to need detection, logistic regression grader will be by calculating
Value give one score value of each envelope mail, when this score value is for 0.5, is given and is judged as spam, on the contrary this score value
During less than or equal to 0.5, be given and be judged as normal email.
The feature selection approach and its detection method of a kind of spam proposed by the present invention, based on packaged type feature selecting
Algorithm carries out the Feature Selection of spam, intrinsic dimensionality is greatly lowered, and remove mail using online Logic Regression Models
Substantial amounts of uncorrelated and redundancy feature in data, generates optimal character subset, and carry out rubbish using the optimal feature subset
Rubbish mail-detection, fundamentally improves Detection accuracy and reduces the time of sorting algorithm consumption, can be widely used in rubbish
In mail-detection.
Embodiment of above is merely to illustrate the present invention, and not limitation of the present invention, about the common of technical field
Technical staff, without departing from the spirit and scope of the present invention, can also make a variety of changes and modification, therefore all
Equivalent technical scheme falls within scope of the invention, and scope of patent protection of the invention should be defined by the claims.
Claims (7)
1. a kind of feature selection approach of spam, it is characterised in that including:
N-grams methods based on byte carry out the feature extraction of mail;
Feature ordering generation initial characteristicses subset is carried out according to the feature extracted and the degree of correlation of default mail classes;
Redundancy feature in the initial characteristicses subset is deleted according to approximate Markov blanket algorithm and obtains candidate feature subset;
The candidate feature subset is predicted by online logistic regression grader and according to predicting the outcome to the candidate
Character subset carries out evaluating selection optimal feature subset;
The feature extraction that the N-grams methods based on byte carry out mail is specifically included:
Mail is obtained the hash dictionaries of the mail according to the byte cutting that byte stream carries out preset length;
Default sample and the hash dictionaries are carried out into Characteristic Contrast and obtains feature set corresponding with the hash dictionaries;
It is described default sample and the hash dictionaries are carried out into Characteristic Contrast to obtain corresponding with hash dictionaries feature set tool
Body is:
Feature in the hash dictionaries occurs in the default sample, and the characteristic value of the hash dictionaries correspondence position sets
It is 1, if not occurring, the characteristic value of the hash dictionaries correspondence position is set to 0, obtains a sparse binary feature data
Collection;
It is specific that the feature and the degree of correlation of default mail classes according to extraction carries out feature ordering generation initial characteristicses subset
Including:
The relative density of the feature and default mail classes extracted is calculated, it is specific as follows:
Wherein, F is characterized collection, FiThe ith feature of concentration is characterized, C is default mail classes collection, ClFor the l that classification is concentrated
Individual classification,It is classification ClContained sample number, L is classification sum, and M is characterized sum,Represent that ith feature numerical value is 1
When relative to classification ClRelative density, and
The feature of the extraction and the degree of correlation of default mail classes are judged according to the relative density;
Feature ordering generation initial characteristicses subset is carried out according to the degree of correlation.
2. the method for claim 1, it is characterised in that it is described according to relative density judge the feature of the extraction with it is pre-
If the degree of correlation of mail classes is specifically included:
Relatedness computation is carried out according to the relative density, formula is as follows:
Wherein, W (Fi)diffScope be [0,1],Represent ith feature numerical value when being 1 relative to classification C1Relative density,Represent ith feature numerical value when being 1 relative to classification C0Relative density, and as W (Fi)diffWhen=0, feature F is representediWith
Classification is least related, as W (Fi)diff=1, represent feature FiIt is most related to classification;
By W (Fi)diffAs interpretational criteria, to the W (Fi)diffIt is compared with predetermined threshold value ω, judges the feature FiWith
The degree of correlation of default mail classes.
3. the method for claim 1, it is characterised in that the approximate Markov blanket algorithm of basis deletes described initial
Redundancy feature in character subset obtains candidate feature subset and specifically includes:
Initialization feature subset, for the feature F in the initial characteristicses subsetiAccording to coefficient correlation from initial characteristicses
Concentrate and choose and the FiMaximally related K feature, the computing formula of the coefficient correlation is as follows:
Wherein, fiAnd ciIt is respectively feature FiWith classification CiComponent, n for sample number,WithIt is feature FiWith classification Ci
Value, computing formula is as follows:
By this K feature composition set Mi, and by the MiAs feature FiApproximate Markov blanket calculate the feature Fi's
Score value δG(Fi|Mi), computing formula is as follows:
Wherein, DKLRelative entropy is represented, is the index of similarity between measurement variable, computing formula is as follows:
According to the score value δG(Fi|Mi) redundancy feature deleted in the initial characteristicses subset obtains candidate feature subset.
4. method as claimed in claim 3, it is characterised in that described according to score value δG(Fi|Mi) delete initial characteristicses
The redundancy feature of concentration obtains candidate feature subset and specifically includes:
According to the score value δG(Fi|Mi) feature in the initial characteristicses subset is ranked up, the minimum δ of deletion valueG(Fi|
Mi) corresponding to feature;
Each step of the claims 3 is circulated, candidate feature subset is obtained according to default Characteristic Number.
5. the method for claim 1, it is characterised in that described special to the candidate by online logistic regression grader
Levy subset be predicted and according to predict the outcome to the candidate feature subset carry out grading selection optimal feature subset specifically wrap
Include:
The online logistic regression grader is predicted using anticipation function to the candidate feature subset, the prediction letter
Number is:
Wherein, w is weight, and b is biasing, and x is input, and P (Y | x) is to predict the outcome and scope is [0,1];
For the feature input that the candidate feature is concentrated, predicting the outcome for the anticipation function is obtained, as the P > that predict the outcome
Then be spam when 0.5, when predict the outcome P≤0.5 when, then be normal email;
According to it is described predict the outcome the feature that the candidate feature is concentrated evaluate extract predetermined quantity described
The optimal optimal feature subset of prediction effect on online logistic regression grader.
6. a kind of junk mail detection method of the feature selection approach based on described in claim 1, it is characterised in that including:
Spam is detected using online logistic regression grader according to the optimal feature subset.
7. method as claimed in claim 6, it is characterised in that it is described according to optimal feature subset using online logistic regression point
Class device is detected specially to spam:
When there is mail to need detection, the online logistic regression grader is predicted to the optimal feature subset;
Spam is detected according to predicting the outcome.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410228073.6A CN104050556B (en) | 2014-05-27 | 2014-05-27 | The feature selection approach and its detection method of a kind of spam |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410228073.6A CN104050556B (en) | 2014-05-27 | 2014-05-27 | The feature selection approach and its detection method of a kind of spam |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104050556A CN104050556A (en) | 2014-09-17 |
CN104050556B true CN104050556B (en) | 2017-06-16 |
Family
ID=51503365
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410228073.6A Active CN104050556B (en) | 2014-05-27 | 2014-05-27 | The feature selection approach and its detection method of a kind of spam |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104050556B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105205349B (en) * | 2015-08-25 | 2018-08-03 | 合肥工业大学 | The Embedded Gene Selection Method based on encapsulation of Markov blanket |
CN105306296B (en) * | 2015-10-21 | 2018-10-12 | 北京工业大学 | A kind of data filtering processing method based on LTE signalings |
CN106570178B (en) * | 2016-11-10 | 2020-09-29 | 重庆邮电大学 | High-dimensional text data feature selection method based on graph clustering |
CN107193804B (en) * | 2017-06-02 | 2019-03-29 | 河海大学 | A kind of refuse messages text feature selection method towards word and portmanteau word |
CN107239447B (en) * | 2017-06-05 | 2020-12-18 | 厦门美柚股份有限公司 | Junk information identification method, device and system |
CN109241523B (en) * | 2018-08-10 | 2020-12-11 | 北京百度网讯科技有限公司 | Method, device and equipment for identifying variant cheating fields |
CN110119756B (en) * | 2019-03-25 | 2021-08-10 | 北京天泽智云科技有限公司 | Automatic trend data feature selection method based on voting method |
CN110174106A (en) * | 2019-04-01 | 2019-08-27 | 香港理工大学深圳研究院 | A kind of healthy walking paths planning method and terminal device based on PM2.5 |
CN111312403A (en) * | 2020-01-21 | 2020-06-19 | 山东师范大学 | Disease prediction system, device and medium based on instance and feature sharing cascade |
CN112561082A (en) * | 2020-12-22 | 2021-03-26 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for generating model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101930561A (en) * | 2010-05-21 | 2010-12-29 | 电子科技大学 | N-Gram participle model-based reverse neural network junk mail filter device |
US8417783B1 (en) * | 2006-05-31 | 2013-04-09 | Proofpoint, Inc. | System and method for improving feature selection for a spam filtering model |
-
2014
- 2014-05-27 CN CN201410228073.6A patent/CN104050556B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8417783B1 (en) * | 2006-05-31 | 2013-04-09 | Proofpoint, Inc. | System and method for improving feature selection for a spam filtering model |
CN101930561A (en) * | 2010-05-21 | 2010-12-29 | 电子科技大学 | N-Gram participle model-based reverse neural network junk mail filter device |
Non-Patent Citations (1)
Title |
---|
"基于贝叶斯分类的垃圾邮件过滤***研究与实现";林伟;《中国优秀硕士学位论文全文数据库 信息科技辑》;20100215;第I139-96页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104050556A (en) | 2014-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104050556B (en) | The feature selection approach and its detection method of a kind of spam | |
Barrón-Cedeno et al. | Proppy: A system to unmask propaganda in online news | |
CN106599054B (en) | Method and system for classifying and pushing questions | |
CN102411563B (en) | Method, device and system for identifying target words | |
CN102929937B (en) | Based on the data processing method of the commodity classification of text subject model | |
TWI438637B (en) | Systems and methods for capturing and managing collective social intelligence information | |
CN104199965B (en) | Semantic information retrieval method | |
CN109471942B (en) | Chinese comment emotion classification method and device based on evidence reasoning rule | |
WO2021051518A1 (en) | Text data classification method and apparatus based on neural network model, and storage medium | |
CN107301171A (en) | A kind of text emotion analysis method and system learnt based on sentiment dictionary | |
CN105335352A (en) | Entity identification method based on Weibo emotion | |
CN105824922A (en) | Emotion classifying method fusing intrinsic feature and shallow feature | |
CN106156372B (en) | A kind of classification method and device of internet site | |
CN105279252A (en) | Related word mining method, search method and search system | |
CN105389379A (en) | Rubbish article classification method based on distributed feature representation of text | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN105574544A (en) | Data processing method and device | |
BaygIn | Classification of text documents based on Naive Bayes using N-Gram features | |
CN103955547A (en) | Method and system for searching forum hot-posts | |
CN101702167A (en) | Method for extracting attribution and comment word with template based on internet | |
CN111753082A (en) | Text classification method and device based on comment data, equipment and medium | |
Singh et al. | Sentiment analysis of Twitter data using TF-IDF and machine learning techniques | |
CN114139634A (en) | Multi-label feature selection method based on paired label weights | |
CN108596205B (en) | Microblog forwarding behavior prediction method based on region correlation factor and sparse representation | |
CN104794209A (en) | Chinese microblog sentiment classification method and system based on Markov logic network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20221214 Address after: Room 1035 and Room 1036, Block C1, C2 and C3, Daqing Service Outsourcing Industrial Park, No. 6-1, Xinfeng Road, High tech Zone, Daqing City, Heilongjiang Province, 163711 Patentee after: Daqing Lehen Information Technology Co.,Ltd. Address before: 150080 No. 52, Xuefu Road, Nangang District, Heilongjiang, Harbin Patentee before: HARBIN University OF SCIENCE AND TECHNOLOGY |
|
TR01 | Transfer of patent right |