CN104050556B

CN104050556B - The feature selection approach and its detection method of a kind of spam

Info

Publication number: CN104050556B
Application number: CN201410228073.6A
Authority: CN
Inventors: 孙广路; 何勇军; 刘广明
Original assignee: Harbin University of Science and Technology
Current assignee: Daqing Lehen Information Technology Co.,Ltd.
Priority date: 2014-05-27
Filing date: 2014-05-27
Publication date: 2017-06-16
Anticipated expiration: 2034-05-27
Also published as: CN104050556A

Abstract

The present invention relates to the feature selection approach and its detection method of a kind of spam, including：N grams methods based on byte carry out the feature extraction of mail；Feature ordering generation initial characteristicses subset is carried out according to the feature extracted and the degree of correlation of default mail classes；Redundancy feature in the initial characteristicses subset is deleted according to approximate Markov blanket algorithm and obtains candidate feature subset；The candidate feature subset is predicted by online logistic regression grader and the candidate feature subset is carried out according to predicting the outcome evaluate selection optimal feature subset；Spam is detected using online logistic regression grader according to the optimal feature subset.Using the feature selection approach and its detection method of spam proposed by the present invention so that the feature selecting of spam and the calculating process of spam detection are simple, and time complexity is low, and cause that the accuracy rate of spam detection is greatly improved.

Description

The feature selection approach and its detection method of a kind of spam

Technical field

The present invention relates to computer network security technology field, more particularly to a kind of spam feature selection approach and Its detection method.

Background technology

With the fast development of internet, Email becomes new media of information, by its is cheap, Convenient and swift the advantages of, it is widely used in every field.Then it is extensive using some negative impacts are also brought, largely Spam be full of in the mailbox of people, not only have impact on the normal of user and use, and image to operator is produced Infringement.Many garbage mail systems arise at the historic moment, but are faced with that data volume is big, the low problem of operational efficiency.

Traditional rubbish mail filtering method, including including Flexible Bayes, decision tree, SVM, Boosting very Many machine learning methods are all applied in Spam filtering.From in terms of current result of study, Flexible Bayes, SVM, These machine learning methods such as Boosting, Winnow seem that practical degree can be reached in some small-scale data. But for large-scale data, training grader can take a significant amount of time, and because data are numerous and diverse, it is difficult to obtain optimal Training pattern.

In the middle of current characterization method, the feature selection approach research for higher-dimension two-value data is very few, does not have also at present There is effective solution.Traditional method can process the feature selecting of two-value data, but for the data of higher-dimension, Often complexity is very high for those methods, it is difficult to obtain good effect in actual applications.

The content of the invention

(1) technical problem to be solved

It is an object of the invention to provide the feature selection approach and its detection method of a kind of spam, to solve existing spy Levy system of selection and pass junk mail detection method present in computation complexity it is high, the cost time is more, and is difficult in reality The problem of good effect is obtained in.

(2) technical scheme

In order to achieve the above object, the present invention proposes a kind of feature selection approach of spam, including：

N-grams methods based on byte carry out the feature extraction of mail；

Feature ordering generation initial characteristicses subset is carried out according to the feature extracted and the degree of correlation of default mail classes；

Redundancy feature in the initial characteristicses subset is deleted according to approximate Markov blanket algorithm and obtains candidate feature Collection；

The candidate feature subset is predicted by online logistic regression grader and according to predicting the outcome to described Candidate feature subset carries out evaluating selection optimal feature subset.

The invention allows for a kind of junk mail detection method of the feature selection approach based on above-mentioned spam, bag Include：

Spam is detected using online logistic regression grader according to the optimal feature subset.

Preferably, the feature extraction that the N-grams methods based on byte carry out mail is specifically included：

Mail is obtained the hash dictionaries of the mail according to the byte cutting that byte stream carries out preset length；

Default sample and the hash dictionaries are carried out into Characteristic Contrast and obtains feature set corresponding with the hash dictionaries.

Preferably, it is described default sample and the hash dictionaries are carried out into Characteristic Contrast to obtain corresponding with the hash dictionaries Feature set be specially：

When the spy of the then hash dictionaries correspondence position occurs in the default sample in the feature in the hash dictionaries Value indicative is set to 1, if not occurring, the characteristic value of the hash dictionaries correspondence position is set to 0, obtains a sparse binary feature Data set.

Preferably, it is described that the initial spy of feature ordering generation is carried out according to the feature extracted and the degree of correlation of default mail classes Subset is levied to specifically include：

The relative density of the feature and default mail classes extracted is calculated, it is specific as follows：

Wherein, F is characterized collection, F_iThe ith feature of concentration is characterized, C is default mail classes collection, C_lFor classification is concentrated L-th classification,It is classification C_lContained sample number, L is classification sum, and M is characterized sum,Represent ith feature number Be worth for 1 when relative to classification C_lRelative density, and

The feature of the extraction and the degree of correlation of default mail classes are judged according to the relative density；

Feature ordering generation initial characteristicses subset is carried out according to the degree of correlation.

Preferably, it is described to judge that the feature of the extraction and the degree of correlation of default mail classes are specifically wrapped according to relative density Include：

Relatedness computation is carried out according to the relative density, formula is as follows：

Wherein, W (F_i)_diffScope be [0,1],Represent ith feature numerical value when being 1 relative to classification C₁It is relative Density,Represent ith feature numerical value when being 1 relative to classification C₀Relative density, and as W (F_i)_diffWhen=0, feature is represented F_iIt is least related to classification, as W (F_i)_diff=1, represent feature F_iIt is most related to classification；

With W (F_i)_diffAs interpretational criteria, by the W (F_i)_diffIt is compared with predetermined threshold value ω, judges the feature F_iWith the degree of correlation of default mail classes.

Preferably, the redundancy feature that the approximate Markov blanket algorithm of the basis is deleted in the initial characteristicses subset is obtained Candidate feature subset is specifically included：

Initialization feature subset, for the feature F in the initial characteristicses subset_iAccording to coefficient correlation from the initial spy Levy selection and the F in subset_iMaximally related K feature, the computing formula of the coefficient correlation is as follows：

Wherein, f_iAnd c_iIt is respectively feature F_iWith classification C_iComponent, n for sample number,WithIt is feature F_iAnd class Other C_iAverage, computing formula is as follows：

By this K feature composition set M_i, and by the M_iAs feature F_iApproximate Markov blanket calculate the feature F_iScore value δ_G(F_i|M_i), computing formula is as follows：

Wherein, D_KLRelative entropy is represented, is the index of similarity between measurement variable, computing formula is as follows：

According to the score value δ_G(F_i|M_i) redundancy feature deleted in the initial characteristicses subset obtains candidate feature subset.

Preferably, it is described according to score value δ_G(F_i|M_i) to obtain candidate special for the redundancy feature deleted in the initial characteristicses subset Subset is levied to specifically include：

According to the score value δ_G(F_i|M_i) feature in the initial characteristicses subset is ranked up, the minimum δ of deletion value_G (F_i|M_i) corresponding to feature；

Circulation above-mentioned steps, candidate feature subset is obtained according to default Characteristic Number.

Preferably, it is described the candidate feature subset to be predicted and according to prediction by online logistic regression grader Result carries out grading selection optimal feature subset to the candidate feature subset and specifically includes：

The online logistic regression grader is predicted using anticipation function to the candidate feature subset, the prediction Function is：

Wherein, w is weight, and b is biasing, and x is input, and P (Y | x) is to predict the outcome and scope is [0,1]；

For the feature input that the candidate feature is concentrated, predicting the outcome for the anticipation function is obtained, when prediction is tied Fruit P>Then be spam when 0.5, when predict the outcome P≤0.5 when, then be normal email；

According to it is described predict the outcome the feature that the candidate feature is concentrated evaluate extract predetermined quantity The optimal optimal feature subset of prediction effect on the online logistic regression grader.

Preferably, it is described that detection tool is carried out to spam using online logistic regression grader according to optimal feature subset Body is：

When there is mail to need detection, the online logistic regression grader is predicted to the optimal feature subset；

Spam is detected according to predicting the outcome.

(3) beneficial effect

The feature selection approach and its detection method of a kind of spam proposed by the present invention, based on packaged type feature selecting Algorithm carries out the Feature Selection of spam, intrinsic dimensionality is greatly lowered, and remove mail using online Logic Regression Models Substantial amounts of uncorrelated and redundancy feature in data, generates optimal character subset, and carry out rubbish using the optimal feature subset Rubbish mail-detection, fundamentally improves Detection accuracy and reduces the time of sorting algorithm consumption, can be widely used in rubbish In mail-detection.

Brief description of the drawings

Fig. 1 is a kind of feature selection approach flow chart of spam of the invention；

Fig. 2 is a kind of junk mail detection method flow chart of the feature selection approach based on spam of the present invention.

Specific embodiment

With reference to the accompanying drawings and examples, specific embodiment of the invention is described in further detail.Hereinafter implement Example is not limited to the scope of the present invention for illustrating the present invention.

The present invention proposes a kind of feature selection approach of spam, as shown in figure 1, comprising the following steps：

N-grams methods of the S101 based on byte carries out the feature extraction of mail, specifically includes：By mail according to byte stream The byte cutting for carrying out preset length obtains the hash dictionaries of the mail；Default sample is carried out into feature with the hash dictionaries Contrast obtains feature set corresponding with the hash dictionaries；

Wherein, default sample and the hash dictionaries are carried out into Characteristic Contrast and obtains feature corresponding with the hash dictionaries Collection is specially：When the spy of the then hash dictionaries correspondence position occurs in the default sample in the feature in the hash dictionaries Value indicative is set to 1, if not occurring, the characteristic value of the hash dictionaries correspondence position is set to 0, obtains a sparse binary feature Data set.

S102 carries out feature ordering generation initial characteristicses subset according to the feature extracted and the degree of correlation of default mail classes, Specifically include：The relative density of the feature and default mail classes extracted is calculated, it is specific as follows：

Wherein, F is characterized collection, F_iThe ith feature of concentration is characterized, C is default mail classes collection, C_lFor classification is concentrated L-th classification,It is classification C_lContained sample number, L is classification sum, and M is characterized sum,Represent ith feature number Be worth for 1 when relative to classification C_lRelative density, andAccording to the relative density judge the feature of the extraction with The degree of correlation of default mail classes；Feature ordering generation initial characteristicses subset is carried out according to the degree of correlation.

Wherein, the feature and the degree of correlation of default mail classes that the extraction is judged according to relative density are specifically included：Root Relatedness computation is carried out according to the relative density, formula is as follows：

Wherein, W (F_i)_diffScope be [0,1],Represent ith feature numerical value when being 1 relative to classification C₁It is relative Density,Represent ith feature numerical value when being 1 relative to classification C₀Relative density, and as W (F_i)_diffWhen=0, feature is represented F_iIt is least related to classification, as W (F_i)_diff=1, represent feature F_iIt is most related to classification；With W (F_i)_diffAs interpretational criteria, will W (the F_i)_diffIt is compared with predetermined threshold value ω, judges the feature F_iWith the degree of correlation of default mail classes.Root of the present invention The characteristics of according to binary feature, using the method based on density, particularly with the sparse data of two-value, computational methods are simple, the time Complexity is low, and accuracy rate is greatly improved.

It is special that S103 obtains candidate according to the redundancy feature that approximate Markov blanket algorithm is deleted in the initial characteristicses subset Subset is levied, is specifically included：

According to the score value δ_G(F_i|M_i) redundancy feature deleted in the initial characteristicses subset obtains candidate feature subset, Specifically include following steps：According to the score value δ_G(F_i|M_i) feature in the initial characteristicses subset is ranked up, delete It is worth minimum δ_G(F_i|M_i) corresponding to feature；Circulation above-mentioned steps, candidate feature subset is obtained according to default Characteristic Number.

S104 is predicted and right according to predicting the outcome by online logistic regression grader to the candidate feature subset The candidate feature subset carries out evaluating selection optimal feature subset, specifically includes：The online logistic regression grader is utilized Anticipation function is predicted to the candidate feature subset, and the anticipation function is：

With specific embodiment, the present invention is described in detail below.

With the development of anti-spam technologies, send anti-spam technology and also improving, spammer passes through Intentional misspelling, character are replaced and are inserted the forms such as blank and carry out variant to the word of spam feature, so as to escape inspection The detection of examining system.In order to overcome these problems, the present invention to be carried using the feature that the N-grams methods based on byte carry out mail Take.Very easy to use based on byte level n-grams feature extracting methods, it is not necessary to the support of any dictionary, it is right to need Sentence carries out participle；It is trained also without to corpus before the use.When feature is extracted to mail, without to mail Pre-processed, without consideration mail encoded question, but mail is directly converted into indiscriminate byte stream.

Feature extracting method based on n-grams is that mail is carried out into size for n bytes carry out cutting (its according to byte stream In, n values are 1,2,3,4 ...), it is n several strings of byte to obtain length, and each string is referred to as 1 gram.Such as： Information, sliding window cutting is carried out during according to n=4 is：Info, nfor, form, orma, rmat, mati, atio and Tion this 8 features of 4-grams.

After carrying out feature extraction to all of training data, a hash dictionary for higher-dimension will be obtained, it is each in dictionary Individual position, is all a feature.Default sample is contrasted with hash dictionaries, then phase occurs in the sample in the feature in dictionary It is 1 to answer the characteristic value of position, if not occurring, the characteristic value of correspondence position is 0.Finally obtain a sparse two-value for higher-dimension Characteristic data set.

The feature extracted by n-grams methods only has 0 and 1 two kind of numerical value, and data are quite sparse, using classics Method can process such data, but at a relatively high time loss can be brought.The present invention is using between feature and classification Relative density weighs the degree of correlation between feature and classification, it is not necessary to complicated computing and iteration.The formula of relative density, tool Body is as follows：

Wherein, F is characterized collection, F_iThe ith feature of concentration is characterized, C is default mail classes collection, C_lFor classification is concentrated L-th classification,It is classification C_lContained sample number, L is classification sum, and M is characterized sum,Represent ith feature numerical value For 1 when relative to classification C_lRelative density, and

The present invention uses the method for feature ordering as the first stage of feature selecting.Firstly the need of by interpretational criteria pair Given a mark per one-dimensional characteristic, be ranked up by fraction.The characteristics of present invention is for binary feature, using measurement feature and class The formula of other degree of correlation is specific as follows as interpretational criteria：

BecauseSo W (F_i)_diffScope be [0,1].And as W (F_i)_diffWhen=0, feature F is represented_iWith class It is least not related, as W (F_i)_diff=1, represent feature F_iIt is most related to classification.W(F_i)_diffFraction is higher, represents feature and classification Degree of correlation is higher.Therefore can be by W (F_i)_diffAs interpretational criteria.

One threshold value ω is preset according to actual demand, for W (F_i)_diffThe feature of >=ω, it is believed that with classification degree of correlation It is higher, initial character subset F is remained and generates, other incoherent features are to be deleted.

By the detection of correlation, also there is substantial amounts of redundancy feature in data, when this feature can bring unnecessary Between consume, or even influence grader the degree of accuracy, therefore delete redundancy feature be necessary.

It is theoretical according to Markov blanket on the basis of initial character subset, the redundancy feature in subset is deleted, select most Excellent character subset.Markov blanket theory thinks：Assuming that characteristic set is F, there is a subset M and not comprising there is feature F_iIf, under conditions of subset M, feature F_iWith set F-M- { F_iSeparate, then say that M is F_iMarkov blanket.Tool Body formula can be expressed as follows：

P(F-M_i-{F_i},C|F_i,M_i)=P (F-M_i-{F_i},C|M_i)

If meeting above formula, it is believed that subset M includes feature F_iAll information, therefore F_iIt is the feature of redundancy, can To be deleted.But in practical application, it is np hard problem to search for optimal Markov blanket, therefore the present invention is using heuristic Algorithm, propose a kind of approximate Markov blanket model.

The embodiment of the present invention deletes the feature of redundancy one by one using the strategy deleted backward.First, initialization feature subset G=F, for every one-dimensional characteristic F_i, according to coefficient correlation from subset G- { F_iIn choose and F_iMaximally related K feature, the phase The computing formula of relation number is as follows：

Wherein D_KLExpression KL distances, also referred to as relative entropy, are the index of similarity between measurement variable, D_KLIt is worth smaller, table Show that similarity is higher.Due to needing to calculate probable value, but the real distribution of feature is hardly resulted in, therefore used here original Definition of probability calculates the value of joint probability distribution.KL range formulas are as follows：

By δ_G(F_i|M_i) calculating, be readily apparent that δ_G(F_i|M_i) fraction is smaller, represents M_iWith F_iSimilarity it is higher, M_i Include F_iInformation it is more, can be by M_iApproximately regard F as_iMarkov blanket.Therefore according to δ_G(F_i|M_i) fraction size enter Row sequence, deletes the minimum δ of fraction_G(F_i|M_i) corresponding to feature F_i.So circulation is gone down, and can voluntarily be set as needed Residue character number, but it is able to obtain preferably feature oneself, we can obtain the n character subset G of candidate₁, G₂,...,G_n, therefrom select optimal subset.

The evaluation method of the character subset proposed in the embodiment of the present invention.It is more targeted compared to other methods, place Managing a certain specific data can have good effect.Feature selection approach is packaged together with grader, by grader Character subset is evaluated.By after two stage feature selecting, the feature selected all is significant feature, comprising corresponding The main information of classification, therefore ability can be divided stronger.The filter that the present invention is used is online logistic regression (LR) grader, its Time complexity is low, and classification effectiveness is high, and treatment high dimensional data has great advantage.

The thought of Logic Regression Models is the presence of hyperplane f (x)=wx+b=0, and anticipation function is：

Wherein w is weight, and b is biasing, and x is input, and P (Y | x) is a successive value that output and scope are [0,1].It is right In given input example x, [0,1] numerical value P can be obtained down by the calculating of formula, work as P>When 0.5, Y=1 is taken, i.e., It is spam to predict the outcome, otherwise when P≤0.5, takes Y=0, and it is normal email to predict the outcome.

The embodiment of the present invention uses the update mode of stochastic gradient descent, although traditional gradient declines can obtain Globally optimal solution, but iteration is required for traveling through all of data each time, it is extremely inefficient during treatment mass data.Boarding steps It is that only this example is trained to spend the thought for declining, it is not necessary to travel through all of sample, efficiency high can obtain suboptimum Solution.The update mode of stochastic gradient descent is as follows：

w_i←w_i-α(f(x_i)-Y_i)x_i

By the training and classification of online Logic Regression Models, one point of each sample will be given by the calculating of formula Number, when fraction is more than 0.5, fallout predictor classification is spam, otherwise is then predicted as normal email, below will be by predicting classification Evaluated with actual class comparer collection.

For n candidate subset G₁,G₂,...,G_n, we want to obtain best that character subset work of classifying quality For according to predicting the outcome that previous step is obtained, we will be evaluated each subset.By the classification of previous step, can To obtain some related datas, table 1 lists the statistic of Calculation Estimation function needs：

The evaluation function statistical form of table 1

Can be obtained calculating following statistic according to these data：

Wherein BER is referred to as balanced error rate, when normal email and spam quantity variance are larger, can by BER With effect of the more preferable evaluating characteristic collection on grader.Specifically, carried out by online logistic regression grader characteristic set Classification, wherein normal email number be P, spam number be N, and statistical sorter correctly classification normal email number TP and rubbish Mail number TN, by formula TP=P-FN, TN=N-FP can obtain TP and TN.

A series of BER values BER will finally be obtained₁,BER₂......BER_n, corresponding feature of selection minimum BER values Collection G_optIt is optimal feature subset as final character subset, illustrates in online Logic Regression Models, optimal feature subset G_optThere is best classifying quality.

Additionally, the invention allows for a kind of junk mail detection method of the feature selection approach based on spam, As shown in Fig. 2 comprising the following steps：

N-grams methods of the S201 based on byte carries out the feature extraction of mail, specifically includes：By mail according to byte stream The byte cutting for carrying out preset length obtains the hash dictionaries of the mail；Default sample is carried out into feature with the hash dictionaries Contrast obtains feature set corresponding with the hash dictionaries.

S202 carries out feature ordering generation initial characteristicses subset according to the feature extracted and the degree of correlation of default mail classes, Specifically include：The relative density of the feature and default mail classes extracted is calculated, it is specific as follows：

Wherein, W (F_i)_diffScope be [0,1],Represent ith feature numerical value when being 1 relative to classification C₁It is relative Density,Represent ith feature numerical value when being 1 relative to classification C₀Relative density, and as W (F_i)_diffWhen=0, represent special Levy F_iIt is least related to classification, as W (F_i)_diff=1, represent feature F_iIt is most related to classification；With W (F_i)_diffAs interpretational criteria, By the W (F_i)_diffIt is compared with predetermined threshold value ω, judges the feature F_iWith the degree of correlation of default mail classes.

It is special that S203 obtains candidate according to the redundancy feature that approximate Markov blanket algorithm is deleted in the initial characteristicses subset Subset is levied, is specifically included：Initialization feature subset, for the feature F in the initial characteristicses subset_iAccording to coefficient correlation from institute State selection and the F in initial characteristicses subset_iMaximally related K feature, the computing formula of the coefficient correlation is as follows：

S204 is predicted and right according to predicting the outcome by online logistic regression grader to the candidate feature subset The candidate feature subset carries out evaluating selection optimal feature subset, specifically includes：The online logistic regression grader is utilized Anticipation function is predicted to the candidate feature subset, and the anticipation function is：

Wherein, w is weight, and b is biasing, and x is input, and P (Y | x) is to predict the outcome and scope is [0,1]；For the time Select the feature in character subset to be input into, predicting the outcome for the anticipation function is obtained, as the P that predicts the outcome>Then it is rubbish when 0.5 Mail, when predict the outcome P≤0.5 when, then be normal email；According to the spy for predicting the outcome and being concentrated to the candidate feature Levying carries out evaluating the optimal characteristics that prediction effect is optimal on the online logistic regression grader for extracting predetermined quantity Collection.

S205 detected using online logistic regression grader according to the optimal feature subset to spam, specifically For：When there is mail to need detection, the online logistic regression grader is predicted to the optimal feature subset；According to pre- Survey result and detect spam.

Because spam detection systems need real-time update and detection, thus select online Logic Regression Models as point Class device, can not only improve recognition accuracy, and can reduce the time complexity of training and identification.Examined with traditional spam Survey method is compared, it is contemplated that by the feature selection approach based on spam, intrinsic dimensionality is greatly lowered, by patrolling Collect regression model and obtain optimal character subset, in sorting phase, spam is examined using online Logic Regression Models Survey.

By the feature selection approach of packaged type, final optimal feature subset G has been obtained_opt, wherein include be all with Category Relevance is high, and the small feature of redundancy, there is stronger classification performance.By the evaluation and test of online Logic Regression Models, So that optimal subset G_optThere is best performance in online Logic Regression Models, therefore detection-phase uses logistic regression mould Type, can obtain optimal prediction effect.

Whenever having mail to need detection, logistic regression grader will be by calculating Value give one score value of each envelope mail, when this score value is for 0.5, is given and is judged as spam, on the contrary this score value During less than or equal to 0.5, be given and be judged as normal email.

Embodiment of above is merely to illustrate the present invention, and not limitation of the present invention, about the common of technical field Technical staff, without departing from the spirit and scope of the present invention, can also make a variety of changes and modification, therefore all Equivalent technical scheme falls within scope of the invention, and scope of patent protection of the invention should be defined by the claims.

Claims

1. a kind of feature selection approach of spam, it is characterised in that including：

N-grams methods based on byte carry out the feature extraction of mail；

Redundancy feature in the initial characteristicses subset is deleted according to approximate Markov blanket algorithm and obtains candidate feature subset；

The candidate feature subset is predicted by online logistic regression grader and according to predicting the outcome to the candidate Character subset carries out evaluating selection optimal feature subset；

The feature extraction that the N-grams methods based on byte carry out mail is specifically included：

Default sample and the hash dictionaries are carried out into Characteristic Contrast and obtains feature set corresponding with the hash dictionaries；

It is described default sample and the hash dictionaries are carried out into Characteristic Contrast to obtain corresponding with hash dictionaries feature set tool Body is：

Feature in the hash dictionaries occurs in the default sample, and the characteristic value of the hash dictionaries correspondence position sets It is 1, if not occurring, the characteristic value of the hash dictionaries correspondence position is set to 0, obtains a sparse binary feature data Collection；

It is specific that the feature and the degree of correlation of default mail classes according to extraction carries out feature ordering generation initial characteristicses subset Including：

Wherein, F is characterized collection, F_iThe ith feature of concentration is characterized, C is default mail classes collection, C_lFor the l that classification is concentrated Individual classification,It is classification C_lContained sample number, L is classification sum, and M is characterized sum,Represent that ith feature numerical value is 1 When relative to classification C_lRelative density, and

2. the method for claim 1, it is characterised in that it is described according to relative density judge the feature of the extraction with it is pre- If the degree of correlation of mail classes is specifically included：

Wherein, W (F_i)_diffScope be [0,1],Represent ith feature numerical value when being 1 relative to classification C₁Relative density,Represent ith feature numerical value when being 1 relative to classification C₀Relative density, and as W (F_i)_diffWhen=0, feature F is represented_iWith Classification is least related, as W (F_i)_diff=1, represent feature F_iIt is most related to classification；

By W (F_i)_diffAs interpretational criteria, to the W (F_i)_diffIt is compared with predetermined threshold value ω, judges the feature F_iWith The degree of correlation of default mail classes.

3. the method for claim 1, it is characterised in that the approximate Markov blanket algorithm of basis deletes described initial Redundancy feature in character subset obtains candidate feature subset and specifically includes：

Initialization feature subset, for the feature F in the initial characteristicses subset_iAccording to coefficient correlation from initial characteristicses Concentrate and choose and the F_iMaximally related K feature, the computing formula of the coefficient correlation is as follows：

Wherein, f_iAnd c_iIt is respectively feature F_iWith classification C_iComponent, n for sample number,WithIt is feature F_iWith classification C_i Value, computing formula is as follows：

By this K feature composition set M_i, and by the M_iAs feature F_iApproximate Markov blanket calculate the feature F_i's Score value δ_G(F_i|M_i), computing formula is as follows：

4. method as claimed in claim 3, it is characterised in that described according to score value δ_G(F_i|M_i) delete initial characteristicses The redundancy feature of concentration obtains candidate feature subset and specifically includes：

According to the score value δ_G(F_i|M_i) feature in the initial characteristicses subset is ranked up, the minimum δ of deletion value_G(F_i| M_i) corresponding to feature；

Each step of the claims 3 is circulated, candidate feature subset is obtained according to default Characteristic Number.

5. the method for claim 1, it is characterised in that described special to the candidate by online logistic regression grader Levy subset be predicted and according to predict the outcome to the candidate feature subset carry out grading selection optimal feature subset specifically wrap Include：

The online logistic regression grader is predicted using anticipation function to the candidate feature subset, the prediction letter Number is：

For the feature input that the candidate feature is concentrated, predicting the outcome for the anticipation function is obtained, as the P ＞ that predict the outcome Then be spam when 0.5, when predict the outcome P≤0.5 when, then be normal email；

According to it is described predict the outcome the feature that the candidate feature is concentrated evaluate extract predetermined quantity described The optimal optimal feature subset of prediction effect on online logistic regression grader.

6. a kind of junk mail detection method of the feature selection approach based on described in claim 1, it is characterised in that including：

7. method as claimed in claim 6, it is characterised in that it is described according to optimal feature subset using online logistic regression point Class device is detected specially to spam：

Spam is detected according to predicting the outcome.