CN117474510A

CN117474510A - Feature selection-based spam filtering method

Info

Publication number: CN117474510A
Application number: CN202311791685.1A
Authority: CN
Inventors: 杨良志; 白琳; 汪志新; 卢业波; 白小刚; 瞿勇金; 张蔓
Original assignee: Richinfo Technology Co ltd
Current assignee: Richinfo Technology Co ltd
Priority date: 2023-12-25
Filing date: 2023-12-25
Publication date: 2024-01-30

Abstract

The invention provides a feature selection-based spam filtering method, which belongs to the technical field of text processing and comprises the following steps: preprocessing text information in a sample mail to obtain preprocessed text information; converting the preprocessed text information into corresponding sample feature vectors through a word bag model; performing feature selection on the sample feature vector based on gain analysis and simplification processing to obtain an optimized sample feature vector; constructing a mail classification model according to the optimized sample feature vector; acquiring mail to be processed; inputting the mail to be processed into a mail classification model to obtain a corresponding classification result; and filtering the junk mails according to the classification result, and retaining the normal mails. The method realizes the identification of the mail to be processed through the mail classification model, overcomes the defect that the junk mail needs to be identified manually in the traditional technology, improves the processing efficiency of the E-mail, filters the identified junk mail, and further improves the safety of the user using the E-mail.

Description

Feature selection-based spam filtering method

Technical Field

The invention relates to the technical field of text processing, in particular to a feature selection-based spam filtering method.

Background

Along with the rapid development of network technology, email is widely applied to life and work of people, becomes an important tool for people to communicate and transfer information daily, provides functions of email sending, receiving, storing, searching and the like, and brings great convenience to work and life of people.

However, at present, poor merchants use technical means to randomly distribute advertisement mails, and network hackers maliciously spread virus files through the electronic mails, so that great trouble is brought to the offices and lives of users. The advertisement mail and the virus mail are all classified as junk mail, and in the traditional technology, users need to judge and delete the junk mail one by manpower when receiving the junk mail, so that time and energy are wasted, the mail processing efficiency is reduced, and meanwhile, fraud and false information in the junk mail are likely to cause loss to property and interests of the users.

Accordingly, a feature selection-based spam filtering method is presented.

Disclosure of Invention

In order to solve the technical problems, the invention provides a feature selection-based spam filtering method, which is used for solving the problem that the conventional technology is inconvenient for manually identifying the spam.

The embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps:

acquiring a sample mail;

preprocessing the text information in the sample mail to obtain preprocessed text information;

converting the preprocessed text information into corresponding sample feature vectors through a word bag model;

performing feature selection on the sample feature vector based on gain analysis and simplification processing to obtain an optimized sample feature vector;

constructing a mail classification model according to the optimized sample feature vector;

acquiring mail to be processed;

inputting the mail to be processed into the mail classification model to obtain a corresponding classification result;

and filtering the junk mails according to the classification result, and retaining the normal mails.

Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: preprocessing the text information in the sample mail to obtain preprocessed text information, wherein the preprocessing comprises the following steps:

acquiring text information in the sample mail;

identifying the language type in the text information, performing corresponding word segmentation according to the language type, and constructing a characteristic phrase library; comprising the following steps:

when the language type is Chinese, word segmentation processing is carried out on the text information through a JIEBA word segmentation library, chinese characteristic phrases are obtained and stored in a characteristic phrase library;

when the language type is English, word segmentation processing is carried out on the text information through spaces in the text information, english feature phrases are obtained and stored in the feature phrase library;

and filtering nonsensical information in the feature phrase library to obtain preprocessed text information.

Preferably, the present invention provides a feature-based spam filtering method, where the meaningless information includes: and filtering out the virtual words, punctuation marks and low-frequency phrases in the characteristic phrase library.

Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: converting the preprocessed text information into corresponding sample feature vectors through a word bag model, comprising:

calculating the weight of each phrase in the preprocessing text information, and setting the first phraseiThe weight corresponding to the word group isα _i The preprocessing text information is commonnA word group;

；

wherein,f _i is the firstiThe frequency of occurrence of individual phrases in the text processing information,Mfor the number of all phrases in the text library,m _i is the firstiThe number of individual phrases in the text library;

according to the weight corresponding to each phrase, constructing a sample feature vectorT=[α ₁ , …,α _n ]。

Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: feature selection is carried out on the sample feature vector based on gain analysis and simplification processing, and an optimized sample feature vector is obtained, and the method comprises the following steps:

performing gain analysis according to the preprocessed text information, and performing feature selection on feature vectors in the sample feature vectors to obtain initial optimized feature vectors;

and carrying out simplification processing on the initial optimized feature vector for a plurality of times, constructing a standard feature vector set, and carrying out threshold probability judgment processing on the test probability of all feature vectors in the standard feature vector set to obtain an optimized sample feature vector.

Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: performing gain analysis according to the preprocessed text information, performing feature selection on feature vectors in the sample feature vectors, and obtaining initial optimized feature vectors, wherein the method comprises the following steps:

calculating the gain value of each phrase in the preprocessed text information, and setting the first phraseiThe gain value of the word group isg _i ；

；

Wherein,P(mail _j ) For sample mail at the firstjThe probability of occurrence in the individual categories,F _i is the firstiThe probability of the individual phrase appearing in the sample mail,is the firstiProbability of the individual phrase not appearing in the sample mail, < >>Is the firstiThe word group is at the firstjProbability of occurrence in the individual categories,/->Is the firstiThe word group is at the firstjProbability of non-occurrence in the individual classifications;

setting a gain threshold value, acquiring a phrase which does not meet the gain threshold value in the text processing information, filtering a feature vector corresponding to the phrase from a sample feature vector, and acquiring an initial optimized feature vector; comprising the following steps:

when the first isiGain value of word groupg _i Less than the gain thresholdTr _g Will beα _i From sample feature vectorsTFiltering to obtain initial optimized feature vectorT’=[α ₁ ’ ,…,α _n1 ’]。

Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: performing a plurality of simplification processes on the initial optimized feature vector to construct a standard feature vector set, and performing a threshold probability judgment process on test probabilities of all feature vectors in the standard feature vector set to obtain an optimized sample feature vector, including:

performing the initial optimization feature vectortSub-simplifying processing to obtain correspondingtA subset of standard feature vectors;

will betThe standard feature vector subsets are constructed into standard feature vector sets;

obtaining test probabilities of all feature vectors in the standard feature vector set, and constructing a vector probability set; set the standard feature vector setiTest probability of individual feature vectorsp _i ；

；

Wherein,for the initial optimization of the feature vector +.>Proceeding withtThe number of times the sub-reduction process occurs;

threshold probability judgment is carried out on the test probabilities of all feature vectors in the vector probability set, and when the threshold probability condition is not met, feature vectors corresponding to the standard feature vector set are removed, so that an optimized sample feature vector is constructedT’’=[α ₁ ’’ ,…,α _n2 ’’]。

Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: performing the initial optimization feature vectortSub-simplifying processing to obtain correspondingtA subset of standard feature vectors; comprising the following steps:

according to the initial optimization feature vector by a grain sphere neighborhood rough set algorithmT’Constructing a vector space;

randomly selecting any characteristic vector in the vector space as an initial particle;

obtaining a certain characteristic vector meeting a similarity condition with the initial particles in the vector space, constructing a growth vector subspace, and calculating a vector average value of the growth vector subspace;

according to the vector average value of the growth vector subspace, other eigenvectors meeting the similarity condition with the growth vector subspace in the vector space are obtained, the eigenvectors are added into the growth vector subspace, and the vector average value of the growth vector subspace is recalculated;

until all feature vectors in the vector space are calculated in a traversing way;

removing the feature vectors which are not added into the growth vector subspace in the vector space, and constructing an optimized vector space according to the feature vectors in the growth vector subspace;

analyzing the attribute of the optimized vector space, and filtering out the feature vector when the attribute of the optimized vector space is unchanged after a certain feature vector in the optimized vector space is moved out; when the attribute of the optimized vector space changes after a certain feature vector in the optimized vector space is moved out, the feature vector is reserved in a standard feature vector subset;

and traversing all feature vectors in the optimized vector space to obtain a standard feature vector subset.

Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: constructing a mail classification model according to the optimized sample feature vector, wherein the mail classification model comprises the following steps:

acquiring the optimized sample feature vector and a classification result set;

according to the sample mail, calculating learning parameters of the corresponding classification result, and constructing a mail classification model;

let the number of sample mails beKThe classification result isC _S Calculating the average attaching degree, the average non-attaching degree and the average boundary degree of the sample mail;

；

wherein,is the firstkSample mailmail _k Middle (f)iIndividual feature vectorsα _i ’’Belonging to the classification resultC _S Is used to determine the average degree of adherence of the substrate to the substrate,A _ik is the firstkSample mailmail _k Middle (f)iIndividual feature vectorsα _i ’’Belonging to the classification resultC _S The degree of adherence of (3);

；

wherein,l _ik to optimize the feature vector in the sample feature vectorα _i ’’In the first placekSample mailmail _k Is used to determine the number of occurrences of the pattern,is a feature vectorα _i ’’Mean of (2)，Is a feature vectorα _i ’’Standard deviation of (2);

；

wherein,D _S (mail _k ) Is the firstkSample mailmail _k Fall into the classification resultC _S Is a judgment value of (2);

；

wherein,is the firstkSample mailmail _k Middle (f)iIndividual feature vectorsα _i ’’Belonging to the classification resultC _S Is used to determine the average degree of non-attachment,E _ik is the firstkSample mailmail _k Middle (f)iIndividual feature vectorsα _i ’’Belonging to the classification resultC _S Is not dependent on the degree of attachment;

；

wherein,is the firstkSample mailmail _k Middle (f)iIndividual feature vectorsα _i ’’Belonging to the classification resultC _S Average boundary degree of (2);

the learning parametersSPIncluding average degree of attachment, average degree of non-attachment, and average degree of boundary of sample mail,。

preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: inputting the mail to be processed into the mail classification model to obtain a corresponding classification result, wherein the mail classification model comprises the following steps:

calculating mail to be processednewmailAverage degree of attachment, average degree of non-attachment, and average degree of boundary;

according to mail to be processednewmailAverage degree of attachment of (2)Average degree of non-attachment->Average boundary degree->Analysis and the learning parametersSPTo obtain the classification resultC _newmail ；

；

Wherein,sim[]as a function of the degree of homogeneity,argmax[]classifying the functions for the collection.

Compared with the prior art, the invention has the beneficial effects that: the text information in the sample mail is preprocessed and converted into sample feature vectors; in the actual processing process, the converted sample feature vectors are high-dimensional data and have high redundancy, and the redundancy vectors in the sample feature vectors are filtered through feature selection of the sample feature vectors, so that the classification performance of the constructed mail classification model is improved;

the method is applied to the E-mail processing process of the user, realizes the identification of the E-mail to be processed through the mail classification model, overcomes the defect that the E-mail needs to be identified manually in the traditional technology, saves time and energy, improves the processing efficiency of the E-mail, filters the identified E-mail, avoids the bad influence of fraud and false information in the E-mail on the user, and further improves the safety of the user using the E-mail.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

Fig. 1 is a schematic flow chart of a feature selection-based spam filtering method according to the present invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

Example 1:

the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps of:

acquiring a sample mail;

preprocessing text information in a sample mail to obtain preprocessed text information;

acquiring mail to be processed;

inputting the mail to be processed into a mail classification model to obtain a corresponding classification result;

In the above embodiment, the text information in the sample mail is preprocessed, and the preprocessed text information is converted into the corresponding sample feature vector by using the word bag model; performing feature selection on the sample feature vector based on gain analysis and simplification processing to obtain an optimized sample feature vector, and constructing a mail classification model; and after the mail classification model receives the mail to be processed, carrying out classification processing, obtaining a classification result, and filtering junk mails according to the classification result.

The beneficial effects of the above technology are: the text information in the sample mail is preprocessed and converted into sample feature vectors; in the actual processing process, the converted sample feature vectors are high-dimensional data and have high redundancy, and the redundancy vectors in the sample feature vectors are filtered through feature selection of the sample feature vectors, so that the classification performance of the constructed mail classification model is improved;

Example 2:

the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: preprocessing text information in a sample mail to obtain preprocessed text information, wherein the preprocessing text information comprises the following steps:

acquiring text information in a sample mail;

identifying the language type in the text information, carrying out corresponding word segmentation processing according to the language type, and constructing a characteristic phrase library; comprising the following steps:

when the language type is Chinese, word segmentation processing is carried out on the text information through a JIEBA word segmentation library, so that Chinese characteristic phrases are obtained and stored in the characteristic phrase library;

when the language type is English, word segmentation processing is carried out on the text information through spaces in the text information, english feature phrases are obtained and stored in a feature phrase library;

and filtering nonsensical information in the feature phrase library to obtain the preprocessed text information.

In the above embodiment, according to the language type of the text information in the sample mail, the corresponding word segmentation processing is performed, the feature phrase library is constructed, and the meaningless information in the feature phrase library is acquired to obtain the preprocessed text information.

In the above embodiment, when the language type of the text information in the sample mail is chinese, the text information is subjected to word segmentation processing by the JIEBA word segmentation library, so as to obtain a chinese feature phrase.

In the above embodiment, when the language type of the text information in the sample mail is english, the text information is subjected to word segmentation processing by using spaces in the text information, so as to obtain the english feature phrase.

The beneficial effects of the above technology are: word segmentation processing and meaningless information filtering are carried out on text information in the sample mail, so that preprocessing of the text information is realized, and preprocessed text information is obtained.

Example 3:

the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps of: and filtering out the virtual words, punctuation marks and low-frequency phrases in the characteristic phrase library.

In the above embodiment, the preprocessing of the text information in the sample mail is realized through the virtual word, the punctuation mark and the low-frequency phrase contained in the nonsensical information.

Example 4:

the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: converting the preprocessed text information into corresponding sample feature vectors through a word bag model, comprising:

calculating the weight of each phrase in the preprocessed text information, and setting the firstiThe weight corresponding to the word group isα _i Preprocessing text information commonalitynA word group;

；

wherein,f _i is the firstiThe frequency of occurrence of the individual phrases in the text processing information,Mfor the number of all phrases in the text library,m _i is the firstiThe number of individual phrases in the text library;

In the above embodiment, the sample feature vector is obtained by calculating the weight of each phrase in the preprocessed text information, so that the vectorized expression of the preprocessed text information is realized.

Example 5:

the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: feature selection is carried out on the sample feature vector based on gain analysis and simplification processing, and an optimized sample feature vector is obtained, and the method comprises the following steps:

In the above embodiment, gain analysis is performed according to the preprocessed text information, and feature vectors with smaller information content in the sample feature vectors are filtered out to obtain initial optimized feature vectors; and performing simplification processing on the initial optimized feature vector, filtering feature vectors with low similarity and consistent attributes, constructing a standard feature vector set, performing threshold probability judgment processing on test probabilities of all feature vectors in the standard feature vector set, and filtering feature vectors with non-advantageous classification performance to obtain optimized sample feature vectors.

The beneficial effects of the above technology are: performing feature selection on feature vectors in sample feature vectors through gain analysis on the preprocessed text information, and enhancing information expression of the feature vectors in the initial optimized feature vectors; and simplifying the initial optimized feature vector, judging the threshold probability of the acquired standard feature vector set calculation test probability, improving the classification performance of the feature vector in the optimized sample feature vector, and providing a reliable sample for training a mail classification model.

Example 6:

the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: performing gain analysis according to the preprocessed text information, performing feature selection on feature vectors in the sample feature vectors, and obtaining initial optimized feature vectors, including:

；

setting a gain threshold value, acquiring a phrase which does not meet the gain threshold value in the text processing information, filtering a feature vector corresponding to the phrase in a sample feature vector, and acquiring an initial optimized feature vector; comprising the following steps:

In the above embodiment, the gain value of each phrase in the preprocessed text information is calculated, and when the gain value of the phrase does not meet the gain threshold, the feature vector corresponding to the phrase is filtered out in the sample feature vector, so as to obtain the initial optimized feature vector.

The beneficial effects of the above technology are: the measurement of phrase-containing information is realized by calculating the gain value of each phrase in the preprocessed text information, and the higher the gain value is, the more important the feature is indicated; and filtering the feature vector with less information content by setting a gain threshold value to obtain the initial optimized feature vector after feature selection.

Example 7:

the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: performing a number of simplification processes on the initial optimized feature vector, constructing a standard feature vector set, and performing a threshold probability judgment process on test probabilities of all feature vectors in the standard feature vector set, to obtain an optimized sample feature vector, including:

performing initial optimization feature vectortSub-simplifying processing to obtain correspondingtA subset of standard feature vectors;

obtaining test probabilities of all feature vectors in a standard feature vector set, and constructing a vector probability set; set standard feature vector setiTest probability of individual feature vectorsp _i ；

；

threshold probability judgment is carried out on test probabilities of all feature vectors in the vector probability set, and when the threshold probability condition is not met, feature vectors corresponding to the standard feature vector set are removed, so that an optimized sample feature vector is constructedT’’=[α ₁ ’’,…,α _n2 ’’]。

In the above embodiment, the initial optimization feature vector is performedtSub-simplifying processing to obtain correspondingtA subset of standard feature vectors to betConstructing a standard feature vector subset as a standard feature vector set; and calculating test probabilities of all feature vectors in the standard feature vector set, carrying out threshold probability judgment, and eliminating the corresponding feature vectors when the threshold probability condition is not met, so as to construct an optimized sample feature vector.

The beneficial effects of the above technology are: because the simplification process has certain randomness, the initial optimization feature vector is subjected to multiple times of simplification process based on a large number law, a standard feature vector subset is obtained, a standard feature vector set is constructed, judgment is carried out by setting threshold probability, when the test probability of the feature vector does not meet the threshold probability condition, the feature vector is judged to have no advantage in classification performance, the feature vector is filtered, and the optimization sample feature vector with classification performance is constructed.

Example 8:

the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: performing initial optimization feature vectortSub-simplifying processing to obtain correspondingtA subset of standard feature vectors; comprising the following steps:

obtaining a certain characteristic vector meeting a similarity condition with the initial particles in a vector space, constructing a growth vector subspace, and calculating a vector average value of the growth vector subspace;

according to the vector average value of the growth vector subspace, other feature vectors meeting the similarity condition with the growth vector subspace in the vector space are obtained, the feature vectors are added into the growth vector subspace, and the vector average value of the growth vector subspace is recalculated;

until all feature vectors in the vector space are traversed;

analyzing the attribute of the optimized vector space, and filtering the feature vector when the attribute of the optimized vector space is unchanged after a certain feature vector in the optimized vector space is moved out; when the attribute of the optimized vector space changes after a certain feature vector in the optimized vector space is moved out, the feature vector is reserved in the standard feature vector subset;

In the above embodiment, the simplification processing is performed on the initial optimization feature vector, a vector space is constructed according to the initial optimization feature vector based on the grain sphere neighborhood rough set algorithm, a growing vector subspace is constructed on the feature vector meeting the similarity condition in the vector space, and the feature vector which is not added into the growing vector subspace in the vector space is removed, so that the first simplification processing of the initial optimization feature vector is realized, and an optimization vector space is obtained; analyzing the overall attribute of the optimized vector space and the attributes of all feature vectors in the optimized vector space, moving the feature vectors out of the optimized vector space, filtering the feature vectors when the attributes of the optimized vector space are unchanged, and realizing the second simplification processing of the initial optimized feature vectors to obtain a standard feature vector subset.

The beneficial effects of the above technology are: through similarity detection and attribute analysis, twice simplification processing of feature vectors in the initial optimized feature vectors is realized, feature vectors with lower similarity and consistent attribute in the initial optimized feature vectors are filtered, feature selection of the feature vectors is realized, and repeated calculation of redundant vectors in the subsequent mail classification model processing process is avoided.

Example 9:

the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: constructing a mail classification model according to the optimized sample feature vector, wherein the mail classification model comprises the following steps:

obtaining an optimized sample feature vector and a classification result set;

；

learning parametersSPIncluding average degree of attachment, average degree of non-attachment, and average degree of boundary of sample mail,。

in the above embodiment, according to the sample mail, the optimized sample feature vector and the classification result set, the learning parameters of the mail classification model are calculated, where the learning parameters include the average degree of attachment, the average degree of non-attachment and the average degree of boundary of the sample mail.

In the above embodiments, the average degree of attachment, the average degree of non-attachment, and the average degree of boundary of the sample mail are specifically a set of the average degree of attachment, the average degree of non-attachment, and the average degree of boundary of the sample mail to all feature vectors in the optimized sample feature vector.

The beneficial effects of the above technology are: the mail classification model sets the mail classification as normal mail, junk mail and other mails based on the three-classification decision idea, and other mails are located at decision boundaries, so that the classification error deletion is most likely caused, the model does not judge other mails, and the user makes judgment, so that the situation that the model mistakenly recognizes the normal mail as junk mail is avoided, and the robustness of the model is improved.

Example 10:

the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: inputting the mail to be processed into a mail classification model to obtain a corresponding classification result, wherein the mail classification model comprises the following steps:

according to mail to be processednewmailAverage degree of attachment of (2)Average degree of non-attachment->Average boundary degree->Analysis and learning parametersSPTo obtain the classification resultC _newmail ；

；

In the above embodiment, the obtained mail to be processed is input into the mail classification model, the mail classification model calculates the average degree of attachment, average degree of non-attachment and average degree of boundary of the mail to be processed, calculates the degree of homogeneity of the mail to be processed and learning parameters obtained through training in the model, and obtains the final classification result of the mail to be processed through the set classification function.

In one embodiment, the homogeneity-level function may be implemented as:

；

the beneficial effects of the technology are as follows: after the mail classification model receives the mail to be processed, the corresponding classification result of the mail to be processed is obtained through the processing of the technical scheme, and the identification of the junk mail and the normal mail is specifically realized.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A feature selection-based spam filtering method, comprising:

acquiring a sample mail;

acquiring mail to be processed;

2. The feature-based spam filtering method of claim 1, wherein the steps of: preprocessing the text information in the sample mail to obtain preprocessed text information, wherein the preprocessing comprises the following steps:

acquiring text information in the sample mail;

3. The feature-selection-based spam filtering method of claim 2, wherein the nonsensical information comprises: and filtering out the virtual words, punctuation marks and low-frequency phrases in the characteristic phrase library.

4. The feature-based spam filtering method of claim 1, wherein the steps of: converting the preprocessed text information into corresponding sample feature vectors through a word bag model, comprising:

；

5. The feature-based spam filtering method of claim 1, wherein the steps of: feature selection is carried out on the sample feature vector based on gain analysis and simplification processing, and an optimized sample feature vector is obtained, and the method comprises the following steps:

6. The feature-based spam filtering method of claim 5, wherein the steps of: performing gain analysis according to the preprocessed text information, performing feature selection on feature vectors in the sample feature vectors, and obtaining initial optimized feature vectors, wherein the method comprises the following steps:

；

7. The feature-based spam filtering method of claim 5, wherein the steps of: performing a plurality of simplification processes on the initial optimized feature vector to construct a standard feature vector set, and performing a threshold probability judgment process on test probabilities of all feature vectors in the standard feature vector set to obtain an optimized sample feature vector, including:

performing the initial optimization feature vectortSub-simplifying processing to obtain correspondingtPersonal labelA quasi-feature vector subset;

；

8. The feature-based spam filtering method of claim 7, wherein the steps of: performing the initial optimization feature vectortSub-simplifying processing to obtain correspondingtA subset of standard feature vectors; comprising the following steps:

9. The feature-based spam filtering method of claim 1, wherein the steps of: constructing a mail classification model according to the optimized sample feature vector, wherein the mail classification model comprises the following steps:

acquiring the optimized sample feature vector and a classification result set;

；

10. the feature-based spam filtering method of claim 9, wherein the steps of: inputting the mail to be processed into the mail classification model to obtain a corresponding classification result, wherein the mail classification model comprises the following steps:

；