CN117474510A - Feature selection-based spam filtering method - Google Patents

Feature selection-based spam filtering method Download PDF

Info

Publication number
CN117474510A
CN117474510A CN202311791685.1A CN202311791685A CN117474510A CN 117474510 A CN117474510 A CN 117474510A CN 202311791685 A CN202311791685 A CN 202311791685A CN 117474510 A CN117474510 A CN 117474510A
Authority
CN
China
Prior art keywords
mail
feature
vector
sample
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311791685.1A
Other languages
Chinese (zh)
Inventor
杨良志
白琳
汪志新
卢业波
白小刚
瞿勇金
张蔓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Richinfo Technology Co ltd
Original Assignee
Richinfo Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Richinfo Technology Co ltd filed Critical Richinfo Technology Co ltd
Priority to CN202311791685.1A priority Critical patent/CN117474510A/en
Publication of CN117474510A publication Critical patent/CN117474510A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/107Computer-aided management of electronic mailing [e-mailing]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Tourism & Hospitality (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a feature selection-based spam filtering method, which belongs to the technical field of text processing and comprises the following steps: preprocessing text information in a sample mail to obtain preprocessed text information; converting the preprocessed text information into corresponding sample feature vectors through a word bag model; performing feature selection on the sample feature vector based on gain analysis and simplification processing to obtain an optimized sample feature vector; constructing a mail classification model according to the optimized sample feature vector; acquiring mail to be processed; inputting the mail to be processed into a mail classification model to obtain a corresponding classification result; and filtering the junk mails according to the classification result, and retaining the normal mails. The method realizes the identification of the mail to be processed through the mail classification model, overcomes the defect that the junk mail needs to be identified manually in the traditional technology, improves the processing efficiency of the E-mail, filters the identified junk mail, and further improves the safety of the user using the E-mail.

Description

Feature selection-based spam filtering method
Technical Field
The invention relates to the technical field of text processing, in particular to a feature selection-based spam filtering method.
Background
Along with the rapid development of network technology, email is widely applied to life and work of people, becomes an important tool for people to communicate and transfer information daily, provides functions of email sending, receiving, storing, searching and the like, and brings great convenience to work and life of people.
However, at present, poor merchants use technical means to randomly distribute advertisement mails, and network hackers maliciously spread virus files through the electronic mails, so that great trouble is brought to the offices and lives of users. The advertisement mail and the virus mail are all classified as junk mail, and in the traditional technology, users need to judge and delete the junk mail one by manpower when receiving the junk mail, so that time and energy are wasted, the mail processing efficiency is reduced, and meanwhile, fraud and false information in the junk mail are likely to cause loss to property and interests of the users.
Accordingly, a feature selection-based spam filtering method is presented.
Disclosure of Invention
In order to solve the technical problems, the invention provides a feature selection-based spam filtering method, which is used for solving the problem that the conventional technology is inconvenient for manually identifying the spam.
The embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps:
acquiring a sample mail;
preprocessing the text information in the sample mail to obtain preprocessed text information;
converting the preprocessed text information into corresponding sample feature vectors through a word bag model;
performing feature selection on the sample feature vector based on gain analysis and simplification processing to obtain an optimized sample feature vector;
constructing a mail classification model according to the optimized sample feature vector;
acquiring mail to be processed;
inputting the mail to be processed into the mail classification model to obtain a corresponding classification result;
and filtering the junk mails according to the classification result, and retaining the normal mails.
Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: preprocessing the text information in the sample mail to obtain preprocessed text information, wherein the preprocessing comprises the following steps:
acquiring text information in the sample mail;
identifying the language type in the text information, performing corresponding word segmentation according to the language type, and constructing a characteristic phrase library; comprising the following steps:
when the language type is Chinese, word segmentation processing is carried out on the text information through a JIEBA word segmentation library, chinese characteristic phrases are obtained and stored in a characteristic phrase library;
when the language type is English, word segmentation processing is carried out on the text information through spaces in the text information, english feature phrases are obtained and stored in the feature phrase library;
and filtering nonsensical information in the feature phrase library to obtain preprocessed text information.
Preferably, the present invention provides a feature-based spam filtering method, where the meaningless information includes: and filtering out the virtual words, punctuation marks and low-frequency phrases in the characteristic phrase library.
Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: converting the preprocessed text information into corresponding sample feature vectors through a word bag model, comprising:
calculating the weight of each phrase in the preprocessing text information, and setting the first phraseiThe weight corresponding to the word group isα i The preprocessing text information is commonnA word group;
wherein,f i is the firstiThe frequency of occurrence of individual phrases in the text processing information,Mfor the number of all phrases in the text library,m i is the firstiThe number of individual phrases in the text library;
according to the weight corresponding to each phrase, constructing a sample feature vectorT=[α 1 , …,α n ]。
Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: feature selection is carried out on the sample feature vector based on gain analysis and simplification processing, and an optimized sample feature vector is obtained, and the method comprises the following steps:
performing gain analysis according to the preprocessed text information, and performing feature selection on feature vectors in the sample feature vectors to obtain initial optimized feature vectors;
and carrying out simplification processing on the initial optimized feature vector for a plurality of times, constructing a standard feature vector set, and carrying out threshold probability judgment processing on the test probability of all feature vectors in the standard feature vector set to obtain an optimized sample feature vector.
Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: performing gain analysis according to the preprocessed text information, performing feature selection on feature vectors in the sample feature vectors, and obtaining initial optimized feature vectors, wherein the method comprises the following steps:
calculating the gain value of each phrase in the preprocessed text information, and setting the first phraseiThe gain value of the word group isg i
Wherein,P(mail j ) For sample mail at the firstjThe probability of occurrence in the individual categories,F i is the firstiThe probability of the individual phrase appearing in the sample mail,is the firstiProbability of the individual phrase not appearing in the sample mail, < >>Is the firstiThe word group is at the firstjProbability of occurrence in the individual categories,/->Is the firstiThe word group is at the firstjProbability of non-occurrence in the individual classifications;
setting a gain threshold value, acquiring a phrase which does not meet the gain threshold value in the text processing information, filtering a feature vector corresponding to the phrase from a sample feature vector, and acquiring an initial optimized feature vector; comprising the following steps:
when the first isiGain value of word groupg i Less than the gain thresholdTr g Will beα i From sample feature vectorsTFiltering to obtain initial optimized feature vectorT’=[α 1 ,…,α n1 ]。
Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: performing a plurality of simplification processes on the initial optimized feature vector to construct a standard feature vector set, and performing a threshold probability judgment process on test probabilities of all feature vectors in the standard feature vector set to obtain an optimized sample feature vector, including:
performing the initial optimization feature vectortSub-simplifying processing to obtain correspondingtA subset of standard feature vectors;
will betThe standard feature vector subsets are constructed into standard feature vector sets;
obtaining test probabilities of all feature vectors in the standard feature vector set, and constructing a vector probability set; set the standard feature vector setiTest probability of individual feature vectorsp i
Wherein,for the initial optimization of the feature vector +.>Proceeding withtThe number of times the sub-reduction process occurs;
threshold probability judgment is carried out on the test probabilities of all feature vectors in the vector probability set, and when the threshold probability condition is not met, feature vectors corresponding to the standard feature vector set are removed, so that an optimized sample feature vector is constructedT’’=[α 1 ’’ ,…,α n2 ’’]。
Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: performing the initial optimization feature vectortSub-simplifying processing to obtain correspondingtA subset of standard feature vectors; comprising the following steps:
according to the initial optimization feature vector by a grain sphere neighborhood rough set algorithmT’Constructing a vector space;
randomly selecting any characteristic vector in the vector space as an initial particle;
obtaining a certain characteristic vector meeting a similarity condition with the initial particles in the vector space, constructing a growth vector subspace, and calculating a vector average value of the growth vector subspace;
according to the vector average value of the growth vector subspace, other eigenvectors meeting the similarity condition with the growth vector subspace in the vector space are obtained, the eigenvectors are added into the growth vector subspace, and the vector average value of the growth vector subspace is recalculated;
until all feature vectors in the vector space are calculated in a traversing way;
removing the feature vectors which are not added into the growth vector subspace in the vector space, and constructing an optimized vector space according to the feature vectors in the growth vector subspace;
analyzing the attribute of the optimized vector space, and filtering out the feature vector when the attribute of the optimized vector space is unchanged after a certain feature vector in the optimized vector space is moved out; when the attribute of the optimized vector space changes after a certain feature vector in the optimized vector space is moved out, the feature vector is reserved in a standard feature vector subset;
and traversing all feature vectors in the optimized vector space to obtain a standard feature vector subset.
Preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: constructing a mail classification model according to the optimized sample feature vector, wherein the mail classification model comprises the following steps:
acquiring the optimized sample feature vector and a classification result set;
according to the sample mail, calculating learning parameters of the corresponding classification result, and constructing a mail classification model;
let the number of sample mails beKThe classification result isC S Calculating the average attaching degree, the average non-attaching degree and the average boundary degree of the sample mail;
wherein,is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S Is used to determine the average degree of adherence of the substrate to the substrate,A ik is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S The degree of adherence of (3);
wherein,l ik to optimize the feature vector in the sample feature vectorα i ’’In the first placekSample mailmail k Is used to determine the number of occurrences of the pattern,is a feature vectorα i ’’Mean of (2)Is a feature vectorα i ’’Standard deviation of (2);
wherein,D S (mail k ) Is the firstkSample mailmail k Fall into the classification resultC S Is a judgment value of (2);
wherein,is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S Is used to determine the average degree of non-attachment,E ik is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S Is not dependent on the degree of attachment;
wherein,is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S Average boundary degree of (2);
the learning parametersSPIncluding average degree of attachment, average degree of non-attachment, and average degree of boundary of sample mail,
preferably, the invention provides a feature selection-based spam filtering method, which comprises the following steps: inputting the mail to be processed into the mail classification model to obtain a corresponding classification result, wherein the mail classification model comprises the following steps:
calculating mail to be processednewmailAverage degree of attachment, average degree of non-attachment, and average degree of boundary;
according to mail to be processednewmailAverage degree of attachment of (2)Average degree of non-attachment->Average boundary degree->Analysis and the learning parametersSPTo obtain the classification resultC newmail
Wherein,sim[]as a function of the degree of homogeneity,argmax[]classifying the functions for the collection.
Compared with the prior art, the invention has the beneficial effects that: the text information in the sample mail is preprocessed and converted into sample feature vectors; in the actual processing process, the converted sample feature vectors are high-dimensional data and have high redundancy, and the redundancy vectors in the sample feature vectors are filtered through feature selection of the sample feature vectors, so that the classification performance of the constructed mail classification model is improved;
the method is applied to the E-mail processing process of the user, realizes the identification of the E-mail to be processed through the mail classification model, overcomes the defect that the E-mail needs to be identified manually in the traditional technology, saves time and energy, improves the processing efficiency of the E-mail, filters the identified E-mail, avoids the bad influence of fraud and false information in the E-mail on the user, and further improves the safety of the user using the E-mail.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
Fig. 1 is a schematic flow chart of a feature selection-based spam filtering method according to the present invention.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.
Example 1:
the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps of:
acquiring a sample mail;
preprocessing text information in a sample mail to obtain preprocessed text information;
converting the preprocessed text information into corresponding sample feature vectors through a word bag model;
performing feature selection on the sample feature vector based on gain analysis and simplification processing to obtain an optimized sample feature vector;
constructing a mail classification model according to the optimized sample feature vector;
acquiring mail to be processed;
inputting the mail to be processed into a mail classification model to obtain a corresponding classification result;
and filtering the junk mails according to the classification result, and retaining the normal mails.
In the above embodiment, the text information in the sample mail is preprocessed, and the preprocessed text information is converted into the corresponding sample feature vector by using the word bag model; performing feature selection on the sample feature vector based on gain analysis and simplification processing to obtain an optimized sample feature vector, and constructing a mail classification model; and after the mail classification model receives the mail to be processed, carrying out classification processing, obtaining a classification result, and filtering junk mails according to the classification result.
The beneficial effects of the above technology are: the text information in the sample mail is preprocessed and converted into sample feature vectors; in the actual processing process, the converted sample feature vectors are high-dimensional data and have high redundancy, and the redundancy vectors in the sample feature vectors are filtered through feature selection of the sample feature vectors, so that the classification performance of the constructed mail classification model is improved;
the method is applied to the E-mail processing process of the user, realizes the identification of the E-mail to be processed through the mail classification model, overcomes the defect that the E-mail needs to be identified manually in the traditional technology, saves time and energy, improves the processing efficiency of the E-mail, filters the identified E-mail, avoids the bad influence of fraud and false information in the E-mail on the user, and further improves the safety of the user using the E-mail.
Example 2:
the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: preprocessing text information in a sample mail to obtain preprocessed text information, wherein the preprocessing text information comprises the following steps:
acquiring text information in a sample mail;
identifying the language type in the text information, carrying out corresponding word segmentation processing according to the language type, and constructing a characteristic phrase library; comprising the following steps:
when the language type is Chinese, word segmentation processing is carried out on the text information through a JIEBA word segmentation library, so that Chinese characteristic phrases are obtained and stored in the characteristic phrase library;
when the language type is English, word segmentation processing is carried out on the text information through spaces in the text information, english feature phrases are obtained and stored in a feature phrase library;
and filtering nonsensical information in the feature phrase library to obtain the preprocessed text information.
In the above embodiment, according to the language type of the text information in the sample mail, the corresponding word segmentation processing is performed, the feature phrase library is constructed, and the meaningless information in the feature phrase library is acquired to obtain the preprocessed text information.
In the above embodiment, when the language type of the text information in the sample mail is chinese, the text information is subjected to word segmentation processing by the JIEBA word segmentation library, so as to obtain a chinese feature phrase.
In the above embodiment, when the language type of the text information in the sample mail is english, the text information is subjected to word segmentation processing by using spaces in the text information, so as to obtain the english feature phrase.
The beneficial effects of the above technology are: word segmentation processing and meaningless information filtering are carried out on text information in the sample mail, so that preprocessing of the text information is realized, and preprocessed text information is obtained.
Example 3:
the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps of: and filtering out the virtual words, punctuation marks and low-frequency phrases in the characteristic phrase library.
In the above embodiment, the preprocessing of the text information in the sample mail is realized through the virtual word, the punctuation mark and the low-frequency phrase contained in the nonsensical information.
Example 4:
the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: converting the preprocessed text information into corresponding sample feature vectors through a word bag model, comprising:
calculating the weight of each phrase in the preprocessed text information, and setting the firstiThe weight corresponding to the word group isα i Preprocessing text information commonalitynA word group;
wherein,f i is the firstiThe frequency of occurrence of the individual phrases in the text processing information,Mfor the number of all phrases in the text library,m i is the firstiThe number of individual phrases in the text library;
according to the weight corresponding to each phrase, constructing a sample feature vectorT=[α 1 , …,α n ]。
In the above embodiment, the sample feature vector is obtained by calculating the weight of each phrase in the preprocessed text information, so that the vectorized expression of the preprocessed text information is realized.
Example 5:
the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: feature selection is carried out on the sample feature vector based on gain analysis and simplification processing, and an optimized sample feature vector is obtained, and the method comprises the following steps:
performing gain analysis according to the preprocessed text information, and performing feature selection on feature vectors in the sample feature vectors to obtain initial optimized feature vectors;
and carrying out simplification processing on the initial optimized feature vector for a plurality of times, constructing a standard feature vector set, and carrying out threshold probability judgment processing on the test probability of all feature vectors in the standard feature vector set to obtain an optimized sample feature vector.
In the above embodiment, gain analysis is performed according to the preprocessed text information, and feature vectors with smaller information content in the sample feature vectors are filtered out to obtain initial optimized feature vectors; and performing simplification processing on the initial optimized feature vector, filtering feature vectors with low similarity and consistent attributes, constructing a standard feature vector set, performing threshold probability judgment processing on test probabilities of all feature vectors in the standard feature vector set, and filtering feature vectors with non-advantageous classification performance to obtain optimized sample feature vectors.
The beneficial effects of the above technology are: performing feature selection on feature vectors in sample feature vectors through gain analysis on the preprocessed text information, and enhancing information expression of the feature vectors in the initial optimized feature vectors; and simplifying the initial optimized feature vector, judging the threshold probability of the acquired standard feature vector set calculation test probability, improving the classification performance of the feature vector in the optimized sample feature vector, and providing a reliable sample for training a mail classification model.
Example 6:
the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: performing gain analysis according to the preprocessed text information, performing feature selection on feature vectors in the sample feature vectors, and obtaining initial optimized feature vectors, including:
calculating the gain value of each phrase in the preprocessed text information, and setting the first phraseiThe gain value of the word group isg i
Wherein,P(mail j ) For sample mail at the firstjThe probability of occurrence in the individual categories,F i is the firstiThe probability of the individual phrase appearing in the sample mail,is the firstiProbability of the individual phrase not appearing in the sample mail, < >>Is the firstiThe word group is at the firstjProbability of occurrence in the individual categories,/->Is the firstiThe word group is at the firstjProbability of non-occurrence in the individual classifications;
setting a gain threshold value, acquiring a phrase which does not meet the gain threshold value in the text processing information, filtering a feature vector corresponding to the phrase in a sample feature vector, and acquiring an initial optimized feature vector; comprising the following steps:
when the first isiGain value of word groupg i Less than the gain thresholdTr g Will beα i From sample feature vectorsTFiltering to obtain initial optimized feature vectorT’=[α 1 ,…,α n1 ]。
In the above embodiment, the gain value of each phrase in the preprocessed text information is calculated, and when the gain value of the phrase does not meet the gain threshold, the feature vector corresponding to the phrase is filtered out in the sample feature vector, so as to obtain the initial optimized feature vector.
The beneficial effects of the above technology are: the measurement of phrase-containing information is realized by calculating the gain value of each phrase in the preprocessed text information, and the higher the gain value is, the more important the feature is indicated; and filtering the feature vector with less information content by setting a gain threshold value to obtain the initial optimized feature vector after feature selection.
Example 7:
the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: performing a number of simplification processes on the initial optimized feature vector, constructing a standard feature vector set, and performing a threshold probability judgment process on test probabilities of all feature vectors in the standard feature vector set, to obtain an optimized sample feature vector, including:
performing initial optimization feature vectortSub-simplifying processing to obtain correspondingtA subset of standard feature vectors;
will betThe standard feature vector subsets are constructed into standard feature vector sets;
obtaining test probabilities of all feature vectors in a standard feature vector set, and constructing a vector probability set; set standard feature vector setiTest probability of individual feature vectorsp i
Wherein,for the initial optimization of the feature vector +.>Proceeding withtThe number of times the sub-reduction process occurs;
threshold probability judgment is carried out on test probabilities of all feature vectors in the vector probability set, and when the threshold probability condition is not met, feature vectors corresponding to the standard feature vector set are removed, so that an optimized sample feature vector is constructedT’’=[α 1 ’’,…,α n2 ’’]。
In the above embodiment, the initial optimization feature vector is performedtSub-simplifying processing to obtain correspondingtA subset of standard feature vectors to betConstructing a standard feature vector subset as a standard feature vector set; and calculating test probabilities of all feature vectors in the standard feature vector set, carrying out threshold probability judgment, and eliminating the corresponding feature vectors when the threshold probability condition is not met, so as to construct an optimized sample feature vector.
The beneficial effects of the above technology are: because the simplification process has certain randomness, the initial optimization feature vector is subjected to multiple times of simplification process based on a large number law, a standard feature vector subset is obtained, a standard feature vector set is constructed, judgment is carried out by setting threshold probability, when the test probability of the feature vector does not meet the threshold probability condition, the feature vector is judged to have no advantage in classification performance, the feature vector is filtered, and the optimization sample feature vector with classification performance is constructed.
Example 8:
the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: performing initial optimization feature vectortSub-simplifying processing to obtain correspondingtA subset of standard feature vectors; comprising the following steps:
according to the initial optimization feature vector by a grain sphere neighborhood rough set algorithmT’Constructing a vector space;
randomly selecting any characteristic vector in the vector space as an initial particle;
obtaining a certain characteristic vector meeting a similarity condition with the initial particles in a vector space, constructing a growth vector subspace, and calculating a vector average value of the growth vector subspace;
according to the vector average value of the growth vector subspace, other feature vectors meeting the similarity condition with the growth vector subspace in the vector space are obtained, the feature vectors are added into the growth vector subspace, and the vector average value of the growth vector subspace is recalculated;
until all feature vectors in the vector space are traversed;
removing the feature vectors which are not added into the growth vector subspace in the vector space, and constructing an optimized vector space according to the feature vectors in the growth vector subspace;
analyzing the attribute of the optimized vector space, and filtering the feature vector when the attribute of the optimized vector space is unchanged after a certain feature vector in the optimized vector space is moved out; when the attribute of the optimized vector space changes after a certain feature vector in the optimized vector space is moved out, the feature vector is reserved in the standard feature vector subset;
and traversing all feature vectors in the optimized vector space to obtain a standard feature vector subset.
In the above embodiment, the simplification processing is performed on the initial optimization feature vector, a vector space is constructed according to the initial optimization feature vector based on the grain sphere neighborhood rough set algorithm, a growing vector subspace is constructed on the feature vector meeting the similarity condition in the vector space, and the feature vector which is not added into the growing vector subspace in the vector space is removed, so that the first simplification processing of the initial optimization feature vector is realized, and an optimization vector space is obtained; analyzing the overall attribute of the optimized vector space and the attributes of all feature vectors in the optimized vector space, moving the feature vectors out of the optimized vector space, filtering the feature vectors when the attributes of the optimized vector space are unchanged, and realizing the second simplification processing of the initial optimized feature vectors to obtain a standard feature vector subset.
The beneficial effects of the above technology are: through similarity detection and attribute analysis, twice simplification processing of feature vectors in the initial optimized feature vectors is realized, feature vectors with lower similarity and consistent attribute in the initial optimized feature vectors are filtered, feature selection of the feature vectors is realized, and repeated calculation of redundant vectors in the subsequent mail classification model processing process is avoided.
Example 9:
the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: constructing a mail classification model according to the optimized sample feature vector, wherein the mail classification model comprises the following steps:
obtaining an optimized sample feature vector and a classification result set;
according to the sample mail, calculating learning parameters of the corresponding classification result, and constructing a mail classification model;
let the number of sample mails beKThe classification result isC S Calculating the average attaching degree, the average non-attaching degree and the average boundary degree of the sample mail;
wherein,is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S Is used to determine the average degree of adherence of the substrate to the substrate,A ik is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S The degree of adherence of (3);
wherein,l ik to optimize the feature vector in the sample feature vectorα i ’’In the first placekSample mailmail k Is used to determine the number of occurrences of the pattern,is a feature vectorα i ’’Mean of (2)Is a feature vectorα i ’’Standard deviation of (2);
wherein,D S (mail k ) Is the firstkSample mailmail k Fall into the classification resultC S Is a judgment value of (2);
wherein,is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S Is used to determine the average degree of non-attachment,E ik is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S Is not dependent on the degree of attachment;
wherein,is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S Average boundary degree of (2);
learning parametersSPIncluding average degree of attachment, average degree of non-attachment, and average degree of boundary of sample mail,
in the above embodiment, according to the sample mail, the optimized sample feature vector and the classification result set, the learning parameters of the mail classification model are calculated, where the learning parameters include the average degree of attachment, the average degree of non-attachment and the average degree of boundary of the sample mail.
In the above embodiments, the average degree of attachment, the average degree of non-attachment, and the average degree of boundary of the sample mail are specifically a set of the average degree of attachment, the average degree of non-attachment, and the average degree of boundary of the sample mail to all feature vectors in the optimized sample feature vector.
The beneficial effects of the above technology are: the mail classification model sets the mail classification as normal mail, junk mail and other mails based on the three-classification decision idea, and other mails are located at decision boundaries, so that the classification error deletion is most likely caused, the model does not judge other mails, and the user makes judgment, so that the situation that the model mistakenly recognizes the normal mail as junk mail is avoided, and the robustness of the model is improved.
Example 10:
the embodiment of the invention provides a feature selection-based spam filtering method, which comprises the following steps: inputting the mail to be processed into a mail classification model to obtain a corresponding classification result, wherein the mail classification model comprises the following steps:
calculating mail to be processednewmailAverage degree of attachment, average degree of non-attachment, and average degree of boundary;
according to mail to be processednewmailAverage degree of attachment of (2)Average degree of non-attachment->Average boundary degree->Analysis and learning parametersSPTo obtain the classification resultC newmail
Wherein,sim[]as a function of the degree of homogeneity,argmax[]classifying the functions for the collection.
In the above embodiment, the obtained mail to be processed is input into the mail classification model, the mail classification model calculates the average degree of attachment, average degree of non-attachment and average degree of boundary of the mail to be processed, calculates the degree of homogeneity of the mail to be processed and learning parameters obtained through training in the model, and obtains the final classification result of the mail to be processed through the set classification function.
In one embodiment, the homogeneity-level function may be implemented as:
the beneficial effects of the technology are as follows: after the mail classification model receives the mail to be processed, the corresponding classification result of the mail to be processed is obtained through the processing of the technical scheme, and the identification of the junk mail and the normal mail is specifically realized.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1. A feature selection-based spam filtering method, comprising:
acquiring a sample mail;
preprocessing the text information in the sample mail to obtain preprocessed text information;
converting the preprocessed text information into corresponding sample feature vectors through a word bag model;
performing feature selection on the sample feature vector based on gain analysis and simplification processing to obtain an optimized sample feature vector;
constructing a mail classification model according to the optimized sample feature vector;
acquiring mail to be processed;
inputting the mail to be processed into the mail classification model to obtain a corresponding classification result;
and filtering the junk mails according to the classification result, and retaining the normal mails.
2. The feature-based spam filtering method of claim 1, wherein the steps of: preprocessing the text information in the sample mail to obtain preprocessed text information, wherein the preprocessing comprises the following steps:
acquiring text information in the sample mail;
identifying the language type in the text information, performing corresponding word segmentation according to the language type, and constructing a characteristic phrase library; comprising the following steps:
when the language type is Chinese, word segmentation processing is carried out on the text information through a JIEBA word segmentation library, chinese characteristic phrases are obtained and stored in a characteristic phrase library;
when the language type is English, word segmentation processing is carried out on the text information through spaces in the text information, english feature phrases are obtained and stored in the feature phrase library;
and filtering nonsensical information in the feature phrase library to obtain preprocessed text information.
3. The feature-selection-based spam filtering method of claim 2, wherein the nonsensical information comprises: and filtering out the virtual words, punctuation marks and low-frequency phrases in the characteristic phrase library.
4. The feature-based spam filtering method of claim 1, wherein the steps of: converting the preprocessed text information into corresponding sample feature vectors through a word bag model, comprising:
calculating the weight of each phrase in the preprocessing text information, and setting the first phraseiThe weight corresponding to the word group isα i The preprocessing text information is commonnA word group;
wherein,f i is the firstiThe frequency of occurrence of individual phrases in the text processing information,Mfor the number of all phrases in the text library,m i is the firstiThe number of individual phrases in the text library;
according to the weight corresponding to each phrase, constructing a sample feature vectorT=[α 1 , …,α n ]。
5. The feature-based spam filtering method of claim 1, wherein the steps of: feature selection is carried out on the sample feature vector based on gain analysis and simplification processing, and an optimized sample feature vector is obtained, and the method comprises the following steps:
performing gain analysis according to the preprocessed text information, and performing feature selection on feature vectors in the sample feature vectors to obtain initial optimized feature vectors;
and carrying out simplification processing on the initial optimized feature vector for a plurality of times, constructing a standard feature vector set, and carrying out threshold probability judgment processing on the test probability of all feature vectors in the standard feature vector set to obtain an optimized sample feature vector.
6. The feature-based spam filtering method of claim 5, wherein the steps of: performing gain analysis according to the preprocessed text information, performing feature selection on feature vectors in the sample feature vectors, and obtaining initial optimized feature vectors, wherein the method comprises the following steps:
calculating the gain value of each phrase in the preprocessed text information, and setting the first phraseiThe gain value of the word group isg i
Wherein,P(mail j ) For sample mail at the firstjThe probability of occurrence in the individual categories,F i is the firstiThe probability of the individual phrase appearing in the sample mail,is the firstiProbability of the individual phrase not appearing in the sample mail, < >>Is the firstiThe word group is at the firstjProbability of occurrence in the individual categories,/->Is the firstiThe word group is at the firstjProbability of non-occurrence in the individual classifications;
setting a gain threshold value, acquiring a phrase which does not meet the gain threshold value in the text processing information, filtering a feature vector corresponding to the phrase from a sample feature vector, and acquiring an initial optimized feature vector; comprising the following steps:
when the first isiGain value of word groupg i Less than the gain thresholdTr g Will beα i From sample feature vectorsTFiltering to obtain initial optimized feature vectorT’=[α 1 ,…,α n1 ]。
7. The feature-based spam filtering method of claim 5, wherein the steps of: performing a plurality of simplification processes on the initial optimized feature vector to construct a standard feature vector set, and performing a threshold probability judgment process on test probabilities of all feature vectors in the standard feature vector set to obtain an optimized sample feature vector, including:
performing the initial optimization feature vectortSub-simplifying processing to obtain correspondingtPersonal labelA quasi-feature vector subset;
will betThe standard feature vector subsets are constructed into standard feature vector sets;
obtaining test probabilities of all feature vectors in the standard feature vector set, and constructing a vector probability set; set the standard feature vector setiTest probability of individual feature vectorsp i
Wherein,for the initial optimization of the feature vector +.>Proceeding withtThe number of times the sub-reduction process occurs;
threshold probability judgment is carried out on the test probabilities of all feature vectors in the vector probability set, and when the threshold probability condition is not met, feature vectors corresponding to the standard feature vector set are removed, so that an optimized sample feature vector is constructedT’’=[α 1 ’’ ,…,α n2 ’’]。
8. The feature-based spam filtering method of claim 7, wherein the steps of: performing the initial optimization feature vectortSub-simplifying processing to obtain correspondingtA subset of standard feature vectors; comprising the following steps:
according to the initial optimization feature vector by a grain sphere neighborhood rough set algorithmT’Constructing a vector space;
randomly selecting any characteristic vector in the vector space as an initial particle;
obtaining a certain characteristic vector meeting a similarity condition with the initial particles in the vector space, constructing a growth vector subspace, and calculating a vector average value of the growth vector subspace;
according to the vector average value of the growth vector subspace, other eigenvectors meeting the similarity condition with the growth vector subspace in the vector space are obtained, the eigenvectors are added into the growth vector subspace, and the vector average value of the growth vector subspace is recalculated;
until all feature vectors in the vector space are calculated in a traversing way;
removing the feature vectors which are not added into the growth vector subspace in the vector space, and constructing an optimized vector space according to the feature vectors in the growth vector subspace;
analyzing the attribute of the optimized vector space, and filtering out the feature vector when the attribute of the optimized vector space is unchanged after a certain feature vector in the optimized vector space is moved out; when the attribute of the optimized vector space changes after a certain feature vector in the optimized vector space is moved out, the feature vector is reserved in a standard feature vector subset;
and traversing all feature vectors in the optimized vector space to obtain a standard feature vector subset.
9. The feature-based spam filtering method of claim 1, wherein the steps of: constructing a mail classification model according to the optimized sample feature vector, wherein the mail classification model comprises the following steps:
acquiring the optimized sample feature vector and a classification result set;
according to the sample mail, calculating learning parameters of the corresponding classification result, and constructing a mail classification model;
let the number of sample mails beKThe classification result isC S Calculating the average attaching degree, the average non-attaching degree and the average boundary degree of the sample mail;
wherein,is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S Is used to determine the average degree of adherence of the substrate to the substrate,A ik is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S The degree of adherence of (3);
wherein,l ik to optimize the feature vector in the sample feature vectorα i ’’In the first placekSample mailmail k Is used to determine the number of occurrences of the pattern,is a feature vectorα i ’’Mean of (2)Is a feature vectorα i ’’Standard deviation of (2);
wherein,D S (mail k ) Is the firstkSample mailmail k Fall into the classification resultC S Is a judgment value of (2);
wherein,is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S Is used to determine the average degree of non-attachment,E ik is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S Is not dependent on the degree of attachment;
wherein,is the firstkSample mailmail k Middle (f)iIndividual feature vectorsα i ’’Belonging to the classification resultC S Average boundary degree of (2);
the learning parametersSPIncluding average degree of attachment, average degree of non-attachment, and average degree of boundary of sample mail,
10. the feature-based spam filtering method of claim 9, wherein the steps of: inputting the mail to be processed into the mail classification model to obtain a corresponding classification result, wherein the mail classification model comprises the following steps:
calculating mail to be processednewmailAverage degree of attachment, average degree of non-attachment, and average degree of boundary;
according to mail to be processednewmailAverage degree of attachment of (2)Average degree of non-attachment->Average boundary degree->Analysis and the learning parametersSPTo obtain the classification resultC newmail
Wherein,sim[]as a function of the degree of homogeneity,argmax[]classifying the functions for the collection.
CN202311791685.1A 2023-12-25 2023-12-25 Feature selection-based spam filtering method Pending CN117474510A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311791685.1A CN117474510A (en) 2023-12-25 2023-12-25 Feature selection-based spam filtering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311791685.1A CN117474510A (en) 2023-12-25 2023-12-25 Feature selection-based spam filtering method

Publications (1)

Publication Number Publication Date
CN117474510A true CN117474510A (en) 2024-01-30

Family

ID=89639889

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311791685.1A Pending CN117474510A (en) 2023-12-25 2023-12-25 Feature selection-based spam filtering method

Country Status (1)

Country Link
CN (1) CN117474510A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118227797A (en) * 2024-05-24 2024-06-21 北京数字一百信息技术有限公司 Data processing method and system based on text large model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
EP2028806A1 (en) * 2007-08-24 2009-02-25 Symantec Corporation Bayesian surety check to reduce false positives in filtering of content in non-trained languages
WO2013097327A1 (en) * 2011-12-29 2013-07-04 盈世信息科技(北京)有限公司 Spam filtering method
CN107193804A (en) * 2017-06-02 2017-09-22 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN111079427A (en) * 2019-12-20 2020-04-28 北京金睛云华科技有限公司 Junk mail identification method and system
CN111259652A (en) * 2020-02-10 2020-06-09 腾讯科技(深圳)有限公司 Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment
CN113901797A (en) * 2021-10-18 2022-01-07 广东博智林机器人有限公司 Text error correction method, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2028806A1 (en) * 2007-08-24 2009-02-25 Symantec Corporation Bayesian surety check to reduce false positives in filtering of content in non-trained languages
CN101227435A (en) * 2008-01-28 2008-07-23 浙江大学 Method for filtering Chinese junk mail based on Logistic regression
WO2013097327A1 (en) * 2011-12-29 2013-07-04 盈世信息科技(北京)有限公司 Spam filtering method
CN107193804A (en) * 2017-06-02 2017-09-22 河海大学 A kind of refuse messages text feature selection method towards word and portmanteau word
CN111079427A (en) * 2019-12-20 2020-04-28 北京金睛云华科技有限公司 Junk mail identification method and system
CN111259652A (en) * 2020-02-10 2020-06-09 腾讯科技(深圳)有限公司 Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment
CN113901797A (en) * 2021-10-18 2022-01-07 广东博智林机器人有限公司 Text error correction method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王军;史科;王辉;: "垃圾邮件过滤中特征选择方法研究", 合肥工业大学学报(自然科学版), no. 12, 28 December 2009 (2009-12-28), pages 1863 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118227797A (en) * 2024-05-24 2024-06-21 北京数字一百信息技术有限公司 Data processing method and system based on text large model

Similar Documents

Publication Publication Date Title
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN109492026B (en) Telecommunication fraud classification detection method based on improved active learning technology
CN107391772B (en) Text classification method based on naive Bayes
CN109657011B (en) Data mining system for screening terrorist attack event crime groups
CN111314353B (en) Network intrusion detection method and system based on hybrid sampling
Ning et al. Spam message classification based on the Naïve Bayes classification algorithm
CN105975518B (en) Expectation cross entropy feature selecting Text Classification System and method based on comentropy
US11853339B2 (en) Techniques and components to find new instances of text documents and identify known response templates
CN101477563A (en) Short text clustering method and system, and its data processing device
CN110942099A (en) Abnormal data identification and detection method of DBSCAN based on core point reservation
CN109947936B (en) Method for dynamically detecting junk mails based on machine learning
CN111198947A (en) Convolutional neural network fraud short message classification method and system based on naive Bayes optimization
CN111079427A (en) Junk mail identification method and system
US8699796B1 (en) Identifying sensitive expressions in images for languages with large alphabets
CN116072302A (en) Medical unbalanced data classification method based on biased random forest model
CN109582743B (en) Data mining system for terrorist attack event
CN113569920B (en) Second neighbor anomaly detection method based on automatic coding
US8340428B2 (en) Unsupervised writer style adaptation for handwritten word spotting
Chandra et al. Optical character recognition-A review
CN105337842B (en) A kind of rubbish mail filtering method unrelated with content
Barandela et al. Restricted decontamination for the imbalanced training sample problem
CN117474510A (en) Feature selection-based spam filtering method
CN111931829B (en) Classifier screening method, system, storage medium and computer equipment
CN111709463B (en) Feature selection method based on index synergy measurement
CN110390309B (en) Finger vein illegal user identification method based on residual distribution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination