CN110610213A - Mail classification method, device, equipment and computer readable storage medium - Google Patents

Mail classification method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN110610213A
CN110610213A CN201910893789.0A CN201910893789A CN110610213A CN 110610213 A CN110610213 A CN 110610213A CN 201910893789 A CN201910893789 A CN 201910893789A CN 110610213 A CN110610213 A CN 110610213A
Authority
CN
China
Prior art keywords
mail
data
discrimination
function
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910893789.0A
Other languages
Chinese (zh)
Inventor
张莉
郑晓晗
周伟达
王邦军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201910893789.0A priority Critical patent/CN110610213A/en
Publication of CN110610213A publication Critical patent/CN110610213A/en
Priority to PCT/CN2020/079825 priority patent/WO2021051764A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a mail classification method, which comprises the following steps: receiving mail data; processing the mail data by using a predetermined linear discriminant function to obtain a discriminant function value; the discrimination parameters in the linear discrimination function are: analyzing the training set by a twin support vector machine classification algorithm based on an L1 norm in advance to obtain the training set; and classifying the mail data by using a classification rule and a discrimination function value. Therefore, in the scheme, when mail data is classified through a linear discriminant function, discriminant parameters in the linear discriminant function need to be obtained by analyzing a training set through a twin support vector machine classification algorithm based on an L1 norm in advance, and the influence of characteristics with small contribution degree on a classification result can be reduced through the discriminant parameters, so that the classification efficiency and the generalization performance are improved, and the accuracy of filtering junk mails is improved; the invention also discloses a mail classification device, equipment and a computer readable storage medium, which can also realize the technical effects.

Description

Mail classification method, device, equipment and computer readable storage medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for classifying mails.
Background
The hazard of the junk mails is very large, the junk mails occupy network bandwidth, and the operation efficiency of the whole network is reduced; the network is easy to be used by hackers, and network congestion and even paralysis are caused; spam is also easily utilized by lawbreakers, propagates bad information, and so on. In order to maintain the healthy and safe development of the internet, a safer and more effective spam filtering technology is urgently needed.
Jayadeva et al currently proposes to handle filtering of spam through a Twin Support Vector Machine (TSVM). For the two classification problem, the TSVM seeks two non-parallel planes so that the two types of samples are as close as possible to one plane and away from the other. However, the model constructed by the algorithm is not necessarily sparse, that is, when mail classification is performed through the model, unimportant features in the mail sample can be concerned, so that the generalization performance of the classifier is reduced, and the accuracy of spam filtering is reduced. Therefore, how to improve the accuracy of spam filtering is a problem to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a mail classification method, a mail classification device, mail classification equipment and a computer readable storage medium, so as to realize accurate recognition of junk mails.
In order to achieve the above object, the present invention provides a mail classification method, comprising:
receiving mail data to be classified;
processing the mail data by using a predetermined linear discriminant function to obtain a discriminant function value; wherein, the discrimination parameters in the linear discrimination function are: analyzing the training set by a twin support vector machine classification algorithm based on an L1 norm in advance to obtain the training set; the training data comprises mail training data of different categories;
and classifying the mail data by using a preset classification rule and the discrimination function value.
Optionally, the method for generating the discriminant parameter in the linear discriminant function includes:
acquiring a training set; determining a discrimination parameter in the linear discrimination function by using the training set and a preset condition;
the preset conditions include:
s.t.-(X2w1+e2b1)+ξ2≥e22≥0
s.t.(X1w2+e1b2)+ξ1≥e11≥0
wherein, w1Is a first weight vector, w, in the discriminating parameter2Is a second weight vector in the discriminating parameter, b1As a first function deviation coefficient in said discrimination parameter, b2As a deviation factor, ξ, of a second function of said criterion parameter1Is the first relaxation variable, ξ2Is the second relaxation variable, X1Feature matrices, X, for the non-spam data in the training set2A feature matrix for the spam data in the training set, e1First vector of all 1, e2A second vector of all 1, | |. the non-woven phosphor1Is L1 norm, C1As a predetermined first auxiliary variable, C2As a predetermined second auxiliary variable, C3As a predetermined third auxiliary variable, C4Is a predetermined fourth auxiliary variable.
Optionally, the processing the mail data by using a predetermined linear discriminant function to obtain a discriminant function value includes:
obtaining a first discrimination function value f by using the first linear discrimination function and the mail data x1(x);
Obtaining a second discrimination function value f by using a second linear discrimination function and the mail data x2(x);
Wherein the first linearityThe discriminant function is: f. of1(x)=xTw1+b1The second linear discriminant function is: f. of2(x)=xTw2+b2
Optionally, the classifying the mail data by using a preset classification rule and the discrimination function value includes:
using a predetermined classification rule, the first discrimination function value f1(x) The second discrimination function value f2(x) Obtaining a classification result of the mail data;
the classification rule is as follows:
wherein, if the classification result is obtainedIf the mail is 1, judging the mail to be non-junk mail, and if the mail is classified, judging the mail to be non-junk mailAnd if the mail is-1, judging that the mail is a junk mail.
To achieve the above object, the present invention further provides a mail sorting apparatus comprising:
the data receiving module is used for receiving the mail data to be classified;
the data processing module is used for processing the mail data by utilizing a predetermined linear discriminant function to obtain a discriminant function value; wherein, the discrimination parameters in the linear discrimination function are: analyzing the training set by a twin support vector machine classification algorithm based on an L1 norm in advance to obtain the training set; the training data comprises mail training data of different categories;
and the data classification device is used for classifying the mail data by utilizing a preset classification rule and the judgment function value.
Optionally, the apparatus further includes a discrimination parameter generation module; wherein, the other parameter generation module comprises:
a training set acquisition unit for acquiring a training set;
a discrimination parameter determining unit, configured to determine a discrimination parameter in the linear discrimination function by using the training set and a preset condition; the preset conditions include:
s.t.-(X2w1+e2b1)+ξ2≥e22≥0
s.t.(X1w2+e1b2)+ξ1≥e11≥0
wherein, w1Is a first weight vector, w, in the discriminating parameter2Is a second weight vector in the discriminating parameter, b1As a first function deviation coefficient in said discrimination parameter, b2As a deviation factor, ξ, of a second function of said criterion parameter1Is the first relaxation variable, ξ2Is the second relaxation variable, X1Feature matrices, X, for the non-spam data in the training set2A feature matrix for the spam data in the training set, e1First vector of all 1, e2A second vector of all 1, | |. the non-woven phosphor1Is L1 norm, C1As a predetermined first auxiliary variable, C2As a predetermined second auxiliary variable, C3As a predetermined third auxiliary variable, C4Is a predetermined fourth auxiliary variable.
Optionally, the data processing module includes:
a first processing unit for obtaining a first discrimination function value f by using the first linear discrimination function and the mail data x1(x);
Second oneA processing unit for obtaining a second discrimination function value f by using the second linear discrimination function and the mail data x2(x) (ii) a Wherein the first linear discriminant function is: f. of1(x)=xTw1+b1The second linear discriminant function is: f. of2(x)=xTw2+b2
Optionally, the data classification device is specifically configured to: using a predetermined classification rule, the first discrimination function value f1(x) The second discrimination function value f2(x) Obtaining a classification result of the mail data;
the classification rule is as follows:
wherein, if the classification result is obtainedIf the mail is 1, judging the mail to be non-junk mail, and if the mail is classified, judging the mail to be non-junk mailAnd if the mail is-1, judging that the mail is a junk mail.
To achieve the above object, the present invention further provides a mail sorting apparatus comprising:
a memory for storing a computer program;
a processor for implementing the steps of the above mail classification method when executing the computer program.
To achieve the above object, the present invention further provides a computer-readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the above mail sorting method.
According to the scheme, the mail classification method provided by the embodiment of the invention comprises the following steps: receiving mail data to be classified; processing the mail data by using a predetermined linear discriminant function to obtain a discriminant function value; wherein, the discrimination parameters in the linear discrimination function are: analyzing the training set by a twin support vector machine classification algorithm based on an L1 norm in advance to obtain the training set; the training data comprises mail training data of different categories; and classifying the mail data by using a preset classification rule and the discrimination function value.
Therefore, in the scheme, when mail data is classified through a linear discriminant function, discriminant parameters in the linear discriminant function need to be obtained by analyzing a training set through a twin support vector machine classification algorithm based on an L1 norm in advance, and the influence of characteristics with small contribution degree on a classification result can be reduced through the discriminant parameters, so that the classification efficiency and the generalization performance are improved, and the accuracy of filtering junk mails is improved; the invention also discloses a mail classification device, equipment and a computer readable storage medium, which can also realize the technical effects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a mail classification method according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of an apparatus for classifying mails according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an email sorting apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a mail classification method, a device, equipment and a computer readable storage medium, which are used for realizing accurate identification of junk mails.
Referring to fig. 1, a mail classification method provided in an embodiment of the present invention includes:
s101, receiving mail data to be classified;
in this embodiment, for classified mail data, it is first necessary to perform normalization processing on input mail data x, and normalize the feature of the mail data x in the interval [0,1 ]. In this embodiment, the classification of the mail data can be classified into spam mail and non-spam mail, so that the classification of the mail in the present application can also be understood as recognition of spam mail.
S102, processing the mail data by using a predetermined linear discriminant function to obtain a discriminant function value; wherein, the discrimination parameters in the linear discrimination function are: analyzing the training set by a twin support vector machine classification algorithm based on an L1 norm in advance to obtain the training set; the training data comprises mail training data of different categories;
the method for generating the discriminant parameters in the linear discriminant function comprises the following steps:
acquiring a training set; determining a discrimination parameter in the linear discrimination function by using the training set and a preset condition;
the preset conditions include:
s.t.-(X2w1+e2b1)+ξ2≥e22≥0
s.t.(X1w2+e1b2)+ξ1≥e11≥0
wherein, w1Is a first weight vector, w, in the discriminating parameter2Is a second weight vector in the discriminating parameter, b1As a first function deviation coefficient in said discrimination parameter, b2As a deviation factor, ξ, of a second function of said criterion parameter1Is the first relaxation variable, ξ2Is the second relaxation variable, X1Feature matrices, X, for the non-spam data in the training set2A feature matrix for the spam data in the training set, e1First vector of all 1, e2A second vector of all 1, | |. the non-woven phosphor1Is L1 norm, C1As a predetermined first auxiliary variable, C2As a predetermined second auxiliary variable, C3As a predetermined third auxiliary variable, C4Is a predetermined fourth auxiliary variable.
In this embodiment, the samples need to be normalized to obtain a training set, the training set is used to train the model, and the trained model is used to perform prediction on the test set to obtain a final prediction result. Specifically, the collected spam-related data needs to be counted first as a training set of the system, where D is X1∪X2Wherein X is1={x1i|x1i∈Rm,y1i=1,i=1,...,n1Is a non-spam data set, X2={x2i|x2i∈Rm,y2i=-1,i=1,...,n2Is a spam data set, with each sample having a characteristic number of m, n1Number of non-spam data, n2The number of the junk mail data, n is n1+n2For training set total number of samples, RmFor a real number set with m features, X1Feature matrices, X, being non-spam data2Feature matrices, x, for spam data1iFor the i-th non-spam mail data, y1iFor classification result of ith non-spam e-mail, since x1iIs the ith non-spam email, therefore y1i=1,x2iMail data for ith spam, y2iIs the classification result of the ith spam mail because of x2iIs the ith spam, therefore y2i=-1。
In this embodiment, the classification result is obtained mainly by the following two linear discriminant functions:
f1(x)=xTw1+b1
f2(x)=xTw2+b2
wherein, w1And w2A first weight vector and a second weight vector of two functions, respectively, b1And b2A first function deviation factor and a second function deviation factor of the two functions, respectively. Therefore, to obtain the function weight vector and the deviation, two optimization problems as follows need to be solved respectively:
s.t.-(X2w1+e2b1)+ξ2≥e22≥0
s.t.(X1w2+e1b2)+ξ1≥e11≥0
wherein, C1,C2,C3,C4Four auxiliary variables that need to be determined in advance;andfeature metrics for non-spam and spam data, respectivelyThe number of the arrays is determined,andthe value of the relaxation variable is represented by,andis a vector of all 1, | | - | non-calculation1Is the norm of L1.
After solving the two optimization problems, w is obtained1,w2,b1And b2Thus, two linear discriminant functions can be determined. In addition, w is1And w2The smaller the value of the middle element is, the smaller the contribution of the feature corresponding to the element to the model training is. Removal of w1And w2The characteristics corresponding to the elements with smaller median value improve the classification efficiency and the generalization performance of the model, thereby improving the accuracy of filtering the junk mails; therefore, in the present application, after obtaining the discriminant parameter in the linear discriminant function, whether a feature value smaller than a predetermined threshold exists in the first weight vector and the second weight vector in the discriminant parameter may be determined, and if so, the feature value smaller than the predetermined threshold in the first weight vector and the second weight vector may be set to zero, thereby improving the classification effect and the generalization capability of the model.
It can be understood that the discriminant parameters of the linear discriminant function are obtained in the above manner: w is a1,w2,b1And b2Then, the mail data can be processed by using a predetermined linear discriminant function to obtain a discriminant function value, and the process specifically includes: obtaining a first discrimination function value f by using the first linear discrimination function and the mail data x1(x) (ii) a Obtaining a second discrimination function value f by using a second linear discrimination function and the mail data x2(x) (ii) a Wherein the first linear discriminant function is: f. of1(x)=xTw1+b1The second linear discriminant function is: f. of2(x)=xTw2+b2
That is, after the input mail data x to be predicted is acquired, normalization processing needs to be performed on the predicted mail data to make the characteristics thereof in the interval [0,1]]Then, the values of the discriminant functions are respectively calculated to obtain a first discriminant function value f1(x) And a second discrimination function value f2(x) The mail type is classified by the two discrimination function values.
S103, mail data is classified by using preset classification rules and discrimination function values.
The classifying the mail data by using a preset classification rule and the discrimination function value includes:
using a predetermined classification rule, the first discrimination function value f1(x) The second discrimination function value f2(x) Obtaining a classification result of the mail data;
the classification rule is as follows:
wherein, if the classification result is obtainedIf the mail is 1, judging the mail to be non-junk mail, and if the mail is classified, judging the mail to be non-junk mailAnd if the mail is-1, judging that the mail is a junk mail.
It can be seen that after two discrimination function values are obtained, the type of the mail data can be discriminated according to the predetermined classification rule, that is: and judging whether the mail data is a junk mail.
The present invention is described in detail below with reference to a specific example, which is implemented on the premise of the technical solution of the present invention, and detailed embodiments and procedures are given, but the application scope of the present invention is not limited to the following example.
In this embodiment, a test is performed on the Spambase dataset from the UCI, which classifies mail according to whether it is spam or not. The data set contains 4601 training samples, each sample contains 57 features, most of which indicate whether a particular word or character is frequently present in the mail, as shown in table 1. Wherein, the feature with type "WORD _ freq _ WORD" represents the percentage of occurrences of matching WORDs in the email, namely:
"WORD" here may be any string of alphanumeric characters;
a feature of type "word _ freq _ CHAR" represents the percentage of occurrences of matching characters in an email, namely:
"Capital _ run _ length _ average" represents the average length of an uninterrupted sequence of capital letters;
"Capital _ run _ length _ changest" represents the length of the longest continuous capital letter sequence;
"Central _ run _ length _ total" represents the total number of capital letters in an email.
In this training sample, there are 1813 non-spam, which are labeled + 1; there are 2788 spam, which are marked as-1.
TABLE 1 characterization of the Spambase dataset
The specific implementation steps are as follows:
first, data preprocessing module
(1) And (4) counting the collected related data of the junk mails to be used as a training set of the system. The Spambase dataset is used in this example.
(2) Input training set D ═ X1∪X2Wherein X is1={x1i|x1i∈Rm,y1i=1,i=1,...,n1Is a non-spam data set, X2={x2i|x2i∈Rm,y2i=-1,i=1,...,n2Is a spam data set, with each sample having a characteristic number of m, n1Number of non-spam data, n2The number of the junk mail data, n is n1+n2Is the total number of samples in the training set. In this example, the number of features m is 57 and the total number of training set samples n is 4601. 3680 samples in the sample set were randomly taken as a training set, and the remaining 921 samples were taken as a test set.
Second, data training module
Two linear discriminant functions were determined using the present invention:
f1(x)=xTw1+b1
f2(x)=xTw2+b2
wherein w1And w2As a weight vector of a function, b1And b2Is the deviation of the function. To obtain the function weight vector and the deviation, the following two optimization problems are solved respectively:
s.t.-(X2w1+e2b1)+ξ2≥e22≥0
s.t.(X1w2+e1b2)+ξ1≥e11≥0
wherein, C1,C2,C3,C4Is an auxiliary variable that needs to be determined in advance;andfeature matrices that are non-spam and spam data respectively,andthe value of the relaxation variable is represented by,andis a vector of all 1 s.
After solving the two optimization problems, w is obtained1,w2,b1And b2Thus, two linear discriminant functions can be determined. w is a1And w2The smaller the value of the middle element is, the smaller the contribution of the feature corresponding to the element to the model training is. Removal of w1And w2And the characteristics corresponding to the elements with smaller median value improve the classification efficiency and the generalization performance of the model, thereby improving the accuracy of filtering the junk mails.
Table 2 shows w in this example1And w2And its corresponding characteristics.
TABLE 2 Spambase dataset trained w1,w2Values, and their corresponding characteristics
As can be seen from Table 2, some character strings consisting of numbers and some features such as symbols of "(", "[", etc. contribute less to the model training, while w1And w2The feature vocabulary corresponding to the larger element value in the model, such as features of "meeting", "business", "edu", etc., has a larger contribution to the model. Will be w in this example1And w2Median value of [ -e-4,e-4]The elements in between (i.e., the bolded data in the table) are set to 0.
Data prediction module
Inputting mail data x to be predicted, respectively calculating the value of discriminant function
f1(x)=xTw1+b1
f2(x)=xTw2+b2
Then, the mail category is judged according to the following rules:
if it isIf the number is 1, the mail is a non-junk mail; otherwise, the mail is junk mail.
TSVM and the present invention are compared. The invention is divided into two types, one is to carry out the pair w1And w2Setting the smaller value to zero directly; the other is to reserve w1And w2. The experimental results are shown in table 3, and the method reduces the influence of the features with lower contribution degree on the classification results, improves the generalization performance of classification, and further improves the accuracy of mail filtering.
TABLE 3 comparison of accuracy of Spambase data set test results
Method of producing a composite material Accuracy of measurement
The invention (Small weight rejection) 94.14%
Present invention (all weights) 94.03%
TSVM 92.31%
It can be seen that when mail data is classified through a linear discriminant function, discriminant parameters in the linear discriminant function need to be obtained by analyzing a training set through a twin support vector machine classification algorithm based on an L1 norm in advance, and through the discriminant parameters, the influence of features with small contribution degree on a classification result can be reduced, so that the classification efficiency and the generalization performance are improved; furthermore, the scheme can also be realized by combining w1And w2And the influence of the characteristics with small contribution degree on the classification result is directly removed in a mode of directly setting the smaller value of the spam, so that the accuracy of filtering the spam is further improved.
In the following, the mail sorting apparatus provided in the embodiment of the present invention is introduced, and the mail sorting apparatus described below and the mail sorting method described above may be referred to each other.
Referring to fig. 2, an email sorting apparatus provided in an embodiment of the present invention includes:
a data receiving module 100, configured to receive mail data to be classified;
a data processing module 200, configured to process the mail data by using a predetermined linear discriminant function to obtain a discriminant function value; wherein, the discrimination parameters in the linear discrimination function are: analyzing the training set by a twin support vector machine classification algorithm based on an L1 norm in advance to obtain the training set; the training data comprises mail training data of different categories;
a data classification device 300, configured to classify the mail data by using a preset classification rule and the discrimination function value.
The device also comprises a discrimination parameter generation module; wherein, the discrimination parameter generation module comprises:
a training set acquisition unit for acquiring a training set;
a discrimination parameter determining unit, configured to determine a discrimination parameter in the linear discrimination function by using the training set and a preset condition; the preset conditions include:
s.t.-(X2w1+e2b1)+ξ2≥e22≥0
s.t.(X1w2+e1b2)+ξ1≥e11≥0
wherein, w1Is a first weight vector, w, in the discriminating parameter2Is a second weight vector in the discriminating parameter, b1As a first function deviation coefficient in said discrimination parameter, b2As a deviation factor, ξ, of a second function of said criterion parameter1Is the first relaxation variable, ξ2Is the second relaxation variable, X1Feature matrices, X, for the non-spam data in the training set2A feature matrix for the spam data in the training set, e1First vector of all 1, e2A second vector of all 1, | |. the non-woven phosphor1Is L1 norm, C1As a predetermined first auxiliary variable, C2As a predetermined second auxiliary variable, C3As a predetermined third auxiliary variable, C4Is a predetermined fourth auxiliary variable.
Wherein the data processing module comprises:
a first processing unit for obtaining a first discrimination function value f by using the first linear discrimination function and the mail data x1(x);
A second processing unit for obtaining a second discrimination function value f by using a second linear discrimination function and the mail data x2(x) (ii) a Wherein the first linear discriminant function is: f. of1(x)=xTw1+b1The second linear discriminant function is: f. of2(x)=xTw2+b2
Wherein the data classification device is specifically configured to: using a predetermined classification rule, the first discrimination function value f1(x) The second discrimination function value f2(x) Obtaining a classification result of the mail data;
the classification rule is as follows:
wherein, if the classification result is obtainedIf the mail is 1, judging the mail to be non-junk mail, and if the mail is classified, judging the mail to be non-junk mailAnd if the mail is-1, judging that the mail is a junk mail.
Referring to fig. 3, a schematic structural diagram of an email sorting apparatus is also disclosed for the embodiment of the present invention; the apparatus may include:
a memory 11 for storing a computer program;
a processor 12 for implementing the steps of the mail sorting method according to any of the above-described method embodiments when executing said computer program.
In the present embodiment, the device 1 may be a PC (Personal Computer), or may be a terminal device such as a smart phone, a tablet Computer, a palmtop Computer, or a portable Computer.
The device 1 may include a memory 11, a processor 12 and a bus 13.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the device 1, for example a hard disk of the device 1. The memory 11 may also be an external storage device of the device 1 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the device 1. Further, the memory 11 may also comprise both internal memory units of the device 1 and external memory devices. The memory 11 can be used not only for storing application software installed in the apparatus 1 and various types of data such as codes for executing a mail classification method, etc., but also for temporarily storing data that has been output or is to be output.
Processor 12, which in some embodiments may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip, executes program code or processes data stored in memory 11, such as code for performing mail sorting methods.
The bus 13 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 3, but this does not mean only one bus or one type of bus.
Further, the device may further comprise a network interface 14, and the network interface 14 may optionally comprise a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the device 1 and other electronic devices.
Optionally, the device 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the device 1 and for displaying a visual user interface.
Fig. 3 only shows the device 1 with the components 11-14, and it will be understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
The embodiment of the invention also discloses a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when being executed by a processor, the computer program realizes the steps of the mail classification method according to any method embodiment.
Wherein the storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of classifying mail, comprising:
receiving mail data to be classified;
processing the mail data by using a predetermined linear discriminant function to obtain a discriminant function value; wherein, the discrimination parameters in the linear discrimination function are: analyzing the training set by a twin support vector machine classification algorithm based on an L1 norm in advance to obtain the training set; the training data comprises mail training data of different categories;
and classifying the mail data by using a preset classification rule and the discrimination function value.
2. The mail classification method according to claim 1, wherein the method for generating the discriminant parameters in the linear discriminant function comprises:
acquiring a training set; determining a discrimination parameter in the linear discrimination function by using the training set and a preset condition;
the preset conditions include:
s.t.-(X2w1+e2b1)+ξ2≥e22≥0
s.t.(X1w2+e1b2)+ξ1≥e11≥0
wherein, w1Is a first weight vector, w, in the discriminating parameter2Is a second weight vector in the discriminating parameter, b1As a first function deviation coefficient in said discrimination parameter, b2As a deviation factor, ξ, of a second function of said criterion parameter1Is the first relaxation variable, ξ2Is the second relaxation variable, X1Feature matrices, X, for the non-spam data in the training set2A feature matrix for the spam data in the training set, e1First vector of all 1, e2A second vector of all 1, | |. the non-woven phosphor1Is L1 norm, C1As a predetermined first auxiliary variable, C2As a predetermined second auxiliary variable, C3As a predetermined third auxiliary variable, C4Is a predetermined fourth auxiliary variable.
3. The mail sorting method of claim 2, wherein the processing the mail data with a predetermined linear discriminant function to obtain a discriminant function value comprises:
obtaining a first discrimination function value f by using the first linear discrimination function and the mail data x1(x);
Obtaining a second discrimination function value f by using a second linear discrimination function and the mail data x2(x);
Wherein the first linear discriminant function is: f. of1(x)=xTw1+b1The second linear discriminant function is: f. of2(x)=xTw2+b2
4. The mail classification method according to claim 3, wherein said classifying the mail data by using a predetermined classification rule and the discrimination function value includes:
using a predetermined classification rule, the first discrimination function value f1(x) The second discrimination function value f2(x),Obtaining a classification result of the mail data;
the classification rule is as follows:
wherein, if the classification result is obtainedIf the mail is 1, judging the mail to be non-junk mail, and if the mail is classified, judging the mail to be non-junk mailAnd if the mail is-1, judging that the mail is a junk mail.
5. A mail sorting apparatus, comprising:
the data receiving module is used for receiving the mail data to be classified;
the data processing module is used for processing the mail data by utilizing a predetermined linear discriminant function to obtain a discriminant function value; wherein, the discrimination parameters in the linear discrimination function are: analyzing the training set by a twin support vector machine classification algorithm based on an L1 norm in advance to obtain the training set; the training data comprises mail training data of different categories;
and the data classification device is used for classifying the mail data by utilizing a preset classification rule and the judgment function value.
6. The mail sorting apparatus of claim 5, further comprising a discrimination parameter generation module; wherein, the discrimination parameter generation module comprises:
a training set acquisition unit for acquiring a training set;
a discrimination parameter determining unit, configured to determine a discrimination parameter in the linear discrimination function by using the training set and a preset condition; the preset conditions include:
s.t.-(X2w1+e2b1)+ξ2≥e22≥0
s.t.(X1w2+e1b2)+ξ1≥e11≥0
wherein, w1Is a first weight vector, w, in the discriminating parameter2Is a second weight vector in the discriminating parameter, b1As a first function deviation coefficient in said discrimination parameter, b2As a deviation factor, ξ, of a second function of said criterion parameter1Is the first relaxation variable, ξ2Is the second relaxation variable, X1Feature matrices, X, for the non-spam data in the training set2A feature matrix for the spam data in the training set, e1First vector of all 1, e2A second vector of all 1, | |. the non-woven phosphor1Is L1 norm, C1As a predetermined first auxiliary variable, C2As a predetermined second auxiliary variable, C3As a predetermined third auxiliary variable, C4Is a predetermined fourth auxiliary variable.
7. The mail sorting device of claim 6, wherein said data processing module comprises:
a first processing unit for obtaining a first discrimination function value f by using the first linear discrimination function and the mail data x1(x);
A second processing unit for obtaining a second discrimination function value f by using a second linear discrimination function and the mail data x2(x) (ii) a Wherein the first linear discriminant function is: f. of1(x)=xTw1+b1The second linear discriminant function is: f. of2(x)=xTw2+b2
8. The mail sorting device of claim 7, wherein the data sorting device is specifically configured to: using a predetermined classification rule, the first discrimination function value f1(x) The second discrimination function value f2(x) Obtaining a classification result of the mail data;
the classification rule is as follows:
wherein, if the classification result is obtainedIf the mail is 1, judging the mail to be non-junk mail, and if the mail is classified, judging the mail to be non-junk mailAnd if the mail is-1, judging that the mail is a junk mail.
9. A mail sorting apparatus, comprising:
a memory for storing a computer program;
processor for implementing the steps of the mail sorting method according to any one of claims 1 to 4 when executing said computer program.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the mail sorting method according to any one of claims 1 to 4.
CN201910893789.0A 2019-09-20 2019-09-20 Mail classification method, device, equipment and computer readable storage medium Pending CN110610213A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910893789.0A CN110610213A (en) 2019-09-20 2019-09-20 Mail classification method, device, equipment and computer readable storage medium
PCT/CN2020/079825 WO2021051764A1 (en) 2019-09-20 2020-03-18 Email classification method and apparatus, device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910893789.0A CN110610213A (en) 2019-09-20 2019-09-20 Mail classification method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN110610213A true CN110610213A (en) 2019-12-24

Family

ID=68891665

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910893789.0A Pending CN110610213A (en) 2019-09-20 2019-09-20 Mail classification method, device, equipment and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN110610213A (en)
WO (1) WO2021051764A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021051764A1 (en) * 2019-09-20 2021-03-25 苏州大学 Email classification method and apparatus, device, and computer-readable storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079851A (en) * 2007-07-09 2007-11-28 华为技术有限公司 Email type judgement method and device and establishment device of system and behavior model
CN102984176A (en) * 2012-12-24 2013-03-20 重庆大学 Identification method and system for junk mail
CN103020645A (en) * 2013-01-06 2013-04-03 深圳市彩讯科技有限公司 System and method for junk picture recognition
CN103186845A (en) * 2011-12-29 2013-07-03 盈世信息科技(北京)有限公司 Junk mail filtering method
CN104573630A (en) * 2014-12-05 2015-04-29 杭州电子科技大学 Multiclass brain electrical mode online identification method based on probability output of twin support vector machine
CN104967558A (en) * 2015-06-10 2015-10-07 东软集团股份有限公司 Method and device for detecting junk mail
CN106779755A (en) * 2016-12-31 2017-05-31 湖南文沥征信数据服务有限公司 A kind of network electric business borrows or lends money methods of risk assessment and model
CN107844801A (en) * 2017-10-19 2018-03-27 苏翀 A kind of sorting technique of spam
CN108876001A (en) * 2018-05-03 2018-11-23 东北大学 A kind of Short-Term Load Forecasting Method based on twin support vector machines
CN110048936A (en) * 2019-04-18 2019-07-23 合肥天毅网络传媒有限公司 A kind of method that semantic association word judges spam

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8885928B2 (en) * 2006-10-25 2014-11-11 Hewlett-Packard Development Company, L.P. Automated machine-learning classification using feature scaling
CN109919202A (en) * 2019-02-18 2019-06-21 新华三技术有限公司合肥分公司 Disaggregated model training method and device
CN110505144A (en) * 2019-08-09 2019-11-26 世纪龙信息网络有限责任公司 Process for sorting mailings, device, equipment and storage medium
CN110610213A (en) * 2019-09-20 2019-12-24 苏州大学 Mail classification method, device, equipment and computer readable storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079851A (en) * 2007-07-09 2007-11-28 华为技术有限公司 Email type judgement method and device and establishment device of system and behavior model
CN103186845A (en) * 2011-12-29 2013-07-03 盈世信息科技(北京)有限公司 Junk mail filtering method
CN102984176A (en) * 2012-12-24 2013-03-20 重庆大学 Identification method and system for junk mail
CN103020645A (en) * 2013-01-06 2013-04-03 深圳市彩讯科技有限公司 System and method for junk picture recognition
CN104573630A (en) * 2014-12-05 2015-04-29 杭州电子科技大学 Multiclass brain electrical mode online identification method based on probability output of twin support vector machine
CN104967558A (en) * 2015-06-10 2015-10-07 东软集团股份有限公司 Method and device for detecting junk mail
CN106779755A (en) * 2016-12-31 2017-05-31 湖南文沥征信数据服务有限公司 A kind of network electric business borrows or lends money methods of risk assessment and model
CN107844801A (en) * 2017-10-19 2018-03-27 苏翀 A kind of sorting technique of spam
CN108876001A (en) * 2018-05-03 2018-11-23 东北大学 A kind of Short-Term Load Forecasting Method based on twin support vector machines
CN110048936A (en) * 2019-04-18 2019-07-23 合肥天毅网络传媒有限公司 A kind of method that semantic association word judges spam

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SEYED HAMID REZA MOHAMMADI 等: "Web Spam Detection Using Multiple Kernels in Twin Support Vector Machine", 《ARXIV:1605.02917V1 [CS.IR]》 *
XINJUNPENG 等: "L1-norm loss based twin support vector machine for data recognition", 《INFORMATION SCIENCES》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021051764A1 (en) * 2019-09-20 2021-03-25 苏州大学 Email classification method and apparatus, device, and computer-readable storage medium

Also Published As

Publication number Publication date
WO2021051764A1 (en) 2021-03-25

Similar Documents

Publication Publication Date Title
CN109325165B (en) Network public opinion analysis method, device and storage medium
US9552570B2 (en) Document classification system, document classification method, and document classification program
CN103336766B (en) Short text garbage identification and modeling method and device
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
CN110555372A (en) Data entry method, device, equipment and storage medium
CN109284371B (en) Anti-fraud method, electronic device, and computer-readable storage medium
CN112560453B (en) Voice information verification method and device, electronic equipment and medium
CN107797982A (en) For identifying the method, apparatus and equipment of text type
CN103064987A (en) Bogus transaction information identification method
CN113722438B (en) Sentence vector generation method and device based on sentence vector model and computer equipment
CN111984792A (en) Website classification method and device, computer equipment and storage medium
WO2019085332A1 (en) Financial data analysis method, application server, and computer readable storage medium
CN110020430B (en) Malicious information identification method, device, equipment and storage medium
CN109933648B (en) Real user comment distinguishing method and device
CN111753087A (en) Public opinion text classification method and device, computer equipment and storage medium
CN110287311A (en) File classification method and device, storage medium, computer equipment
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN111260490A (en) Rapid claims settlement method and system based on tree model for car insurance
CN112579781B (en) Text classification method, device, electronic equipment and medium
CN108470065B (en) Method and device for determining abnormal comment text
CN113704474A (en) Bank outlet equipment operation guide generation method, device, equipment and storage medium
CN113628043A (en) Complaint validity judgment method, device, equipment and medium based on data classification
CN110610213A (en) Mail classification method, device, equipment and computer readable storage medium
CN110619212A (en) Character string-based malicious software identification method, system and related device
CN110704611B (en) Illegal text recognition method and device based on feature de-interleaving

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191224