CN111382273A

CN111382273A - Text classification method based on feature selection of attraction factors

Info

Publication number: CN111382273A
Application number: CN202010158078.1A
Authority: CN
Inventors: 周红芳; 韩霜
Original assignee: Xian University of Technology
Current assignee: Guangzhou Zhiying Wanshi Market Management Co ltd; Xi'an Huaqi Zhongxin Technology Development Co ltd
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2020-07-07
Anticipated expiration: 2040-03-09
Also published as: CN111382273B

Abstract

The invention discloses a text classification method based on feature selection of attraction factors, which comprises the steps of preprocessing a data set by using a naive Bayes classifier and a support vector machine classifier, acquiring the data set, eliminating entries with the frequency of the entries in the data set exceeding 25% in a document and entries with the entries in the document less than 3, and dividing a test set and a training set by adopting a cross-validation method; setting the number of feature words in a test set and a training set by using a feature selection method based on an attraction factor to generate an optimal feature subset; training and classifying the optimal feature subset of the training set in sequence by using a naive Bayes classifier and a support vector machine classifier, training out a classifier model, and inputting the optimal feature subset of the test set into the classifier model to obtain a classification result; the classification results were evaluated using two evaluation indexes, micro-average-F1 and macro-average-F1, to verify the performance of the method.

Description

Text classification method based on feature selection of attraction factors

Technical Field

The invention belongs to the technical field of data mining methods, and relates to a text classification method based on feature selection of attraction factors.

Background

Text classification is the task of assigning predefined categories to documents, traditionally performed manually by domain experts, but with the significant increase in the number of digital documents available on the internet, IT is not possible to manually process such a large amount of information, and classification algorithms have evolved as IT technology has developed. Text classification, which is studied in information science and computer science, has found many applications in many fields, such as information retrieval, genre classification, spam filtering, language identification, and the like. Text classification is a basic function of text information mining, is also a core technology for processing and organizing text data, can effectively assist people in organizing and classifying information data, solves the problem of information disorder to a great extent, and has strong practical significance for efficient management and effective utilization of information, so that the text classification technology becomes one of important research directions in the field of data mining.

The text classification technology is a complex system engineering, and the feature selection is one of the key technologies of text classification. Feature selection is an important problem in text classification, and can reduce the size of a feature space without sacrificing classification performance, and simultaneously avoid the generation of an overfitting phenomenon. The method is mainly characterized in that feature words which do not greatly contribute to text classification are deleted from an original high-dimensional feature set space according to a certain rule, and a part of most effective and most representative feature words are selected to form a new feature subset. Through the step of feature selection, some feature words irrelevant to requirements can be removed, so that the dimension of a text feature set space is greatly reduced, and the efficiency and the precision of text classification are improved.

The main feature of text classification is that even for medium-sized data sets, the number of features in the feature space can easily reach tens of thousands of orders, so that in a high-dimensional situation there are two problems:

one is that some complex algorithms cannot be optimally used in text classification; another problem is that when most algorithms are trained in the training set, over-classification is inevitable in text classification, resulting in low classification accuracy. Therefore, dimension reduction has been a major research area. Meanwhile, the rapid development of the text classification technology brings difficulties and challenges which are not met before, and a great development space still exists for the research of the text classification technology in theory and practice.

Disclosure of Invention

The invention aims to provide a text classification method based on feature selection of attraction factors, and solves the problem of low classification accuracy in the prior art.

The technical scheme adopted by the invention is that the text classification method based on the feature selection of the attraction factor specifically comprises the following steps:

step 1: preprocessing a data set by using a naive Bayes classifier NB and a support vector machine classifier SVM, acquiring a plurality of data sets subjected to stem extraction and stop word processing, eliminating entries with the frequency of the entries in the data sets exceeding 25% in a document and entries with the frequency of the entries less than 3, and dividing a test set and a training set by adopting a cross-validation method;

step 2: setting the quantity of the feature words of the test set and the training set obtained in the step 1 by using a feature selection method based on an attraction factor to generate an optimal feature subset;

and step 3: sequentially training and classifying the optimal feature subset of the training set obtained in the step 2 by using a naive Bayes classifier NB and a Support Vector Machine (SVM), training out a classifier model, and inputting the optimal feature subset of the test set obtained in the step 2 into the classifier model to obtain a classification result;

and 4, step 4: and (3) evaluating the classification result obtained in the step (3) by using two evaluation indexes of micro-average-F1 and macro-average-F1, and verifying the performance of the feature selection method based on the attraction factor.

The invention is also characterized in that:

the data sets in step 1 are four data sets of 20Newsgroups, WebKB, K1a and K1 b.

The step 2 comprises the following specific steps:

step 2.1: calculating an attraction factor T (T)_i) Wherein the attraction factor represents an average frequency of occurrence of the term in each text in the category;

step 2.2: calculating the maximum term positive rate MT, namely taking the maximum value of the real rate and the false positive rate to balance the real relevance of the terms;

step 2.3: according to the true rate tpr and the false positive rate fpr calculated in the step 2.2, calculating a normalized difference measurement factor NDM;

step 2.4: calculating the weight value MTFS (t) of each entry according to the following formula_i) Then sorting is carried out, an optimal characteristic subset is selected according to the number of the entries,

MTFS(t_i)＝MT·T(t_i)·NDM

wherein MT stands for the maximum term positive rate in step 2.2, T (T)_i) NDM represents the normalized difference measure factor in step 2.3, which is the attraction factor in step 2.1.

The specific steps of the step 2 are as follows:

step 2.1: calculating an attraction factor T (T)_i) Wherein the attraction factor represents the average frequency of occurrence of the term in each text in the category

Wherein, tf_ijIs that the term is in class C_iText d in (1)_jNumber of occurrences in, N is class C_iThe total number of texts in (1);

step 2.2: calculating the maximum term positive rate MT, namely taking the maximum value of the real rate tpr and the false positive rate fpr to balance the real correlation of the terms;

the calculation formula of the true rate tpr and the false positive rate fpr is as follows:

wherein tp represents the inclusion of an entry t_iAnd belong to class C_kThe number of texts of; fn denotes no entry t_iAnd belong to class C_kThe number of texts of; fp denotes containing an entry t_iAnd do not belong to the classPin C_kThe number of texts of; tn denotes that the entry t is not included_iAnd do not belong to class C_kThe number of texts of;

step 2.3: according to the true rate tpr and false positive rate fpr calculated by the formulas (2) and (3) in the step 2.2, the normalized difference measurement factor is calculated according to the following formula,

step 2.4: the weight value MTFS (t) of each entry is calculated according to the following formula_i) Then sorting is carried out, an optimal characteristic subset is selected according to the number of the entries,

wherein MT stands for the maximum term positive, T (T), obtained in step 2.2_i) For the attraction factor obtained in step 2.1, NDM represents the normalized difference measure factor obtained in step 2.3.

The calculation formula of the micro-average-F1 in the step 4 is as follows:

wherein,

in order to average the precision ratio,

is the average recall ratio and precision ratio

Recall ratio of

tp indicates the inclusion of an entry t_iAnd belong to class C_kThe number of texts of; fn denotes no entry t_iAnd belong to class C_kThe number of texts of; fp denotes containing an entry t_iAnd do not belong to class C_kThe number of texts of;

the macroaverage-F1 calculation formula is as follows:

in the formula F₁(k) The value of-F1 for the micro-average of the kth test class is indicated, and K indicates the total number of test classes.

The invention has the beneficial effects that:

1. the invention comprehensively considers the contribution of the document frequency and the distribution problem of terms in the classes and among the classes to the classification, so compared with the traditional CHI algorithm, GINI algorithm, NDM algorithm and OR algorithm, the invention has obvious advantages and results on the data sets 20Newsgroups, WebKB, K1a and K1b in the classification accuracy, and experiments prove that the characteristic selection method based on the attraction factor can improve the classification accuracy when applied to text classification, and is an effective characteristic selection algorithm.

2. On the basis of matching with different classifiers, the feature subsets selected by the invention and other traditional CHI algorithm, GINI algorithm, NDM algorithm and OR algorithm are respectively operated on the NB classifier and the SVM classifier, and the final result shows that the result of the invention has good effect and high classification accuracy.

Drawings

FIG. 1 is a flow chart of a method of text classification based on feature selection by an attraction factor of the present invention;

FIG. 2 is a comparison graph of the text classification method based on feature selection of attraction factors with the prior art showing the comparison of the micro-average-F1 values when a naive Bayes classifier is used on different data sets and under different lexical dimension dimensions;

FIG. 3 is a comparison graph of the text classification method based on feature selection of attraction factors according to the present invention and the prior art with a polyline of the value of Macro-average-F1 when a naive Bayes classifier is used on different data sets and under different lexical dimension dimensions;

FIG. 4 is a comparison graph of the text classification method based on feature selection of attraction factors and the prior art with a broken line of the micro-average-F1 value when a support vector machine classifier is used on different data sets and under different vocabulary entry dimensions;

FIG. 5 is a comparison graph of the text classification method based on feature selection of attraction factors and the broken line of the macro-average-F1 value when using a support vector machine classifier on different data sets and under different vocabulary entry dimensions according to the prior art;

FIG. 6 is a histogram comparing the micro-average-F1 values of a text classification method based on feature selection of attraction factors of the present invention with the prior art using a naive Bayes classifier on different datasets and under different lexical dimension numbers;

FIG. 7 is a histogram of the macro-average-F1 values when a naive Bayes classifier is used on different datasets and under different lexical dimension in accordance with the text classification method for feature selection based on an attraction factor of the present invention and the prior art;

FIG. 8 is a histogram comparing the micro-mean-F1 values of a text classification method based on feature selection of attraction factors with prior art using a SVM classifier on different datasets and under different vocabulary entry dimensions;

FIG. 9 is a histogram comparing the macro mean-F1 value when using a SVM classifier on different data sets and under different vocabulary entry dimensions in comparison with the prior art text classification method based on feature selection of attraction factors.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention relates to a text classification method based on feature selection of attraction factors, which specifically comprises the following steps as shown in figure 1:

The invention uses Naive Bayes (NB) and Support Vector Machine (SVM) classification algorithm to classify. The naive Bayes algorithm is a probability-based algorithm, is widely applied to the field of machine learning, mainly focuses on the probability that a text belongs to a certain category, and shows good efficiency and robustness in practical application. The support vector machine algorithm has a good effect on the aspect of mining the internal features of data, has higher accuracy compared with other classification algorithms, and can reduce the operation of a vector space from a high dimension to a low dimension by a classified kernel function in the high-dimension vector space.

The step 2 comprises the following specific steps:

step 2.1: calculating an attraction factor T (T)_i) Wherein the attraction factor represents the average frequency of occurrence of the term in each text in the category, and the larger the attraction factor, the more representative the term is;

MTFS(t_i)＝MT·T(t_i)·NDM

The specific steps of the step 2 are as follows:

wherein tp represents the inclusion of an entry t_iAnd belong to class C_kThe number of texts of; fn denotes no entry t_iAnd belong to class C_kThe number of texts of; fp denotes containing an entry t_iAnd do not belong to class C_kThe number of texts of; tn denotes that the entry t is not included_iAnd do not belong to class C_kThe number of texts of;

The calculation formula of the micro-average-F1 in the step 4 is as follows:

wherein,

in order to average the precision ratio,

is the average recall ratio and precision ratio

Recall ratio of

tp indicates the inclusion of an entry t_iAnd belong to class C_kThe number of texts of; fn denotes no entry t_iAnd belong to class C_kNumber of texts(ii) a fp denotes containing an entry t_iAnd do not belong to class C_kThe number of texts of;

the macroaverage-F1 calculation formula is as follows:

The higher and more stable the Micro-F1 value and the Macro-F1 value obtained in the experiment are, the better the classification effect is and the higher the precision is.

In the analysis of a data set in an experiment, the invention considers the distribution condition of terms in classes, and also considers the problem how to solve when highly sparse terms exist among the classes, and aims to select a feature item with stronger class distinguishing capability from an original feature space, and perform dimension reduction processing on a feature complete set according to a certain evaluation standard or certain evaluation standards to generate a feature subset with lower dimension.

To validate the ability of the feature selection method based on document-level word frequency reordering, the method was compared to the known normalized CHI-square test (CHI), the Gini coefficient (GINI) method, the Difference measurement method (NDM), and the dominance rate (OR). As can be seen from fig. 2, 3, 6 and 7, in the experimental results of the naive bayes classifier, the method of the invention has a higher F1 value, is more stable and is average optimal than the prior method. As can be seen from fig. 4, 5, 8 and 9, the present invention shows good results in most data sets in the experimental results of the support vector machine classifier. Experiments prove that the invention is an effective feature selection algorithm.

The pseudo code of the algorithm of the present invention is as follows:

the invention discloses a text classification method based on feature selection of attraction factors, which has the beneficial effects that: the invention comprehensively considers the contribution of the document frequency and the distribution problem of terms in the classes and among the classes to the classification, so compared with the traditional CHI algorithm, GINI algorithm, NDM algorithm and OR algorithm, the invention has obvious advantages and results on the data sets 20Newsgroups, WebKB, K1a and K1b in the classification accuracy, and experiments prove that the characteristic selection method based on the attraction factor can improve the classification accuracy when applied to text classification, and is an effective characteristic selection algorithm.

Claims

1. A text classification method based on feature selection of attraction factors is characterized by comprising the following steps:

step 1: preprocessing a data set by using a naive Bayes classifier NB and a support vector machine classifier SVM, acquiring a plurality of data sets subjected to stem extraction and stop word processing, eliminating entries with frequency of more than 25% appearing in documents in the data sets and entries with less than 3 documents appearing in the entries, and dividing a test set and a training set by adopting a cross-validation method;

2. The method of claim 1, wherein the data sets in step 1 are four data sets of 20Newsgroups, WebKB, K1a, and K1 b.

3. The method for classifying texts based on feature selection of attraction factors according to claim 1, wherein the step 2 comprises the following specific steps:

MTFS(t_i)＝MT·T(t_i)·NDM

wherein MT stands for the maximum term positive rate in said step 2.2, T (T)_i) NDM represents the normalized difference measure factor in step 2.3, which is the attraction factor in step 2.1.

4. The method for classifying texts based on feature selection of attraction factors according to claim 1, wherein the specific steps of the step 2 are as follows:

step 2.3: calculating a normalized difference measurement factor according to the real rate tpr and the false positive rate fpr calculated according to the formulas (2) and (3) in the step 2.2 and the following formula,

wherein MT represents the maximum term positive rate, T (T), obtained in said step 2.2_i) For the attraction factor obtained in said step 2.1, NDM represents said stepNormalized difference measurement factor obtained in step 2.3.

5. The method for classifying texts based on feature selection of attraction factors according to claim 1, wherein the calculation formula of the micro-average-F1 in the step 4 is as follows:

wherein,

in order to average the precision ratio,

is the average recall ratio and precision ratio

Recall ratio of

the macroaverage-F1 calculation formula is as follows: