CN109885682B

CN109885682B - Self-defined feature dimension text feature selection algorithm based on FCBF

Info

Publication number: CN109885682B
Application number: CN201910071963.3A
Authority: CN
Inventors: 于舒娟; 张昀; 徐前川; 何伟
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2022-08-16
Anticipated expiration: 2039-01-25
Also published as: CN109885682A

Abstract

The invention discloses a self-defined feature dimension text feature selection algorithm based on FCBF, comprising the following steps: step one, initialization; step two, further screening the feature words in the feature word set by using an FCBF algorithm to obtain an initial feature word set; if the dimension of the initial feature word set is smaller than the set dimension, selecting the features with the relevance values of the feature words and the categories ranked at the top to complement the initial feature word set until the dimension of the initial feature word set is equal to the set feature dimension; if the dimension of the initial feature word set is just larger than or equal to the set feature dimension, the feature words with the user-defined feature dimension can be obtained from the initial feature word set. The invention improves the correlation calculation formula of the FCBF original algorithm, can more accurately select text characteristics, and can obtain self-defined characteristic dimensionality by improving the algorithm.

Description

Self-defined feature dimension text feature selection algorithm based on FCBF

Technical Field

The invention relates to the technical field of natural language processing, in particular to a self-defined feature dimension text feature selection algorithm based on FCBF.

Background

With the continuous development of the internet, the text information and the diversification thereof are continuously increased, so that the text classification task is more and more concerned by the research community. As the number of texts increases, the number of features in the texts also increases and even reaches tens of thousands, not all features are helpful for text classification, and even some redundant features may greatly reduce the classification accuracy, so the feature selection in the text classification is particularly important.

For practical purposes, text data which is difficult to process by a computer is first converted into structured data which can be processed by the computer, and the text is generally represented by using VSM (vector space model) and word frequency method, which is specifically described in the literature: [ Salton G, Wong A, Yang C.A vector space model for automatic indexing [ J ]. Communications of the Acm,1974,18(11): 613-.

There are two main feature selection methods in machine learning: a filter method and a wrapper method. The filter method selects a subset of features as a preprocessing step that works independently of the classification algorithm. In contrast, the wrapper method requires the accuracy of the classifier as a basis for feature selection. The wrapper method tends to work better because it can better select a subset of features for a predefined algorithm. But the wrapper method has higher complexity and also requires more time in selecting features, which is clearly undesirable for the text classification task. Thus focusing the focus on the filter method. Researchers have proposed many feature filtering methods for text classification tasks, including document frequency method (DF), information gain method (IG), to be of interest. However, document frequency feature selection does not achieve good results, and although information gain can well perform feature selection, Shang et al found that it has a disadvantage that IG (text feature selection method based on information gain) is only screened according to a specific IG value and does not consider redundancy between features: the document [ Shang C, Li M, Feng S, et al. feature selection visual a maximum transformation gain for text classification [ J ]. Knowledge-Based Systems,2013,54(4): 298-; in order to effectively eliminate redundancy among features, Peng et al propose MRMR (maximum correlation-minimum redundancy feature selection method) for eliminating redundancy, which is difficult to apply to text classification due to its great time complexity: literature [ Peng H, Long F, Ding C. feature selection based on structural information criterion of max-dependency, and min-dependency [ J ]. IEEE Transactions on pattern analysis and machine interaction, 2005,27(8):1226-1238 ]; lee et al propose an improved information gain feature selection algorithm: document [ Lee C, Lee G. information and conversion-based feature selection [ M ]. Pergamon Press, Inc.2006,42(1): 155-. Uysal et al propose a feature probabilistic based selection method: distinctive feature selection algorithm (DFS): document [ U real A K, Gunal S.A novel basic feature selection method for text classification [ J ]. Knowledge-Based Systems,2012,36(6): 226-. Although these algorithms can effectively remove redundancy, they have high complexity and cannot perform feature selection quickly. In order to extract features more quickly, the invention focuses on researching a Fast Correlation-based Filter Solution (FCBF). Aiming at the characteristics of text features, an FCBF original algorithm correlation calculation formula is Improved, namely, an FCBF-based user-defined feature dimension feature selection algorithm IFSC-FCBF (IFSC) is provided.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a self-defined feature dimension text feature selection algorithm based on FCBF, which solves the problem that classification efficiency is influenced and even classification precision is reduced due to the fact that feature numbers are increased sharply when a large number of training texts on the premise that VSM and a word frequency method are adopted to represent texts.

In order to achieve the above purpose, the invention adopts the following technical scheme: a self-defined feature dimension text feature selection algorithm based on FCBF is characterized in that: the method comprises the following steps:

step one, setting a vectorized text matrix as X, and a text category matrix as C ═ C ₁ ,C ₂ ...C _j }，C _j Is a training text D _j V, V is the total number of text categories, and initializes all feature word sets T { T ═ T ] according to the text matrix X ₁ ,t ₂ ...t _m The feature words are feature word sets S with the correlation between the feature words and the categories meeting the requirements _list Assigning for the feature word set S after algorithm selection _best Assigning an initial value, t _m Is the m-th feature word;

step two, utilizing FCBF algorithm to collect S characteristic words _list Further screening the characteristic words to obtain an initial characteristic word set S _best ；

Step three, if the initial characteristic word set S _best When the dimension of the feature word is smaller than the set dimension, selecting the feature with the relevance value of the feature word and the category ranked at the top to complement the initial feature word set S _best Until the dimension of the image is equal to the set characteristic dimension; if the initial feature word set S _best Just greater than or equal to the set feature dimension, then the initial feature word set S _best In a state ofFeature words of the custom feature dimension can be obtained.

The FCBF-based custom feature dimension text feature selection algorithm is characterized in that: in the first step, the method specifically comprises the following steps:

is S _list And (4) assignment: for t _k E to T, calculating the kth characteristic word T of the text _k Correlation Corr (t) with text class C _k C), when Corr (t) _k Adding t when C) is greater than or equal to thresh _k Into S _list T is a set of all the feature words, thresh is a threshold value, k is 1-m, and m is the total number of the feature words;

will S _list According to the characteristic word t _k Correlation Corr (t) with text classes _k C) value of S is arranged from large to small _best Sequence, t _p ＝getFirst(S _list )，S _best ＝{t _p }; characteristic word t _p For the sorted first parameter, t is added _p Is assigned to S _best 。

The FCBF-based custom feature dimension text feature selection algorithm is characterized in that: the second step comprises the following specific steps:

1) using a characteristic word variable t _q Reading S in sequence _list If t is a parameter of _q Calculating t if not null _q And t _p Correlation of (1) Corr (t) _p ,t _q ) If t is _q If the algorithm is empty, the loop algorithm is ended;

2) comparison of the correlation Corr (t) _p ,t _q ) And Corr (t) _q C) size, if the former is greater than or equal to the latter, then t is _q From S _list Otherwise, it is added to S _best ；

3) If S is _best If the length is just larger than or equal to the size of the custom feature dimension, the algorithm is ended, otherwise the variable t _p Reading S backwards in sequence _list And continuing to the steps 1) -3).

The FCBF-based user-defined feature dimension text feature selection algorithm is characterized in that: the kth characteristic word t of the text _k Correlation Corr (t) with text class C _k And C), the calculation method comprises the following steps: let D _i For any one training document:

wherein H (C) represents the entropy of text category information, P (C) _j ) Representing a text category C _j δ () represents a binary function, L is to suppress the probability P (C) _j ) A smoothing factor added in the case of 0; n denotes the total number of training documents, i 1,2 _i Representing a training document D _j Class (C) _j Representing a training document D _j A category of (1); h (t) _k ) Representation feature word t _k Information entropy of (1), P (t) _k ) Representation feature word t _k The probability of occurrence; tf (t) _k,i ) Representation feature word t _k Generating D in a training document _i The frequency of the occurrence;

for a single feature, in the known class C _j The conditional information entropy in the case of distribution is:

in the formula, H (t) _k | C) represents the characteristic word t when the text category distribution is known _k Entropy of information of (c), P (t) _k |C _j ) Represents a known class C _j Distribution time characteristic word t _k The probability of occurrence; tf (t) _k |C _j ) Representation feature word t _k At C _j Frequency in category;

feature word t with known distribution of class C _k The variation of the information entropy is the characteristic information gain, and the calculation formula is as follows:

IG(t _k |C)＝H(t _k )-H(t _k |C)

in the formula, IG (t) _k | C) represents the feature word t with a known distribution of classes _k The variation of the information entropy is the characteristic information gain;

thus, the text feature word t _k The relevance to text category C is:

the FCBF-based custom feature dimension text feature selection algorithm is characterized in that: the characteristic word t _p And the feature word t _q The correlation of (d) is calculated as:

for t _p In other words, at known t _q Conditional information entropy in the case of distribution in each class H (t) _p |t _q ) Comprises the following steps:

wherein, P (t) _p ,t _q ) Representation feature word t _q Feature word t when it appears _p Probability of occurrence as well; df (t) _p ,t _q |C _j ) Is shown in category C _j Middle characteristic word t _q And t _p Number of documents that appear at the same time; df (t) _q |C _j ) Is shown in category C _j Middle characteristic word t _q The number of documents present;

t _q feature word t in presence compared to absence _p The variation of the information entropy in the category matrix C is:

IG(t _p |t _q )＝H(t _p |C)-H(t _p |t _q )

H(t _p | C) represents a feature word t when the text category C is known _q The entropy of the information of (1);

characteristic word t _p And t _q The correlation between can be calculated as:

H(t _p ) Representation feature word t _p Entropy of information of (1), H (t) _q ) Representation feature word t _q The entropy of information of (1).

The FCBF-based user-defined feature dimension text feature selection algorithm is characterized in that: the binary function has the following formula:

and x and y are binary function variables.

The invention achieves the following beneficial effects: aiming at the characteristics of text features, the correlation calculation formula of the FCBF original algorithm is improved, the text features can be selected more accurately, the improved algorithm can obtain customized feature dimensions, the feature selection algorithm is combined with a naive Bayes classification algorithm for verification, and compared with other feature selection algorithms on data sets of an English corpus and a Chinese corpus, and the result shows that the algorithm of the invention has higher accuracy and lower running time under the same feature dimensions, and can remove redundant features more effectively;

the invention adds two judgments on the basis of the FCBF algorithm in the aspect of complexity, so the complexity is the same as the FCBF algorithm.

Drawings

FIG. 1 is a graph of the performance of the algorithm in a 20newsgroup data set using the FCBF, IG, DFS, IFSC-FCBF algorithms to extract features and then combining the naive Bayes classification algorithm;

FIG. 2 is a comparison of algorithm performance in a Ruster21678 dataset with a naive Bayes classification algorithm after features are extracted using FCBF, IG, DFS, IFSC-FCBF algorithms, respectively;

FIG. 3 is a comparison of algorithm performance in a complex denier corpus by combining a naive Bayes classification algorithm after extracting features by respectively using FCBF, IG, DFS, IFSC-FCBF algorithms;

FIG. 4 shows the comparison of the performance of the algorithm in the dog search corpus by using FCBF, IG, DFS, IFSC-FCBF algorithms to extract features and combining naive Bayesian classification algorithm.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

A self-defined feature dimension text feature selection algorithm based on FCBF includes the following steps:

step one, a vectorized text matrix is X, a text category matrix is C, and T-T is initialized according to the text matrix X ₁ ,t ₂ ...t _m Is provided with S _list ＝{}，S _best ；

Is S _list And (4) assignment: s _list Initially as an empty set, for t _k E, T, calculating the kth characteristic word T of the text _k Correlation Corr (t) with text class C _k C), when Corr (t) _k And C) adding t when the temperature is more than or equal to thresh _k Into S _list T is the set of all the characteristic words, S _list The method is characterized in that the relevance between the feature words and the categories meets the requirement, thresh is a decimal number (a number between 0 and 1), k is 1 to m, and m is the total number of the feature words;

will S _list According to the characteristic word t _k Correlation Corr (t) with text classes _k C) value sorting from big to little, t _p ＝getFirst(S _list )，S _best ＝{t _p }; characteristic word t _p Is S _list First parameter of, will t _p Is assigned to S _best ，S _best The feature words are selected through an algorithm;

step two, using FCBF algorithm to collect S characteristic words _list The method for further screening the characteristic words comprises the following steps:

1) using a characteristic word variable t _q Reading S in sequence _list If t is a parameter of _q Calculating t if not null _q And t _p Correlation of (1) Corr (t) _p ,t _q ) If t is _q If the algorithm is empty, the loop algorithm is ended; (ii) a

2) Comparison of the correlation Corr (t) _p ,t _q ) And Corr (t) _q C) size, if the former is greater than or equal to the latter, then t is _q From S _list Otherwise, add it into S _best ；

3) If S is _best If the length is greater than or equal to size (is an integer, self-determined), the loop algorithm ends, otherwise, the variable t _p Reading S backwards in sequence _list Continuing with steps 1) -3);

the method for calculating the relevance between the text characteristic words and the text categories comprises the following steps:

let text category be C ═ C ₁ ,C ₂ ...C _j }，j＝1,2...V，C _j Is a text D _j V is the total number of text categories, and D is set _i Any training text is selected;

the entropy is a physical quantity reflecting the uncertainty degree of a variable, and in text classification, the entropy reflects the degree of uniform distribution of the variable in a corpus, and for text categories and feature words, the entropy can be defined as:

wherein H (C) represents the entropy of text category information, P (C) _j ) Representing a text category C _j δ () represents a binary function, whose formula is as follows:

x and y are binary function variables;

l is to suppress the probability P (C) _j ) The smoothing factor added for the case of 0, in this embodiment, L ═ 0.001n represents the total number of training documents, i ═ 1,2.. n, C _i Representing a training document D _j The category of (1); h (t) _k ) Representation feature word t _k Information entropy of (1), P (t) _k ) Representation feature word t _k The probability of occurrence; tf (t) _k,i ) Representation feature word t _k Generating D in a training document _i The frequency of the occurrence;

in the formula, H (t) _k | C) represents the characteristic word t when the text category distribution is known _k Information entropy of (1), P (t) _k |C _j ) Represents a known class C _j Distribution time characteristic word t _k Probability of occurrence；tf(t _k |C _j ) Representation feature word t _k At C _j Frequency in category;

IG(t _k |C)＝H(t _k )-H(t _k |C)

from this text characteristic word t _k The relevance to text category C is:

characteristic word t _p And the feature word t _q The correlation of (d) is calculated as:

IG(t _p |t _q )＝H(t _p |C)-H(t _p |t _q )

H(t _p | C) the feature word t when the text category C is known _q The entropy of the information of (1);

characteristic word t _p And t _q The correlation between can be calculated as:

Characteristic word t _q And t _q The correlation between can be calculated as:

Step three, in order to ensure that the IFSC-FCBF feature selection algorithm of the present invention can obtain the feature of the user-defined dimension, the flow of the algorithm is also improved, and two conditions need to be judged: when all the feature words are screened, the feature list S is finally output _best Is less than the set dimension size, the correlation Corr (t) between the feature word and the category is considered _k C) as a significant correlation, which means that Corr (t) is more important to look at _k C) value selected to have a large Corr (t) _k C) feature of value to complement feature word list S _best Until the dimension of the feature word list is equal to the set feature dimension size; when the dimension of the finally output feature word list is just larger than or equal to the set feature dimension size, directly outputting the feature word list S from the final output _best Acquiring feature words with set dimensions;

the experimental indices used the commonly used precision (P), recall (R), F1 values and Macro F1 values (Macro _ F1) as references, and were calculated as follows:

precision ratio:

macro precision ratio:

the recall ratio is as follows:

macro recall ratio:

f1 value:

macro F1 value:

wherein V represents the number of categories; TP: the prediction is true and the reality is true; TN: the prediction is false, and the actual is false;

FP: the prediction is true; actually false; FN: prediction is false and actual is true

The F1 value (F1 Score) is a measure of the accuracy of the two-class model in statistics. The method simultaneously considers the accuracy rate and the recall rate of the classification model. The F1 score can be viewed as a weighted average of model accuracy and recall, with a maximum of 1 and a minimum of 0. The macro F1 value is more scientific. Therefore, under the condition of the same algorithm complexity, the higher the macro F1 value is, the higher the accuracy and recall rate of the algorithm is, and the more excellent the algorithm is.

Fig. 1 is a comparison of algorithm performances of four algorithms on a 20newsgroup data set, fig. 2 is a comparison of algorithm performances of four algorithms on a Ruster21678 data set, and it can be seen from fig. 1 and fig. 2 that, on the two english data sets, as the number of feature words increases, the macro F1 value slightly increases, and when the number of feature words reaches 300, the macro F1 value tends to be stable. It can be seen that the common point of the two graphs is that the IFSC-FCBF feature selection algorithm can more effectively select features, and the maximum macro F1 value is obtained. On one hand, the original FCBF algorithm does not give good results in text classification because of the excessive elimination characteristics, and the performance is the worst among the four algorithms. On the other hand, the discriminative feature selection algorithm DFS is superior to the information gain method IG algorithm, especially on the 20newsgroup data set. It can be seen that the F1 value of DFS is about 3% higher than the F1 value of IG when the number of features is 300. Although the IFSC-FCBF algorithm results closely to and sometimes falls behind DFS on the 20newsgroup data set, the performance of IFSC-FCBF has been superior to DFS on the Ruster21578 data set. In general, IFSC-FCBF has better performance than other algorithms in English data set, and macro F1 value is 1% higher than DFS algorithm on average and 2% to 3% higher than IG algorithm on average. The method has higher accuracy and recall rate, and the algorithm is more superior.

It can be seen from fig. 3 and 4 that as the number of features increases to 300, the macro F1 value tends to substantially level off, similar to the results on the english dataset. This means that when the feature dimension reaches 300, the feature is no longer a factor affecting the performance of the algorithm. For the FCBF algorithm, it does not perform well on both english datasets and chinese corpora. The DFS algorithm is higher in macro F1 value by 1.4% in the double denier corpus than the IG algorithm on average and 0.8% in the dog searching corpus on average. While the IFSC-FCBF algorithm is 1.3% and 1.5% higher than DFS on average. The method has higher accuracy and recall rate, and the algorithm is more superior.

In order to see the classification effect of each feature selection algorithm on each category, statistics are made on the classification effect of each category when the feature dimension is 300:

TABLE 1.20 comparison of the Classification Effect of each class of newsgroup data set

TABLE 2 comparison of the Classification Effect of classes of Ruster21578 data set

TABLE 3 comparison of Classification effects of classes in the language database of the Dan university

TABLE 4 comparison of the Classification effects of classes in the corpus of the dog searching laboratory

Table 1 shows a comparison of the classification effect of the feature selection algorithm on each class at 20newsgroup, with IFSC-FCBF having the highest F1 value in most classes, followed by DFS algorithm, as shown in bold in the table. Because each feature selection algorithm has different attention to features, the selected features are different, and thus different feature selection algorithms have good class features, which is one of the reasons why the IFSC-FCBF algorithm cannot guarantee that the F1 value for each class is higher than that of other algorithms. However, the average value of P, R and F1 is still the best value of the IFSC-FCBF algorithm.

As can be seen from table 2, the FCBF algorithm can select the feature words more favorable for the category of the CRUDE for the reasons described above, so that the F1 value is better for the category of the CRUDE than other algorithms. But the classification effect of the other four classes of FCBF is not ideal. The feature words of the money-fx category can show the precision ratio of the IFSC-FCBF, and the recall ratio and the F1 are higher than that of DFS and higher than that of IG by 1% respectively. In general, on two english datasets, our proposed algorithm is also optimal for classification of each class when the macro F1 value is high.

As can be seen in Table 3, for the Compound denier university corpus, it is interesting to have nearly the same IFSC-FCBF and DFS feature selection algorithms, the IG algorithm second. In table 4, the IFSC-FCBF algorithm is more effective for the selected features of the three categories, health, education and tourism, and the DFS is more effective for the sport category, and finally, the IG and the DFS have almost the same average value, but as can be seen from the comparison of the macro F1 value in fig. 4, the IG algorithm is slightly better than the DFS. In general, our algorithm can also select features more efficiently on chinese datasets.

The invention provides an improved fast correlation filtering algorithm of a self-defined feature dimension. Because most feature selection algorithms rarely consider redundancy between features, some noise features may be selected when performing feature selection and the classification accuracy may be reduced. And the original FCBF algorithm easily removes too many features when the correlation between features is strong. To solve the above problem, we propose IFSC-FCBF. According to experimental results, the IFSC-FCBF algorithm can select more effective features under the condition of ensuring low running time, and has obvious improvement on text classification tasks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A self-defined feature dimension text feature selection algorithm based on FCBF is characterized in that: the method comprises the following steps:

step one, setting a vectorized text matrix as X, and a text category matrix as C ═ C ₁ ,C ₂ ...C _j }，C _j Is a training text D _j J is 1,2.. V, V is the total number of text categories, and all feature word sets T are initialized according to the text matrix X ₁ ,t ₂ ...t _m The feature words are feature word sets S with the correlation between the feature words and the categories meeting the requirements _list Assigning for the feature word set S after algorithm selection _list Assigning an initial value, t _m Is the m-th feature word;

step two, utilizing FCBF algorithm to collect S characteristic words _list Screening the characteristic words to obtain an initial characteristic word set S _best ；

Step three, if the initial characteristic word set S _best When the dimension of (2) is less than the set dimension, selectingSelecting the characteristics with the relevance value ranking top of the characteristic words and the categories to complement the initial characteristic word set S _best Until the dimension of the image is equal to the set characteristic dimension; if the initial feature word set S _best When the dimension of (A) is greater than or equal to the set feature dimension, the initial feature word set S _best Obtaining a feature word of the user-defined feature dimension;

in the first step, the method specifically comprises the following steps:

is S _list And (4) assignment: for t _k E to T, calculating the kth characteristic word T of the text _k Correlation Corr (t) with text class C _k C), when Corr (t) _k Adding t when C) is greater than or equal to thresh _k Into S _list T is a set of all feature words, thresh is a threshold, k is 1 to m, and m is the total number of feature words;

will S _list According to the characteristic word t _k Correlation Corr (t) with text classes _k C) value sorting from big to little, t _p ＝getFirst(S _list )，S _best ＝{t _p }; characteristic word t _p For the first parameter after sorting, t _p Is assigned to S _best 。

2. The FCBF-based custom feature dimension textual feature selection algorithm as defined in claim 1, wherein: the second step comprises the following specific steps:

1) using a characteristic word variable t _q Reading S in sequence _list Parameter of if t _q Calculating t if not null _q And t _p Correlation of (1) Corr (t) _p ,t _q ) If t is _q If the algorithm is empty, the loop algorithm is ended;

3) If S is _best If the length is larger than or equal to the size of the custom feature dimension, the algorithm is ended, otherwise, the variable t _p Reading S backwards in sequence _list And continuing to the steps 1) -3).

3. The FCBF-based custom feature dimension text feature selection algorithm of claim 1, wherein: the kth characteristic word t of the text _k Correlation Corr (t) with text class C _k And C), the calculation method comprises the following steps: let D _i For any one training document:

in the formula, H (t) _k | C) represents the characteristic word t when the text category distribution is known _k Information entropy of (1), P (t) _k |C _j ) Represents a known class C _j Distribution time characteristic word t _k The probability of occurrence; tf (t) _k |C _j ) Representation feature word t _k At C _j Frequency in categories;

IG(t _k |C)＝H(t _k )-H(t _k |C)

thus, the text feature word t _k The relevance to text category C is:

4. the FCBF-based custom feature dimension textual feature selection algorithm as defined in claim 2, wherein: the characteristic word t _p And the feature word t _q The correlation of (c) is calculated as:

IG(t _p |t _q )＝H(t _p |C)-H(t _p |t _q )

characteristic word t _p And t _q The correlation between them is calculated as:

5. The FCBF-based custom feature dimension text feature selection algorithm as defined in claim 3, wherein: the binary function has the following formula:

and x and y are binary function variables.