CN109885682B - Self-defined feature dimension text feature selection algorithm based on FCBF - Google Patents

Self-defined feature dimension text feature selection algorithm based on FCBF Download PDF

Info

Publication number
CN109885682B
CN109885682B CN201910071963.3A CN201910071963A CN109885682B CN 109885682 B CN109885682 B CN 109885682B CN 201910071963 A CN201910071963 A CN 201910071963A CN 109885682 B CN109885682 B CN 109885682B
Authority
CN
China
Prior art keywords
feature
word
text
dimension
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910071963.3A
Other languages
Chinese (zh)
Other versions
CN109885682A (en
Inventor
于舒娟
张昀
徐前川
何伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN201910071963.3A priority Critical patent/CN109885682B/en
Publication of CN109885682A publication Critical patent/CN109885682A/en
Application granted granted Critical
Publication of CN109885682B publication Critical patent/CN109885682B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a self-defined feature dimension text feature selection algorithm based on FCBF, comprising the following steps: step one, initialization; step two, further screening the feature words in the feature word set by using an FCBF algorithm to obtain an initial feature word set; if the dimension of the initial feature word set is smaller than the set dimension, selecting the features with the relevance values of the feature words and the categories ranked at the top to complement the initial feature word set until the dimension of the initial feature word set is equal to the set feature dimension; if the dimension of the initial feature word set is just larger than or equal to the set feature dimension, the feature words with the user-defined feature dimension can be obtained from the initial feature word set. The invention improves the correlation calculation formula of the FCBF original algorithm, can more accurately select text characteristics, and can obtain self-defined characteristic dimensionality by improving the algorithm.

Description

Self-defined feature dimension text feature selection algorithm based on FCBF
Technical Field
The invention relates to the technical field of natural language processing, in particular to a self-defined feature dimension text feature selection algorithm based on FCBF.
Background
With the continuous development of the internet, the text information and the diversification thereof are continuously increased, so that the text classification task is more and more concerned by the research community. As the number of texts increases, the number of features in the texts also increases and even reaches tens of thousands, not all features are helpful for text classification, and even some redundant features may greatly reduce the classification accuracy, so the feature selection in the text classification is particularly important.
For practical purposes, text data which is difficult to process by a computer is first converted into structured data which can be processed by the computer, and the text is generally represented by using VSM (vector space model) and word frequency method, which is specifically described in the literature: [ Salton G, Wong A, Yang C.A vector space model for automatic indexing [ J ]. Communications of the Acm,1974,18(11): 613-.
There are two main feature selection methods in machine learning: a filter method and a wrapper method. The filter method selects a subset of features as a preprocessing step that works independently of the classification algorithm. In contrast, the wrapper method requires the accuracy of the classifier as a basis for feature selection. The wrapper method tends to work better because it can better select a subset of features for a predefined algorithm. But the wrapper method has higher complexity and also requires more time in selecting features, which is clearly undesirable for the text classification task. Thus focusing the focus on the filter method. Researchers have proposed many feature filtering methods for text classification tasks, including document frequency method (DF), information gain method (IG), to be of interest. However, document frequency feature selection does not achieve good results, and although information gain can well perform feature selection, Shang et al found that it has a disadvantage that IG (text feature selection method based on information gain) is only screened according to a specific IG value and does not consider redundancy between features: the document [ Shang C, Li M, Feng S, et al. feature selection visual a maximum transformation gain for text classification [ J ]. Knowledge-Based Systems,2013,54(4): 298-; in order to effectively eliminate redundancy among features, Peng et al propose MRMR (maximum correlation-minimum redundancy feature selection method) for eliminating redundancy, which is difficult to apply to text classification due to its great time complexity: literature [ Peng H, Long F, Ding C. feature selection based on structural information criterion of max-dependency, and min-dependency [ J ]. IEEE Transactions on pattern analysis and machine interaction, 2005,27(8):1226-1238 ]; lee et al propose an improved information gain feature selection algorithm: document [ Lee C, Lee G. information and conversion-based feature selection [ M ]. Pergamon Press, Inc.2006,42(1): 155-. Uysal et al propose a feature probabilistic based selection method: distinctive feature selection algorithm (DFS): document [ U real A K, Gunal S.A novel basic feature selection method for text classification [ J ]. Knowledge-Based Systems,2012,36(6): 226-. Although these algorithms can effectively remove redundancy, they have high complexity and cannot perform feature selection quickly. In order to extract features more quickly, the invention focuses on researching a Fast Correlation-based Filter Solution (FCBF). Aiming at the characteristics of text features, an FCBF original algorithm correlation calculation formula is Improved, namely, an FCBF-based user-defined feature dimension feature selection algorithm IFSC-FCBF (IFSC) is provided.
Disclosure of Invention
In order to solve the defects in the prior art, the invention provides a self-defined feature dimension text feature selection algorithm based on FCBF, which solves the problem that classification efficiency is influenced and even classification precision is reduced due to the fact that feature numbers are increased sharply when a large number of training texts on the premise that VSM and a word frequency method are adopted to represent texts.
In order to achieve the above purpose, the invention adopts the following technical scheme: a self-defined feature dimension text feature selection algorithm based on FCBF is characterized in that: the method comprises the following steps:
step one, setting a vectorized text matrix as X, and a text category matrix as C ═ C 1 ,C 2 ...C j },C j Is a training text D j V, V is the total number of text categories, and initializes all feature word sets T { T ═ T ] according to the text matrix X 1 ,t 2 ...t m The feature words are feature word sets S with the correlation between the feature words and the categories meeting the requirements list Assigning for the feature word set S after algorithm selection best Assigning an initial value, t m Is the m-th feature word;
step two, utilizing FCBF algorithm to collect S characteristic words list Further screening the characteristic words to obtain an initial characteristic word set S best
Step three, if the initial characteristic word set S best When the dimension of the feature word is smaller than the set dimension, selecting the feature with the relevance value of the feature word and the category ranked at the top to complement the initial feature word set S best Until the dimension of the image is equal to the set characteristic dimension; if the initial feature word set S best Just greater than or equal to the set feature dimension, then the initial feature word set S best In a state ofFeature words of the custom feature dimension can be obtained.
The FCBF-based custom feature dimension text feature selection algorithm is characterized in that: in the first step, the method specifically comprises the following steps:
is S list And (4) assignment: for t k E to T, calculating the kth characteristic word T of the text k Correlation Corr (t) with text class C k C), when Corr (t) k Adding t when C) is greater than or equal to thresh k Into S list T is a set of all the feature words, thresh is a threshold value, k is 1-m, and m is the total number of the feature words;
will S list According to the characteristic word t k Correlation Corr (t) with text classes k C) value of S is arranged from large to small best Sequence, t p =getFirst(S list ),S best ={t p }; characteristic word t p For the sorted first parameter, t is added p Is assigned to S best
The FCBF-based custom feature dimension text feature selection algorithm is characterized in that: the second step comprises the following specific steps:
1) using a characteristic word variable t q Reading S in sequence list If t is a parameter of q Calculating t if not null q And t p Correlation of (1) Corr (t) p ,t q ) If t is q If the algorithm is empty, the loop algorithm is ended;
2) comparison of the correlation Corr (t) p ,t q ) And Corr (t) q C) size, if the former is greater than or equal to the latter, then t is q From S list Otherwise, it is added to S best
3) If S is best If the length is just larger than or equal to the size of the custom feature dimension, the algorithm is ended, otherwise the variable t p Reading S backwards in sequence list And continuing to the steps 1) -3).
The FCBF-based user-defined feature dimension text feature selection algorithm is characterized in that: the kth characteristic word t of the text k Correlation Corr (t) with text class C k And C), the calculation method comprises the following steps: let D i For any one training document:
Figure RE-GDA0002005055870000041
Figure RE-GDA0002005055870000042
Figure RE-GDA0002005055870000043
Figure RE-GDA0002005055870000044
wherein H (C) represents the entropy of text category information, P (C) j ) Representing a text category C j δ () represents a binary function, L is to suppress the probability P (C) j ) A smoothing factor added in the case of 0; n denotes the total number of training documents, i 1,2 i Representing a training document D j Class (C) j Representing a training document D j A category of (1); h (t) k ) Representation feature word t k Information entropy of (1), P (t) k ) Representation feature word t k The probability of occurrence; tf (t) k,i ) Representation feature word t k Generating D in a training document i The frequency of the occurrence;
for a single feature, in the known class C j The conditional information entropy in the case of distribution is:
Figure RE-GDA0002005055870000045
Figure RE-GDA0002005055870000046
in the formula, H (t) k | C) represents the characteristic word t when the text category distribution is known k Entropy of information of (c), P (t) k |C j ) Represents a known class C j Distribution time characteristic word t k The probability of occurrence; tf (t) k |C j ) Representation feature word t k At C j Frequency in category;
feature word t with known distribution of class C k The variation of the information entropy is the characteristic information gain, and the calculation formula is as follows:
IG(t k |C)=H(t k )-H(t k |C)
in the formula, IG (t) k | C) represents the feature word t with a known distribution of classes k The variation of the information entropy is the characteristic information gain;
thus, the text feature word t k The relevance to text category C is:
Figure RE-GDA0002005055870000051
the FCBF-based custom feature dimension text feature selection algorithm is characterized in that: the characteristic word t p And the feature word t q The correlation of (d) is calculated as:
for t p In other words, at known t q Conditional information entropy in the case of distribution in each class H (t) p |t q ) Comprises the following steps:
Figure RE-GDA0002005055870000052
Figure RE-GDA0002005055870000053
wherein, P (t) p ,t q ) Representation feature word t q Feature word t when it appears p Probability of occurrence as well; df (t) p ,t q |C j ) Is shown in category C j Middle characteristic word t q And t p Number of documents that appear at the same time; df (t) q |C j ) Is shown in category C j Middle characteristic word t q The number of documents present;
t q feature word t in presence compared to absence p The variation of the information entropy in the category matrix C is:
IG(t p |t q )=H(t p |C)-H(t p |t q )
H(t p | C) represents a feature word t when the text category C is known q The entropy of the information of (1);
characteristic word t p And t q The correlation between can be calculated as:
Figure RE-GDA0002005055870000054
H(t p ) Representation feature word t p Entropy of information of (1), H (t) q ) Representation feature word t q The entropy of information of (1).
The FCBF-based user-defined feature dimension text feature selection algorithm is characterized in that: the binary function has the following formula:
Figure RE-GDA0002005055870000055
and x and y are binary function variables.
The invention achieves the following beneficial effects: aiming at the characteristics of text features, the correlation calculation formula of the FCBF original algorithm is improved, the text features can be selected more accurately, the improved algorithm can obtain customized feature dimensions, the feature selection algorithm is combined with a naive Bayes classification algorithm for verification, and compared with other feature selection algorithms on data sets of an English corpus and a Chinese corpus, and the result shows that the algorithm of the invention has higher accuracy and lower running time under the same feature dimensions, and can remove redundant features more effectively;
the invention adds two judgments on the basis of the FCBF algorithm in the aspect of complexity, so the complexity is the same as the FCBF algorithm.
Drawings
FIG. 1 is a graph of the performance of the algorithm in a 20newsgroup data set using the FCBF, IG, DFS, IFSC-FCBF algorithms to extract features and then combining the naive Bayes classification algorithm;
FIG. 2 is a comparison of algorithm performance in a Ruster21678 dataset with a naive Bayes classification algorithm after features are extracted using FCBF, IG, DFS, IFSC-FCBF algorithms, respectively;
FIG. 3 is a comparison of algorithm performance in a complex denier corpus by combining a naive Bayes classification algorithm after extracting features by respectively using FCBF, IG, DFS, IFSC-FCBF algorithms;
FIG. 4 shows the comparison of the performance of the algorithm in the dog search corpus by using FCBF, IG, DFS, IFSC-FCBF algorithms to extract features and combining naive Bayesian classification algorithm.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
A self-defined feature dimension text feature selection algorithm based on FCBF includes the following steps:
step one, a vectorized text matrix is X, a text category matrix is C, and T-T is initialized according to the text matrix X 1 ,t 2 ...t m Is provided with S list ={},S best
Is S list And (4) assignment: s list Initially as an empty set, for t k E, T, calculating the kth characteristic word T of the text k Correlation Corr (t) with text class C k C), when Corr (t) k And C) adding t when the temperature is more than or equal to thresh k Into S list T is the set of all the characteristic words, S list The method is characterized in that the relevance between the feature words and the categories meets the requirement, thresh is a decimal number (a number between 0 and 1), k is 1 to m, and m is the total number of the feature words;
will S list According to the characteristic word t k Correlation Corr (t) with text classes k C) value sorting from big to little, t p =getFirst(S list ),S best ={t p }; characteristic word t p Is S list First parameter of, will t p Is assigned to S best ,S best The feature words are selected through an algorithm;
step two, using FCBF algorithm to collect S characteristic words list The method for further screening the characteristic words comprises the following steps:
1) using a characteristic word variable t q Reading S in sequence list If t is a parameter of q Calculating t if not null q And t p Correlation of (1) Corr (t) p ,t q ) If t is q If the algorithm is empty, the loop algorithm is ended; (ii) a
2) Comparison of the correlation Corr (t) p ,t q ) And Corr (t) q C) size, if the former is greater than or equal to the latter, then t is q From S list Otherwise, add it into S best
3) If S is best If the length is greater than or equal to size (is an integer, self-determined), the loop algorithm ends, otherwise, the variable t p Reading S backwards in sequence list Continuing with steps 1) -3);
the method for calculating the relevance between the text characteristic words and the text categories comprises the following steps:
let text category be C ═ C 1 ,C 2 ...C j },j=1,2...V,C j Is a text D j V is the total number of text categories, and D is set i Any training text is selected;
the entropy is a physical quantity reflecting the uncertainty degree of a variable, and in text classification, the entropy reflects the degree of uniform distribution of the variable in a corpus, and for text categories and feature words, the entropy can be defined as:
Figure RE-GDA0002005055870000071
Figure RE-GDA0002005055870000072
Figure RE-GDA0002005055870000073
Figure RE-GDA0002005055870000074
wherein H (C) represents the entropy of text category information, P (C) j ) Representing a text category C j δ () represents a binary function, whose formula is as follows:
Figure RE-GDA0002005055870000081
x and y are binary function variables;
l is to suppress the probability P (C) j ) The smoothing factor added for the case of 0, in this embodiment, L ═ 0.001n represents the total number of training documents, i ═ 1,2.. n, C i Representing a training document D j The category of (1); h (t) k ) Representation feature word t k Information entropy of (1), P (t) k ) Representation feature word t k The probability of occurrence; tf (t) k,i ) Representation feature word t k Generating D in a training document i The frequency of the occurrence;
for a single feature, in the known class C j The conditional information entropy in the case of distribution is:
Figure RE-GDA0002005055870000082
Figure RE-GDA0002005055870000083
in the formula, H (t) k | C) represents the characteristic word t when the text category distribution is known k Information entropy of (1), P (t) k |C j ) Represents a known class C j Distribution time characteristic word t k Probability of occurrence;tf(t k |C j ) Representation feature word t k At C j Frequency in category;
feature word t with known distribution of class C k The variation of the information entropy is the characteristic information gain, and the calculation formula is as follows:
IG(t k |C)=H(t k )-H(t k |C)
in the formula, IG (t) k | C) represents the feature word t with a known distribution of classes k The variation of the information entropy is the characteristic information gain;
from this text characteristic word t k The relevance to text category C is:
Figure RE-GDA0002005055870000084
characteristic word t p And the feature word t q The correlation of (d) is calculated as:
for t p In other words, at known t q Conditional information entropy in the case of distribution in each class H (t) p |t q ) Comprises the following steps:
Figure RE-GDA0002005055870000085
Figure RE-GDA0002005055870000091
t q feature word t in presence compared to absence p The variation of the information entropy in the category matrix C is:
IG(t p |t q )=H(t p |C)-H(t p |t q )
H(t p | C) the feature word t when the text category C is known q The entropy of the information of (1);
characteristic word t p And t q The correlation between can be calculated as:
Figure RE-GDA0002005055870000092
H(t p ) Representation feature word t p Entropy of information of (1), H (t) q ) Representation feature word t q The entropy of information of (1).
Characteristic word t q And t q The correlation between can be calculated as:
Figure RE-GDA0002005055870000093
H(t p ) Representation feature word t p Entropy of information of (1), H (t) q ) Representation feature word t q The entropy of information of (1).
Step three, in order to ensure that the IFSC-FCBF feature selection algorithm of the present invention can obtain the feature of the user-defined dimension, the flow of the algorithm is also improved, and two conditions need to be judged: when all the feature words are screened, the feature list S is finally output best Is less than the set dimension size, the correlation Corr (t) between the feature word and the category is considered k C) as a significant correlation, which means that Corr (t) is more important to look at k C) value selected to have a large Corr (t) k C) feature of value to complement feature word list S best Until the dimension of the feature word list is equal to the set feature dimension size; when the dimension of the finally output feature word list is just larger than or equal to the set feature dimension size, directly outputting the feature word list S from the final output best Acquiring feature words with set dimensions;
the experimental indices used the commonly used precision (P), recall (R), F1 values and Macro F1 values (Macro _ F1) as references, and were calculated as follows:
precision ratio:
Figure RE-GDA0002005055870000094
macro precision ratio:
Figure RE-GDA0002005055870000095
the recall ratio is as follows:
Figure RE-GDA0002005055870000101
macro recall ratio:
Figure RE-GDA0002005055870000102
f1 value:
Figure RE-GDA0002005055870000103
macro F1 value:
Figure RE-GDA0002005055870000104
wherein V represents the number of categories; TP: the prediction is true and the reality is true; TN: the prediction is false, and the actual is false;
FP: the prediction is true; actually false; FN: prediction is false and actual is true
The F1 value (F1 Score) is a measure of the accuracy of the two-class model in statistics. The method simultaneously considers the accuracy rate and the recall rate of the classification model. The F1 score can be viewed as a weighted average of model accuracy and recall, with a maximum of 1 and a minimum of 0. The macro F1 value is more scientific. Therefore, under the condition of the same algorithm complexity, the higher the macro F1 value is, the higher the accuracy and recall rate of the algorithm is, and the more excellent the algorithm is.
Fig. 1 is a comparison of algorithm performances of four algorithms on a 20newsgroup data set, fig. 2 is a comparison of algorithm performances of four algorithms on a Ruster21678 data set, and it can be seen from fig. 1 and fig. 2 that, on the two english data sets, as the number of feature words increases, the macro F1 value slightly increases, and when the number of feature words reaches 300, the macro F1 value tends to be stable. It can be seen that the common point of the two graphs is that the IFSC-FCBF feature selection algorithm can more effectively select features, and the maximum macro F1 value is obtained. On one hand, the original FCBF algorithm does not give good results in text classification because of the excessive elimination characteristics, and the performance is the worst among the four algorithms. On the other hand, the discriminative feature selection algorithm DFS is superior to the information gain method IG algorithm, especially on the 20newsgroup data set. It can be seen that the F1 value of DFS is about 3% higher than the F1 value of IG when the number of features is 300. Although the IFSC-FCBF algorithm results closely to and sometimes falls behind DFS on the 20newsgroup data set, the performance of IFSC-FCBF has been superior to DFS on the Ruster21578 data set. In general, IFSC-FCBF has better performance than other algorithms in English data set, and macro F1 value is 1% higher than DFS algorithm on average and 2% to 3% higher than IG algorithm on average. The method has higher accuracy and recall rate, and the algorithm is more superior.
It can be seen from fig. 3 and 4 that as the number of features increases to 300, the macro F1 value tends to substantially level off, similar to the results on the english dataset. This means that when the feature dimension reaches 300, the feature is no longer a factor affecting the performance of the algorithm. For the FCBF algorithm, it does not perform well on both english datasets and chinese corpora. The DFS algorithm is higher in macro F1 value by 1.4% in the double denier corpus than the IG algorithm on average and 0.8% in the dog searching corpus on average. While the IFSC-FCBF algorithm is 1.3% and 1.5% higher than DFS on average. The method has higher accuracy and recall rate, and the algorithm is more superior.
In order to see the classification effect of each feature selection algorithm on each category, statistics are made on the classification effect of each category when the feature dimension is 300:
TABLE 1.20 comparison of the Classification Effect of each class of newsgroup data set
Figure RE-GDA0002005055870000111
TABLE 2 comparison of the Classification Effect of classes of Ruster21578 data set
Figure RE-GDA0002005055870000112
TABLE 3 comparison of Classification effects of classes in the language database of the Dan university
Figure RE-GDA0002005055870000121
TABLE 4 comparison of the Classification effects of classes in the corpus of the dog searching laboratory
Figure RE-GDA0002005055870000122
Table 1 shows a comparison of the classification effect of the feature selection algorithm on each class at 20newsgroup, with IFSC-FCBF having the highest F1 value in most classes, followed by DFS algorithm, as shown in bold in the table. Because each feature selection algorithm has different attention to features, the selected features are different, and thus different feature selection algorithms have good class features, which is one of the reasons why the IFSC-FCBF algorithm cannot guarantee that the F1 value for each class is higher than that of other algorithms. However, the average value of P, R and F1 is still the best value of the IFSC-FCBF algorithm.
As can be seen from table 2, the FCBF algorithm can select the feature words more favorable for the category of the CRUDE for the reasons described above, so that the F1 value is better for the category of the CRUDE than other algorithms. But the classification effect of the other four classes of FCBF is not ideal. The feature words of the money-fx category can show the precision ratio of the IFSC-FCBF, and the recall ratio and the F1 are higher than that of DFS and higher than that of IG by 1% respectively. In general, on two english datasets, our proposed algorithm is also optimal for classification of each class when the macro F1 value is high.
As can be seen in Table 3, for the Compound denier university corpus, it is interesting to have nearly the same IFSC-FCBF and DFS feature selection algorithms, the IG algorithm second. In table 4, the IFSC-FCBF algorithm is more effective for the selected features of the three categories, health, education and tourism, and the DFS is more effective for the sport category, and finally, the IG and the DFS have almost the same average value, but as can be seen from the comparison of the macro F1 value in fig. 4, the IG algorithm is slightly better than the DFS. In general, our algorithm can also select features more efficiently on chinese datasets.
The invention provides an improved fast correlation filtering algorithm of a self-defined feature dimension. Because most feature selection algorithms rarely consider redundancy between features, some noise features may be selected when performing feature selection and the classification accuracy may be reduced. And the original FCBF algorithm easily removes too many features when the correlation between features is strong. To solve the above problem, we propose IFSC-FCBF. According to experimental results, the IFSC-FCBF algorithm can select more effective features under the condition of ensuring low running time, and has obvious improvement on text classification tasks.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (5)

1. A self-defined feature dimension text feature selection algorithm based on FCBF is characterized in that: the method comprises the following steps:
step one, setting a vectorized text matrix as X, and a text category matrix as C ═ C 1 ,C 2 ...C j },C j Is a training text D j J is 1,2.. V, V is the total number of text categories, and all feature word sets T are initialized according to the text matrix X 1 ,t 2 ...t m The feature words are feature word sets S with the correlation between the feature words and the categories meeting the requirements list Assigning for the feature word set S after algorithm selection list Assigning an initial value, t m Is the m-th feature word;
step two, utilizing FCBF algorithm to collect S characteristic words list Screening the characteristic words to obtain an initial characteristic word set S best
Step three, if the initial characteristic word set S best When the dimension of (2) is less than the set dimension, selectingSelecting the characteristics with the relevance value ranking top of the characteristic words and the categories to complement the initial characteristic word set S best Until the dimension of the image is equal to the set characteristic dimension; if the initial feature word set S best When the dimension of (A) is greater than or equal to the set feature dimension, the initial feature word set S best Obtaining a feature word of the user-defined feature dimension;
in the first step, the method specifically comprises the following steps:
is S list And (4) assignment: for t k E to T, calculating the kth characteristic word T of the text k Correlation Corr (t) with text class C k C), when Corr (t) k Adding t when C) is greater than or equal to thresh k Into S list T is a set of all feature words, thresh is a threshold, k is 1 to m, and m is the total number of feature words;
will S list According to the characteristic word t k Correlation Corr (t) with text classes k C) value sorting from big to little, t p =getFirst(S list ),S best ={t p }; characteristic word t p For the first parameter after sorting, t p Is assigned to S best
2. The FCBF-based custom feature dimension textual feature selection algorithm as defined in claim 1, wherein: the second step comprises the following specific steps:
1) using a characteristic word variable t q Reading S in sequence list Parameter of if t q Calculating t if not null q And t p Correlation of (1) Corr (t) p ,t q ) If t is q If the algorithm is empty, the loop algorithm is ended;
2) comparison of the correlation Corr (t) p ,t q ) And Corr (t) q C) size, if the former is greater than or equal to the latter, then t is q From S list Otherwise, add it into S best
3) If S is best If the length is larger than or equal to the size of the custom feature dimension, the algorithm is ended, otherwise, the variable t p Reading S backwards in sequence list And continuing to the steps 1) -3).
3. The FCBF-based custom feature dimension text feature selection algorithm of claim 1, wherein: the kth characteristic word t of the text k Correlation Corr (t) with text class C k And C), the calculation method comprises the following steps: let D i For any one training document:
Figure FDA0003727853880000021
Figure FDA0003727853880000022
Figure FDA0003727853880000023
Figure FDA0003727853880000024
wherein H (C) represents the entropy of text category information, P (C) j ) Representing a text category C j δ () represents a binary function, L is to suppress the probability P (C) j ) A smoothing factor added in the case of 0; n denotes the total number of training documents, i 1,2 i Representing a training document D j Class (C) j Representing a training document D j A category of (1); h (t) k ) Representation feature word t k Information entropy of (1), P (t) k ) Representation feature word t k The probability of occurrence; tf (t) k,i ) Representation feature word t k Generating D in a training document i The frequency of the occurrence;
for a single feature, in the known class C j The conditional information entropy in the case of distribution is:
Figure FDA0003727853880000025
Figure FDA0003727853880000026
in the formula, H (t) k | C) represents the characteristic word t when the text category distribution is known k Information entropy of (1), P (t) k |C j ) Represents a known class C j Distribution time characteristic word t k The probability of occurrence; tf (t) k |C j ) Representation feature word t k At C j Frequency in categories;
feature word t with known distribution of class C k The variation of the information entropy is the characteristic information gain, and the calculation formula is as follows:
IG(t k |C)=H(t k )-H(t k |C)
in the formula, IG (t) k | C) represents the feature word t with a known distribution of classes k The variation of the information entropy is the characteristic information gain;
thus, the text feature word t k The relevance to text category C is:
Figure FDA0003727853880000031
4. the FCBF-based custom feature dimension textual feature selection algorithm as defined in claim 2, wherein: the characteristic word t p And the feature word t q The correlation of (c) is calculated as:
for t p In other words, at known t q Conditional information entropy in the case of distribution in each class H (t) p |t q ) Comprises the following steps:
Figure FDA0003727853880000032
Figure FDA0003727853880000033
wherein, P (t) p ,t q ) Representation feature word t q Feature word t when it appears p Probability of occurrence as well; df (t) p ,t q |C j ) Is shown in category C j Middle characteristic word t q And t p Number of documents that appear at the same time; df (t) q |C j ) Is shown in category C j Middle characteristic word t q The number of documents present;
t q feature word t in presence compared to absence p The variation of the information entropy in the category matrix C is:
IG(t p |t q )=H(t p |C)-H(t p |t q )
H(t p | C) represents a feature word t when the text category C is known q The entropy of the information of (1);
characteristic word t p And t q The correlation between them is calculated as:
Figure FDA0003727853880000034
H(t p ) Representation feature word t p Entropy of information of (1), H (t) q ) Representation feature word t q The entropy of information of (1).
5. The FCBF-based custom feature dimension text feature selection algorithm as defined in claim 3, wherein: the binary function has the following formula:
Figure FDA0003727853880000041
and x and y are binary function variables.
CN201910071963.3A 2019-01-25 2019-01-25 Self-defined feature dimension text feature selection algorithm based on FCBF Active CN109885682B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910071963.3A CN109885682B (en) 2019-01-25 2019-01-25 Self-defined feature dimension text feature selection algorithm based on FCBF

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910071963.3A CN109885682B (en) 2019-01-25 2019-01-25 Self-defined feature dimension text feature selection algorithm based on FCBF

Publications (2)

Publication Number Publication Date
CN109885682A CN109885682A (en) 2019-06-14
CN109885682B true CN109885682B (en) 2022-08-16

Family

ID=66926831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910071963.3A Active CN109885682B (en) 2019-01-25 2019-01-25 Self-defined feature dimension text feature selection algorithm based on FCBF

Country Status (1)

Country Link
CN (1) CN109885682B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220346A (en) * 2017-05-27 2017-09-29 荣科科技股份有限公司 A kind of higher-dimension deficiency of data feature selection approach
CN108647259A (en) * 2018-04-26 2018-10-12 南京邮电大学 Based on the naive Bayesian file classification method for improving depth characteristic weighting

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220346A (en) * 2017-05-27 2017-09-29 荣科科技股份有限公司 A kind of higher-dimension deficiency of data feature selection approach
CN108647259A (en) * 2018-04-26 2018-10-12 南京邮电大学 Based on the naive Bayesian file classification method for improving depth characteristic weighting

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于归一化互信息的FCBF特征选择算法;段宏湘等;《华中科技大学学报(自然科学版)》;20170131;第45卷(第1期);第52-56页 *

Also Published As

Publication number Publication date
CN109885682A (en) 2019-06-14

Similar Documents

Publication Publication Date Title
US6253169B1 (en) Method for improvement accuracy of decision tree based text categorization
KR100756921B1 (en) Method of classifying documents, computer readable record medium on which program for executing the method is recorded
US8024331B2 (en) Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors
US5819258A (en) Method and apparatus for automatically generating hierarchical categories from large document collections
CN108132927B (en) Keyword extraction method for combining graph structure and node association
CN108228541B (en) Method and device for generating document abstract
Yi et al. A hidden Markov model-based text classification of medical documents
CN101404015A (en) Automatically generating a hierarchy of terms
CN106557777B (en) One kind being based on the improved Kmeans document clustering method of SimHash
CN110909116B (en) Entity set expansion method and system for social media
CN115686432B (en) Document evaluation method for retrieval sorting, storage medium and terminal
CN115098690B (en) Multi-data document classification method and system based on cluster analysis
JP4967705B2 (en) Cluster generation apparatus and cluster generation program
Yoshioka et al. The classification of the documents based on Word2Vec and 2-layer self organizing maps
CN109885682B (en) Self-defined feature dimension text feature selection algorithm based on FCBF
CN112115256A (en) Method and device for generating news text abstract integrated with Chinese stroke information
CN112463894B (en) Multi-label feature selection method based on conditional mutual information and interactive information
CN111899832B (en) Medical theme management system and method based on context semantic analysis
Triwijoyo et al. Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms
CN114443820A (en) Text aggregation method and text recommendation method
CN111881678A (en) Domain word discovery method based on unsupervised learning
Sheng et al. An information retrieval system based on automatic query expansion and hopfield network
CN113392124B (en) Structured language-based data query method and device
Karakos et al. Cross-instance tuning of unsupervised document clustering algorithms
CN111159393B (en) Text generation method for abstract extraction based on LDA and D2V

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant