CN107016073A

CN107016073A - A kind of text classification feature selection approach

Info

Publication number: CN107016073A
Application number: CN201710181572.8A
Authority: CN
Inventors: 张晓彤; 余伟伟; 刘喆; 王璇
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2017-03-24
Filing date: 2017-03-24
Publication date: 2017-08-04
Anticipated expiration: 2037-03-24
Also published as: CN107016073B

Abstract

The present invention provides a kind of text classification feature selection approach, can reduce characteristic dimension and complicated classification degree and improve classification accuracy.Methods described includes：Feature set S and target classification C is obtained, each feature x in feature set S is calculated⁽ⁱ⁾With the degree of association R between target classification C_c(x⁽ⁱ⁾), and according to degree of association R_c(x⁽ⁱ⁾) size to feature set S carry out descending sort；Calculate the redundancy R between each two feature in feature set S_xWith collaborative e-commerce S_x, the degree of association R between binding characteristic and target classification_c(x⁽ⁱ⁾) the sensitivity S en of feature is calculated, and it is compared with threshold value th set in advance, with reference to the descending sort result to feature set S, feature set S is divided into Candidate Set S according to threshold value th_selCollect S with excluding_exc；Calculate Candidate Set S_selCollect S with excluding_excIn feature between sensitivity S en, and it is compared with threshold value th set in advance, according to threshold value th to Candidate Set S_selCollect S with excluding_excIt is adjusted.The present invention is applied to machine learning text classification field.

Description

A kind of text classification feature selection approach

Technical field

The present invention relates to machine learning text classification field, a kind of text classification feature selection approach is particularly related to.

Background technology

With the continuous expansion of internet scale, the information resources converged in internet are also on the increase.In order to effective Management and easily utilize these information resources, content-based information retrieval and data mining receive much concern all the time. Text Classification is the important foundation of information retrieval and text data digging, and its main task is the word according to unknown classification With the content of document, they are determined as one or more of previously given classification.However, training samples number is big and vectorial This two major features of dimension height, it is that all very high machine learning of an operation time and space complexity are asked to determine text classification Topic.It would therefore be desirable to carry out feature selecting, characteristic dimension is reduced while classification performance is ensured as far as possible.

Feature selecting is an important process of data preprocessing, in conventional text classification feature selection approach, card (Chi-Square) is examined by setting up null hypothesis in side, it is assumed that word is uncorrelated to target classification, selection deviates hypothesis degree greatly Word is used as feature.But whether there is certain word in its statistic document, but regardless of occurring in that several times, this causes it to low-frequency word It is partial.Mutual information (Mutual Information) method is selected by measuring the presence of word to the information content that target classification is brought Select feature.But it only considered the degree of association between word and target classification, dependence that may be present between word and word is ignored. TF-IDF (Term Frequency-Inverse Document Frequency) method considers what word occurred hereof Frequency and word are distributed to assess the significance level of word in All Files, so as to carry out Feature Selection.But it is simple Think text frequency it is small word is more important and word that text frequency is big is more useless, therefore precision is not very high.In addition Also information gain, odds ratio, text weight evidence, expect the feature selection approach such as cross entropy, they all only considered mostly , easily there is dimensionality reduction degree not enough or classification is smart in the degree of correlation between degree of correlation or word and word between word and target classification The problem of degree is not high.

The content of the invention

The technical problem to be solved in the present invention is to provide a kind of text classification feature selection approach, to solve prior art institute The problem of characteristic dimension height or low nicety of grading of presence.

In order to solve the above technical problems, the embodiment of the present invention provides a kind of text classification feature selection approach, including：

Step 1：Feature set S and target classification C is obtained, each feature x in feature set S is calculated⁽ⁱ⁾With target classification C it Between degree of association R_c(x⁽ⁱ⁾), and according to degree of association R_cSize carries out descending sort to feature set S；

Step 2：Calculate the redundancy R between each two feature in feature set S_xWith collaborative e-commerce S_x, binding characteristic and target class Degree of association R between not_c(x⁽ⁱ⁾) the sensitivity S en of feature is calculated, and it is compared with threshold value th set in advance, with reference to right Feature set S descending sort result, Candidate Set S is divided into according to threshold value th by feature set S_selCollect S with excluding_exc；

Step 3：Calculate Candidate Set S_selCollect S with excluding_excIn feature between sensitivity S en, and by it with setting in advance Fixed threshold value th compares, according to threshold value th to Candidate Set S_selCollect S with excluding_excIt is adjusted.

Further, the step 1 includes：

Step 11, for each feature x in feature set S⁽ⁱ⁾, according to formula R_c(x⁽ⁱ⁾)=I (x⁽ⁱ⁾；C feature x) is calculated⁽ⁱ⁾With the degree of association R between target classification C_c(x⁽ⁱ⁾), wherein, I (x⁽ⁱ⁾；C feature x) is represented⁽ⁱ⁾It is mutual between target classification C Information；

Step 12, according to degree of association R_c(x⁽ⁱ⁾) size the feature in feature set S is sorted from big to small, sorted Feature set S afterwards；

Wherein, x⁽ⁱ⁾Represent ith feature, R in feature set S_c(x⁽ⁱ⁾) represent feature x⁽ⁱ⁾With the pass between target classification C Connection degree.

Further, the I (x⁽ⁱ⁾；C) it is expressed as：

Wherein, c_kRepresent target classification C k-th of classification, p (x⁽ⁱ⁾, c_k) represent feature x⁽ⁱ⁾With classification c_kOccur simultaneously Probability, p (x⁽ⁱ⁾|c_k) represent in c_kFeature x in classification⁽ⁱ⁾The probability of appearance, p (x⁽ⁱ⁾) represent feature x⁽ⁱ⁾Occur in feature set S Probability.

Further, the redundancy R_xIt is expressed as：

R_x(x⁽ⁱ⁾；x^(j))=min (0, IG (x⁽ⁱ⁾；x^(j)；C)), i ≠ j

Wherein, IG (x⁽ⁱ⁾；x^(j)；C ith feature x in feature set S) is represented⁽ⁱ⁾With j-th of feature x^(j)Between the degree of correlation Gain, R_x(x⁽ⁱ⁾；x^(j)) represent feature x⁽ⁱ⁾With feature x^(j)Between redundancy, R_x(x⁽ⁱ⁾；x^(j)) value be 0 and degree of correlation gain In smaller value.

Further, the collaborative e-commerce S_xIt is expressed as：

S_x(x⁽ⁱ⁾；x^(j))=max (0, IG (x⁽ⁱ⁾；x^(j)；C)), i ≠ j

Wherein, IG (x⁽ⁱ⁾；x^(j)；C ith feature x in feature set S) is represented⁽ⁱ⁾With j-th of feature x^(j)Between the degree of correlation Gain, S_x(x⁽ⁱ⁾；x^(j)) represent feature x⁽ⁱ⁾With feature x^(j)Between collaborative e-commerce, S_x(x⁽ⁱ⁾；x^(j)) value be 0 and degree of correlation gain In higher value.

Further, the IG (x⁽ⁱ⁾；x^(j)；C) it is expressed as：

IG(x⁽ⁱ⁾；x^(j)；C)=I [(x⁽ⁱ⁾, x^(j))；C]-I(x⁽ⁱ⁾；C)-I(x^(j)；C)

Wherein, I (x⁽ⁱ⁾；C feature x) is represented⁽ⁱ⁾With the mutual information between target classification C；I(x^(j)；C）Represent feature x^(j)With Mutual information between target classification C；I((x⁽ⁱ⁾, x^(j)；C feature x) is represented⁽ⁱ⁾, feature x^(j)With the mutual trust between target classification C Breath.

Further, the I ((x⁽ⁱ⁾, x^(j)；C) it is expressed as：

Wherein, c_kRepresent target classification C k-th of classification, p (x⁽ⁱ⁾, x^(j), c_k) represent feature x⁽ⁱ⁾, feature x^(j)And classification c_kThe probability occurred simultaneously, p ((x⁽ⁱ⁾, x^(j))|c_k) represent in c_kFeature x in classification⁽ⁱ⁾With feature x^(j)The probability occurred simultaneously, p (x⁽ⁱ⁾, x^(j)) represent feature x⁽ⁱ⁾With feature x^(j)The probability occurred simultaneously in feature set S.

Further, the step 2 includes：

Step 21：First feature in feature set S is added to Candidate Set S_sel, collect S by excluding_excIt is set to empty set, i.e. S_sel ={ x⁽¹⁾, S_exc={ }, the corresponding degree of association R of first feature_c(x⁽ⁱ⁾) maximum；

Step 22：Since feature set S second feature, x is used⁽ⁱ⁾Second feature is represented, feature x is calculated⁽ⁱ⁾ With Candidate Set S_selIn redundancy R between all features_xWith collaborative e-commerce S_x, and the degree of association between binding characteristic and target classification R_c(x⁽ⁱ⁾) calculate feature x⁽ⁱ⁾Sensitivity S en (x⁽ⁱ⁾)；

Step 23：By sensitivity S en (x⁽ⁱ⁾) compared with threshold value th set in advance, if Sen (x⁽ⁱ⁾) ＞ th, then by feature x⁽ⁱ⁾Add Candidate Set S_sel；Otherwise by feature x⁽ⁱ⁾Add and exclude collection S_exc；

Step 24：If x⁽ⁱ⁾Last feature in collection S is characterized, then terminates to divide；Otherwise, by x⁽ⁱ⁾It is set to feature set S In next feature, return to step 22.

Further, the sensitivity S en (x⁽ⁱ⁾) be expressed as：

Sen(x⁽ⁱ⁾)=R_c(x⁽ⁱ⁾)+αmin(R_x(x⁽ⁱ⁾；x^(j)))

+βmax(S_x(x⁽ⁱ⁾；x^(j))), j ≠ i

Wherein, α and β are redundancy R respectively_xWith collaborative e-commerce S_xWeights, min (R_x(x⁽ⁱ⁾；x^(j))) represent feature x⁽ⁱ⁾With The minimum value of redundancy between remaining feature, max (S_x(x⁽ⁱ⁾；x^(j))) represent feature x⁽ⁱ⁾The collaborative e-commerce between remaining feature Maximum, Sen (x⁽ⁱ⁾) represent feature x⁽ⁱ⁾Sensitivity to target classification C, R_c(x⁽ⁱ⁾) represent feature x⁽ⁱ⁾With target classification C it Between the degree of association.

Further, the step 3 includes：

Step 31：Make collection S undetermined_tbdFor sky, i.e. S_tbd={ }, if x^(k)To exclude collection S_excIn first feature, if x^(m) For Candidate Set S_selIn first feature；

Step 32：For excluding collection S_excIn feature x^(k), calculate Candidate Set S_selIn feature x^(m)With being removed in feature set S x^(m)Outside all features between collaborative e-commerce maximum, i.e. max (S_x(x^(m)；x⁽ⁱ⁾)), x⁽ⁱ⁾∈ S, i ≠ m；

Step 33：If feature x^(m)The corresponding feature of maximum collaborative e-commerce be x^(k), then by x^(m)Add collection S undetermined_tbd；

Step 34：If feature x^(m)It is Candidate Set S_selIn last feature, and collection S undetermined_tbdFor sky, then into step 36；If collection S undetermined_tbdIt is not sky, if x^(j)For collection S undetermined_tbdIn first feature, into step 35；If feature x^(m)It is not Candidate Set S_selIn last feature, then by feature x^(m)It is set to Candidate Set S_selIn next feature, return to step 32；

Step 35：For collection S undetermined_tbdIn feature x^(j), more new feature x as follows^(j)Sensitivity：

Sen(x^(j))=R_c(x^(j))+αmin(R_x(x^(j)；x⁽ⁿ⁾))

+βmax(S_x(x^(j)；x⁽ⁿ⁾)), x⁽ⁿ⁾∈ S, n ≠ j, n ≠ k

By feature x^(j)Sensitivity S en (x^(j)) compared with threshold value th set in advance, if Sen (x^(j)) ＜ th andThen by feature x^(k)Collect S from excluding_excIt is middle to remove, it is added to Candidate Set S_sel, enter Enter step 36；Otherwise, if feature x^(j)It is collection S undetermined_tbdIn last element, then be directly entered step 36；Otherwise, by feature x^(j)It is set to collection S undetermined_tbdIn next element, return to step 35；

Step 36：If feature x^(k)It is to exclude collection S_excIn last element, then return to current candidate collection S_selCollect with excluding S_excIt is used as the result of final feature selecting；Otherwise, by feature x^(k)It is set to exclusion collection S_excIn next element, return to step 31.

The above-mentioned technical proposal of the present invention has the beneficial effect that：

In such scheme, by feature set S and target classification C, the degree of association R between feature and target classification is calculated_c(x⁽ⁱ⁾) and feature and feature between redundancy R_xWith collaborative e-commerce S_x, so as to calculate the sensitivity S en of feature；According to presetting Threshold value th feature is screened, feature set be divided into Candidate Set and excluded collect, and continue in subsequent process to candidate Collection and exclusion collection are adjusted optimization.So, the phase between feature and target classification and between feature and feature has been considered Mutual relation, by the degree of association, redundancy and collaborative e-commerce, is selected feature, remains the feature played a crucial role to classification, Contribute to reduction characteristic dimension and complicated classification degree, and classification accuracy can be improved.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of text classification feature selection approach provided in an embodiment of the present invention；

Fig. 2 is the detailed process schematic diagram of text classification feature selection approach provided in an embodiment of the present invention；

Fig. 3 is that feature selection approach provided in an embodiment of the present invention divides Candidate Set and excludes the schematic flow sheet of collection；

Fig. 4 is that feature selection approach provided in an embodiment of the present invention adjusts Candidate Set and excludes the schematic flow sheet of collection.

Embodiment

To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool Body embodiment is described in detail.

There is provided a kind of text classification feature selecting for the problem of present invention is for existing characteristic dimension height or low nicety of grading Method.

As shown in figure 1, text classification feature selection approach provided in an embodiment of the present invention, including：

Step 1：Feature set S and target classification C is obtained, each feature x in feature set S is calculated⁽ⁱ⁾With target classification C it Between degree of association R_c(x⁽ⁱ⁾), and according to degree of association R_c(x⁽ⁱ⁾) size to feature set S carry out descending sort；

Text classification feature selection approach described in the embodiment of the present invention, by feature set S and target classification C, calculates special Levy the degree of association R between target classification_c(x⁽ⁱ⁾) and feature and feature between redundancy R_xWith collaborative e-commerce S_x, so as to calculate The sensitivity S en of feature；Feature is screened according to threshold value th set in advance, feature set is divided into Candidate Set and exclusion Collection, and continue in subsequent process to Candidate Set and exclude collection and be adjusted optimization.So, feature and target class have been considered Not between and the correlation between feature and feature, by the degree of association, redundancy and collaborative e-commerce, feature is selected, guarantor The feature played a crucial role to classification has been stayed, has contributed to reduction characteristic dimension and complicated classification degree, and it is accurate to improve classification Property.

In the present embodiment, as shown in Fig. 2 in order to get feature set S and target classification C, needing elder generation input feature vector collection S=(x⁽¹⁾, x⁽²⁾..., x⁽ⁿ⁾) and target classification C.

In the present embodiment, the feature set S is represented during text classification, all feature (single feature x⁽ⁱ⁾Represent, i.e., word vector) set, i.e. S=(x⁽¹⁾, x⁽²⁾..., x⁽ⁿ⁾), n represents feature in feature set S Number；Feature x⁽ⁱ⁾The column vector that the number of times that word corresponding to representing feature occurs in each text is constituted, i.e.,Target classification C represents the column vector that the classification corresponding to each text is constituted, Target classification C is category set.

In the present embodiment, the feature x⁽ⁱ⁾With the degree of association R between target classification C_c(x⁽ⁱ⁾) it is characterized x⁽ⁱ⁾With target class Mutual information between other C.

In the present embodiment, as an alternative embodiment, each feature x in the calculating feature set S⁽ⁱ⁾With target classification C Between degree of association R_c(x⁽ⁱ⁾), and according to degree of association R_c(x⁽ⁱ⁾) size to feature set S carry out descending sort (step 1) include：

It is described in the present embodiment

Wherein, I (x⁽ⁱ⁾；C feature x) is represented⁽ⁱ⁾With the mutual information between target classification C, c_kRepresent the target classification C K classification, p (x⁽ⁱ⁾, c_k) represent feature x⁽ⁱ⁾With classification c_kThe probability occurred simultaneously, p (x⁽ⁱ⁾|c_k) represent in c_kFeature in classification x⁽ⁱ⁾The probability of appearance, p (x⁽ⁱ⁾) represent feature x⁽ⁱ⁾The probability occurred in feature set S.

In the present embodiment, it is preferable that the feature x⁽ⁱ⁾With classification c_kProbability p (the x occurred simultaneously⁽ⁱ⁾, c_k), by c_kClassification Feature x in file⁽ⁱ⁾The frequency that corresponding word occurs in All Files comes approximate, i.e.,：

Wherein,Represent feature x⁽ⁱ⁾J-th of element (i.e. feature x⁽ⁱ⁾What corresponding word occurred in j-th of file Number of times)；Represent feature x⁽ⁱ⁾Middle correspondence target classification is c_kM-th of element (i.e. feature x⁽ⁱ⁾Corresponding word is at m-th c_kThe number of times occurred in category file).

In the present embodiment, it is preferable that described in c_kFeature x in classification⁽ⁱ⁾Probability p (the x of appearance⁽ⁱ⁾|c_k), by feature x⁽ⁱ⁾Institute Correspondence word is in c_kThe frequency occurred in category file comes approximate, i.e.,：

In the present embodiment, it is preferable that the feature x⁽ⁱ⁾Probability p (the x occurred in feature set S⁽ⁱ⁾), by feature x⁽ⁱ⁾Institute The frequency that correspondence word occurs in All Files comes approximate, i.e.,：

In the present embodiment, as yet another alternative embodiment, as shown in figure 3, in the calculating feature set S each two feature it Between redundancy R_xWith collaborative e-commerce S_x, the degree of association R between binding characteristic and target classification_c(x⁽ⁱ⁾) calculate feature sensitivity Sen, and it is compared with threshold value th set in advance, feature set S is divided into Candidate Set S according to threshold value th_selCollect with excluding S_exc(step 2) includes：

In the embodiment of aforementioned texts characteristic of division system of selection, further, the redundancy R_xRepresent For：

R_x(x⁽ⁱ⁾；x^(j))=min (0, IG (x⁽ⁱ⁾；x^(j)；C)), i ≠ j

In the embodiment of aforementioned texts characteristic of division system of selection, further, the collaborative e-commerce S_xRepresent For：

S_x(x⁽ⁱ⁾；x^(j))=max (0, IG (x⁽ⁱ⁾；x^(j)；C)), i ≠ j

In the embodiment of aforementioned texts characteristic of division system of selection, further, the IG (x⁽ⁱ⁾；x^(j)；C) It is expressed as：

Wherein, I (x⁽ⁱ⁾；) and I (x C^(j)；C) with the feature x⁽ⁱ⁾Mutual information calculation formula phase between target classification C Together, I (x⁽ⁱ⁾；C feature x) is represented⁽ⁱ⁾With the mutual information between target classification C；I(x^(j)；C feature x) is represented^(j)With target classification C Between mutual information；I((x⁽ⁱ⁾, x^(j)；C feature x) is represented⁽ⁱ⁾, feature x^(j)With the mutual information between target classification C.

In the embodiment of aforementioned texts characteristic of division system of selection, further, the I ((x⁽ⁱ⁾, x^(j)；C) It is expressed as：

Wherein, c_kRepresent target classification C k-th of classification, p (x⁽ⁱ⁾, x^(j), ck) and represent feature x⁽ⁱ⁾, feature x^(j)And class Other c_kThe probability occurred simultaneously, p ((x⁽ⁱ⁾, x^(j))|c_k) represent in c_kFeature x in classification⁽ⁱ⁾With feature x^(j)What is occurred simultaneously is general Rate, p (x⁽ⁱ⁾, x^(j)) represent feature x⁽ⁱ⁾With feature x^(j)The probability occurred simultaneously in feature set S.

In the present embodiment, it is preferable that the feature x⁽ⁱ⁾, feature x^(j)With classification c_kProbability p (the x occurred simultaneously⁽ⁱ⁾, x^(j), c_k), by c_kFeature x in category file⁽ⁱ⁾With feature x^(j)The frequency that corresponding word occurs simultaneously in All Files is come near Seemingly, i.e.,：

Wherein,Represent feature x⁽ⁱ⁾With feature x^(j)Middle correspondence target classification is c_kM-th yuan Smaller value (i.e. feature x in element⁽ⁱ⁾With feature x^(j)The two corresponding word is in m-th of c_kThe number of times occurred in category file Smaller value).

In the present embodiment, it is preferable that described in c_kFeature x in classification⁽ⁱ⁾With feature x^(j)The Probability p ((x occurred simultaneously⁽ⁱ⁾, x^(j))|c_k), by feature x⁽ⁱ⁾With feature x^(j)Corresponding word is in c_kThe frequency occurred simultaneously in category file comes approximate, i.e.,：

In the present embodiment, it is preferable that the feature x⁽ⁱ⁾With feature x^(j)Probability p (the x occurred simultaneously in feature set S⁽ⁱ⁾), by feature x⁽ⁱ⁾With feature x^(j)The frequency that corresponding word occurs simultaneously in All Files comes approximate, i.e.,：

In the embodiment of aforementioned texts characteristic of division system of selection, further, the sensitivity S en (x⁽ⁱ⁾) be expressed as：

Sen(x⁽ⁱ⁾)=R_c(x⁽ⁱ⁾)+αmin(R_x(x⁽ⁱ⁾；x^(j)))

+βmax(S_x(x⁽ⁱ⁾；x^(j))), j ≠ i

In the present embodiment, as shown in figure 4, being used as an alternative embodiment, the calculating Candidate Set S_selCollect S with excluding_excIn Feature between sensitivity S en, and it is compared with threshold value th set in advance, according to threshold value th to Candidate Set S_selAnd row Except collection S_excBeing adjusted (step 3) includes：

Step 34：If feature x^(m)It is Candidate Set S_selIn last feature, and collection S undetermined_tbdFor sky, then into step 36；If collection S undetermined_tbdIt is not sky, if x^(j)For collection S undetermined_tbdIn first feature, into step 35；If feature x^(m)It is not Candidate Set S_selIn last feature, then by feature x^(m)It is set to Candidate Set Sse_lIn next feature, return to step 32；

Sen(x^(j))=R_c(x^(j))+αmin(R_x(x^(j)；x⁽ⁿ⁾))

+βmax(S_x(x^(j)；x⁽ⁿ⁾)), x⁽ⁿ⁾∈ S, n ≠ j, n ≠ k

In the present embodiment, according to step 31-36, Candidate Set S is calculated_selCollect S with excluding_excIn feature between sensitivity Sen, and it is compared with threshold value th set in advance, according to threshold value th to Candidate Set S_selCollect S with excluding_excIt is adjusted, obtains To new Candidate Set S_selCollect S with excluding_exc, the influence of the removal or increase of feature to classification results can be reduced.

In the present embodiment, the redundancy R_xWeights α default values can be 0.5；The collaborative e-commerce S_xWeights β default values can Think 0.5；The threshold value th set in advance is defaulted as being 0.01.The redundancy R_xWeights α, collaborative e-commerce S_xWeights β and Threshold value th set in advance is in follow-up training and test process by genetic algorithm optimization with updating.

Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, on the premise of principle of the present invention is not departed from, some improvements and modifications can also be made, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

1. a kind of text classification feature selection approach, it is characterised in that including：

Step 1：Feature set S and target classification C is obtained, each feature x in feature set S is calculated⁽ⁱ⁾Between target classification C Degree of association R_c(x⁽ⁱ⁾), and according to degree of association R_c(x⁽ⁱ⁾) size to feature set S carry out descending sort；

Step 2：Calculate the redundancy R between each two feature in feature set S_xWith collaborative e-commerce S_x, binding characteristic and target classification it Between degree of association R_c(x⁽ⁱ⁾) the sensitivity S en of feature is calculated, and it is compared with threshold value th set in advance, with reference to feature Collect S descending sort result, feature set S is divided into Candidate Set S according to threshold value th_selCollect S with excluding_exc；

Step 3：Calculate Candidate Set S_selCollect S with excluding_excIn feature between sensitivity S en, and by its with it is set in advance Threshold value th compares, according to threshold value th to Candidate Set S_selCollect S with excluding_excIt is adjusted.

2. text classification feature selection approach according to claim 1, it is characterised in that the step 1 includes：

Step 11, for each feature x in feature set S⁽ⁱ⁾, according to formula R_c(x⁽ⁱ⁾)=I (x⁽ⁱ⁾；C feature x) is calculated⁽ⁱ⁾With mesh Mark the degree of association R between classification C_c(x⁽ⁱ⁾), wherein, I (x⁽ⁱ⁾；C feature x) is represented⁽ⁱ⁾With the mutual information between target classification C；

Step 12, according to degree of association R_c(x⁽ⁱ⁾) size the feature in feature set S is sorted from big to small, the spy after being sorted Collect S；

Wherein, x⁽ⁱ⁾Represent ith feature, R in feature set S_c(x⁽ⁱ⁾) represent feature x⁽ⁱ⁾With the degree of association between target classification C.

3. text classification feature selection approach according to claim 2, it is characterised in that the I (x⁽ⁱ⁾；C) it is expressed as：

Wherein, c_kRepresent target classification C k-th of classification, p (x⁽ⁱ⁾, c_k) represent feature x⁽ⁱ⁾With classification c_kWhat is occurred simultaneously is general Rate, p (x⁽ⁱ⁾|c_k) represent in c_kFeature x in classification⁽ⁱ⁾The probability of appearance, p (x⁽ⁱ⁾) represent feature x⁽ⁱ⁾Occur in feature set S Probability.

4. text classification feature selection approach according to claim 1, it is characterised in that the redundancy R_xIt is expressed as：

R_x(x⁽ⁱ⁾；x^(j))=min (0, IG (x⁽ⁱ⁾；x^(j)；C))；i≠j

Wherein, IG (x⁽ⁱ⁾；x^(j)；C ith feature x in feature set S) is represented⁽ⁱ⁾With j-th of feature x^(j)Between the degree of correlation increase Benefit, R_x(x⁽ⁱ⁾；x^(j)) represent feature x⁽ⁱ⁾With feature x^(j)Between redundancy, R_x(x⁽ⁱ⁾；x^(j)) value for 0 and degree of correlation gain in Smaller value.

5. text classification feature selection approach according to claim 1, it is characterised in that the collaborative e-commerce S_xIt is expressed as：

S_x(x⁽ⁱ⁾；x^(j))=max (0, IG (x⁽ⁱ⁾；x^(j)；C))；i≠j

Wherein, IG (x⁽ⁱ⁾；x^(j)；C ith feature x in feature set S) is represented⁽ⁱ⁾With j-th of feature x^(j)Between the degree of correlation increase Benefit, S_x(x⁽ⁱ⁾；x^(j)) represent feature x⁽ⁱ⁾With feature x^(j)Between collaborative e-commerce, S_x(x⁽ⁱ⁾；x^(j)) value for 0 and degree of correlation gain in Higher value.

6. the text classification feature selection approach according to claim 4 or 5, it is characterised in that the IG (x⁽ⁱ⁾；x^(j)；C) It is expressed as：

Wherein, I (x⁽ⁱ⁾；C feature x) is represented⁽ⁱ⁾With the mutual information between target classification C；I(x^(j)；C feature x) is represented^(j)With target Mutual information between classification C；I((x⁽ⁱ⁾, x^(j)；C feature x) is represented⁽ⁱ⁾, feature x^(j)With the mutual information between target classification C.

7. text classification feature selection approach according to claim 6, it is characterised in that the I ((x⁽ⁱ⁾, x^(j)；C) table It is shown as：

Wherein, c_kRepresent target classification C k-th of classification, p (x⁽ⁱ⁾, x^(j), c_k) represent feature x⁽ⁱ⁾, feature x^(j)With classification c_kTogether When the probability that occurs, p ((x⁽ⁱ⁾, x^(j)|c_k) represent in c_kFeature x in classification⁽ⁱ⁾With feature x^(j)The probability occurred simultaneously, p (x⁽ⁱ⁾, x^(j)) represent feature x⁽ⁱ⁾With feature x^(j)The probability occurred simultaneously in feature set S.

8. text classification feature selection approach according to claim 1, it is characterised in that the step 2 includes：

Step 21：First feature in feature set S is added to Candidate Set S_sel, collect S by excluding_excIt is set to empty set, i.e. S_sel={ x⁽¹⁾, S_exc={ }, the corresponding degree of association R of first feature_c(x⁽ⁱ⁾) maximum；

Step 22：Since feature set S second feature, x is used⁽ⁱ⁾Second feature is represented, feature x is calculated⁽ⁱ⁾With time Selected works S_selIn redundancy R between all features_xWith collaborative e-commerce S_x, and the degree of association R between binding characteristic and target classification_c(x⁽ⁱ⁾) calculate feature x⁽ⁱ⁾Sensitivity S en (x⁽ⁱ⁾)；

Step 23：By sensitivity S en (x⁽ⁱ⁾) compared with threshold value th set in advance, if Ssen (x⁽ⁱ⁾) ＞ th, then by feature x⁽ⁱ⁾ Add Candidate Set S_sel；Otherwise by feature x⁽ⁱ⁾Add and exclude collection S_exc；

Step 24：If x⁽ⁱ⁾Last feature in collection S is characterized, then terminates to divide；Otherwise, by x⁽ⁱ⁾Under being set in feature set S One feature, returns to step 22.

9. text classification feature selection approach according to claim 8, it is characterised in that the sensitivity S en (x⁽ⁱ⁾) table It is shown as：

Sen(x⁽ⁱ⁾)=R_c(x⁽ⁱ⁾)+αmin(R_x(x⁽ⁱ⁾；x^(j)))

+βmax(S_x(x⁽ⁱ⁾；x^(j))), j ≠ i

Wherein, α and β are redundancy R respectively_xWith collaborative e-commerce S_xWeights, min (R_x(x⁽ⁱ⁾；x^(j))) represent feature x⁽ⁱ⁾With remaining The minimum value of redundancy between feature, max (S_x(x⁽ⁱ⁾；x^(j))) represent feature x⁽ⁱ⁾The maximum of collaborative e-commerce between remaining feature Value, Sen (x⁽ⁱ⁾) represent feature x⁽ⁱ⁾Sensitivity to target classification C, R_c(x⁽ⁱ⁾) represent feature x⁽ⁱ⁾Between target classification C The degree of association.

10. text classification feature selection approach according to claim 1, it is characterised in that the step 3 includes：

Step 31：Make collection S undetermined_tbdFor sky, i.e. S_tbd={ }, if x^(k)To exclude collection S_excIn first feature, if x^(m)For Candidate Set S_selIn first feature；

Step 32：For excluding collection S_excIn feature s^(k), calculate Candidate Set S_selIn feature x^(m)With removing x in feature set S^(m) Outside all features between collaborative e-commerce maximum, i.e. max (S_x(x^(m)；x⁽ⁱ⁾)), x⁽ⁱ⁾∈ S, i ≠ m；

Sen(x^(j))=R_c(x^(j))+αmin(R_x(x^(j)；x⁽ⁿ⁾))

+βmax(S_x(x^(j)；x⁽ⁿ⁾)), x⁽ⁿ⁾∈ S, n ≠ j, n ≠ k

Step 36：If feature x^(k)It is to exclude collection S_excIn last element, then return to current candidate collection S_selCollect S with excluding_exc It is used as the result of final feature selecting；Otherwise, by feature x^(k)It is set to exclusion collection S_excIn next element, return to step 31.