CN107016073A - A kind of text classification feature selection approach - Google Patents

A kind of text classification feature selection approach Download PDF

Info

Publication number
CN107016073A
CN107016073A CN201710181572.8A CN201710181572A CN107016073A CN 107016073 A CN107016073 A CN 107016073A CN 201710181572 A CN201710181572 A CN 201710181572A CN 107016073 A CN107016073 A CN 107016073A
Authority
CN
China
Prior art keywords
feature
classification
sel
degree
represent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710181572.8A
Other languages
Chinese (zh)
Other versions
CN107016073B (en
Inventor
张晓彤
余伟伟
刘喆
王璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201710181572.8A priority Critical patent/CN107016073B/en
Publication of CN107016073A publication Critical patent/CN107016073A/en
Application granted granted Critical
Publication of CN107016073B publication Critical patent/CN107016073B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of text classification feature selection approach, can reduce characteristic dimension and complicated classification degree and improve classification accuracy.Methods described includes:Feature set S and target classification C is obtained, each feature x in feature set S is calculated(i)With the degree of association R between target classification Cc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;Calculate the redundancy R between each two feature in feature set SxWith collaborative e-commerce Sx, the degree of association R between binding characteristic and target classificationc(x(i)) the sensitivity S en of feature is calculated, and it is compared with threshold value th set in advance, with reference to the descending sort result to feature set S, feature set S is divided into Candidate Set S according to threshold value thselCollect S with excludingexc;Calculate Candidate Set SselCollect S with excludingexcIn feature between sensitivity S en, and it is compared with threshold value th set in advance, according to threshold value th to Candidate Set SselCollect S with excludingexcIt is adjusted.The present invention is applied to machine learning text classification field.

Description

A kind of text classification feature selection approach
Technical field
The present invention relates to machine learning text classification field, a kind of text classification feature selection approach is particularly related to.
Background technology
With the continuous expansion of internet scale, the information resources converged in internet are also on the increase.In order to effective Management and easily utilize these information resources, content-based information retrieval and data mining receive much concern all the time. Text Classification is the important foundation of information retrieval and text data digging, and its main task is the word according to unknown classification With the content of document, they are determined as one or more of previously given classification.However, training samples number is big and vectorial This two major features of dimension height, it is that all very high machine learning of an operation time and space complexity are asked to determine text classification Topic.It would therefore be desirable to carry out feature selecting, characteristic dimension is reduced while classification performance is ensured as far as possible.
Feature selecting is an important process of data preprocessing, in conventional text classification feature selection approach, card (Chi-Square) is examined by setting up null hypothesis in side, it is assumed that word is uncorrelated to target classification, selection deviates hypothesis degree greatly Word is used as feature.But whether there is certain word in its statistic document, but regardless of occurring in that several times, this causes it to low-frequency word It is partial.Mutual information (Mutual Information) method is selected by measuring the presence of word to the information content that target classification is brought Select feature.But it only considered the degree of association between word and target classification, dependence that may be present between word and word is ignored. TF-IDF (Term Frequency-Inverse Document Frequency) method considers what word occurred hereof Frequency and word are distributed to assess the significance level of word in All Files, so as to carry out Feature Selection.But it is simple Think text frequency it is small word is more important and word that text frequency is big is more useless, therefore precision is not very high.In addition Also information gain, odds ratio, text weight evidence, expect the feature selection approach such as cross entropy, they all only considered mostly , easily there is dimensionality reduction degree not enough or classification is smart in the degree of correlation between degree of correlation or word and word between word and target classification The problem of degree is not high.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of text classification feature selection approach, to solve prior art institute The problem of characteristic dimension height or low nicety of grading of presence.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of text classification feature selection approach, including:
Step 1:Feature set S and target classification C is obtained, each feature x in feature set S is calculated(i)With target classification C it Between degree of association Rc(x(i)), and according to degree of association RcSize carries out descending sort to feature set S;
Step 2:Calculate the redundancy R between each two feature in feature set SxWith collaborative e-commerce Sx, binding characteristic and target class Degree of association R between notc(x(i)) the sensitivity S en of feature is calculated, and it is compared with threshold value th set in advance, with reference to right Feature set S descending sort result, Candidate Set S is divided into according to threshold value th by feature set SselCollect S with excludingexc
Step 3:Calculate Candidate Set SselCollect S with excludingexcIn feature between sensitivity S en, and by it with setting in advance Fixed threshold value th compares, according to threshold value th to Candidate Set SselCollect S with excludingexcIt is adjusted.
Further, the step 1 includes:
Step 11, for each feature x in feature set S(i), according to formula Rc(x(i))=I (x(i);C feature x) is calculated(i)With the degree of association R between target classification Cc(x(i)), wherein, I (x(i);C feature x) is represented(i)It is mutual between target classification C Information;
Step 12, according to degree of association Rc(x(i)) size the feature in feature set S is sorted from big to small, sorted Feature set S afterwards;
Wherein, x(i)Represent ith feature, R in feature set Sc(x(i)) represent feature x(i)With the pass between target classification C Connection degree.
Further, the I (x(i);C) it is expressed as:
Wherein, ckRepresent target classification C k-th of classification, p (x(i), ck) represent feature x(i)With classification ckOccur simultaneously Probability, p (x(i)|ck) represent in ckFeature x in classification(i)The probability of appearance, p (x(i)) represent feature x(i)Occur in feature set S Probability.
Further, the redundancy RxIt is expressed as:
Rx(x(i);x(j))=min (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is represented(i)With j-th of feature x(j)Between the degree of correlation Gain, Rx(x(i);x(j)) represent feature x(i)With feature x(j)Between redundancy, Rx(x(i);x(j)) value be 0 and degree of correlation gain In smaller value.
Further, the collaborative e-commerce SxIt is expressed as:
Sx(x(i);x(j))=max (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is represented(i)With j-th of feature x(j)Between the degree of correlation Gain, Sx(x(i);x(j)) represent feature x(i)With feature x(j)Between collaborative e-commerce, Sx(x(i);x(j)) value be 0 and degree of correlation gain In higher value.
Further, the IG (x(i);x(j);C) it is expressed as:
IG(x(i);x(j);C)=I [(x(i), x(j));C]-I(x(i);C)-I(x(j);C)
Wherein, I (x(i);C feature x) is represented(i)With the mutual information between target classification C;I(x(j);C)Represent feature x(j)With Mutual information between target classification C;I((x(i), x(j);C feature x) is represented(i), feature x(j)With the mutual trust between target classification C Breath.
Further, the I ((x(i), x(j);C) it is expressed as:
Wherein, ckRepresent target classification C k-th of classification, p (x(i), x(j), ck) represent feature x(i), feature x(j)And classification ckThe probability occurred simultaneously, p ((x(i), x(j))|ck) represent in ckFeature x in classification(i)With feature x(j)The probability occurred simultaneously, p (x(i), x(j)) represent feature x(i)With feature x(j)The probability occurred simultaneously in feature set S.
Further, the step 2 includes:
Step 21:First feature in feature set S is added to Candidate Set Ssel, collect S by excludingexcIt is set to empty set, i.e. Ssel ={ x(1), Sexc={ }, the corresponding degree of association R of first featurec(x(i)) maximum;
Step 22:Since feature set S second feature, x is used(i)Second feature is represented, feature x is calculated(i) With Candidate Set SselIn redundancy R between all featuresxWith collaborative e-commerce Sx, and the degree of association between binding characteristic and target classification Rc(x(i)) calculate feature x(i)Sensitivity S en (x(i));
Step 23:By sensitivity S en (x(i)) compared with threshold value th set in advance, if Sen (x(i)) > th, then by feature x(i)Add Candidate Set Ssel;Otherwise by feature x(i)Add and exclude collection Sexc
Step 24:If x(i)Last feature in collection S is characterized, then terminates to divide;Otherwise, by x(i)It is set to feature set S In next feature, return to step 22.
Further, the sensitivity S en (x(i)) be expressed as:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))), j ≠ i
Wherein, α and β are redundancy R respectivelyxWith collaborative e-commerce SxWeights, min (Rx(x(i);x(j))) represent feature x(i)With The minimum value of redundancy between remaining feature, max (Sx(x(i);x(j))) represent feature x(i)The collaborative e-commerce between remaining feature Maximum, Sen (x(i)) represent feature x(i)Sensitivity to target classification C, Rc(x(i)) represent feature x(i)With target classification C it Between the degree of association.
Further, the step 3 includes:
Step 31:Make collection S undeterminedtbdFor sky, i.e. Stbd={ }, if x(k)To exclude collection SexcIn first feature, if x(m) For Candidate Set SselIn first feature;
Step 32:For excluding collection SexcIn feature x(k), calculate Candidate Set SselIn feature x(m)With being removed in feature set S x(m)Outside all features between collaborative e-commerce maximum, i.e. max (Sx(x(m);x(i))), x(i)∈ S, i ≠ m;
Step 33:If feature x(m)The corresponding feature of maximum collaborative e-commerce be x(k), then by x(m)Add collection S undeterminedtbd
Step 34:If feature x(m)It is Candidate Set SselIn last feature, and collection S undeterminedtbdFor sky, then into step 36;If collection S undeterminedtbdIt is not sky, if x(j)For collection S undeterminedtbdIn first feature, into step 35;If feature x(m)It is not Candidate Set SselIn last feature, then by feature x(m)It is set to Candidate Set SselIn next feature, return to step 32;
Step 35:For collection S undeterminedtbdIn feature x(j), more new feature x as follows(j)Sensitivity:
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))), x(n)∈ S, n ≠ j, n ≠ k
By feature x(j)Sensitivity S en (x(j)) compared with threshold value th set in advance, if Sen (x(j)) < th andThen by feature x(k)Collect S from excludingexcIt is middle to remove, it is added to Candidate Set Ssel, enter Enter step 36;Otherwise, if feature x(j)It is collection S undeterminedtbdIn last element, then be directly entered step 36;Otherwise, by feature x(j)It is set to collection S undeterminedtbdIn next element, return to step 35;
Step 36:If feature x(k)It is to exclude collection SexcIn last element, then return to current candidate collection SselCollect with excluding SexcIt is used as the result of final feature selecting;Otherwise, by feature x(k)It is set to exclusion collection SexcIn next element, return to step 31.
The above-mentioned technical proposal of the present invention has the beneficial effect that:
In such scheme, by feature set S and target classification C, the degree of association R between feature and target classification is calculatedc(x(i)) and feature and feature between redundancy RxWith collaborative e-commerce Sx, so as to calculate the sensitivity S en of feature;According to presetting Threshold value th feature is screened, feature set be divided into Candidate Set and excluded collect, and continue in subsequent process to candidate Collection and exclusion collection are adjusted optimization.So, the phase between feature and target classification and between feature and feature has been considered Mutual relation, by the degree of association, redundancy and collaborative e-commerce, is selected feature, remains the feature played a crucial role to classification, Contribute to reduction characteristic dimension and complicated classification degree, and classification accuracy can be improved.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of text classification feature selection approach provided in an embodiment of the present invention;
Fig. 2 is the detailed process schematic diagram of text classification feature selection approach provided in an embodiment of the present invention;
Fig. 3 is that feature selection approach provided in an embodiment of the present invention divides Candidate Set and excludes the schematic flow sheet of collection;
Fig. 4 is that feature selection approach provided in an embodiment of the present invention adjusts Candidate Set and excludes the schematic flow sheet of collection.
Embodiment
To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool Body embodiment is described in detail.
There is provided a kind of text classification feature selecting for the problem of present invention is for existing characteristic dimension height or low nicety of grading Method.
As shown in figure 1, text classification feature selection approach provided in an embodiment of the present invention, including:
Step 1:Feature set S and target classification C is obtained, each feature x in feature set S is calculated(i)With target classification C it Between degree of association Rc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;
Step 2:Calculate the redundancy R between each two feature in feature set SxWith collaborative e-commerce Sx, binding characteristic and target class Degree of association R between notc(x(i)) the sensitivity S en of feature is calculated, and it is compared with threshold value th set in advance, with reference to right Feature set S descending sort result, Candidate Set S is divided into according to threshold value th by feature set SselCollect S with excludingexc
Step 3:Calculate Candidate Set SselCollect S with excludingexcIn feature between sensitivity S en, and by it with setting in advance Fixed threshold value th compares, according to threshold value th to Candidate Set SselCollect S with excludingexcIt is adjusted.
Text classification feature selection approach described in the embodiment of the present invention, by feature set S and target classification C, calculates special Levy the degree of association R between target classificationc(x(i)) and feature and feature between redundancy RxWith collaborative e-commerce Sx, so as to calculate The sensitivity S en of feature;Feature is screened according to threshold value th set in advance, feature set is divided into Candidate Set and exclusion Collection, and continue in subsequent process to Candidate Set and exclude collection and be adjusted optimization.So, feature and target class have been considered Not between and the correlation between feature and feature, by the degree of association, redundancy and collaborative e-commerce, feature is selected, guarantor The feature played a crucial role to classification has been stayed, has contributed to reduction characteristic dimension and complicated classification degree, and it is accurate to improve classification Property.
In the present embodiment, as shown in Fig. 2 in order to get feature set S and target classification C, needing elder generation input feature vector collection S=(x(1), x(2)..., x(n)) and target classification C.
In the present embodiment, the feature set S is represented during text classification, all feature (single feature x(i)Represent, i.e., word vector) set, i.e. S=(x(1), x(2)..., x(n)), n represents feature in feature set S Number;Feature x(i)The column vector that the number of times that word corresponding to representing feature occurs in each text is constituted, i.e.,Target classification C represents the column vector that the classification corresponding to each text is constituted, Target classification C is category set.
In the present embodiment, the feature x(i)With the degree of association R between target classification Cc(x(i)) it is characterized x(i)With target class Mutual information between other C.
In the present embodiment, as an alternative embodiment, each feature x in the calculating feature set S(i)With target classification C Between degree of association Rc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort (step 1) include:
Step 11, for each feature x in feature set S(i), according to formula Rc(x(i))=I (x(i);C feature x) is calculated(i)With the degree of association R between target classification Cc(x(i)), wherein, I (x(i);C feature x) is represented(i)It is mutual between target classification C Information;
Step 12, according to degree of association Rc(x(i)) size the feature in feature set S is sorted from big to small, sorted Feature set S afterwards;
Wherein, x(i)Represent ith feature, R in feature set Sc(x(i)) represent feature x(i)With the pass between target classification C Connection degree.
It is described in the present embodiment
Wherein, I (x(i);C feature x) is represented(i)With the mutual information between target classification C, ckRepresent the target classification C K classification, p (x(i), ck) represent feature x(i)With classification ckThe probability occurred simultaneously, p (x(i)|ck) represent in ckFeature in classification x(i)The probability of appearance, p (x(i)) represent feature x(i)The probability occurred in feature set S.
In the present embodiment, it is preferable that the feature x(i)With classification ckProbability p (the x occurred simultaneously(i), ck), by ckClassification Feature x in file(i)The frequency that corresponding word occurs in All Files comes approximate, i.e.,:
Wherein,Represent feature x(i)J-th of element (i.e. feature x(i)What corresponding word occurred in j-th of file Number of times);Represent feature x(i)Middle correspondence target classification is ckM-th of element (i.e. feature x(i)Corresponding word is at m-th ckThe number of times occurred in category file).
In the present embodiment, it is preferable that described in ckFeature x in classification(i)Probability p (the x of appearance(i)|ck), by feature x(i)Institute Correspondence word is in ckThe frequency occurred in category file comes approximate, i.e.,:
In the present embodiment, it is preferable that the feature x(i)Probability p (the x occurred in feature set S(i)), by feature x(i)Institute The frequency that correspondence word occurs in All Files comes approximate, i.e.,:
In the present embodiment, as yet another alternative embodiment, as shown in figure 3, in the calculating feature set S each two feature it Between redundancy RxWith collaborative e-commerce Sx, the degree of association R between binding characteristic and target classificationc(x(i)) calculate feature sensitivity Sen, and it is compared with threshold value th set in advance, feature set S is divided into Candidate Set S according to threshold value thselCollect with excluding Sexc(step 2) includes:
Step 21:First feature in feature set S is added to Candidate Set Ssel, collect S by excludingexcIt is set to empty set, i.e. Ssel ={ x(1), Sexc={ }, the corresponding degree of association R of first featurec(x(i)) maximum;
Step 22:Since feature set S second feature, x is used(i)Second feature is represented, feature x is calculated(i) With Candidate Set SselIn redundancy R between all featuresxWith collaborative e-commerce Sx, and the degree of association between binding characteristic and target classification Rc(x(i)) calculate feature x(i)Sensitivity S en (x(i));
Step 23:By sensitivity S en (x(i)) compared with threshold value th set in advance, if Sen (x(i)) > th, then by feature x(i)Add Candidate Set Ssel;Otherwise by feature x(i)Add and exclude collection Sexc
Step 24:If x(i)Last feature in collection S is characterized, then terminates to divide;Otherwise, by x(i)It is set to feature set S In next feature, return to step 22.
In the embodiment of aforementioned texts characteristic of division system of selection, further, the redundancy RxRepresent For:
Rx(x(i);x(j))=min (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is represented(i)With j-th of feature x(j)Between the degree of correlation Gain, Rx(x(i);x(j)) represent feature x(i)With feature x(j)Between redundancy, Rx(x(i);x(j)) value be 0 and degree of correlation gain In smaller value.
In the embodiment of aforementioned texts characteristic of division system of selection, further, the collaborative e-commerce SxRepresent For:
Sx(x(i);x(j))=max (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is represented(i)With j-th of feature x(j)Between the degree of correlation Gain, Sx(x(i);x(j)) represent feature x(i)With feature x(j)Between collaborative e-commerce, Sx(x(i);x(j)) value be 0 and degree of correlation gain In higher value.
In the embodiment of aforementioned texts characteristic of division system of selection, further, the IG (x(i);x(j);C) It is expressed as:
IG(x(i);x(j);C)=I [(x(i), x(j));C]-I(x(i);C)-I(x(j);C)
Wherein, I (x(i);) and I (x C(j);C) with the feature x(i)Mutual information calculation formula phase between target classification C Together, I (x(i);C feature x) is represented(i)With the mutual information between target classification C;I(x(j);C feature x) is represented(j)With target classification C Between mutual information;I((x(i), x(j);C feature x) is represented(i), feature x(j)With the mutual information between target classification C.
In the embodiment of aforementioned texts characteristic of division system of selection, further, the I ((x(i), x(j);C) It is expressed as:
Wherein, ckRepresent target classification C k-th of classification, p (x(i), x(j), ck) and represent feature x(i), feature x(j)And class Other ckThe probability occurred simultaneously, p ((x(i), x(j))|ck) represent in ckFeature x in classification(i)With feature x(j)What is occurred simultaneously is general Rate, p (x(i), x(j)) represent feature x(i)With feature x(j)The probability occurred simultaneously in feature set S.
In the present embodiment, it is preferable that the feature x(i), feature x(j)With classification ckProbability p (the x occurred simultaneously(i), x(j), ck), by ckFeature x in category file(i)With feature x(j)The frequency that corresponding word occurs simultaneously in All Files is come near Seemingly, i.e.,:
Wherein,Represent feature x(i)With feature x(j)Middle correspondence target classification is ckM-th yuan Smaller value (i.e. feature x in element(i)With feature x(j)The two corresponding word is in m-th of ckThe number of times occurred in category file Smaller value).
In the present embodiment, it is preferable that described in ckFeature x in classification(i)With feature x(j)The Probability p ((x occurred simultaneously(i), x(j))|ck), by feature x(i)With feature x(j)Corresponding word is in ckThe frequency occurred simultaneously in category file comes approximate, i.e.,:
In the present embodiment, it is preferable that the feature x(i)With feature x(j)Probability p (the x occurred simultaneously in feature set S(i)), by feature x(i)With feature x(j)The frequency that corresponding word occurs simultaneously in All Files comes approximate, i.e.,:
In the embodiment of aforementioned texts characteristic of division system of selection, further, the sensitivity S en (x(i)) be expressed as:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))), j ≠ i
Wherein, α and β are redundancy R respectivelyxWith collaborative e-commerce SxWeights, min (Rx(x(i);x(j))) represent feature x(i)With The minimum value of redundancy between remaining feature, max (Sx(x(i);x(j))) represent feature x(i)The collaborative e-commerce between remaining feature Maximum, Sen (x(i)) represent feature x(i)Sensitivity to target classification C, Rc(x(i)) represent feature x(i)With target classification C it Between the degree of association.
In the present embodiment, as shown in figure 4, being used as an alternative embodiment, the calculating Candidate Set SselCollect S with excludingexcIn Feature between sensitivity S en, and it is compared with threshold value th set in advance, according to threshold value th to Candidate Set SselAnd row Except collection SexcBeing adjusted (step 3) includes:
Step 31:Make collection S undeterminedtbdFor sky, i.e. Stbd={ }, if x(k)To exclude collection SexcIn first feature, if x(m) For Candidate Set SselIn first feature;
Step 32:For excluding collection SexcIn feature x(k), calculate Candidate Set SselIn feature x(m)With being removed in feature set S x(m)Outside all features between collaborative e-commerce maximum, i.e. max (Sx(x(m);x(i))), x(i)∈ S, i ≠ m;
Step 33:If feature x(m)The corresponding feature of maximum collaborative e-commerce be x(k), then by x(m)Add collection S undeterminedtbd
Step 34:If feature x(m)It is Candidate Set SselIn last feature, and collection S undeterminedtbdFor sky, then into step 36;If collection S undeterminedtbdIt is not sky, if x(j)For collection S undeterminedtbdIn first feature, into step 35;If feature x(m)It is not Candidate Set SselIn last feature, then by feature x(m)It is set to Candidate Set SselIn next feature, return to step 32;
Step 35:For collection S undeterminedtbdIn feature x(j), more new feature x as follows(j)Sensitivity:
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))), x(n)∈ S, n ≠ j, n ≠ k
By feature x(j)Sensitivity S en (x(j)) compared with threshold value th set in advance, if Sen (x(j)) < th andThen by feature x(k)Collect S from excludingexcIt is middle to remove, it is added to Candidate Set Ssel, enter Enter step 36;Otherwise, if feature x(j)It is collection S undeterminedtbdIn last element, then be directly entered step 36;Otherwise, by feature x(j)It is set to collection S undeterminedtbdIn next element, return to step 35;
Step 36:If feature x(k)It is to exclude collection SexcIn last element, then return to current candidate collection SselCollect with excluding SexcIt is used as the result of final feature selecting;Otherwise, by feature x(k)It is set to exclusion collection SexcIn next element, return to step 31.
In the present embodiment, according to step 31-36, Candidate Set S is calculatedselCollect S with excludingexcIn feature between sensitivity Sen, and it is compared with threshold value th set in advance, according to threshold value th to Candidate Set SselCollect S with excludingexcIt is adjusted, obtains To new Candidate Set SselCollect S with excludingexc, the influence of the removal or increase of feature to classification results can be reduced.
In the present embodiment, the redundancy RxWeights α default values can be 0.5;The collaborative e-commerce SxWeights β default values can Think 0.5;The threshold value th set in advance is defaulted as being 0.01.The redundancy RxWeights α, collaborative e-commerce SxWeights β and Threshold value th set in advance is in follow-up training and test process by genetic algorithm optimization with updating.
Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, on the premise of principle of the present invention is not departed from, some improvements and modifications can also be made, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (10)

1. a kind of text classification feature selection approach, it is characterised in that including:
Step 1:Feature set S and target classification C is obtained, each feature x in feature set S is calculated(i)Between target classification C Degree of association Rc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;
Step 2:Calculate the redundancy R between each two feature in feature set SxWith collaborative e-commerce Sx, binding characteristic and target classification it Between degree of association Rc(x(i)) the sensitivity S en of feature is calculated, and it is compared with threshold value th set in advance, with reference to feature Collect S descending sort result, feature set S is divided into Candidate Set S according to threshold value thselCollect S with excludingexc
Step 3:Calculate Candidate Set SselCollect S with excludingexcIn feature between sensitivity S en, and by its with it is set in advance Threshold value th compares, according to threshold value th to Candidate Set SselCollect S with excludingexcIt is adjusted.
2. text classification feature selection approach according to claim 1, it is characterised in that the step 1 includes:
Step 11, for each feature x in feature set S(i), according to formula Rc(x(i))=I (x(i);C feature x) is calculated(i)With mesh Mark the degree of association R between classification Cc(x(i)), wherein, I (x(i);C feature x) is represented(i)With the mutual information between target classification C;
Step 12, according to degree of association Rc(x(i)) size the feature in feature set S is sorted from big to small, the spy after being sorted Collect S;
Wherein, x(i)Represent ith feature, R in feature set Sc(x(i)) represent feature x(i)With the degree of association between target classification C.
3. text classification feature selection approach according to claim 2, it is characterised in that the I (x(i);C) it is expressed as:
Wherein, ckRepresent target classification C k-th of classification, p (x(i), ck) represent feature x(i)With classification ckWhat is occurred simultaneously is general Rate, p (x(i)|ck) represent in ckFeature x in classification(i)The probability of appearance, p (x(i)) represent feature x(i)Occur in feature set S Probability.
4. text classification feature selection approach according to claim 1, it is characterised in that the redundancy RxIt is expressed as:
Rx(x(i);x(j))=min (0, IG (x(i);x(j);C));i≠j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is represented(i)With j-th of feature x(j)Between the degree of correlation increase Benefit, Rx(x(i);x(j)) represent feature x(i)With feature x(j)Between redundancy, Rx(x(i);x(j)) value for 0 and degree of correlation gain in Smaller value.
5. text classification feature selection approach according to claim 1, it is characterised in that the collaborative e-commerce SxIt is expressed as:
Sx(x(i);x(j))=max (0, IG (x(i);x(j);C));i≠j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is represented(i)With j-th of feature x(j)Between the degree of correlation increase Benefit, Sx(x(i);x(j)) represent feature x(i)With feature x(j)Between collaborative e-commerce, Sx(x(i);x(j)) value for 0 and degree of correlation gain in Higher value.
6. the text classification feature selection approach according to claim 4 or 5, it is characterised in that the IG (x(i);x(j);C) It is expressed as:
IG(x(i);x(j);C)=I [(x(i), x(j));C]-I(x(i);C)-I(x(j);C)
Wherein, I (x(i);C feature x) is represented(i)With the mutual information between target classification C;I(x(j);C feature x) is represented(j)With target Mutual information between classification C;I((x(i), x(j);C feature x) is represented(i), feature x(j)With the mutual information between target classification C.
7. text classification feature selection approach according to claim 6, it is characterised in that the I ((x(i), x(j);C) table It is shown as:
Wherein, ckRepresent target classification C k-th of classification, p (x(i), x(j), ck) represent feature x(i), feature x(j)With classification ckTogether When the probability that occurs, p ((x(i), x(j)|ck) represent in ckFeature x in classification(i)With feature x(j)The probability occurred simultaneously, p (x(i), x(j)) represent feature x(i)With feature x(j)The probability occurred simultaneously in feature set S.
8. text classification feature selection approach according to claim 1, it is characterised in that the step 2 includes:
Step 21:First feature in feature set S is added to Candidate Set Ssel, collect S by excludingexcIt is set to empty set, i.e. Ssel={ x(1), Sexc={ }, the corresponding degree of association R of first featurec(x(i)) maximum;
Step 22:Since feature set S second feature, x is used(i)Second feature is represented, feature x is calculated(i)With time Selected works SselIn redundancy R between all featuresxWith collaborative e-commerce Sx, and the degree of association R between binding characteristic and target classificationc(x(i)) calculate feature x(i)Sensitivity S en (x(i));
Step 23:By sensitivity S en (x(i)) compared with threshold value th set in advance, if Ssen (x(i)) > th, then by feature x(i) Add Candidate Set Ssel;Otherwise by feature x(i)Add and exclude collection Sexc
Step 24:If x(i)Last feature in collection S is characterized, then terminates to divide;Otherwise, by x(i)Under being set in feature set S One feature, returns to step 22.
9. text classification feature selection approach according to claim 8, it is characterised in that the sensitivity S en (x(i)) table It is shown as:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))), j ≠ i
Wherein, α and β are redundancy R respectivelyxWith collaborative e-commerce SxWeights, min (Rx(x(i);x(j))) represent feature x(i)With remaining The minimum value of redundancy between feature, max (Sx(x(i);x(j))) represent feature x(i)The maximum of collaborative e-commerce between remaining feature Value, Sen (x(i)) represent feature x(i)Sensitivity to target classification C, Rc(x(i)) represent feature x(i)Between target classification C The degree of association.
10. text classification feature selection approach according to claim 1, it is characterised in that the step 3 includes:
Step 31:Make collection S undeterminedtbdFor sky, i.e. Stbd={ }, if x(k)To exclude collection SexcIn first feature, if x(m)For Candidate Set SselIn first feature;
Step 32:For excluding collection SexcIn feature s(k), calculate Candidate Set SselIn feature x(m)With removing x in feature set S(m) Outside all features between collaborative e-commerce maximum, i.e. max (Sx(x(m);x(i))), x(i)∈ S, i ≠ m;
Step 33:If feature x(m)The corresponding feature of maximum collaborative e-commerce be x(k), then by x(m)Add collection S undeterminedtbd
Step 34:If feature x(m)It is Candidate Set SselIn last feature, and collection S undeterminedtbdFor sky, then into step 36;If Collection S undeterminedtbdIt is not sky, if x(j)For collection S undeterminedtbdIn first feature, into step 35;If feature x(m)It is not Candidate Set SselIn last feature, then by feature x(m)It is set to Candidate Set SselIn next feature, return to step 32;
Step 35:For collection S undeterminedtbdIn feature x(j), more new feature x as follows(j)Sensitivity:
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))), x(n)∈ S, n ≠ j, n ≠ k
By feature x(j)Sensitivity S en (x(j)) compared with threshold value th set in advance, if Sen (x(j)) < th andThen by feature x(k)Collect S from excludingexcIt is middle to remove, it is added to Candidate Set Ssel, enter Enter step 36;Otherwise, if feature x(j)It is collection S undeterminedtbdIn last element, then be directly entered step 36;Otherwise, by feature x(j)It is set to collection S undeterminedtbdIn next element, return to step 35;
Step 36:If feature x(k)It is to exclude collection SexcIn last element, then return to current candidate collection SselCollect S with excludingexc It is used as the result of final feature selecting;Otherwise, by feature x(k)It is set to exclusion collection SexcIn next element, return to step 31.
CN201710181572.8A 2017-03-24 2017-03-24 A kind of text classification feature selection approach Active CN107016073B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710181572.8A CN107016073B (en) 2017-03-24 2017-03-24 A kind of text classification feature selection approach

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710181572.8A CN107016073B (en) 2017-03-24 2017-03-24 A kind of text classification feature selection approach

Publications (2)

Publication Number Publication Date
CN107016073A true CN107016073A (en) 2017-08-04
CN107016073B CN107016073B (en) 2019-06-28

Family

ID=59445053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710181572.8A Active CN107016073B (en) 2017-03-24 2017-03-24 A kind of text classification feature selection approach

Country Status (1)

Country Link
CN (1) CN107016073B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934251A (en) * 2018-12-27 2019-06-25 国家计算机网络与信息安全管理中心广东分中心 A kind of method, identifying system and storage medium for rare foreign languages text identification
CN111612385A (en) * 2019-02-22 2020-09-01 北京京东尚科信息技术有限公司 Method and device for clustering to-be-delivered articles

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278409A1 (en) * 2004-07-30 2014-09-18 At&T Intellectual Property Ii, L.P. Preserving privacy in natural langauge databases
CN105184323A (en) * 2015-09-15 2015-12-23 广州唯品会信息科技有限公司 Feature selection method and system
CN105260437A (en) * 2015-09-30 2016-01-20 陈一飞 Text classification feature selection method and application thereof to biomedical text classification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278409A1 (en) * 2004-07-30 2014-09-18 At&T Intellectual Property Ii, L.P. Preserving privacy in natural langauge databases
CN105184323A (en) * 2015-09-15 2015-12-23 广州唯品会信息科技有限公司 Feature selection method and system
CN105260437A (en) * 2015-09-30 2016-01-20 陈一飞 Text classification feature selection method and application thereof to biomedical text classification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周茜 等: "中文文本分类中的特征选择研究", 《中文信息学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109934251A (en) * 2018-12-27 2019-06-25 国家计算机网络与信息安全管理中心广东分中心 A kind of method, identifying system and storage medium for rare foreign languages text identification
CN109934251B (en) * 2018-12-27 2021-08-06 国家计算机网络与信息安全管理中心广东分中心 Method, system and storage medium for recognizing text in Chinese language
CN111612385A (en) * 2019-02-22 2020-09-01 北京京东尚科信息技术有限公司 Method and device for clustering to-be-delivered articles
CN111612385B (en) * 2019-02-22 2024-04-16 北京京东振世信息技术有限公司 Method and device for clustering articles to be distributed

Also Published As

Publication number Publication date
CN107016073B (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
US20200293924A1 (en) Gbdt model feature interpretation method and apparatus
CN110555717A (en) method for mining potential purchased goods and categories of users based on user behavior characteristics
Al Qadi et al. Arabic text classification of news articles using classical supervised classifiers
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN103617429A (en) Sorting method and system for active learning
US10387805B2 (en) System and method for ranking news feeds
CN103294817A (en) Text feature extraction method based on categorical distribution probability
CN102194013A (en) Domain-knowledge-based short text classification method and text classification system
CN103617435A (en) Image sorting method and system for active learning
CN103838798A (en) Page classification system and method
CN109933619A (en) A kind of semisupervised classification prediction technique
US20230325632A1 (en) Automated anomaly detection using a hybrid machine learning system
CN105359172A (en) Calculating a probability of a business being delinquent
US20230138491A1 (en) Continuous learning for document processing and analysis
CN107016073A (en) A kind of text classification feature selection approach
CN109101574B (en) Task approval method and system of data leakage prevention system
CN110781297A (en) Classification method of multi-label scientific research papers based on hierarchical discriminant trees
CN111488400B (en) Data classification method, device and computer readable storage medium
US20230134218A1 (en) Continuous learning for document processing and analysis
CN113641823B (en) Text classification model training, text classification method, device, equipment and medium
CN113033170B (en) Form standardization processing method, device, equipment and storage medium
CN104778478A (en) Handwritten numeral identification method
CN103207893A (en) Classification method of two types of texts on basis of vector group mapping
CN111539576B (en) Risk identification model optimization method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant