CN107016073A - A kind of text classification feature selection approach - Google Patents
A kind of text classification feature selection approach Download PDFInfo
- Publication number
- CN107016073A CN107016073A CN201710181572.8A CN201710181572A CN107016073A CN 107016073 A CN107016073 A CN 107016073A CN 201710181572 A CN201710181572 A CN 201710181572A CN 107016073 A CN107016073 A CN 107016073A
- Authority
- CN
- China
- Prior art keywords
- feature
- classification
- sel
- degree
- represent
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of text classification feature selection approach, can reduce characteristic dimension and complicated classification degree and improve classification accuracy.Methods described includes:Feature set S and target classification C is obtained, each feature x in feature set S is calculated(i)With the degree of association R between target classification Cc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;Calculate the redundancy R between each two feature in feature set SxWith collaborative e-commerce Sx, the degree of association R between binding characteristic and target classificationc(x(i)) the sensitivity S en of feature is calculated, and it is compared with threshold value th set in advance, with reference to the descending sort result to feature set S, feature set S is divided into Candidate Set S according to threshold value thselCollect S with excludingexc;Calculate Candidate Set SselCollect S with excludingexcIn feature between sensitivity S en, and it is compared with threshold value th set in advance, according to threshold value th to Candidate Set SselCollect S with excludingexcIt is adjusted.The present invention is applied to machine learning text classification field.
Description
Technical field
The present invention relates to machine learning text classification field, a kind of text classification feature selection approach is particularly related to.
Background technology
With the continuous expansion of internet scale, the information resources converged in internet are also on the increase.In order to effective
Management and easily utilize these information resources, content-based information retrieval and data mining receive much concern all the time.
Text Classification is the important foundation of information retrieval and text data digging, and its main task is the word according to unknown classification
With the content of document, they are determined as one or more of previously given classification.However, training samples number is big and vectorial
This two major features of dimension height, it is that all very high machine learning of an operation time and space complexity are asked to determine text classification
Topic.It would therefore be desirable to carry out feature selecting, characteristic dimension is reduced while classification performance is ensured as far as possible.
Feature selecting is an important process of data preprocessing, in conventional text classification feature selection approach, card
(Chi-Square) is examined by setting up null hypothesis in side, it is assumed that word is uncorrelated to target classification, selection deviates hypothesis degree greatly
Word is used as feature.But whether there is certain word in its statistic document, but regardless of occurring in that several times, this causes it to low-frequency word
It is partial.Mutual information (Mutual Information) method is selected by measuring the presence of word to the information content that target classification is brought
Select feature.But it only considered the degree of association between word and target classification, dependence that may be present between word and word is ignored.
TF-IDF (Term Frequency-Inverse Document Frequency) method considers what word occurred hereof
Frequency and word are distributed to assess the significance level of word in All Files, so as to carry out Feature Selection.But it is simple
Think text frequency it is small word is more important and word that text frequency is big is more useless, therefore precision is not very high.In addition
Also information gain, odds ratio, text weight evidence, expect the feature selection approach such as cross entropy, they all only considered mostly
, easily there is dimensionality reduction degree not enough or classification is smart in the degree of correlation between degree of correlation or word and word between word and target classification
The problem of degree is not high.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of text classification feature selection approach, to solve prior art institute
The problem of characteristic dimension height or low nicety of grading of presence.
In order to solve the above technical problems, the embodiment of the present invention provides a kind of text classification feature selection approach, including:
Step 1:Feature set S and target classification C is obtained, each feature x in feature set S is calculated(i)With target classification C it
Between degree of association Rc(x(i)), and according to degree of association RcSize carries out descending sort to feature set S;
Step 2:Calculate the redundancy R between each two feature in feature set SxWith collaborative e-commerce Sx, binding characteristic and target class
Degree of association R between notc(x(i)) the sensitivity S en of feature is calculated, and it is compared with threshold value th set in advance, with reference to right
Feature set S descending sort result, Candidate Set S is divided into according to threshold value th by feature set SselCollect S with excludingexc;
Step 3:Calculate Candidate Set SselCollect S with excludingexcIn feature between sensitivity S en, and by it with setting in advance
Fixed threshold value th compares, according to threshold value th to Candidate Set SselCollect S with excludingexcIt is adjusted.
Further, the step 1 includes:
Step 11, for each feature x in feature set S(i), according to formula Rc(x(i))=I (x(i);C feature x) is calculated(i)With the degree of association R between target classification Cc(x(i)), wherein, I (x(i);C feature x) is represented(i)It is mutual between target classification C
Information;
Step 12, according to degree of association Rc(x(i)) size the feature in feature set S is sorted from big to small, sorted
Feature set S afterwards;
Wherein, x(i)Represent ith feature, R in feature set Sc(x(i)) represent feature x(i)With the pass between target classification C
Connection degree.
Further, the I (x(i);C) it is expressed as:
Wherein, ckRepresent target classification C k-th of classification, p (x(i), ck) represent feature x(i)With classification ckOccur simultaneously
Probability, p (x(i)|ck) represent in ckFeature x in classification(i)The probability of appearance, p (x(i)) represent feature x(i)Occur in feature set S
Probability.
Further, the redundancy RxIt is expressed as:
Rx(x(i);x(j))=min (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is represented(i)With j-th of feature x(j)Between the degree of correlation
Gain, Rx(x(i);x(j)) represent feature x(i)With feature x(j)Between redundancy, Rx(x(i);x(j)) value be 0 and degree of correlation gain
In smaller value.
Further, the collaborative e-commerce SxIt is expressed as:
Sx(x(i);x(j))=max (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is represented(i)With j-th of feature x(j)Between the degree of correlation
Gain, Sx(x(i);x(j)) represent feature x(i)With feature x(j)Between collaborative e-commerce, Sx(x(i);x(j)) value be 0 and degree of correlation gain
In higher value.
Further, the IG (x(i);x(j);C) it is expressed as:
IG(x(i);x(j);C)=I [(x(i), x(j));C]-I(x(i);C)-I(x(j);C)
Wherein, I (x(i);C feature x) is represented(i)With the mutual information between target classification C;I(x(j);C)Represent feature x(j)With
Mutual information between target classification C;I((x(i), x(j);C feature x) is represented(i), feature x(j)With the mutual trust between target classification C
Breath.
Further, the I ((x(i), x(j);C) it is expressed as:
Wherein, ckRepresent target classification C k-th of classification, p (x(i), x(j), ck) represent feature x(i), feature x(j)And classification
ckThe probability occurred simultaneously, p ((x(i), x(j))|ck) represent in ckFeature x in classification(i)With feature x(j)The probability occurred simultaneously, p
(x(i), x(j)) represent feature x(i)With feature x(j)The probability occurred simultaneously in feature set S.
Further, the step 2 includes:
Step 21:First feature in feature set S is added to Candidate Set Ssel, collect S by excludingexcIt is set to empty set, i.e. Ssel
={ x(1), Sexc={ }, the corresponding degree of association R of first featurec(x(i)) maximum;
Step 22:Since feature set S second feature, x is used(i)Second feature is represented, feature x is calculated(i)
With Candidate Set SselIn redundancy R between all featuresxWith collaborative e-commerce Sx, and the degree of association between binding characteristic and target classification
Rc(x(i)) calculate feature x(i)Sensitivity S en (x(i));
Step 23:By sensitivity S en (x(i)) compared with threshold value th set in advance, if Sen (x(i)) > th, then by feature
x(i)Add Candidate Set Ssel;Otherwise by feature x(i)Add and exclude collection Sexc;
Step 24:If x(i)Last feature in collection S is characterized, then terminates to divide;Otherwise, by x(i)It is set to feature set S
In next feature, return to step 22.
Further, the sensitivity S en (x(i)) be expressed as:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))), j ≠ i
Wherein, α and β are redundancy R respectivelyxWith collaborative e-commerce SxWeights, min (Rx(x(i);x(j))) represent feature x(i)With
The minimum value of redundancy between remaining feature, max (Sx(x(i);x(j))) represent feature x(i)The collaborative e-commerce between remaining feature
Maximum, Sen (x(i)) represent feature x(i)Sensitivity to target classification C, Rc(x(i)) represent feature x(i)With target classification C it
Between the degree of association.
Further, the step 3 includes:
Step 31:Make collection S undeterminedtbdFor sky, i.e. Stbd={ }, if x(k)To exclude collection SexcIn first feature, if x(m)
For Candidate Set SselIn first feature;
Step 32:For excluding collection SexcIn feature x(k), calculate Candidate Set SselIn feature x(m)With being removed in feature set S
x(m)Outside all features between collaborative e-commerce maximum, i.e. max (Sx(x(m);x(i))), x(i)∈ S, i ≠ m;
Step 33:If feature x(m)The corresponding feature of maximum collaborative e-commerce be x(k), then by x(m)Add collection S undeterminedtbd;
Step 34:If feature x(m)It is Candidate Set SselIn last feature, and collection S undeterminedtbdFor sky, then into step
36;If collection S undeterminedtbdIt is not sky, if x(j)For collection S undeterminedtbdIn first feature, into step 35;If feature x(m)It is not
Candidate Set SselIn last feature, then by feature x(m)It is set to Candidate Set SselIn next feature, return to step 32;
Step 35:For collection S undeterminedtbdIn feature x(j), more new feature x as follows(j)Sensitivity:
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))), x(n)∈ S, n ≠ j, n ≠ k
By feature x(j)Sensitivity S en (x(j)) compared with threshold value th set in advance, if Sen (x(j)) < th andThen by feature x(k)Collect S from excludingexcIt is middle to remove, it is added to Candidate Set Ssel, enter
Enter step 36;Otherwise, if feature x(j)It is collection S undeterminedtbdIn last element, then be directly entered step 36;Otherwise, by feature
x(j)It is set to collection S undeterminedtbdIn next element, return to step 35;
Step 36:If feature x(k)It is to exclude collection SexcIn last element, then return to current candidate collection SselCollect with excluding
SexcIt is used as the result of final feature selecting;Otherwise, by feature x(k)It is set to exclusion collection SexcIn next element, return to step 31.
The above-mentioned technical proposal of the present invention has the beneficial effect that:
In such scheme, by feature set S and target classification C, the degree of association R between feature and target classification is calculatedc(x(i)) and feature and feature between redundancy RxWith collaborative e-commerce Sx, so as to calculate the sensitivity S en of feature;According to presetting
Threshold value th feature is screened, feature set be divided into Candidate Set and excluded collect, and continue in subsequent process to candidate
Collection and exclusion collection are adjusted optimization.So, the phase between feature and target classification and between feature and feature has been considered
Mutual relation, by the degree of association, redundancy and collaborative e-commerce, is selected feature, remains the feature played a crucial role to classification,
Contribute to reduction characteristic dimension and complicated classification degree, and classification accuracy can be improved.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of text classification feature selection approach provided in an embodiment of the present invention;
Fig. 2 is the detailed process schematic diagram of text classification feature selection approach provided in an embodiment of the present invention;
Fig. 3 is that feature selection approach provided in an embodiment of the present invention divides Candidate Set and excludes the schematic flow sheet of collection;
Fig. 4 is that feature selection approach provided in an embodiment of the present invention adjusts Candidate Set and excludes the schematic flow sheet of collection.
Embodiment
To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool
Body embodiment is described in detail.
There is provided a kind of text classification feature selecting for the problem of present invention is for existing characteristic dimension height or low nicety of grading
Method.
As shown in figure 1, text classification feature selection approach provided in an embodiment of the present invention, including:
Step 1:Feature set S and target classification C is obtained, each feature x in feature set S is calculated(i)With target classification C it
Between degree of association Rc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;
Step 2:Calculate the redundancy R between each two feature in feature set SxWith collaborative e-commerce Sx, binding characteristic and target class
Degree of association R between notc(x(i)) the sensitivity S en of feature is calculated, and it is compared with threshold value th set in advance, with reference to right
Feature set S descending sort result, Candidate Set S is divided into according to threshold value th by feature set SselCollect S with excludingexc;
Step 3:Calculate Candidate Set SselCollect S with excludingexcIn feature between sensitivity S en, and by it with setting in advance
Fixed threshold value th compares, according to threshold value th to Candidate Set SselCollect S with excludingexcIt is adjusted.
Text classification feature selection approach described in the embodiment of the present invention, by feature set S and target classification C, calculates special
Levy the degree of association R between target classificationc(x(i)) and feature and feature between redundancy RxWith collaborative e-commerce Sx, so as to calculate
The sensitivity S en of feature;Feature is screened according to threshold value th set in advance, feature set is divided into Candidate Set and exclusion
Collection, and continue in subsequent process to Candidate Set and exclude collection and be adjusted optimization.So, feature and target class have been considered
Not between and the correlation between feature and feature, by the degree of association, redundancy and collaborative e-commerce, feature is selected, guarantor
The feature played a crucial role to classification has been stayed, has contributed to reduction characteristic dimension and complicated classification degree, and it is accurate to improve classification
Property.
In the present embodiment, as shown in Fig. 2 in order to get feature set S and target classification C, needing elder generation input feature vector collection S=(x(1), x(2)..., x(n)) and target classification C.
In the present embodiment, the feature set S is represented during text classification, all feature (single feature x(i)Represent, i.e., word vector) set, i.e. S=(x(1), x(2)..., x(n)), n represents feature in feature set S
Number;Feature x(i)The column vector that the number of times that word corresponding to representing feature occurs in each text is constituted, i.e.,Target classification C represents the column vector that the classification corresponding to each text is constituted,
Target classification C is category set.
In the present embodiment, the feature x(i)With the degree of association R between target classification Cc(x(i)) it is characterized x(i)With target class
Mutual information between other C.
In the present embodiment, as an alternative embodiment, each feature x in the calculating feature set S(i)With target classification C
Between degree of association Rc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort (step 1) include:
Step 11, for each feature x in feature set S(i), according to formula Rc(x(i))=I (x(i);C feature x) is calculated(i)With the degree of association R between target classification Cc(x(i)), wherein, I (x(i);C feature x) is represented(i)It is mutual between target classification C
Information;
Step 12, according to degree of association Rc(x(i)) size the feature in feature set S is sorted from big to small, sorted
Feature set S afterwards;
Wherein, x(i)Represent ith feature, R in feature set Sc(x(i)) represent feature x(i)With the pass between target classification C
Connection degree.
It is described in the present embodiment
Wherein, I (x(i);C feature x) is represented(i)With the mutual information between target classification C, ckRepresent the target classification C
K classification, p (x(i), ck) represent feature x(i)With classification ckThe probability occurred simultaneously, p (x(i)|ck) represent in ckFeature in classification
x(i)The probability of appearance, p (x(i)) represent feature x(i)The probability occurred in feature set S.
In the present embodiment, it is preferable that the feature x(i)With classification ckProbability p (the x occurred simultaneously(i), ck), by ckClassification
Feature x in file(i)The frequency that corresponding word occurs in All Files comes approximate, i.e.,:
Wherein,Represent feature x(i)J-th of element (i.e. feature x(i)What corresponding word occurred in j-th of file
Number of times);Represent feature x(i)Middle correspondence target classification is ckM-th of element (i.e. feature x(i)Corresponding word is at m-th
ckThe number of times occurred in category file).
In the present embodiment, it is preferable that described in ckFeature x in classification(i)Probability p (the x of appearance(i)|ck), by feature x(i)Institute
Correspondence word is in ckThe frequency occurred in category file comes approximate, i.e.,:
In the present embodiment, it is preferable that the feature x(i)Probability p (the x occurred in feature set S(i)), by feature x(i)Institute
The frequency that correspondence word occurs in All Files comes approximate, i.e.,:
In the present embodiment, as yet another alternative embodiment, as shown in figure 3, in the calculating feature set S each two feature it
Between redundancy RxWith collaborative e-commerce Sx, the degree of association R between binding characteristic and target classificationc(x(i)) calculate feature sensitivity
Sen, and it is compared with threshold value th set in advance, feature set S is divided into Candidate Set S according to threshold value thselCollect with excluding
Sexc(step 2) includes:
Step 21:First feature in feature set S is added to Candidate Set Ssel, collect S by excludingexcIt is set to empty set, i.e. Ssel
={ x(1), Sexc={ }, the corresponding degree of association R of first featurec(x(i)) maximum;
Step 22:Since feature set S second feature, x is used(i)Second feature is represented, feature x is calculated(i)
With Candidate Set SselIn redundancy R between all featuresxWith collaborative e-commerce Sx, and the degree of association between binding characteristic and target classification
Rc(x(i)) calculate feature x(i)Sensitivity S en (x(i));
Step 23:By sensitivity S en (x(i)) compared with threshold value th set in advance, if Sen (x(i)) > th, then by feature
x(i)Add Candidate Set Ssel;Otherwise by feature x(i)Add and exclude collection Sexc;
Step 24:If x(i)Last feature in collection S is characterized, then terminates to divide;Otherwise, by x(i)It is set to feature set S
In next feature, return to step 22.
In the embodiment of aforementioned texts characteristic of division system of selection, further, the redundancy RxRepresent
For:
Rx(x(i);x(j))=min (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is represented(i)With j-th of feature x(j)Between the degree of correlation
Gain, Rx(x(i);x(j)) represent feature x(i)With feature x(j)Between redundancy, Rx(x(i);x(j)) value be 0 and degree of correlation gain
In smaller value.
In the embodiment of aforementioned texts characteristic of division system of selection, further, the collaborative e-commerce SxRepresent
For:
Sx(x(i);x(j))=max (0, IG (x(i);x(j);C)), i ≠ j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is represented(i)With j-th of feature x(j)Between the degree of correlation
Gain, Sx(x(i);x(j)) represent feature x(i)With feature x(j)Between collaborative e-commerce, Sx(x(i);x(j)) value be 0 and degree of correlation gain
In higher value.
In the embodiment of aforementioned texts characteristic of division system of selection, further, the IG (x(i);x(j);C)
It is expressed as:
IG(x(i);x(j);C)=I [(x(i), x(j));C]-I(x(i);C)-I(x(j);C)
Wherein, I (x(i);) and I (x C(j);C) with the feature x(i)Mutual information calculation formula phase between target classification C
Together, I (x(i);C feature x) is represented(i)With the mutual information between target classification C;I(x(j);C feature x) is represented(j)With target classification C
Between mutual information;I((x(i), x(j);C feature x) is represented(i), feature x(j)With the mutual information between target classification C.
In the embodiment of aforementioned texts characteristic of division system of selection, further, the I ((x(i), x(j);C)
It is expressed as:
Wherein, ckRepresent target classification C k-th of classification, p (x(i), x(j), ck) and represent feature x(i), feature x(j)And class
Other ckThe probability occurred simultaneously, p ((x(i), x(j))|ck) represent in ckFeature x in classification(i)With feature x(j)What is occurred simultaneously is general
Rate, p (x(i), x(j)) represent feature x(i)With feature x(j)The probability occurred simultaneously in feature set S.
In the present embodiment, it is preferable that the feature x(i), feature x(j)With classification ckProbability p (the x occurred simultaneously(i), x(j),
ck), by ckFeature x in category file(i)With feature x(j)The frequency that corresponding word occurs simultaneously in All Files is come near
Seemingly, i.e.,:
Wherein,Represent feature x(i)With feature x(j)Middle correspondence target classification is ckM-th yuan
Smaller value (i.e. feature x in element(i)With feature x(j)The two corresponding word is in m-th of ckThe number of times occurred in category file
Smaller value).
In the present embodiment, it is preferable that described in ckFeature x in classification(i)With feature x(j)The Probability p ((x occurred simultaneously(i),
x(j))|ck), by feature x(i)With feature x(j)Corresponding word is in ckThe frequency occurred simultaneously in category file comes approximate, i.e.,:
In the present embodiment, it is preferable that the feature x(i)With feature x(j)Probability p (the x occurred simultaneously in feature set S(i)), by feature x(i)With feature x(j)The frequency that corresponding word occurs simultaneously in All Files comes approximate, i.e.,:
In the embodiment of aforementioned texts characteristic of division system of selection, further, the sensitivity S en (x(i)) be expressed as:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))), j ≠ i
Wherein, α and β are redundancy R respectivelyxWith collaborative e-commerce SxWeights, min (Rx(x(i);x(j))) represent feature x(i)With
The minimum value of redundancy between remaining feature, max (Sx(x(i);x(j))) represent feature x(i)The collaborative e-commerce between remaining feature
Maximum, Sen (x(i)) represent feature x(i)Sensitivity to target classification C, Rc(x(i)) represent feature x(i)With target classification C it
Between the degree of association.
In the present embodiment, as shown in figure 4, being used as an alternative embodiment, the calculating Candidate Set SselCollect S with excludingexcIn
Feature between sensitivity S en, and it is compared with threshold value th set in advance, according to threshold value th to Candidate Set SselAnd row
Except collection SexcBeing adjusted (step 3) includes:
Step 31:Make collection S undeterminedtbdFor sky, i.e. Stbd={ }, if x(k)To exclude collection SexcIn first feature, if x(m)
For Candidate Set SselIn first feature;
Step 32:For excluding collection SexcIn feature x(k), calculate Candidate Set SselIn feature x(m)With being removed in feature set S
x(m)Outside all features between collaborative e-commerce maximum, i.e. max (Sx(x(m);x(i))), x(i)∈ S, i ≠ m;
Step 33:If feature x(m)The corresponding feature of maximum collaborative e-commerce be x(k), then by x(m)Add collection S undeterminedtbd;
Step 34:If feature x(m)It is Candidate Set SselIn last feature, and collection S undeterminedtbdFor sky, then into step
36;If collection S undeterminedtbdIt is not sky, if x(j)For collection S undeterminedtbdIn first feature, into step 35;If feature x(m)It is not
Candidate Set SselIn last feature, then by feature x(m)It is set to Candidate Set SselIn next feature, return to step 32;
Step 35:For collection S undeterminedtbdIn feature x(j), more new feature x as follows(j)Sensitivity:
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))), x(n)∈ S, n ≠ j, n ≠ k
By feature x(j)Sensitivity S en (x(j)) compared with threshold value th set in advance, if Sen (x(j)) < th andThen by feature x(k)Collect S from excludingexcIt is middle to remove, it is added to Candidate Set Ssel, enter
Enter step 36;Otherwise, if feature x(j)It is collection S undeterminedtbdIn last element, then be directly entered step 36;Otherwise, by feature
x(j)It is set to collection S undeterminedtbdIn next element, return to step 35;
Step 36:If feature x(k)It is to exclude collection SexcIn last element, then return to current candidate collection SselCollect with excluding
SexcIt is used as the result of final feature selecting;Otherwise, by feature x(k)It is set to exclusion collection SexcIn next element, return to step 31.
In the present embodiment, according to step 31-36, Candidate Set S is calculatedselCollect S with excludingexcIn feature between sensitivity
Sen, and it is compared with threshold value th set in advance, according to threshold value th to Candidate Set SselCollect S with excludingexcIt is adjusted, obtains
To new Candidate Set SselCollect S with excludingexc, the influence of the removal or increase of feature to classification results can be reduced.
In the present embodiment, the redundancy RxWeights α default values can be 0.5;The collaborative e-commerce SxWeights β default values can
Think 0.5;The threshold value th set in advance is defaulted as being 0.01.The redundancy RxWeights α, collaborative e-commerce SxWeights β and
Threshold value th set in advance is in follow-up training and test process by genetic algorithm optimization with updating.
Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art
For, on the premise of principle of the present invention is not departed from, some improvements and modifications can also be made, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (10)
1. a kind of text classification feature selection approach, it is characterised in that including:
Step 1:Feature set S and target classification C is obtained, each feature x in feature set S is calculated(i)Between target classification C
Degree of association Rc(x(i)), and according to degree of association Rc(x(i)) size to feature set S carry out descending sort;
Step 2:Calculate the redundancy R between each two feature in feature set SxWith collaborative e-commerce Sx, binding characteristic and target classification it
Between degree of association Rc(x(i)) the sensitivity S en of feature is calculated, and it is compared with threshold value th set in advance, with reference to feature
Collect S descending sort result, feature set S is divided into Candidate Set S according to threshold value thselCollect S with excludingexc;
Step 3:Calculate Candidate Set SselCollect S with excludingexcIn feature between sensitivity S en, and by its with it is set in advance
Threshold value th compares, according to threshold value th to Candidate Set SselCollect S with excludingexcIt is adjusted.
2. text classification feature selection approach according to claim 1, it is characterised in that the step 1 includes:
Step 11, for each feature x in feature set S(i), according to formula Rc(x(i))=I (x(i);C feature x) is calculated(i)With mesh
Mark the degree of association R between classification Cc(x(i)), wherein, I (x(i);C feature x) is represented(i)With the mutual information between target classification C;
Step 12, according to degree of association Rc(x(i)) size the feature in feature set S is sorted from big to small, the spy after being sorted
Collect S;
Wherein, x(i)Represent ith feature, R in feature set Sc(x(i)) represent feature x(i)With the degree of association between target classification C.
3. text classification feature selection approach according to claim 2, it is characterised in that the I (x(i);C) it is expressed as:
Wherein, ckRepresent target classification C k-th of classification, p (x(i), ck) represent feature x(i)With classification ckWhat is occurred simultaneously is general
Rate, p (x(i)|ck) represent in ckFeature x in classification(i)The probability of appearance, p (x(i)) represent feature x(i)Occur in feature set S
Probability.
4. text classification feature selection approach according to claim 1, it is characterised in that the redundancy RxIt is expressed as:
Rx(x(i);x(j))=min (0, IG (x(i);x(j);C));i≠j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is represented(i)With j-th of feature x(j)Between the degree of correlation increase
Benefit, Rx(x(i);x(j)) represent feature x(i)With feature x(j)Between redundancy, Rx(x(i);x(j)) value for 0 and degree of correlation gain in
Smaller value.
5. text classification feature selection approach according to claim 1, it is characterised in that the collaborative e-commerce SxIt is expressed as:
Sx(x(i);x(j))=max (0, IG (x(i);x(j);C));i≠j
Wherein, IG (x(i);x(j);C ith feature x in feature set S) is represented(i)With j-th of feature x(j)Between the degree of correlation increase
Benefit, Sx(x(i);x(j)) represent feature x(i)With feature x(j)Between collaborative e-commerce, Sx(x(i);x(j)) value for 0 and degree of correlation gain in
Higher value.
6. the text classification feature selection approach according to claim 4 or 5, it is characterised in that the IG (x(i);x(j);C)
It is expressed as:
IG(x(i);x(j);C)=I [(x(i), x(j));C]-I(x(i);C)-I(x(j);C)
Wherein, I (x(i);C feature x) is represented(i)With the mutual information between target classification C;I(x(j);C feature x) is represented(j)With target
Mutual information between classification C;I((x(i), x(j);C feature x) is represented(i), feature x(j)With the mutual information between target classification C.
7. text classification feature selection approach according to claim 6, it is characterised in that the I ((x(i), x(j);C) table
It is shown as:
Wherein, ckRepresent target classification C k-th of classification, p (x(i), x(j), ck) represent feature x(i), feature x(j)With classification ckTogether
When the probability that occurs, p ((x(i), x(j)|ck) represent in ckFeature x in classification(i)With feature x(j)The probability occurred simultaneously, p (x(i),
x(j)) represent feature x(i)With feature x(j)The probability occurred simultaneously in feature set S.
8. text classification feature selection approach according to claim 1, it is characterised in that the step 2 includes:
Step 21:First feature in feature set S is added to Candidate Set Ssel, collect S by excludingexcIt is set to empty set, i.e. Ssel={ x(1), Sexc={ }, the corresponding degree of association R of first featurec(x(i)) maximum;
Step 22:Since feature set S second feature, x is used(i)Second feature is represented, feature x is calculated(i)With time
Selected works SselIn redundancy R between all featuresxWith collaborative e-commerce Sx, and the degree of association R between binding characteristic and target classificationc(x(i)) calculate feature x(i)Sensitivity S en (x(i));
Step 23:By sensitivity S en (x(i)) compared with threshold value th set in advance, if Ssen (x(i)) > th, then by feature x(i)
Add Candidate Set Ssel;Otherwise by feature x(i)Add and exclude collection Sexc;
Step 24:If x(i)Last feature in collection S is characterized, then terminates to divide;Otherwise, by x(i)Under being set in feature set S
One feature, returns to step 22.
9. text classification feature selection approach according to claim 8, it is characterised in that the sensitivity S en (x(i)) table
It is shown as:
Sen(x(i))=Rc(x(i))+αmin(Rx(x(i);x(j)))
+βmax(Sx(x(i);x(j))), j ≠ i
Wherein, α and β are redundancy R respectivelyxWith collaborative e-commerce SxWeights, min (Rx(x(i);x(j))) represent feature x(i)With remaining
The minimum value of redundancy between feature, max (Sx(x(i);x(j))) represent feature x(i)The maximum of collaborative e-commerce between remaining feature
Value, Sen (x(i)) represent feature x(i)Sensitivity to target classification C, Rc(x(i)) represent feature x(i)Between target classification C
The degree of association.
10. text classification feature selection approach according to claim 1, it is characterised in that the step 3 includes:
Step 31:Make collection S undeterminedtbdFor sky, i.e. Stbd={ }, if x(k)To exclude collection SexcIn first feature, if x(m)For
Candidate Set SselIn first feature;
Step 32:For excluding collection SexcIn feature s(k), calculate Candidate Set SselIn feature x(m)With removing x in feature set S(m)
Outside all features between collaborative e-commerce maximum, i.e. max (Sx(x(m);x(i))), x(i)∈ S, i ≠ m;
Step 33:If feature x(m)The corresponding feature of maximum collaborative e-commerce be x(k), then by x(m)Add collection S undeterminedtbd;
Step 34:If feature x(m)It is Candidate Set SselIn last feature, and collection S undeterminedtbdFor sky, then into step 36;If
Collection S undeterminedtbdIt is not sky, if x(j)For collection S undeterminedtbdIn first feature, into step 35;If feature x(m)It is not Candidate Set
SselIn last feature, then by feature x(m)It is set to Candidate Set SselIn next feature, return to step 32;
Step 35:For collection S undeterminedtbdIn feature x(j), more new feature x as follows(j)Sensitivity:
Sen(x(j))=Rc(x(j))+αmin(Rx(x(j);x(n)))
+βmax(Sx(x(j);x(n))), x(n)∈ S, n ≠ j, n ≠ k
By feature x(j)Sensitivity S en (x(j)) compared with threshold value th set in advance, if Sen (x(j)) < th andThen by feature x(k)Collect S from excludingexcIt is middle to remove, it is added to Candidate Set Ssel, enter
Enter step 36;Otherwise, if feature x(j)It is collection S undeterminedtbdIn last element, then be directly entered step 36;Otherwise, by feature
x(j)It is set to collection S undeterminedtbdIn next element, return to step 35;
Step 36:If feature x(k)It is to exclude collection SexcIn last element, then return to current candidate collection SselCollect S with excludingexc
It is used as the result of final feature selecting;Otherwise, by feature x(k)It is set to exclusion collection SexcIn next element, return to step 31.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710181572.8A CN107016073B (en) | 2017-03-24 | 2017-03-24 | A kind of text classification feature selection approach |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710181572.8A CN107016073B (en) | 2017-03-24 | 2017-03-24 | A kind of text classification feature selection approach |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107016073A true CN107016073A (en) | 2017-08-04 |
CN107016073B CN107016073B (en) | 2019-06-28 |
Family
ID=59445053
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710181572.8A Active CN107016073B (en) | 2017-03-24 | 2017-03-24 | A kind of text classification feature selection approach |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107016073B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109934251A (en) * | 2018-12-27 | 2019-06-25 | 国家计算机网络与信息安全管理中心广东分中心 | A kind of method, identifying system and storage medium for rare foreign languages text identification |
CN111612385A (en) * | 2019-02-22 | 2020-09-01 | 北京京东尚科信息技术有限公司 | Method and device for clustering to-be-delivered articles |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140278409A1 (en) * | 2004-07-30 | 2014-09-18 | At&T Intellectual Property Ii, L.P. | Preserving privacy in natural langauge databases |
CN105184323A (en) * | 2015-09-15 | 2015-12-23 | 广州唯品会信息科技有限公司 | Feature selection method and system |
CN105260437A (en) * | 2015-09-30 | 2016-01-20 | 陈一飞 | Text classification feature selection method and application thereof to biomedical text classification |
-
2017
- 2017-03-24 CN CN201710181572.8A patent/CN107016073B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140278409A1 (en) * | 2004-07-30 | 2014-09-18 | At&T Intellectual Property Ii, L.P. | Preserving privacy in natural langauge databases |
CN105184323A (en) * | 2015-09-15 | 2015-12-23 | 广州唯品会信息科技有限公司 | Feature selection method and system |
CN105260437A (en) * | 2015-09-30 | 2016-01-20 | 陈一飞 | Text classification feature selection method and application thereof to biomedical text classification |
Non-Patent Citations (1)
Title |
---|
周茜 等: "中文文本分类中的特征选择研究", 《中文信息学报》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109934251A (en) * | 2018-12-27 | 2019-06-25 | 国家计算机网络与信息安全管理中心广东分中心 | A kind of method, identifying system and storage medium for rare foreign languages text identification |
CN109934251B (en) * | 2018-12-27 | 2021-08-06 | 国家计算机网络与信息安全管理中心广东分中心 | Method, system and storage medium for recognizing text in Chinese language |
CN111612385A (en) * | 2019-02-22 | 2020-09-01 | 北京京东尚科信息技术有限公司 | Method and device for clustering to-be-delivered articles |
CN111612385B (en) * | 2019-02-22 | 2024-04-16 | 北京京东振世信息技术有限公司 | Method and device for clustering articles to be distributed |
Also Published As
Publication number | Publication date |
---|---|
CN107016073B (en) | 2019-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104750844B (en) | Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device | |
US20200293924A1 (en) | Gbdt model feature interpretation method and apparatus | |
CN110555717A (en) | method for mining potential purchased goods and categories of users based on user behavior characteristics | |
Al Qadi et al. | Arabic text classification of news articles using classical supervised classifiers | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN103617429A (en) | Sorting method and system for active learning | |
US10387805B2 (en) | System and method for ranking news feeds | |
CN103294817A (en) | Text feature extraction method based on categorical distribution probability | |
CN102194013A (en) | Domain-knowledge-based short text classification method and text classification system | |
CN103617435A (en) | Image sorting method and system for active learning | |
CN103838798A (en) | Page classification system and method | |
CN109933619A (en) | A kind of semisupervised classification prediction technique | |
US20230325632A1 (en) | Automated anomaly detection using a hybrid machine learning system | |
CN105359172A (en) | Calculating a probability of a business being delinquent | |
US20230138491A1 (en) | Continuous learning for document processing and analysis | |
CN107016073A (en) | A kind of text classification feature selection approach | |
CN109101574B (en) | Task approval method and system of data leakage prevention system | |
CN110781297A (en) | Classification method of multi-label scientific research papers based on hierarchical discriminant trees | |
CN111488400B (en) | Data classification method, device and computer readable storage medium | |
US20230134218A1 (en) | Continuous learning for document processing and analysis | |
CN113641823B (en) | Text classification model training, text classification method, device, equipment and medium | |
CN113033170B (en) | Form standardization processing method, device, equipment and storage medium | |
CN104778478A (en) | Handwritten numeral identification method | |
CN103207893A (en) | Classification method of two types of texts on basis of vector group mapping | |
CN111539576B (en) | Risk identification model optimization method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |