CN105426426A

CN105426426A - KNN text classification method based on improved K-Medoids

Info

Publication number: CN105426426A
Application number: CN201510740516.4A
Authority: CN
Inventors: 汪友生; 樊存佳; 王信
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2015-11-04
Filing date: 2015-11-04
Publication date: 2016-03-23
Anticipated expiration: 2035-11-04
Also published as: CN105426426B

Abstract

The invention provides a KNN (K-Nearest-Neighbor) text classification method based on improved K-Medoids and relates to the field of computer text data processing. The method comprises the following steps: pre-processing a training text set and a testing text set, wherein preprocessing comprising removal of participles and stop words, DF feature selection and vector representation, so as to obtain a training text vector space and a testing text vector space; carrying out training sample clipping on the basis of an improved K-Medoids method, namely, optimizing from the points of initial center point selection and replacement of center point search strategy, and applying optimization to the training sample clipping so as to obtain a new training text space; and finally, carrying out KNN classification, defining a representative function and applying the representative function to class attribute functions for KNN classification so as to obtain a final result. Experimental results show that compared with a conventional KNN method and a KNN method based on the K-Medoids, the KNN text classification method provided by the invention has higher classification accuracy and classification efficiency.

Description

The KNN file classification method of a kind of K-Medoids based on improving

Technical field

The present invention relates to computer version data processing field, particularly K arest neighbors (K-Nearest-Neighbor, the KNN) file classification method of a kind of K-Medoids based on improving.

Background technology

Along with the development of internet, Internet of Things and cloud computing, data increase with exponential form, lead us to step into large data age.US Internet data center (IDC) points out, the data on internet increase with the ratio of 50% every year, and at present in the world the data of more than 90% be produce recent years.Current global metadata amount has reached ZB rank, and with the containing in addition in great potential value wherein of generation of mass data.

Current large data age, the potential value of mining data is most important.Data mining, as the technology finding data potential value, causes great concern.Large data text data accounts for sizable ratio, and text classification is as the data digging method of effective organization and management text data, becomes the focus of attention gradually.It is used widely in information filtering, Information Organization and management, information retrieval, digital library and Spam filtering etc.Text classification (TextClassification, TC) refers to the process unknown classification text being automatically classified into a class or multiclass under classification system given in advance according to its content.Conventional file classification method, as K arest neighbors, Bayes (NaiveBayes, NB) and support vector machine (SupportVectorMachine, SVM) etc.

KNN, as one of the sorting technique of classics, has and realizes simple, robustness advantages of higher; But also there is a lot of shortcoming, to such an extent as to can not be applicable in a lot of practical application.The deficiency of KNN mainly comprises following two aspects: the first, in assorting process because Similarity Measure amount is huge the at substantial time, cause classification effectiveness low.The second, classification performance is easily by the impact of training sample, and when serious uneven distribution appears in data, classifier performance may be had a strong impact on, and even becomes extreme difference.For the problem that KNN assorting process calculated amount is large, the improvement of Many researchers is summarized as following three aspects: first, improve feature selection approach, those are given up the little Feature Words of classification contribution, realizes the effective dimensionality reduction to VSM (VectorSpaceModel) model.The second, represent text as new training text collection or delete some text little to classification contribution that original training text concentrates by some choosing that original training text concentrates, after deleting, remaining text is as new training text collection.3rd, design fast search algorithm, to accelerate the search speed of K arest neighbors text of test text.Consider that current various KNN improved algorithm is difficult to situation about taking into account in speed and precision, design category precision is high and the KNN file classification method that classification speed is fast has important academic significance and practical value.

Summary of the invention

The object of the invention is to, improve KNN Algorithm of documents categorization from classification speed and nicety of grading.On the one hand, for improving KNN algorithm classification speed, the K-Medoids clustering algorithm improved is adopted to contribute little training sample with cutting to KNN classification; On the other hand, for improving KNN algorithm classification precision, definition representative degree function is also introduced in KNN algorithm, realizes K the arest neighbors text differentially processing test text.

Feature of the present invention is as follows:

Step 1, the Chinese corpus published is downloaded from internet---training text collection and test text collection;

Step 2, employing participle software I CTCLAS carries out participle to training text collection and test text collection, stop words is removed and carried out pre-service, obtains the training text collection after participle and test text collection;

Step 3, adopts document frequency DF (DocumentFrequency) method to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;

Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n _k+ 0.01), M is the textual data comprised in collection of document, n _krepresent the number of files comprising this word.

Step 5, based on the training sample cutting of the K-Medoids algorithm improved, (definition training text integrates and comprises C as S, S ₁, C ₂..., C _nthis N number of classification, comprising textual data is altogether M).

Step 5.1, for training text collection S, specifies it to need to be divided into m bunch, m=3 × N;

Step 5.2 is each bunch of random selecting central point O _i(0 < i≤m);

Step 5.3, remains the cosine similarity of non-central some text and this m central point in calculation training text set S, they are assigned to similarity maximum bunch in, cosine similarity computing formula is as follows:

S i m (d, O_{i}) = \frac{Σ_{j = 1}^{n} (X_{j} x_{i j})}{\sqrt{Σ_{j = 1}^{n} ({X_{j}}^{2})} \sqrt{Σ_{j = 1}^{n} ({x_{i j}}^{2})}}

Wherein, n is proper vector dimension threshold value, X _jrepresent the weight (0 < j≤n) of the jth dimension remaining non-central some text d in training text collection S, x _ijrepresent centered text O _ijth dimension weight (0 < i≤m, 0 < j≤n).

Step 5.4, the optimization of initial center point selection, in each bunch, using bunch in each point as central point, calculate it with bunch in the similarity sum of other text, the point selecting similarity sum minimum is new central point O _i';

Step 5.5, selects the central point O of a non-selected mistake _i', this is jth time iteration (j is from 0 to m value), carries out m iteration altogether, and replacing center point set U is no longer overall non-central point set, but O _i' nearby sphere, this scope is span central point O _ithe region that ' j bunch of nearest all non-central some text comprised is formed;

Step 5.6, selects one not by the non-central some Q selected, calculates Q and O in central point Candidate Set U _i' the difference of square error, be recorded in set E, until all non-central point in U was all selected;

Step 5.7, if min (E) < 0 (in set E, minimum value is less than 0), with the former central point of non-central replacement that minimum value in set E is corresponding, the set of a new m central point is obtained after replacement, remaining object distribute to representated by the maximum central point of similarity bunch, again from step 5.5 perform;

Step 5.8, if min (E) > 0 or min (E)=0, replaces center searching process and terminates, finally obtain m cluster centre point O _i";

Step 5.9, calculates the similarity of test text and m cluster centre, if Sim is (D, O _i") < T _i(T _ibe the i-th bunch bunch in threshold value, the minimum similarity degree of text and this bunch of central point namely bunch), illustrate that test text is quite low with the text similarity in this bunch, so the text that this bunch comprises can be cropped; If Sim is (D, O _i") > T _ior Sim (D, O _i")=T _i, the text comprised in this bunch is joined new training text collection S _new.

Step 6, carries out KNN classification.

Training text integrates as S _new, test text is d, n is proper vector dimension threshold value, and K gets 5,10,15,20,25,30.

Step 6.1, utilizes vectorial angle cosine value to calculate test text d and S _newsimilarity between middle full text;

Step 6.2, selects K the arest neighbors text of K maximum text of similarity that step 6.1 obtains as test text d;

Step 6.3, calculates the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight.

If training text d _iknown class be C _j, then by d _ifor classification C _jsignificance level be defined as representative degree function u (d _i, C _j), definition representative degree function is as follows:

u (d_{i}, C_{j}) = \frac{1}{D i s t (d_{i}, {\overset{&OverBar;}{C}}_{j})} \times S i m (d_{i}, {\overset{&OverBar;}{C}}_{j})

Wherein, represent classification C _jcenter vector is by classification C _jall text vectors be added and be averaging again. represent training text d _ito generic C _jthe Euclidean distance of class center, for training text d _iwith generic C _jthe cosine similarity of class center.

Weight calculation formula is as follows:

W (d, C_{j}) = Σ_{i = 1}^{K} S i m (d, d_{i}) y (d_{i}, C_{j})

Wherein, y (d _i, C _j) be category attribute function, representative degree function is introduced category attribute function, and formula is as follows:

Effect of the present invention is:

The present invention proposes the KNN file classification method of a kind of K-Medoids based on improving, degree highland achieves the classification to test text fast and accurately, process flow diagram is shown in Fig. 1, degree of accuracy index in table 1 (traditional KNN algorithm, herein algorithm respectively when K=5, K=10 classifying quality best, here the best effects of two kinds of methods is only provided), time index is in table 2.Compared with traditional KNN method, invention defines representative degree function on the one hand, and be introduced into the category attribute function of classic method, realize K the arest neighbors text differentially processing test text, improve nicety of grading; The present invention adopts the K-Medoids clustering method of improvement to carry out cutting to original training sample collection on the other hand, improves classification effectiveness.Compared with the KNN method based on K-Medoids, the present invention adopts initial center point optimization and replaces the method for center searching policy optimization, one impact being reduction of K-Medoids method initial center point sensitivity, two is accelerate the carrying out that K-Medoids method replaces center searching process.As can be seen from Table 1 and Table 2, compare with the KNN method based on K-Medoids with traditional KNN method, the present invention all has and improves more significantly in nicety of grading and classification effectiveness.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the inventive method.

Embodiment

The present invention adopts following technological means to realize:

The KNN file classification method of a kind of K-Medoids based on improving.First carry out the pre-service of training text collection and test text collection, comprise participle, stop words process, carry out DF feature selecting, training text and test text are all expressed as vector form; Then adopt the K-Medoids method of improvement to carry out cutting to training text, obtain new training text collection S _new; Finally define representative degree function, and be introduced into the category attribute function of original KNN algorithm, classify for KNN.

The KNN file classification method of above-mentioned improvement, comprises the steps:

Step 2, adopts participle software I CTCLAS to carry out participle, stop words removal pre-service to training text collection and test text collection, obtains the training text collection after participle and test text collection;

Step 5, based on the training sample cutting of the K-Medoids algorithm improved;

Definition training text integrates and comprises C as S, S ₁, C ₂..., C _nthis N number of classification, comprising textual data is altogether M.For training text collection S, it is specified to need to be divided into m bunch, m=3 × N; For each bunch of random selecting central point O _i(0 < i≤m); Remain the cosine similarity of non-central some text and this m central point in calculation training text set S, they are assigned to similarity maximum bunch in, cosine similarity computing formula is as follows:

S i m (d, O_{i}) = \frac{Σ_{j = 1}^{n} (X_{j} x_{i j})}{\sqrt{Σ_{j = 1}^{n} ({X_{j}}^{2})} \sqrt{Σ_{j = 1}^{n} ({x_{i j}}^{2})}} - - - (3)

The optimization of initial center point selection.In each bunch, using bunch in each point as central point, calculate it with bunch in the similarity sum of other text, the point selecting similarity sum minimum is new central point O _i'.

Select the central point O of a non-selected mistake _i', this is jth time iteration (j is from 0 to m value), carries out m iteration altogether.Replacement center point set U is no longer overall non-central point set, but O _i' nearby sphere, this scope is span central point O _ithe region that ' j bunch of nearest all non-central some text comprised is formed; In central point Candidate Set U, select one not by the non-central some Q selected, calculate Q and O _i' the difference of square error, be recorded in set E, until all non-central point in U was all selected.If min (E) < 0 (in set E, minimum value is less than 0), with the former central point of non-central replacement that minimum value in set E is corresponding, obtains the set of a new m central point after replacement.Remaining object distribute to representated by the maximum central point of similarity bunch, again from this step iteration; If min (E) > 0 or min (E)=0, replace center searching process and terminate, finally obtain m cluster centre point O _i".

Calculate the similarity of test text and m cluster centre, if Sim is (D, O _i") < T _i(T _ibe the i-th bunch bunch in threshold value, the minimum similarity degree of text and this bunch of central point namely bunch), illustrate that test text is quite low with the text similarity in this bunch, so the text that this bunch comprises can be cropped; If Sim is (D, O _i") > T _ior Sim (D, O _i")=T _i, then the text comprised in this bunch is joined new training text collection S _new.

Step 6, carries out KNN classification.

Utilize vectorial angle cosine value to calculate test text d and S _newsimilarity between middle full text; Select K the arest neighbors of K maximum text of the similarity that calculates as test text d; Calculate the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight.

u (d_{i}, C_{j}) = \frac{1}{D i s t (d_{i}, {\overset{&OverBar;}{C}}_{j})} \times S i m (d_{i}, {\overset{&OverBar;}{C}}_{j}) - - - (4)

Wherein, represent classification C _jcenter vector is by classification C _jall text vectors be added and be averaging again. represent training text d _ito generic C _jthe Euclidean distance of class center, for training text d _iwith generic C _jthe cosine similarity of class center.Weight calculation formula is as follows:

W (d, C_{j}) = Σ_{i = 1}^{K} S i m (d, d_{i}) y (d_{i}, C_{j}) - - - (5)

Table 2 three kinds of algorithm experimental results

Table 3 time performance

Claims

1., based on the KNN file classification method of the K-Medoids improved, it is characterized in that, comprise the following steps:

Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n _k+ 0.01), M is the textual data comprised in collection of document, n _krepresent the number of files comprising this word;

Step 5, based on the training sample cutting of the K-Medoids algorithm improved, (definition training text integrates and comprises C as S, S ₁, C ₂..., C _nthis N number of classification, comprising textual data is altogether M);

Step 5.2 is each bunch of random selecting central point O _i(0 < i≤m);

S i m (d, O_{i}) = \frac{Σ_{j = 1}^{n} (X_{j} x_{i j})}{\sqrt{Σ_{j = 1}^{n} ({X_{j}}^{2})} \sqrt{Σ_{j = 1}^{n} ({x_{i j}}^{2})}}

Wherein, n is proper vector dimension threshold value, X _jrepresent the weight (0 < j≤n) of the jth dimension remaining non-central some text d in training text collection S, x _ijrepresent centered text O _ijth dimension weight (0 < i≤m, 0 < j≤n);

Step 5.4, the optimization of initial center point selection, in each bunch, using bunch in each point as central point, calculate it with bunch in the similarity sum of other text, the point selecting similarity sum minimum is new central point O ' _i;

Step 5.5, selects the central point O ' of a non-selected mistake _i, this is jth time iteration (j is from 0 to m value), carries out m iteration altogether, and replacing center point set U is no longer overall non-central point set, but O ' _inearby sphere, this scope is span central point O ' _ithe region that nearest j bunch of all non-central some text comprised is formed;

Step 5.6, selects one not by the non-central some Q selected, calculates Q and O ' in central point Candidate Set U _ithe difference of square error, be recorded in set E, until all non-central point in U was all selected;

Step 5.8, if min (E) > 0 or min (E)=0, replaces center searching process and terminates, finally obtain m cluster centre point O " _i;

Step 5.9, calculates the similarity of test text and m cluster centre, if Sim (D, O " _i) < T _i(T _ibe the i-th bunch bunch in threshold value, the minimum similarity degree of text and this bunch of central point namely bunch), illustrate that test text is quite low with the text similarity in this bunch, so the text that this bunch comprises can be cropped; If Sim (D, O " _i) > T _ior Sim (D, O " _i)=T _i, the text comprised in this bunch is joined new training text collection S _new;

Step 6, carries out KNN classification;

Training text integrates as S _new, test text is d, n is proper vector dimension threshold value, and K gets 5,10,15,20,25,30;

Step 6.3, calculates the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight;

u (d_{i}, C_{j}) = \frac{1}{D i s t (d_{i}, {\overset{&OverBar;}{C}}_{j})} \times S i m (d_{i}, {\overset{&OverBar;}{C}}_{j})

Wherein, represent classification C _jcenter vector is by classification C _jall text vectors be added and be averaging again; represent training text d _ito generic C _jthe Euclidean distance of class center, for training text d _iwith generic C _jthe cosine similarity of class center;

Weight calculation formula is as follows:

W (d, C_{j}) = Σ_{i = 1}^{K} S i m (d, d_{i}) y (d_{i}, C_{j})