CN105426426A - KNN text classification method based on improved K-Medoids - Google Patents
KNN text classification method based on improved K-Medoids Download PDFInfo
- Publication number
- CN105426426A CN105426426A CN201510740516.4A CN201510740516A CN105426426A CN 105426426 A CN105426426 A CN 105426426A CN 201510740516 A CN201510740516 A CN 201510740516A CN 105426426 A CN105426426 A CN 105426426A
- Authority
- CN
- China
- Prior art keywords
- text
- bunch
- classification
- training
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a KNN (K-Nearest-Neighbor) text classification method based on improved K-Medoids and relates to the field of computer text data processing. The method comprises the following steps: pre-processing a training text set and a testing text set, wherein preprocessing comprising removal of participles and stop words, DF feature selection and vector representation, so as to obtain a training text vector space and a testing text vector space; carrying out training sample clipping on the basis of an improved K-Medoids method, namely, optimizing from the points of initial center point selection and replacement of center point search strategy, and applying optimization to the training sample clipping so as to obtain a new training text space; and finally, carrying out KNN classification, defining a representative function and applying the representative function to class attribute functions for KNN classification so as to obtain a final result. Experimental results show that compared with a conventional KNN method and a KNN method based on the K-Medoids, the KNN text classification method provided by the invention has higher classification accuracy and classification efficiency.
Description
Technical field
The present invention relates to computer version data processing field, particularly K arest neighbors (K-Nearest-Neighbor, the KNN) file classification method of a kind of K-Medoids based on improving.
Background technology
Along with the development of internet, Internet of Things and cloud computing, data increase with exponential form, lead us to step into large data age.US Internet data center (IDC) points out, the data on internet increase with the ratio of 50% every year, and at present in the world the data of more than 90% be produce recent years.Current global metadata amount has reached ZB rank, and with the containing in addition in great potential value wherein of generation of mass data.
Current large data age, the potential value of mining data is most important.Data mining, as the technology finding data potential value, causes great concern.Large data text data accounts for sizable ratio, and text classification is as the data digging method of effective organization and management text data, becomes the focus of attention gradually.It is used widely in information filtering, Information Organization and management, information retrieval, digital library and Spam filtering etc.Text classification (TextClassification, TC) refers to the process unknown classification text being automatically classified into a class or multiclass under classification system given in advance according to its content.Conventional file classification method, as K arest neighbors, Bayes (NaiveBayes, NB) and support vector machine (SupportVectorMachine, SVM) etc.
KNN, as one of the sorting technique of classics, has and realizes simple, robustness advantages of higher; But also there is a lot of shortcoming, to such an extent as to can not be applicable in a lot of practical application.The deficiency of KNN mainly comprises following two aspects: the first, in assorting process because Similarity Measure amount is huge the at substantial time, cause classification effectiveness low.The second, classification performance is easily by the impact of training sample, and when serious uneven distribution appears in data, classifier performance may be had a strong impact on, and even becomes extreme difference.For the problem that KNN assorting process calculated amount is large, the improvement of Many researchers is summarized as following three aspects: first, improve feature selection approach, those are given up the little Feature Words of classification contribution, realizes the effective dimensionality reduction to VSM (VectorSpaceModel) model.The second, represent text as new training text collection or delete some text little to classification contribution that original training text concentrates by some choosing that original training text concentrates, after deleting, remaining text is as new training text collection.3rd, design fast search algorithm, to accelerate the search speed of K arest neighbors text of test text.Consider that current various KNN improved algorithm is difficult to situation about taking into account in speed and precision, design category precision is high and the KNN file classification method that classification speed is fast has important academic significance and practical value.
Summary of the invention
The object of the invention is to, improve KNN Algorithm of documents categorization from classification speed and nicety of grading.On the one hand, for improving KNN algorithm classification speed, the K-Medoids clustering algorithm improved is adopted to contribute little training sample with cutting to KNN classification; On the other hand, for improving KNN algorithm classification precision, definition representative degree function is also introduced in KNN algorithm, realizes K the arest neighbors text differentially processing test text.
Feature of the present invention is as follows:
Step 1, the Chinese corpus published is downloaded from internet---training text collection and test text collection;
Step 2, employing participle software I CTCLAS carries out participle to training text collection and test text collection, stop words is removed and carried out pre-service, obtains the training text collection after participle and test text collection;
Step 3, adopts document frequency DF (DocumentFrequency) method to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n
k+ 0.01), M is the textual data comprised in collection of document, n
krepresent the number of files comprising this word.
Step 5, based on the training sample cutting of the K-Medoids algorithm improved, (definition training text integrates and comprises C as S, S
1, C
2..., C
nthis N number of classification, comprising textual data is altogether M).
Step 5.1, for training text collection S, specifies it to need to be divided into m bunch, m=3 × N;
Step 5.2 is each bunch of random selecting central point O
i(0 < i≤m);
Step 5.3, remains the cosine similarity of non-central some text and this m central point in calculation training text set S, they are assigned to similarity maximum bunch in, cosine similarity computing formula is as follows:
Wherein, n is proper vector dimension threshold value, X
jrepresent the weight (0 < j≤n) of the jth dimension remaining non-central some text d in training text collection S, x
ijrepresent centered text O
ijth dimension weight (0 < i≤m, 0 < j≤n).
Step 5.4, the optimization of initial center point selection, in each bunch, using bunch in each point as central point, calculate it with bunch in the similarity sum of other text, the point selecting similarity sum minimum is new central point O
i';
Step 5.5, selects the central point O of a non-selected mistake
i', this is jth time iteration (j is from 0 to m value), carries out m iteration altogether, and replacing center point set U is no longer overall non-central point set, but O
i' nearby sphere, this scope is span central point O
ithe region that ' j bunch of nearest all non-central some text comprised is formed;
Step 5.6, selects one not by the non-central some Q selected, calculates Q and O in central point Candidate Set U
i' the difference of square error, be recorded in set E, until all non-central point in U was all selected;
Step 5.7, if min (E) < 0 (in set E, minimum value is less than 0), with the former central point of non-central replacement that minimum value in set E is corresponding, the set of a new m central point is obtained after replacement, remaining object distribute to representated by the maximum central point of similarity bunch, again from step 5.5 perform;
Step 5.8, if min (E) > 0 or min (E)=0, replaces center searching process and terminates, finally obtain m cluster centre point O
i";
Step 5.9, calculates the similarity of test text and m cluster centre, if Sim is (D, O
i") < T
i(T
ibe the i-th bunch bunch in threshold value, the minimum similarity degree of text and this bunch of central point namely bunch), illustrate that test text is quite low with the text similarity in this bunch, so the text that this bunch comprises can be cropped; If Sim is (D, O
i") > T
ior Sim (D, O
i")=T
i, the text comprised in this bunch is joined new training text collection S
new.
Step 6, carries out KNN classification.
Training text integrates as S
new, test text is d, n is proper vector dimension threshold value, and K gets 5,10,15,20,25,30.
Step 6.1, utilizes vectorial angle cosine value to calculate test text d and S
newsimilarity between middle full text;
Step 6.2, selects K the arest neighbors text of K maximum text of similarity that step 6.1 obtains as test text d;
Step 6.3, calculates the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight.
If training text d
iknown class be C
j, then by d
ifor classification C
jsignificance level be defined as representative degree function u (d
i, C
j), definition representative degree function is as follows:
Wherein,
represent classification C
jcenter vector is by classification C
jall text vectors be added and be averaging again.
represent training text d
ito generic C
jthe Euclidean distance of class center,
for training text d
iwith generic C
jthe cosine similarity of class center.
Weight calculation formula is as follows:
Wherein, y (d
i, C
j) be category attribute function, representative degree function is introduced category attribute function, and formula is as follows:
Effect of the present invention is:
The present invention proposes the KNN file classification method of a kind of K-Medoids based on improving, degree highland achieves the classification to test text fast and accurately, process flow diagram is shown in Fig. 1, degree of accuracy index in table 1 (traditional KNN algorithm, herein algorithm respectively when K=5, K=10 classifying quality best, here the best effects of two kinds of methods is only provided), time index is in table 2.Compared with traditional KNN method, invention defines representative degree function on the one hand, and be introduced into the category attribute function of classic method, realize K the arest neighbors text differentially processing test text, improve nicety of grading; The present invention adopts the K-Medoids clustering method of improvement to carry out cutting to original training sample collection on the other hand, improves classification effectiveness.Compared with the KNN method based on K-Medoids, the present invention adopts initial center point optimization and replaces the method for center searching policy optimization, one impact being reduction of K-Medoids method initial center point sensitivity, two is accelerate the carrying out that K-Medoids method replaces center searching process.As can be seen from Table 1 and Table 2, compare with the KNN method based on K-Medoids with traditional KNN method, the present invention all has and improves more significantly in nicety of grading and classification effectiveness.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the inventive method.
Embodiment
The present invention adopts following technological means to realize:
The KNN file classification method of a kind of K-Medoids based on improving.First carry out the pre-service of training text collection and test text collection, comprise participle, stop words process, carry out DF feature selecting, training text and test text are all expressed as vector form; Then adopt the K-Medoids method of improvement to carry out cutting to training text, obtain new training text collection S
new; Finally define representative degree function, and be introduced into the category attribute function of original KNN algorithm, classify for KNN.
The KNN file classification method of above-mentioned improvement, comprises the steps:
Step 1, the Chinese corpus published is downloaded from internet---training text collection and test text collection;
Step 2, adopts participle software I CTCLAS to carry out participle, stop words removal pre-service to training text collection and test text collection, obtains the training text collection after participle and test text collection;
Step 3, adopts document frequency DF (DocumentFrequency) method to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n
k+ 0.01), M is the textual data comprised in collection of document, n
krepresent the number of files comprising this word.
Step 5, based on the training sample cutting of the K-Medoids algorithm improved;
Definition training text integrates and comprises C as S, S
1, C
2..., C
nthis N number of classification, comprising textual data is altogether M.For training text collection S, it is specified to need to be divided into m bunch, m=3 × N; For each bunch of random selecting central point O
i(0 < i≤m); Remain the cosine similarity of non-central some text and this m central point in calculation training text set S, they are assigned to similarity maximum bunch in, cosine similarity computing formula is as follows:
Wherein, n is proper vector dimension threshold value, X
jrepresent the weight (0 < j≤n) of the jth dimension remaining non-central some text d in training text collection S, x
ijrepresent centered text O
ijth dimension weight (0 < i≤m, 0 < j≤n).
The optimization of initial center point selection.In each bunch, using bunch in each point as central point, calculate it with bunch in the similarity sum of other text, the point selecting similarity sum minimum is new central point O
i'.
Select the central point O of a non-selected mistake
i', this is jth time iteration (j is from 0 to m value), carries out m iteration altogether.Replacement center point set U is no longer overall non-central point set, but O
i' nearby sphere, this scope is span central point O
ithe region that ' j bunch of nearest all non-central some text comprised is formed; In central point Candidate Set U, select one not by the non-central some Q selected, calculate Q and O
i' the difference of square error, be recorded in set E, until all non-central point in U was all selected.If min (E) < 0 (in set E, minimum value is less than 0), with the former central point of non-central replacement that minimum value in set E is corresponding, obtains the set of a new m central point after replacement.Remaining object distribute to representated by the maximum central point of similarity bunch, again from this step iteration; If min (E) > 0 or min (E)=0, replace center searching process and terminate, finally obtain m cluster centre point O
i".
Calculate the similarity of test text and m cluster centre, if Sim is (D, O
i") < T
i(T
ibe the i-th bunch bunch in threshold value, the minimum similarity degree of text and this bunch of central point namely bunch), illustrate that test text is quite low with the text similarity in this bunch, so the text that this bunch comprises can be cropped; If Sim is (D, O
i") > T
ior Sim (D, O
i")=T
i, then the text comprised in this bunch is joined new training text collection S
new.
Step 6, carries out KNN classification.
Training text integrates as S
new, test text is d, n is proper vector dimension threshold value, and K gets 5,10,15,20,25,30.
Utilize vectorial angle cosine value to calculate test text d and S
newsimilarity between middle full text; Select K the arest neighbors of K maximum text of the similarity that calculates as test text d; Calculate the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight.
If training text d
iknown class be C
j, then by d
ifor classification C
jsignificance level be defined as representative degree function u (d
i, C
j), definition representative degree function is as follows:
Wherein,
represent classification C
jcenter vector is by classification C
jall text vectors be added and be averaging again.
represent training text d
ito generic C
jthe Euclidean distance of class center,
for training text d
iwith generic C
jthe cosine similarity of class center.Weight calculation formula is as follows:
Wherein, y (d
i, C
j) be category attribute function, representative degree function is introduced category attribute function, and formula is as follows:
Table 2 three kinds of algorithm experimental results
Table 3 time performance
Claims (1)
1., based on the KNN file classification method of the K-Medoids improved, it is characterized in that, comprise the following steps:
Step 1, the Chinese corpus published is downloaded from internet---training text collection and test text collection;
Step 2, employing participle software I CTCLAS carries out participle to training text collection and test text collection, stop words is removed and carried out pre-service, obtains the training text collection after participle and test text collection;
Step 3, adopts document frequency DF (DocumentFrequency) method to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n
k+ 0.01), M is the textual data comprised in collection of document, n
krepresent the number of files comprising this word;
Step 5, based on the training sample cutting of the K-Medoids algorithm improved, (definition training text integrates and comprises C as S, S
1, C
2..., C
nthis N number of classification, comprising textual data is altogether M);
Step 5.1, for training text collection S, specifies it to need to be divided into m bunch, m=3 × N;
Step 5.2 is each bunch of random selecting central point O
i(0 < i≤m);
Step 5.3, remains the cosine similarity of non-central some text and this m central point in calculation training text set S, they are assigned to similarity maximum bunch in, cosine similarity computing formula is as follows:
Wherein, n is proper vector dimension threshold value, X
jrepresent the weight (0 < j≤n) of the jth dimension remaining non-central some text d in training text collection S, x
ijrepresent centered text O
ijth dimension weight (0 < i≤m, 0 < j≤n);
Step 5.4, the optimization of initial center point selection, in each bunch, using bunch in each point as central point, calculate it with bunch in the similarity sum of other text, the point selecting similarity sum minimum is new central point O '
i;
Step 5.5, selects the central point O ' of a non-selected mistake
i, this is jth time iteration (j is from 0 to m value), carries out m iteration altogether, and replacing center point set U is no longer overall non-central point set, but O '
inearby sphere, this scope is span central point O '
ithe region that nearest j bunch of all non-central some text comprised is formed;
Step 5.6, selects one not by the non-central some Q selected, calculates Q and O ' in central point Candidate Set U
ithe difference of square error, be recorded in set E, until all non-central point in U was all selected;
Step 5.7, if min (E) < 0 (in set E, minimum value is less than 0), with the former central point of non-central replacement that minimum value in set E is corresponding, the set of a new m central point is obtained after replacement, remaining object distribute to representated by the maximum central point of similarity bunch, again from step 5.5 perform;
Step 5.8, if min (E) > 0 or min (E)=0, replaces center searching process and terminates, finally obtain m cluster centre point O "
i;
Step 5.9, calculates the similarity of test text and m cluster centre, if Sim (D, O "
i) < T
i(T
ibe the i-th bunch bunch in threshold value, the minimum similarity degree of text and this bunch of central point namely bunch), illustrate that test text is quite low with the text similarity in this bunch, so the text that this bunch comprises can be cropped; If Sim (D, O "
i) > T
ior Sim (D, O "
i)=T
i, the text comprised in this bunch is joined new training text collection S
new;
Step 6, carries out KNN classification;
Training text integrates as S
new, test text is d, n is proper vector dimension threshold value, and K gets 5,10,15,20,25,30;
Step 6.1, utilizes vectorial angle cosine value to calculate test text d and S
newsimilarity between middle full text;
Step 6.2, selects K the arest neighbors text of K maximum text of similarity that step 6.1 obtains as test text d;
Step 6.3, calculates the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight;
If training text d
iknown class be C
j, then by d
ifor classification C
jsignificance level be defined as representative degree function u (d
i, C
j), definition representative degree function is as follows:
Wherein,
represent classification C
jcenter vector is by classification C
jall text vectors be added and be averaging again;
represent training text d
ito generic C
jthe Euclidean distance of class center,
for training text d
iwith generic C
jthe cosine similarity of class center;
Weight calculation formula is as follows:
Wherein, y (d
i, C
j) be category attribute function, representative degree function is introduced category attribute function, and formula is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510740516.4A CN105426426B (en) | 2015-11-04 | 2015-11-04 | A kind of KNN file classification methods based on improved K-Medoids |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510740516.4A CN105426426B (en) | 2015-11-04 | 2015-11-04 | A kind of KNN file classification methods based on improved K-Medoids |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105426426A true CN105426426A (en) | 2016-03-23 |
CN105426426B CN105426426B (en) | 2018-11-02 |
Family
ID=55504638
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510740516.4A Expired - Fee Related CN105426426B (en) | 2015-11-04 | 2015-11-04 | A kind of KNN file classification methods based on improved K-Medoids |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105426426B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021578A (en) * | 2016-06-01 | 2016-10-12 | 南京邮电大学 | Improved text classification algorithm based on integration of cluster and membership degree |
CN106971005A (en) * | 2017-04-27 | 2017-07-21 | 杭州杨帆科技有限公司 | Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment |
CN107273416A (en) * | 2017-05-05 | 2017-10-20 | 深信服科技股份有限公司 | The dark chain detection method of webpage, device and computer-readable recording medium |
CN107463705A (en) * | 2017-08-17 | 2017-12-12 | 陕西优百信息技术有限公司 | A kind of data cleaning method |
CN107562853A (en) * | 2017-08-28 | 2018-01-09 | 武汉烽火普天信息技术有限公司 | A kind of method that streaming towards magnanimity internet text notebook data is clustered and showed |
CN107832456A (en) * | 2017-11-24 | 2018-03-23 | 云南大学 | A kind of parallel KNN file classification methods based on the division of critical Value Data |
CN108154178A (en) * | 2017-12-25 | 2018-06-12 | 北京工业大学 | Semi-supervised support attack detection method based on improved SVM-KNN algorithms |
CN108959453A (en) * | 2018-06-14 | 2018-12-07 | 中南民族大学 | Information extracting method, device and readable storage medium storing program for executing based on text cluster |
CN109543739A (en) * | 2018-11-15 | 2019-03-29 | 杭州安恒信息技术股份有限公司 | A kind of log classification method, device, equipment and readable storage medium storing program for executing |
CN109766437A (en) * | 2018-12-07 | 2019-05-17 | 中科恒运股份有限公司 | A kind of Text Clustering Method, text cluster device and terminal device |
CN110287328A (en) * | 2019-07-03 | 2019-09-27 | 广东工业大学 | A kind of file classification method, device, equipment and computer readable storage medium |
CN110969172A (en) * | 2018-09-28 | 2020-04-07 | 武汉斗鱼网络科技有限公司 | Text classification method and related equipment |
CN111104510A (en) * | 2019-11-15 | 2020-05-05 | 南京中新赛克科技有限责任公司 | Word embedding-based text classification training sample expansion method |
CN112381181A (en) * | 2020-12-11 | 2021-02-19 | 桂林电子科技大学 | Dynamic detection method for building energy consumption abnormity |
CN109960799B (en) * | 2019-03-12 | 2021-07-27 | 中南大学 | Short text-oriented optimization classification method |
CN113553430A (en) * | 2021-07-20 | 2021-10-26 | 中国工商银行股份有限公司 | Data classification method, device and equipment |
CN113806732A (en) * | 2020-06-16 | 2021-12-17 | 深信服科技股份有限公司 | Webpage tampering detection method, device, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020042793A1 (en) * | 2000-08-23 | 2002-04-11 | Jun-Hyeog Choi | Method of order-ranking document clusters using entropy data and bayesian self-organizing feature maps |
CN102033949A (en) * | 2010-12-23 | 2011-04-27 | 南京财经大学 | Correction-based K nearest neighbor text classification method |
CN103092931A (en) * | 2012-12-31 | 2013-05-08 | 武汉传神信息技术有限公司 | Multi-strategy combined document automatic classification method |
CN103345528A (en) * | 2013-07-24 | 2013-10-09 | 南京邮电大学 | Text classification method based on correlation analysis and KNN |
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
-
2015
- 2015-11-04 CN CN201510740516.4A patent/CN105426426B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020042793A1 (en) * | 2000-08-23 | 2002-04-11 | Jun-Hyeog Choi | Method of order-ranking document clusters using entropy data and bayesian self-organizing feature maps |
CN102033949A (en) * | 2010-12-23 | 2011-04-27 | 南京财经大学 | Correction-based K nearest neighbor text classification method |
CN103092931A (en) * | 2012-12-31 | 2013-05-08 | 武汉传神信息技术有限公司 | Multi-strategy combined document automatic classification method |
CN103345528A (en) * | 2013-07-24 | 2013-10-09 | 南京邮电大学 | Text classification method based on correlation analysis and KNN |
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
Non-Patent Citations (1)
Title |
---|
LL B ET AL.: "An improved K-nearest-neighbor algorithm for text categorization", 《EXPERT SYSTEMS WITH APPLICATIONS》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021578B (en) * | 2016-06-01 | 2019-07-23 | 南京邮电大学 | A kind of modified text classification algorithm based on cluster and degree of membership fusion |
CN106021578A (en) * | 2016-06-01 | 2016-10-12 | 南京邮电大学 | Improved text classification algorithm based on integration of cluster and membership degree |
CN106971005A (en) * | 2017-04-27 | 2017-07-21 | 杭州杨帆科技有限公司 | Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment |
CN107273416B (en) * | 2017-05-05 | 2021-05-04 | 深信服科技股份有限公司 | Webpage hidden link detection method and device and computer readable storage medium |
CN107273416A (en) * | 2017-05-05 | 2017-10-20 | 深信服科技股份有限公司 | The dark chain detection method of webpage, device and computer-readable recording medium |
CN107463705A (en) * | 2017-08-17 | 2017-12-12 | 陕西优百信息技术有限公司 | A kind of data cleaning method |
CN107562853A (en) * | 2017-08-28 | 2018-01-09 | 武汉烽火普天信息技术有限公司 | A kind of method that streaming towards magnanimity internet text notebook data is clustered and showed |
CN107832456A (en) * | 2017-11-24 | 2018-03-23 | 云南大学 | A kind of parallel KNN file classification methods based on the division of critical Value Data |
CN107832456B (en) * | 2017-11-24 | 2021-11-26 | 云南大学 | Parallel KNN text classification method based on critical value data division |
CN108154178A (en) * | 2017-12-25 | 2018-06-12 | 北京工业大学 | Semi-supervised support attack detection method based on improved SVM-KNN algorithms |
CN108959453A (en) * | 2018-06-14 | 2018-12-07 | 中南民族大学 | Information extracting method, device and readable storage medium storing program for executing based on text cluster |
CN108959453B (en) * | 2018-06-14 | 2021-08-27 | 中南民族大学 | Information extraction method and device based on text clustering and readable storage medium |
CN110969172A (en) * | 2018-09-28 | 2020-04-07 | 武汉斗鱼网络科技有限公司 | Text classification method and related equipment |
CN109543739A (en) * | 2018-11-15 | 2019-03-29 | 杭州安恒信息技术股份有限公司 | A kind of log classification method, device, equipment and readable storage medium storing program for executing |
CN109766437A (en) * | 2018-12-07 | 2019-05-17 | 中科恒运股份有限公司 | A kind of Text Clustering Method, text cluster device and terminal device |
CN109960799B (en) * | 2019-03-12 | 2021-07-27 | 中南大学 | Short text-oriented optimization classification method |
CN110287328A (en) * | 2019-07-03 | 2019-09-27 | 广东工业大学 | A kind of file classification method, device, equipment and computer readable storage medium |
CN111104510A (en) * | 2019-11-15 | 2020-05-05 | 南京中新赛克科技有限责任公司 | Word embedding-based text classification training sample expansion method |
CN111104510B (en) * | 2019-11-15 | 2023-05-09 | 南京中新赛克科技有限责任公司 | Text classification training sample expansion method based on word embedding |
CN113806732A (en) * | 2020-06-16 | 2021-12-17 | 深信服科技股份有限公司 | Webpage tampering detection method, device, equipment and storage medium |
CN113806732B (en) * | 2020-06-16 | 2023-11-03 | 深信服科技股份有限公司 | Webpage tampering detection method, device, equipment and storage medium |
CN112381181A (en) * | 2020-12-11 | 2021-02-19 | 桂林电子科技大学 | Dynamic detection method for building energy consumption abnormity |
CN113553430A (en) * | 2021-07-20 | 2021-10-26 | 中国工商银行股份有限公司 | Data classification method, device and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN105426426B (en) | 2018-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105426426A (en) | KNN text classification method based on improved K-Medoids | |
US10346257B2 (en) | Method and device for deduplicating web page | |
Huang et al. | An improved knn based on class contribution and feature weighting | |
CN105512311A (en) | Chi square statistic based self-adaption feature selection method | |
CN107844559A (en) | A kind of file classifying method, device and electronic equipment | |
Fan et al. | Research on text classification based on improved tf-idf algorithm | |
CN108763348B (en) | Classification improvement method for feature vectors of extended short text words | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN105760889A (en) | Efficient imbalanced data set classification method | |
CN107832456B (en) | Parallel KNN text classification method based on critical value data division | |
CN104391835A (en) | Method and device for selecting feature words in texts | |
CN101021838A (en) | Text handling method and system | |
CN105956031A (en) | Text classification method and apparatus | |
CN102955857A (en) | Class center compression transformation-based text clustering method in search engine | |
CN110543595A (en) | in-station search system and method | |
Fitriyani et al. | The K-means with mini batch algorithm for topics detection on online news | |
CN105893380A (en) | Improved text classification characteristic selection method | |
CN108427686A (en) | Text data querying method and device | |
Kristiyanti et al. | E-Wallet Sentiment Analysis Using Naïve Bayes and Support Vector Machine Algorithm | |
CN109800790B (en) | Feature selection method for high-dimensional data | |
CN102929977A (en) | Event tracing method aiming at news website | |
Ah-Pine et al. | Similarity based hierarchical clustering with an application to text collections | |
Chen et al. | Parallel mining frequent patterns over big transactional data in extended mapreduce | |
CN112417082A (en) | Scientific research achievement data disambiguation filing storage method | |
CN111625578A (en) | Feature extraction method suitable for time sequence data in cultural science and technology fusion field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20181102 Termination date: 20211104 |
|
CF01 | Termination of patent right due to non-payment of annual fee |