CN105426426A - KNN text classification method based on improved K-Medoids - Google Patents

KNN text classification method based on improved K-Medoids Download PDF

Info

Publication number
CN105426426A
CN105426426A CN201510740516.4A CN201510740516A CN105426426A CN 105426426 A CN105426426 A CN 105426426A CN 201510740516 A CN201510740516 A CN 201510740516A CN 105426426 A CN105426426 A CN 105426426A
Authority
CN
China
Prior art keywords
text
bunch
classification
training
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510740516.4A
Other languages
Chinese (zh)
Other versions
CN105426426B (en
Inventor
汪友生
樊存佳
王信
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201510740516.4A priority Critical patent/CN105426426B/en
Publication of CN105426426A publication Critical patent/CN105426426A/en
Application granted granted Critical
Publication of CN105426426B publication Critical patent/CN105426426B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a KNN (K-Nearest-Neighbor) text classification method based on improved K-Medoids and relates to the field of computer text data processing. The method comprises the following steps: pre-processing a training text set and a testing text set, wherein preprocessing comprising removal of participles and stop words, DF feature selection and vector representation, so as to obtain a training text vector space and a testing text vector space; carrying out training sample clipping on the basis of an improved K-Medoids method, namely, optimizing from the points of initial center point selection and replacement of center point search strategy, and applying optimization to the training sample clipping so as to obtain a new training text space; and finally, carrying out KNN classification, defining a representative function and applying the representative function to class attribute functions for KNN classification so as to obtain a final result. Experimental results show that compared with a conventional KNN method and a KNN method based on the K-Medoids, the KNN text classification method provided by the invention has higher classification accuracy and classification efficiency.

Description

The KNN file classification method of a kind of K-Medoids based on improving
Technical field
The present invention relates to computer version data processing field, particularly K arest neighbors (K-Nearest-Neighbor, the KNN) file classification method of a kind of K-Medoids based on improving.
Background technology
Along with the development of internet, Internet of Things and cloud computing, data increase with exponential form, lead us to step into large data age.US Internet data center (IDC) points out, the data on internet increase with the ratio of 50% every year, and at present in the world the data of more than 90% be produce recent years.Current global metadata amount has reached ZB rank, and with the containing in addition in great potential value wherein of generation of mass data.
Current large data age, the potential value of mining data is most important.Data mining, as the technology finding data potential value, causes great concern.Large data text data accounts for sizable ratio, and text classification is as the data digging method of effective organization and management text data, becomes the focus of attention gradually.It is used widely in information filtering, Information Organization and management, information retrieval, digital library and Spam filtering etc.Text classification (TextClassification, TC) refers to the process unknown classification text being automatically classified into a class or multiclass under classification system given in advance according to its content.Conventional file classification method, as K arest neighbors, Bayes (NaiveBayes, NB) and support vector machine (SupportVectorMachine, SVM) etc.
KNN, as one of the sorting technique of classics, has and realizes simple, robustness advantages of higher; But also there is a lot of shortcoming, to such an extent as to can not be applicable in a lot of practical application.The deficiency of KNN mainly comprises following two aspects: the first, in assorting process because Similarity Measure amount is huge the at substantial time, cause classification effectiveness low.The second, classification performance is easily by the impact of training sample, and when serious uneven distribution appears in data, classifier performance may be had a strong impact on, and even becomes extreme difference.For the problem that KNN assorting process calculated amount is large, the improvement of Many researchers is summarized as following three aspects: first, improve feature selection approach, those are given up the little Feature Words of classification contribution, realizes the effective dimensionality reduction to VSM (VectorSpaceModel) model.The second, represent text as new training text collection or delete some text little to classification contribution that original training text concentrates by some choosing that original training text concentrates, after deleting, remaining text is as new training text collection.3rd, design fast search algorithm, to accelerate the search speed of K arest neighbors text of test text.Consider that current various KNN improved algorithm is difficult to situation about taking into account in speed and precision, design category precision is high and the KNN file classification method that classification speed is fast has important academic significance and practical value.
Summary of the invention
The object of the invention is to, improve KNN Algorithm of documents categorization from classification speed and nicety of grading.On the one hand, for improving KNN algorithm classification speed, the K-Medoids clustering algorithm improved is adopted to contribute little training sample with cutting to KNN classification; On the other hand, for improving KNN algorithm classification precision, definition representative degree function is also introduced in KNN algorithm, realizes K the arest neighbors text differentially processing test text.
Feature of the present invention is as follows:
Step 1, the Chinese corpus published is downloaded from internet---training text collection and test text collection;
Step 2, employing participle software I CTCLAS carries out participle to training text collection and test text collection, stop words is removed and carried out pre-service, obtains the training text collection after participle and test text collection;
Step 3, adopts document frequency DF (DocumentFrequency) method to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n k+ 0.01), M is the textual data comprised in collection of document, n krepresent the number of files comprising this word.
Step 5, based on the training sample cutting of the K-Medoids algorithm improved, (definition training text integrates and comprises C as S, S 1, C 2..., C nthis N number of classification, comprising textual data is altogether M).
Step 5.1, for training text collection S, specifies it to need to be divided into m bunch, m=3 × N;
Step 5.2 is each bunch of random selecting central point O i(0 < i≤m);
Step 5.3, remains the cosine similarity of non-central some text and this m central point in calculation training text set S, they are assigned to similarity maximum bunch in, cosine similarity computing formula is as follows:
S i m ( d , O i ) = &Sigma; j = 1 n ( X j x i j ) &Sigma; j = 1 n ( X j 2 ) &Sigma; j = 1 n ( x i j 2 )
Wherein, n is proper vector dimension threshold value, X jrepresent the weight (0 < j≤n) of the jth dimension remaining non-central some text d in training text collection S, x ijrepresent centered text O ijth dimension weight (0 < i≤m, 0 < j≤n).
Step 5.4, the optimization of initial center point selection, in each bunch, using bunch in each point as central point, calculate it with bunch in the similarity sum of other text, the point selecting similarity sum minimum is new central point O i';
Step 5.5, selects the central point O of a non-selected mistake i', this is jth time iteration (j is from 0 to m value), carries out m iteration altogether, and replacing center point set U is no longer overall non-central point set, but O i' nearby sphere, this scope is span central point O ithe region that ' j bunch of nearest all non-central some text comprised is formed;
Step 5.6, selects one not by the non-central some Q selected, calculates Q and O in central point Candidate Set U i' the difference of square error, be recorded in set E, until all non-central point in U was all selected;
Step 5.7, if min (E) < 0 (in set E, minimum value is less than 0), with the former central point of non-central replacement that minimum value in set E is corresponding, the set of a new m central point is obtained after replacement, remaining object distribute to representated by the maximum central point of similarity bunch, again from step 5.5 perform;
Step 5.8, if min (E) > 0 or min (E)=0, replaces center searching process and terminates, finally obtain m cluster centre point O i";
Step 5.9, calculates the similarity of test text and m cluster centre, if Sim is (D, O i") < T i(T ibe the i-th bunch bunch in threshold value, the minimum similarity degree of text and this bunch of central point namely bunch), illustrate that test text is quite low with the text similarity in this bunch, so the text that this bunch comprises can be cropped; If Sim is (D, O i") > T ior Sim (D, O i")=T i, the text comprised in this bunch is joined new training text collection S new.
Step 6, carries out KNN classification.
Training text integrates as S new, test text is d, n is proper vector dimension threshold value, and K gets 5,10,15,20,25,30.
Step 6.1, utilizes vectorial angle cosine value to calculate test text d and S newsimilarity between middle full text;
Step 6.2, selects K the arest neighbors text of K maximum text of similarity that step 6.1 obtains as test text d;
Step 6.3, calculates the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight.
If training text d iknown class be C j, then by d ifor classification C jsignificance level be defined as representative degree function u (d i, C j), definition representative degree function is as follows:
u ( d i , C j ) = 1 D i s t ( d i , C &OverBar; j ) &times; S i m ( d i , C &OverBar; j )
Wherein, represent classification C jcenter vector is by classification C jall text vectors be added and be averaging again. represent training text d ito generic C jthe Euclidean distance of class center, for training text d iwith generic C jthe cosine similarity of class center.
Weight calculation formula is as follows:
W ( d , C j ) = &Sigma; i = 1 K S i m ( d , d i ) y ( d i , C j )
Wherein, y (d i, C j) be category attribute function, representative degree function is introduced category attribute function, and formula is as follows:
Effect of the present invention is:
The present invention proposes the KNN file classification method of a kind of K-Medoids based on improving, degree highland achieves the classification to test text fast and accurately, process flow diagram is shown in Fig. 1, degree of accuracy index in table 1 (traditional KNN algorithm, herein algorithm respectively when K=5, K=10 classifying quality best, here the best effects of two kinds of methods is only provided), time index is in table 2.Compared with traditional KNN method, invention defines representative degree function on the one hand, and be introduced into the category attribute function of classic method, realize K the arest neighbors text differentially processing test text, improve nicety of grading; The present invention adopts the K-Medoids clustering method of improvement to carry out cutting to original training sample collection on the other hand, improves classification effectiveness.Compared with the KNN method based on K-Medoids, the present invention adopts initial center point optimization and replaces the method for center searching policy optimization, one impact being reduction of K-Medoids method initial center point sensitivity, two is accelerate the carrying out that K-Medoids method replaces center searching process.As can be seen from Table 1 and Table 2, compare with the KNN method based on K-Medoids with traditional KNN method, the present invention all has and improves more significantly in nicety of grading and classification effectiveness.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the inventive method.
Embodiment
The present invention adopts following technological means to realize:
The KNN file classification method of a kind of K-Medoids based on improving.First carry out the pre-service of training text collection and test text collection, comprise participle, stop words process, carry out DF feature selecting, training text and test text are all expressed as vector form; Then adopt the K-Medoids method of improvement to carry out cutting to training text, obtain new training text collection S new; Finally define representative degree function, and be introduced into the category attribute function of original KNN algorithm, classify for KNN.
The KNN file classification method of above-mentioned improvement, comprises the steps:
Step 1, the Chinese corpus published is downloaded from internet---training text collection and test text collection;
Step 2, adopts participle software I CTCLAS to carry out participle, stop words removal pre-service to training text collection and test text collection, obtains the training text collection after participle and test text collection;
Step 3, adopts document frequency DF (DocumentFrequency) method to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n k+ 0.01), M is the textual data comprised in collection of document, n krepresent the number of files comprising this word.
Step 5, based on the training sample cutting of the K-Medoids algorithm improved;
Definition training text integrates and comprises C as S, S 1, C 2..., C nthis N number of classification, comprising textual data is altogether M.For training text collection S, it is specified to need to be divided into m bunch, m=3 × N; For each bunch of random selecting central point O i(0 < i≤m); Remain the cosine similarity of non-central some text and this m central point in calculation training text set S, they are assigned to similarity maximum bunch in, cosine similarity computing formula is as follows:
S i m ( d , O i ) = &Sigma; j = 1 n ( X j x i j ) &Sigma; j = 1 n ( X j 2 ) &Sigma; j = 1 n ( x i j 2 ) - - - ( 3 )
Wherein, n is proper vector dimension threshold value, X jrepresent the weight (0 < j≤n) of the jth dimension remaining non-central some text d in training text collection S, x ijrepresent centered text O ijth dimension weight (0 < i≤m, 0 < j≤n).
The optimization of initial center point selection.In each bunch, using bunch in each point as central point, calculate it with bunch in the similarity sum of other text, the point selecting similarity sum minimum is new central point O i'.
Select the central point O of a non-selected mistake i', this is jth time iteration (j is from 0 to m value), carries out m iteration altogether.Replacement center point set U is no longer overall non-central point set, but O i' nearby sphere, this scope is span central point O ithe region that ' j bunch of nearest all non-central some text comprised is formed; In central point Candidate Set U, select one not by the non-central some Q selected, calculate Q and O i' the difference of square error, be recorded in set E, until all non-central point in U was all selected.If min (E) < 0 (in set E, minimum value is less than 0), with the former central point of non-central replacement that minimum value in set E is corresponding, obtains the set of a new m central point after replacement.Remaining object distribute to representated by the maximum central point of similarity bunch, again from this step iteration; If min (E) > 0 or min (E)=0, replace center searching process and terminate, finally obtain m cluster centre point O i".
Calculate the similarity of test text and m cluster centre, if Sim is (D, O i") < T i(T ibe the i-th bunch bunch in threshold value, the minimum similarity degree of text and this bunch of central point namely bunch), illustrate that test text is quite low with the text similarity in this bunch, so the text that this bunch comprises can be cropped; If Sim is (D, O i") > T ior Sim (D, O i")=T i, then the text comprised in this bunch is joined new training text collection S new.
Step 6, carries out KNN classification.
Training text integrates as S new, test text is d, n is proper vector dimension threshold value, and K gets 5,10,15,20,25,30.
Utilize vectorial angle cosine value to calculate test text d and S newsimilarity between middle full text; Select K the arest neighbors of K maximum text of the similarity that calculates as test text d; Calculate the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight.
If training text d iknown class be C j, then by d ifor classification C jsignificance level be defined as representative degree function u (d i, C j), definition representative degree function is as follows:
u ( d i , C j ) = 1 D i s t ( d i , C &OverBar; j ) &times; S i m ( d i , C &OverBar; j ) - - - ( 4 )
Wherein, represent classification C jcenter vector is by classification C jall text vectors be added and be averaging again. represent training text d ito generic C jthe Euclidean distance of class center, for training text d iwith generic C jthe cosine similarity of class center.Weight calculation formula is as follows:
W ( d , C j ) = &Sigma; i = 1 K S i m ( d , d i ) y ( d i , C j ) - - - ( 5 )
Wherein, y (d i, C j) be category attribute function, representative degree function is introduced category attribute function, and formula is as follows:
Table 2 three kinds of algorithm experimental results
Table 3 time performance

Claims (1)

1., based on the KNN file classification method of the K-Medoids improved, it is characterized in that, comprise the following steps:
Step 1, the Chinese corpus published is downloaded from internet---training text collection and test text collection;
Step 2, employing participle software I CTCLAS carries out participle to training text collection and test text collection, stop words is removed and carried out pre-service, obtains the training text collection after participle and test text collection;
Step 3, adopts document frequency DF (DocumentFrequency) method to carry out feature selecting to the training text collection after participle, obtains the feature dictionary that this training text set pair is answered;
Step 4, each training text and each test text are expressed as vector form with the Feature Words of feature dictionary respectively, the weight of every one dimension calculates according to TFIDF=TF × IDF, TF (TermFrequency) is word frequency, refer to the number of times that characteristic item occurs in a document, IDF (InverseDocument) is inverse document frequency, and formula is IDF=log (M/n k+ 0.01), M is the textual data comprised in collection of document, n krepresent the number of files comprising this word;
Step 5, based on the training sample cutting of the K-Medoids algorithm improved, (definition training text integrates and comprises C as S, S 1, C 2..., C nthis N number of classification, comprising textual data is altogether M);
Step 5.1, for training text collection S, specifies it to need to be divided into m bunch, m=3 × N;
Step 5.2 is each bunch of random selecting central point O i(0 < i≤m);
Step 5.3, remains the cosine similarity of non-central some text and this m central point in calculation training text set S, they are assigned to similarity maximum bunch in, cosine similarity computing formula is as follows:
S i m ( d , O i ) = &Sigma; j = 1 n ( X j x i j ) &Sigma; j = 1 n ( X j 2 ) &Sigma; j = 1 n ( x i j 2 )
Wherein, n is proper vector dimension threshold value, X jrepresent the weight (0 < j≤n) of the jth dimension remaining non-central some text d in training text collection S, x ijrepresent centered text O ijth dimension weight (0 < i≤m, 0 < j≤n);
Step 5.4, the optimization of initial center point selection, in each bunch, using bunch in each point as central point, calculate it with bunch in the similarity sum of other text, the point selecting similarity sum minimum is new central point O ' i;
Step 5.5, selects the central point O ' of a non-selected mistake i, this is jth time iteration (j is from 0 to m value), carries out m iteration altogether, and replacing center point set U is no longer overall non-central point set, but O ' inearby sphere, this scope is span central point O ' ithe region that nearest j bunch of all non-central some text comprised is formed;
Step 5.6, selects one not by the non-central some Q selected, calculates Q and O ' in central point Candidate Set U ithe difference of square error, be recorded in set E, until all non-central point in U was all selected;
Step 5.7, if min (E) < 0 (in set E, minimum value is less than 0), with the former central point of non-central replacement that minimum value in set E is corresponding, the set of a new m central point is obtained after replacement, remaining object distribute to representated by the maximum central point of similarity bunch, again from step 5.5 perform;
Step 5.8, if min (E) > 0 or min (E)=0, replaces center searching process and terminates, finally obtain m cluster centre point O " i;
Step 5.9, calculates the similarity of test text and m cluster centre, if Sim (D, O " i) < T i(T ibe the i-th bunch bunch in threshold value, the minimum similarity degree of text and this bunch of central point namely bunch), illustrate that test text is quite low with the text similarity in this bunch, so the text that this bunch comprises can be cropped; If Sim (D, O " i) > T ior Sim (D, O " i)=T i, the text comprised in this bunch is joined new training text collection S new;
Step 6, carries out KNN classification;
Training text integrates as S new, test text is d, n is proper vector dimension threshold value, and K gets 5,10,15,20,25,30;
Step 6.1, utilizes vectorial angle cosine value to calculate test text d and S newsimilarity between middle full text;
Step 6.2, selects K the arest neighbors text of K maximum text of similarity that step 6.1 obtains as test text d;
Step 6.3, calculates the weight that test text d belongs to each classification, test text d is grouped into the maximum classification of weight;
If training text d iknown class be C j, then by d ifor classification C jsignificance level be defined as representative degree function u (d i, C j), definition representative degree function is as follows:
u ( d i , C j ) = 1 D i s t ( d i , C &OverBar; j ) &times; S i m ( d i , C &OverBar; j )
Wherein, represent classification C jcenter vector is by classification C jall text vectors be added and be averaging again; represent training text d ito generic C jthe Euclidean distance of class center, for training text d iwith generic C jthe cosine similarity of class center;
Weight calculation formula is as follows:
W ( d , C j ) = &Sigma; i = 1 K S i m ( d , d i ) y ( d i , C j )
Wherein, y (d i, C j) be category attribute function, representative degree function is introduced category attribute function, and formula is as follows:
CN201510740516.4A 2015-11-04 2015-11-04 A kind of KNN file classification methods based on improved K-Medoids Expired - Fee Related CN105426426B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510740516.4A CN105426426B (en) 2015-11-04 2015-11-04 A kind of KNN file classification methods based on improved K-Medoids

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510740516.4A CN105426426B (en) 2015-11-04 2015-11-04 A kind of KNN file classification methods based on improved K-Medoids

Publications (2)

Publication Number Publication Date
CN105426426A true CN105426426A (en) 2016-03-23
CN105426426B CN105426426B (en) 2018-11-02

Family

ID=55504638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510740516.4A Expired - Fee Related CN105426426B (en) 2015-11-04 2015-11-04 A kind of KNN file classification methods based on improved K-Medoids

Country Status (1)

Country Link
CN (1) CN105426426B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021578A (en) * 2016-06-01 2016-10-12 南京邮电大学 Improved text classification algorithm based on integration of cluster and membership degree
CN106971005A (en) * 2017-04-27 2017-07-21 杭州杨帆科技有限公司 Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment
CN107273416A (en) * 2017-05-05 2017-10-20 深信服科技股份有限公司 The dark chain detection method of webpage, device and computer-readable recording medium
CN107463705A (en) * 2017-08-17 2017-12-12 陕西优百信息技术有限公司 A kind of data cleaning method
CN107562853A (en) * 2017-08-28 2018-01-09 武汉烽火普天信息技术有限公司 A kind of method that streaming towards magnanimity internet text notebook data is clustered and showed
CN107832456A (en) * 2017-11-24 2018-03-23 云南大学 A kind of parallel KNN file classification methods based on the division of critical Value Data
CN108154178A (en) * 2017-12-25 2018-06-12 北京工业大学 Semi-supervised support attack detection method based on improved SVM-KNN algorithms
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster
CN109543739A (en) * 2018-11-15 2019-03-29 杭州安恒信息技术股份有限公司 A kind of log classification method, device, equipment and readable storage medium storing program for executing
CN109766437A (en) * 2018-12-07 2019-05-17 中科恒运股份有限公司 A kind of Text Clustering Method, text cluster device and terminal device
CN110287328A (en) * 2019-07-03 2019-09-27 广东工业大学 A kind of file classification method, device, equipment and computer readable storage medium
CN110969172A (en) * 2018-09-28 2020-04-07 武汉斗鱼网络科技有限公司 Text classification method and related equipment
CN111104510A (en) * 2019-11-15 2020-05-05 南京中新赛克科技有限责任公司 Word embedding-based text classification training sample expansion method
CN112381181A (en) * 2020-12-11 2021-02-19 桂林电子科技大学 Dynamic detection method for building energy consumption abnormity
CN109960799B (en) * 2019-03-12 2021-07-27 中南大学 Short text-oriented optimization classification method
CN113553430A (en) * 2021-07-20 2021-10-26 中国工商银行股份有限公司 Data classification method, device and equipment
CN113806732A (en) * 2020-06-16 2021-12-17 深信服科技股份有限公司 Webpage tampering detection method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020042793A1 (en) * 2000-08-23 2002-04-11 Jun-Hyeog Choi Method of order-ranking document clusters using entropy data and bayesian self-organizing feature maps
CN102033949A (en) * 2010-12-23 2011-04-27 南京财经大学 Correction-based K nearest neighbor text classification method
CN103092931A (en) * 2012-12-31 2013-05-08 武汉传神信息技术有限公司 Multi-strategy combined document automatic classification method
CN103345528A (en) * 2013-07-24 2013-10-09 南京邮电大学 Text classification method based on correlation analysis and KNN
CN104063472A (en) * 2014-06-30 2014-09-24 电子科技大学 KNN text classifying method for optimizing training sample set

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020042793A1 (en) * 2000-08-23 2002-04-11 Jun-Hyeog Choi Method of order-ranking document clusters using entropy data and bayesian self-organizing feature maps
CN102033949A (en) * 2010-12-23 2011-04-27 南京财经大学 Correction-based K nearest neighbor text classification method
CN103092931A (en) * 2012-12-31 2013-05-08 武汉传神信息技术有限公司 Multi-strategy combined document automatic classification method
CN103345528A (en) * 2013-07-24 2013-10-09 南京邮电大学 Text classification method based on correlation analysis and KNN
CN104063472A (en) * 2014-06-30 2014-09-24 电子科技大学 KNN text classifying method for optimizing training sample set

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LL B ET AL.: "An improved K-nearest-neighbor algorithm for text categorization", 《EXPERT SYSTEMS WITH APPLICATIONS》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021578B (en) * 2016-06-01 2019-07-23 南京邮电大学 A kind of modified text classification algorithm based on cluster and degree of membership fusion
CN106021578A (en) * 2016-06-01 2016-10-12 南京邮电大学 Improved text classification algorithm based on integration of cluster and membership degree
CN106971005A (en) * 2017-04-27 2017-07-21 杭州杨帆科技有限公司 Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment
CN107273416B (en) * 2017-05-05 2021-05-04 深信服科技股份有限公司 Webpage hidden link detection method and device and computer readable storage medium
CN107273416A (en) * 2017-05-05 2017-10-20 深信服科技股份有限公司 The dark chain detection method of webpage, device and computer-readable recording medium
CN107463705A (en) * 2017-08-17 2017-12-12 陕西优百信息技术有限公司 A kind of data cleaning method
CN107562853A (en) * 2017-08-28 2018-01-09 武汉烽火普天信息技术有限公司 A kind of method that streaming towards magnanimity internet text notebook data is clustered and showed
CN107832456A (en) * 2017-11-24 2018-03-23 云南大学 A kind of parallel KNN file classification methods based on the division of critical Value Data
CN107832456B (en) * 2017-11-24 2021-11-26 云南大学 Parallel KNN text classification method based on critical value data division
CN108154178A (en) * 2017-12-25 2018-06-12 北京工业大学 Semi-supervised support attack detection method based on improved SVM-KNN algorithms
CN108959453A (en) * 2018-06-14 2018-12-07 中南民族大学 Information extracting method, device and readable storage medium storing program for executing based on text cluster
CN108959453B (en) * 2018-06-14 2021-08-27 中南民族大学 Information extraction method and device based on text clustering and readable storage medium
CN110969172A (en) * 2018-09-28 2020-04-07 武汉斗鱼网络科技有限公司 Text classification method and related equipment
CN109543739A (en) * 2018-11-15 2019-03-29 杭州安恒信息技术股份有限公司 A kind of log classification method, device, equipment and readable storage medium storing program for executing
CN109766437A (en) * 2018-12-07 2019-05-17 中科恒运股份有限公司 A kind of Text Clustering Method, text cluster device and terminal device
CN109960799B (en) * 2019-03-12 2021-07-27 中南大学 Short text-oriented optimization classification method
CN110287328A (en) * 2019-07-03 2019-09-27 广东工业大学 A kind of file classification method, device, equipment and computer readable storage medium
CN111104510A (en) * 2019-11-15 2020-05-05 南京中新赛克科技有限责任公司 Word embedding-based text classification training sample expansion method
CN111104510B (en) * 2019-11-15 2023-05-09 南京中新赛克科技有限责任公司 Text classification training sample expansion method based on word embedding
CN113806732A (en) * 2020-06-16 2021-12-17 深信服科技股份有限公司 Webpage tampering detection method, device, equipment and storage medium
CN113806732B (en) * 2020-06-16 2023-11-03 深信服科技股份有限公司 Webpage tampering detection method, device, equipment and storage medium
CN112381181A (en) * 2020-12-11 2021-02-19 桂林电子科技大学 Dynamic detection method for building energy consumption abnormity
CN113553430A (en) * 2021-07-20 2021-10-26 中国工商银行股份有限公司 Data classification method, device and equipment

Also Published As

Publication number Publication date
CN105426426B (en) 2018-11-02

Similar Documents

Publication Publication Date Title
CN105426426A (en) KNN text classification method based on improved K-Medoids
US10346257B2 (en) Method and device for deduplicating web page
Huang et al. An improved knn based on class contribution and feature weighting
CN105512311A (en) Chi square statistic based self-adaption feature selection method
CN107844559A (en) A kind of file classifying method, device and electronic equipment
Fan et al. Research on text classification based on improved tf-idf algorithm
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN105760889A (en) Efficient imbalanced data set classification method
CN107832456B (en) Parallel KNN text classification method based on critical value data division
CN104391835A (en) Method and device for selecting feature words in texts
CN101021838A (en) Text handling method and system
CN105956031A (en) Text classification method and apparatus
CN102955857A (en) Class center compression transformation-based text clustering method in search engine
CN110543595A (en) in-station search system and method
Fitriyani et al. The K-means with mini batch algorithm for topics detection on online news
CN105893380A (en) Improved text classification characteristic selection method
CN108427686A (en) Text data querying method and device
Kristiyanti et al. E-Wallet Sentiment Analysis Using Naïve Bayes and Support Vector Machine Algorithm
CN109800790B (en) Feature selection method for high-dimensional data
CN102929977A (en) Event tracing method aiming at news website
Ah-Pine et al. Similarity based hierarchical clustering with an application to text collections
Chen et al. Parallel mining frequent patterns over big transactional data in extended mapreduce
CN112417082A (en) Scientific research achievement data disambiguation filing storage method
CN111625578A (en) Feature extraction method suitable for time sequence data in cultural science and technology fusion field

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181102

Termination date: 20211104

CF01 Termination of patent right due to non-payment of annual fee