CN103279556B - Iteration Text Clustering Method based on self adaptation sub-space learning - Google Patents

Iteration Text Clustering Method based on self adaptation sub-space learning Download PDF

Info

Publication number
CN103279556B
CN103279556B CN201310230981.4A CN201310230981A CN103279556B CN 103279556 B CN103279556 B CN 103279556B CN 201310230981 A CN201310230981 A CN 201310230981A CN 103279556 B CN103279556 B CN 103279556B
Authority
CN
China
Prior art keywords
text
subspace
iteration
cluster
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310230981.4A
Other languages
Chinese (zh)
Other versions
CN103279556A (en
Inventor
吴娴
杨兴锋
张东明
何崑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Southern Newspaper Media Group New Media Co., Ltd.
Original Assignee
NANFANG DAILY GROUP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANFANG DAILY GROUP filed Critical NANFANG DAILY GROUP
Priority to CN201310230981.4A priority Critical patent/CN103279556B/en
Publication of CN103279556A publication Critical patent/CN103279556A/en
Application granted granted Critical
Publication of CN103279556B publication Critical patent/CN103279556B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of iteration Text Clustering Method based on self adaptation sub-space learning, comprise the following steps: (1) initializes: corpus of text is expressed as text vector space, using affine propagation clustering method to produce initial K cluster, the cluster category table of all texts is shown as initial classes ownership oriental matrix.(2) iteration between subspace projection and cluster: initial classes is belonged to oriental matrix as priori, to maximize average neighborhood edge for object solving subspace projection matrix, by text vector space projection to subspace, and use affine propagation clustering method to produce K cluster in subspace, thus update class ownership oriental matrix;Belong to oriental matrix based on subspace projection matrix and class and calculate convergent function, until function convergence, exit iteration, complete text cluster.The present invention is unrestricted to size and the distribution of text data, and subspace solution and cluster are fused under Unified frame, is obtained the cluster result of global optimum by the strategy of iteration.

Description

Iteration Text Clustering Method based on self adaptation sub-space learning
Technical field
The present invention relates to machine learning and area of pattern recognition, particularly to a kind of based on self adaptation sub-space learning repeatedly For Text Clustering Method, the method is based on average neighborhood edge maximized self adaptation sub-space learning method, and uses repeatedly The strategy in generation is used it for solving text cluster problem.
Background technology
Along with Internet technology and the universal and development of database technology, people can obtain easily and store in a large number Data.Data in reality exist the most in the form of text, and text cluster, as a kind of means, can carry out group to text message Knit, make a summary and navigate, contribute to accurately obtaining information needed from immense text message resource, therefore, the most obtain Extensive concern.
In text cluster, text is commonly used vector space model (Vector Space Model, VSM) and is represented, but this The feature represented is the higher-dimension of feature space and openness.In this case, the most a large amount of between incoherent dimension Redundancy so that the similarity measurement in clustering algorithm loses its resolving ability, thus affects the final performance of text cluster. The Normal practice solving this problem uses dimensionality reduction technology by text vector space projection to subspace, then uses cluster to calculate The document that low-dimensional is represented by method is divided into different classes.But, the Text Clustering Method of this routine only regards dimensionality reduction as cluster The pre-treatment step of algorithm, has isolated potential contacting between subspace projection and cluster, accordingly, it is difficult to ensure projection gained Subspace is optimum for cluster.
In order to overcome this limitation, C. Ding et al. was at the data mining international conference (International of 2002 Conference on Data Mining, ICDM) on delivered one entitled: Adaptive dimension reduction The article of for clustering high dimensional data, proposes the concept (Adaptive of self-adaptive reduced-dimensions Dimension reduction, ADR), using dimensionality reduction as a dynamic process, and integrate with cluster process.In view of Its practicality and motility, ADR develops into following various ways in application process:
Li et al. in Research into information retrieval and developed international conference (ACM Special Interest Group in 2004 Of Information Retrieval, ACM SIGIR) in deliver one entitled: Document clustering via The article of adaptive subspace iteration, completes category division simultaneously and subspace identifies two tasks, but civilian The subspace of middle employing identifies and need to model each cluster classification respectively so that the solving of this problem becomes a combined optimization and ask Topic, initializes the nonindependence between matrix simultaneously and is also easy to the final performance of impact cluster.
Ding and Li in 2007 at machine learning international conference (IEEE International on Machine Learning, ICML) on deliver one entitled: Adaptive dimension reduction using The article of discriminative analysis and K-means clustering, by linear discriminant analysis (Linear Discriminant Analysis, LDA) and K mean cluster (K-means) be integrated into LDA-Km structure, be used for solving text Clustering problem.But, subspace projection based on LDA easily meets with small sample problem, and only to the literary composition with Gauss distribution Notebook data relative efficiency.
Ye et al. in 2007 at computer vision and pattern recognition international conference (IEEE International Conference on Computer Vision and Pattern Recognition, CVPR) on deliver one entitled: The article of Adaptive distance metric learning for clustering, it is proposed that nonlinear adaptive is measured Learning algorithm (nonlinear adaptive metric learning, NAML), converts core study, dimensionality reduction and cluster simultaneously It is a battle array mark optimization problem, under the framework of expectation maximization (Expectation Maximization, EM), passes through iteration Method solve cluster result.But the limitation of NAML is that its optimization process necessarily depends upon multiple key parameter, in data The appearance of Expired Drugs it is easily caused in the case of insufficient.
Although thinking of self-adaptive reduced-dimensions and associated method can solve the problem that specific text cluster problem, but there is also with On some technological deficiency of pointing out, limit its range of application, the improvement for Text Clustering Algorithm reserves certain space.Therefore, Study a kind of Text Clustering Method with stronger generalization ability and adaptive ability, it has also become a class being of practical significance Topic.
Summary of the invention
Present invention is primarily targeted at the shortcoming overcoming prior art with not enough, it is provided that a kind of based on self adaptation subspace The iteration Text Clustering Method of study, the method, by under subspace solution and Cluster-Fusion to Unified frame, utilizes iteration optimization Strategy obtain global optimum's result.The method to the quantity of text data and distribution without special constraint, and can be effective Avoid Multiple Optimization problem, relate to less adjustment parameter, therefore, there is stronger generalization ability and adaptive ability, classification More reasonable.
The purpose of the present invention is achieved through the following technical solutions: iteration text cluster side based on self adaptation sub-space learning Method comprises the following steps:
(1) initialize: corpus of text is expressed as the mathematical form in text vector space, spatially uses at text vector Affine propagation clustering (K-Affinity Propagation, K-AP) method produces initial K cluster, and then obtains representing text The initial classes ownership oriental matrix of all document generic in language material;
(2) iteration optimization between subspace projection and cluster, comprises the following steps:
(2-1) using the middle initial classes ownership oriental matrix obtained of step (1) as priori, use based on average neighborhood Edge maximizes the sub-space learning method of (Average Neighborhood Margin Maximization, ANMM) and solves Subspace projection matrix, and based on initial classes ownership oriental matrix and subspace projection matrix calculus convergent function value;
If (2-2) the not met condition of convergence, then by urtext vector space according to current subspace projection matrix projection In subspace, in subspace, continue to take K-AP algorithm to produce specify K cluster, update current class ownership instruction square Battle array;
(2-3) using the class ownership oriental matrix after renewal as priori, use and maximize based on average neighborhood edge Sub-space learning method solve subspace projection matrix, and based on the class ownership oriental matrix after updating and subspace projection Matrix, calculates convergent function value;
(2-4) repeating step (2-2)-(2-3), until meeting the condition of convergence, stopping iteration, final from iterative process output Class ownership oriental matrix, obtain the final cluster result of all documents.
Concrete, described step (1) initialization procedure is as follows: use mutually from the participle of all documents of corpus of text is expressed Information approach is selected the set of one group of representativeness lexical item and is constituted lexical item index;Then according to each by after participle of lexical item index Document representation is a text vector, the size that text vector dimension i.e. indexes corresponding to the lexical item selected, and vectorial is every Individual element value tfidf weight represents;If each document is expressed as a text vector, the then all documents in corpus of text I.e. constitute a urtext vector space;Taking K-AP algorithm to produce in urtext vector space specifies K initially to gather Class, each document obtains its initial category, and the initial clustering classification information of all documents collects formation initial classes ownership instruction Matrix.
More specifically, in described step (1), vectorial each element value tfidf weight represents, method is as follows: for word Certain lexical item t in entry indexi, document xiTfidf weight table be shown as:
tfidf i , j = tf i , j × idf i = tf i , j × log ( | D | df i ) ;
Wherein tfI, jRepresent lexical item tiAt document xjThe word frequency of middle appearance, | D | is the quantity of all documents in corpus of text, dfiIt it is lexical item tiNumber of documents once at least occurred, it is assumed that lexical item index is v=[t1, t2..., tm], then document xjPermissible It is expressed as m dimensional vector xj=[tfidfL, j,tfidf2, j, tfidfM, j]T
Concrete, in described step (2-1), solve son based on average neighborhood edge maximized sub-space learning method empty Between the method for projection matrix be:
For certain data point x in text vector spacei, calculate text data and concentrate other data point and xiDistance, And according to distance and data point xiBelonging to classification information, divide them into following two subset:It it is similar neighborhood Collection, comprises and xiBelong to ξ similar nearest-neighbor point;It is foreign peoples's neighborhood collection, comprises and xiAdhere to inhomogeneous ζ separately recently Neighborhood point;
Ask for data point x respectivelyiAverage between class distance and average inter-object distance;
Ask for the average between class distance P and average inter-object distance Q of all data points in text vector space;
For all of data point, in constraintsUnder, maximize its average neighborhood edge function, i.e. Average inter-object distance is minimized while the average between class distance of satisfied maximization;Thus obtain subspace projection matrix W.
Further, in order to keep sample balance, belong to data point similar nearest-neighbor point number ξ and with number It is equal, depending on the contiguous range of selection is by corpus of text situation that inhomogeneous nearest-neighbor point number ζ is adhered at strong point separately.
Concrete, described data point xiAverage between class distance and average inter-object distance be expressed as:
P i = Σ p : x p ∈ N i e | | x i - x p | | 2 | N i e | ;
Q i = Σ q : x q ∈ N i o | | x i - x q | | 2 | N i o | ;
The average between class distance P and average inter-object distance Q of all data points are expressed as:
P = Σ i Σ p : x p ∈ N i e | | x i - x p | | 2 | N i e | ;
Q = Σ i Σ q : x q ∈ N i o | | x i - x q | | 2 | N i o | ;
Wherein | | the quantity of the data point included in expression collection.
Concrete, described average neighborhood edge function is:
Solving of subspace projection must be in constraintsUnder, i.e. maximize following object function:
Further, if the initial vector of document is expressed as m dimension, after subspace projection, each document vector representation is l Dimension, then the feature corresponding to maximum l eigenvalue that every string of subspace projection matrix W is obtained by (P-Q) singular value decomposition Vector constitutes (l≤m), i.e. W=[w1, w2..., wl], then subspace is obtained by the following manner: Y=WTX。
Concrete, in described step (1) and (2-2), for subspace Y={y1..., yn, use affine propagation clustering side It is as follows that method produces K cluster class method for distinguishing: finds K authentic specimen example E={e1..., eKRepresent K inhomogeneity C= {c1..., cK, thus maximize following object function:
max F ( { c j } j = 1 K ) = Σ j = 1 K Σ y i ∈ c j s ( y i , e j ) ;
Wherein with ejClass for sample instantiation is marked as cj, cjApoplexy due to endogenous wind comprises all by ejData as sample instantiation Point, s (yi, ej) represent data point yiWith sample instantiation ejSimilarity;
In the solving of above object function maximization problems, introduce B={bij∈ 0,1}, i, j=1 ..., the expression of n} Form, can be transformed into Zero-one integer programming problem, and above-mentioned object function is changed into:
max F ( { b ij } ) = Σ i = 1 n Σ j = 1 n b ij s ( y i , y j ) ;
This object function must be based on following three item constraints:
bii=1, ifbji=1;
Σ j = 1 n b ij = 1 ;
Σ i = 1 n b ii = K .
Preferably, the optimum of above-mentioned parameter B is solved and is obtained by conventional belief propagation method, and calculating process can refer to Zhang et al. was published in an entitled K-AP of International Conference of Data Mining in 2010: The article of Generating Specified K Clusters by Efficient Affinity Propagation.This Bright this state-of-the-art technology is applied to text cluster, it is possible to obtain clustering more rational than traditional methods such as K-means. In the two values matrix B obtained after K-AP clusters, if bij=1, then show yiSelf it is sample instantiation, yiBelong to ciClass;If bij =1, then yiSample instantiation be yj, yiWith yjBelong to cjClass;Thus may indicate that the similar and foreign peoples between the sample of subspace Relation, and the sample of subspace is corresponding with the sample in text vector space, the class ownership that thus can update all documents refers to Show matrix.
The present invention compared with prior art, has the advantage that and beneficial effect:
1, Text Clustering Method of the prior art is by the text vector space projection of higher-dimension to subspace mostly, so After cluster in subspace, but the method be difficult to ensure that tried to achieve subspace for cluster be optimum.The present invention will Subspace solution and Cluster-Fusion, under Unified frame, utilize the strategy of iteration optimization to ensure that acquisition global optimum result, So that classification more rationally, accurately.
2, the present invention solves subspace projection to maximize average neighborhood edge for criterion, it is to avoid little in traditional method Sample problem, and data are distributed without special requirement, Multiple Optimization problem can be effectively prevented from, and relate to less adjustment Parameter, has stronger generalization ability and adaptive ability.
3, the present invention uses fast affine propagation clustering (K-AP) that text set is divided into A the cluster class that user specifies Not, use and with experimental verification K-AP algorithm to produce in text cluster the most reasonably cluster stroke than traditional method Point.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the present invention;
Fig. 2 (a) (e) is that the inventive method iteration cluster calculation on NG20 and subdata base thereof draws convergent function value Curve synoptic diagram;
Fig. 3 is that the inventive method iteration cluster calculation on Classic3 data base draws the curve chart of convergent function value;
Fig. 4 is that the inventive method iteration cluster calculation on K1b data base draws the curve chart of convergent function value;
Fig. 5 (a) (e) is the degree of accuracy of the present invention and algorithm LDA-Km after each iteration on NG20 and subdata base thereof Comparative result figure;
Fig. 6 is the degree of accuracy comparative result of the present invention and algorithm LDA-Km after each iteration on Classic3 data base Figure;
Fig. 7 is the degree of accuracy comparative result figure of the present invention and algorithm LDA-Km after each iteration on K1b data base;
Detailed description of the invention
Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention do not limit In this.
Embodiment 1
As it is shown in figure 1, iteration Text Clustering Method based on self adaptation sub-space learning, comprise the following steps:
(1) clustering initialization in text vector space: use mutual trust during the participle of all documents is expressed from corpus of text Breath method choice goes out the set of one group of representativeness lexical item and constitutes lexical item index;Then according to lexical item index by each document representation it is One text vector, the size that the dimension of text vector i.e. indexes corresponding to the lexical item selected, each element value of vector Represent by tfidf weight;In corpus of text, all documents i.e. constitute an original text vector space;At urtext vector Taking affine propagation clustering algorithm to produce K the initial clustering (K-AP) specified in space, it is initially affiliated that each document obtains it Classification, the classification information of all clustering documents collects formation initial classes ownership oriental matrix.
(2) iteration optimization between subspace projection and cluster, comprises the following steps:
(2-1) using the middle initial classes ownership oriental matrix obtained of step (1) as priori, use based on average neighborhood Edge maximized sub-space learning method solves subspace projection matrix, refers to initial classes ownership based on subspace projection matrix Show matrix calculus convergent function value;
If (2-2) the not met condition of convergence, then urtext vector space is thrown by current subspace projection matrix Shadow, in subspace, continues to take K-AP algorithm to produce in subspace and specifies K cluster, thus updates current class and belong to and refer to Show matrix;
(2-3) using the class ownership oriental matrix after renewal as priori, use and maximize based on average neighborhood edge Sub-space learning method solve subspace projection matrix, belong to oriental matrix based on subspace projection matrix and the class after updating Calculate convergent function value;
(2-4) repeating step (2-2)-(2-3), until meeting the condition of convergence, stopping iteration, final from iterative process output Class ownership oriental matrix, obtain the final cluster result of all documents.
Text data in the present embodiment is respectively derived from 20Newsgroups(NG20), Classic3 and K1b corpus. The attribute of the corpus of text collection used is shown in Table 1, and all documents in corpus are the most by participle.
The attribute list of the corpus of text collection in table 1 embodiment
In described step (1), take mutual information method can extract m representational word from the participle of document is expressed , constitute lexical item index v=[t1, t2..., tm], then each document of language material may be expressed as m dimensional vector.Assume xjFor language Jth document in material, corresponding to lexical item ti, the element value of vector can be represented by tfidf weight:
tfid f i , j = t f i , j × id f i = t f i , j × log ( | D | d f i ) - - - [ 1 ]
Wherein, tfI, jRepresent lexical item tiAt document xjThe word frequency of middle appearance.At idfiCalculating in, | D | is in corpus of text The quantity of all documents, dfiIt it is lexical item t in corpus of textiNumber of documents once at least occurred.Document xjM dimensional vector X can be represented sequentially asj=[tfidf1, j, tfidf2, j..., tfidfM, j]T.The vector representation of all documents can be stacked up Constitute text vector space, be expressed as the matrix of m × n size: X0=[x1, x2..., xn].In the present embodiment, extraction is set 2000 representational lexical items, then m=2000.As a example by Binary corpus, wherein comprise 500 samples, then text vectors Space can be expressed as the matrix of 2000 × 500 sizes.
The iterative optimization procedure between subspace projection and cluster in described step (2) is with class ownership oriental matrix as bridge Beam, its iterative process is the core methed of the present invention, can be described as self adaptation sub-space learning method (Adaptive Subspace learning, ASL), concrete iterative process is as follows:
Initialize: corpus of text is expressed as urtext vector space X0, in urtext vector space X0Upper employing K- AP cluster obtains initial classes ownership oriental matrix L0;The initial classes comprising text categories information is belonged to oriental matrix L0Enter step Suddenly, in (2-1), use ANMM subspace projection, obtain subspace projection matrix W1, based on L0And W1Calculate convergent function value Score1
1st iteration: by urtext vector space X0According to subspace projection matrix W1Project to subspace Y1In, Subspace Y1Upper employing K-AP cluster obtains class ownership oriental matrix L1;The class comprising text categories information is belonged to oriental matrix L1Enter in step (2-3), use ANMM subspace projection, obtain subspace projection matrix W2, based on L1And W2Calculate convergence Functional value Score2
T takes turns iteration: by urtext vector space X0According to subspace projection matrix WtProject to subspace YtIn, Subspace YtUpper employing K-AP clusters, it is thus achieved that class ownership oriental matrix Lt, the class comprising text categories information is belonged to oriental matrix LtEnter in step (2-3), use ANMM subspace projection, obtain subspace projection matrix Wt+1, based on LtAnd Wt+1Calculate and receive Hold back functional value Scoret+1
Need to calculate its convergent function value after each iteration above, as follows:
Wherein P (Lt) and Q (Lt) it is to belong to oriental matrix L according to classtIn text vector space X0Upper the most calculated averagely Between class distance and average inter-object distance.If meeting condition of convergence Score sett+1-Scoret≤ ∈, or reach to set Big number of iterations T, exits iteration, and obtains final class ownership oriental matrix, i.e. obtains the affiliated of each document in corpus of text Classification, completes text cluster.
In described step (2), either in urtext vector space X0On, or on the subspace that projection obtains (it is embodied as { Y in an iterative process1..., Yt), it is required for carrying out affine propagation clustering (K-AP) and produces K cluster.With As a example by affine propagation clustering (K-AP) in subspace, (sample space is y={y1…yn), the thought of K-AP is to find K truly Sample instantiation E={e1..., eK, it is used for representing K different classes of C={c1..., cK, thus maximize following object function:
max F ( { c j } j = 1 K ) = Σ j = 1 K Σ y i ∈ c j s ( y i , e j ) - - - [ 4 ]
Wherein with ejClass for sample instantiation is marked as cj, cjApoplexy due to endogenous wind comprises all by ejData as sample instantiation Point, s (yi, ej) represent data point yiWith sample instantiation ejSimilarity.In the solving of object function maximization problems, introduce B ={bij∈ 0,1}, i, j=1 ..., and n}, Zero-one integer programming problem can be transformed into, then object function is changed into:
max F ( { b ij } ) = Σ i = 1 n Σ j = 1 n b ij s ( y i , y j ) - - - [ 5 ]
This object function must be based on following three item constraints:
bii=1, ifbji=1
Σ j = 1 n b ij = 1 - - - [ 6 ]
Σ i = 1 n b ii = K
Section 1 constraint shows if yjBy yiElect its sample instantiation, y asiIt must be a sample instantiation;Section 2 is about Bundle shows each data point yiHave and an only sample instantiation;The number of Section 3 constraint representation sample instantiation is necessary for K, from And ensure that K-AP method can produce K the cluster that user specifies.
Maximization problems based on constraints above can be expressed with a factor graph, and being related to that the optimum of B solves can To be obtained by belief propagation method reasoning, calculating process can refer to Zhang et al. and was published in International in 2010 One entitled K-AP:Generating Specified K Clusters by of Conference of Data Mining The article of Efficient Affinity Propagation.Parameter B describes the similar and heterogeneous relationships between sample, thus The cluster classification of document in corpus of text can be obtained.
In described step (1), (2), all documents can obtain its generic information by K-AP clustering algorithm, collects Class ownership oriental matrix L(is become to may particularly denote in an iterative process as { L0, L1..., Lt).Matrix L size is n × K, wherein n Being the number of documents in corpus, K is the categorical measure that cluster produces.If jth document belongs to kth class, then Ljk=1, otherwise Ljk=0。
In described step (2), belong to oriental matrix L as priori using the classification of document, based on average neighborhood edge The sub-space learning method maximizing (ANMM) solves subspace, and method is as follows:
First the average between class distance of all data points and average inter-object distance are calculated.Assume xiIt is that original text vector is empty Certain data point between, can obtain x from class ownership oriental matrix LiClassification information;Calculate xiWith other data point Distance, and according to the cluster classification information of distance and data point, divides them into following two subset: similar neighborhood collectionComprise and xiBelong to ξ similar nearest-neighbor point;Foreign peoples's neighborhood collectionComprise and xiAdhere to inhomogeneous ζ arest neighbors separately Territory point.
Then data point xiAverage between class distance and average inter-object distance be expressed as:
P i = Σ p : x p ∈ N i e | | x i - x p | | 2 | N i e | - - - [ 7 ]
Q i = Σ q : x q ∈ N i o | | x i - x q | | 2 | N i o | - - - [ 8 ]
For all data points, then have:
P = Σ i Σ p : x p ∈ N i e | | x i - x p | | 2 | N i e | - - - [ 9 ]
Q = Σ i Σ q : x q ∈ N i o | | x i - x q | | 2 | N i o | - - - [ 10 ]
Wherein | | the quantity of the data point included in expression subset.
Secondly, if by xiProject in subspace, i.e. yi=WTxi, need also exist for while maximizing average between class distance Minimize average inter-object distance, therefore, for all of data point, its average neighborhood edge function must be maximized:
Solving of subspace projection must be in constraintsUnder, maximize following object function:
Wherein P and Q can be calculated with formula [9] and [10].Assume the vector representation of each document in corpus of text Tieing up for m, after subspace projection, each document vector representation is l dimension, then every string of projection matrix W is by (P-Q) singular value decomposition The characteristic vector corresponding to maximum l eigenvalue obtained constitutes (l≤m), i.e. W=[w1、w2..., wl].In the present embodiment, The initial dimension of document vector is m=2000.The dimension l of subspace can be set to a fixed constant, it is also possible to according to eigenvalue Distribution dynamic changes.Keeping characteristics value is more than 10 by the present embodiment-5Characteristic vector, as a example by Binary corpus, (P-Q) is strange Different value decompose after obtain more than 10-5Eigenvalue number be 1080, therefore l=1080, at the beginning of the document of each 2000 × 1 sizes Beginning vector xiIt is projected into the low dimensional vector y of 1080 × 1 sizesi
In this example, it is assumed that αiIt is that corpus of text concentrates document diCorrect category label, βiFor text cluster hereinafter Shelves diObtain category label, for a corpus of text collection comprising n sample, weigh cluster accuracy estimate as Under:
Accuracy = Σ i = 1 n δ ( α i , map ( β i ) ) n - - - [ 13 ]
If wherein x=y, then δ (x, y)=1, otherwise δ (x, y)=0;Optimum map function can be by classical bipartite graph Maximum weight matching method finds.
In the present invention it needs to be determined that following parameter: ξ similar nearest-neighbor point, ζ foreign peoples's nearest-neighbor point and algorithm Maximum iterations T.In order to keep the balance of sample, in the present embodiment, typically take ξ=ζ.The parameter of ξ and ζ selects to adopt With valued combinations conventional in Local Feature Extraction 5,10,15,20}, to different values on different corpus of text collection Combination is tested, and the cluster accuracy obtained is as shown in table 2:
Table 2 parameter ξ and ζ select test result to compare
Binary Multi5 Multi10 NG10 NG20 Classic3 K1b
ξ=ζ=5 0.880 0.848 0.556 0.683 0.539 0.977 0.732
ξ=ζ=10 0.920 0.906 0.604 0.713 0.580 0.989 0.814
ξ=ζ=15 0.906 0.890 0.574 0.711 0.550 0.989 0.822
ξ=ζ=20 0.894 0.872 0.568 0.703 0.546 0.986 0.797
The present embodiment uses the most excellent value ξ=ζ=10 of ξ and ζ in upper table.The selection of maximum iteration time T and formula [3] in, the calculating of convergent function value is relevant.Fig. 2,3,4 sets forth on NG20 and subdata base, Classic3 and K1b storehouse The curve chart of convergent function value is calculated after self adaptation sub-space learning method (ASL) of the present invention iteration every time.It can be seen that The change starting convergent function value from the 5th iteration is relatively small, and algorithm tends to convergence, the therefore maximum iteration time quilt of ASL It is set to T=10, thus ensures the number of times of enough iteration.
Propose the effectiveness of algorithm to test the present invention, table 3 provides ASL method of the present invention with other method at phase identical text The comparative result of clustering performance in this corpus.
Table 3 clustering performance of distinct methods in same text corpus compares
Corpus of text collection NMF LPI ASI LDA-Km ASL
Binary 0.864 0.872 0.898 0.906 0.920
Multi5 0.818 0.830 0.870 0.882 0.906
Multi10 0.476 0.494 0.558 0.566 0.604
NG10 0.625 0.653 0.662 0.671 0.713
NG20 0.529 0.532 0.551 0.558 0.580
Classic3 0.963 0.972 0.980 0.984 0.989
Yahoo 0.722 0.760 0.802 0.805 0.814
In upper table, Non-negative Matrix Factorization (NMF) and local keep indexing the dimensionality reduction technology that (LPI) is latest development, by former Beginning text vector space projection, in subspace, then clusters in projection subspace;ASI, LDA-Km and ASL be then by Dimensionality reduction, as a dynamic process, is integrated with cluster, from the results, it was seen that dynamically obtain preferable lower-dimensional subspace energy Enough performances effectively promoting text cluster.
Owing to, in the case of only considering that in class, data are distributed, LDA-Km can be reduced to ASI, and therefore, ASI is LDA-Km A kind of special case.In order to illustrate that subspace projection and cluster are optimal with which kind of combination, emphasis is compared by the present embodiment Two kinds of methods of LDA-Km and ASL:
Fig. 5,6,7 provide LDA-Km and ASL on NG20 and subdata base, Classic3 and K1b storehouse after each iteration Cluster accuracy compare.Initialization is marked as t0, its purpose is to provide initial for ANMM or LDA in subspace projection Sample class information, t1Secondary iteration is equivalent to spatially carry out, at the stator obtained, the Normal practice that clusters;From t2To t10 Secondary iteration, LDA-Km and ASL improves the ability of clustering performance the most in an iterative manner.But on identical iterations, ASL It is easier to stably arrive of a relatively high cluster accuracy.Indicated above, under identical implementation condition, ASL compares LDA-Km There is the adaptive ability of higher sub-space learning.
Above by this example demonstrates that alternative manner based on self adaptation sub-space learning, with traditional method and pattern Compare, no matter can more effectively solve text cluster problem from performance and usefulness, reach can be practical level.
Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention are not by above-described embodiment Limit, the change made under other any spirit without departing from the present invention and principle, modify, substitute, combine, simplify, All should be the substitute mode of equivalence, within being included in protection scope of the present invention.

Claims (8)

1. iteration Text Clustering Method based on self adaptation sub-space learning, it is characterised in that comprise the following steps:
(1) initialize: corpus of text is expressed as the mathematical form in text vector space, spatially uses affine at text vector Propagation clustering method produces K initial cluster, and then obtains representing the initial classes of all document generic in corpus of text Ownership oriental matrix;
(2) iteration optimization between subspace projection and cluster, comprises the following steps:
(2-1) using the middle initial classes ownership oriental matrix obtained of step (1) as priori, use based on average neighborhood edge Maximized sub-space learning method solves subspace projection matrix, and throws based on initial classes ownership oriental matrix and subspace Shadow matrix calculus convergent function value;
If (2-2) the not met condition of convergence, then by urtext vector space according to current subspace projection matrix projection to son In space, in subspace, continue to take affine propagation clustering method to produce specify K cluster, update current class ownership instruction Matrix;
(2-3) using the class ownership oriental matrix after renewal as priori, use based on the maximized son in average neighborhood edge Space learning method solves subspace projection matrix, and based on the class ownership oriental matrix after updating and subspace projection matrix Calculate convergent function value;
(2-4) repeating step (2-2)-(2-3), until meeting the condition of convergence, stopping iteration, from the class that iterative process output is final Ownership oriental matrix, obtains the final cluster result of all documents;
In described step (2-1), solve subspace projection matrix based on average neighborhood edge maximized sub-space learning method Step be:
For certain data point x in text vector spacei, calculate other data point and xiDistance, and according to distance with And data point xiBelonging to classification information, divide them into following two subset:It is similar neighborhood collection, comprises and xiBelong to same ξ nearest-neighbor point of class;It is foreign peoples's neighborhood collection, comprises and xiAdhere to inhomogeneous ζ nearest-neighbor point separately;
Ask for data point x respectivelyiAverage between class distance and average inter-object distance;
Ask for the average between class distance P and average inter-object distance Q of all data points in text vector space;
For all of data point, in constraintsUnder, maximize its average neighborhood edge function, i.e. meeting Average inter-object distance is minimized while maximizing average between class distance;Thus obtain subspace projection matrix W;
Described average neighborhood edge function is:
γ = Σ i ( Σ p : x p ∈ N i e | | y i - y p | | 2 | N i e | - Σ q : x q ∈ N i o | | y i - y q | | 2 | N i o | ) ;
Solving of subspace projection must be in constraintsUnder, i.e. maximize following object function:
Wherein, yi=WTxi, i.e. data point xiThe text vector behind subspace, y is projected to by subspace projection matrix WpAnd yq Represent the text vector after the data point that similar neighborhood collection and foreign peoples's neighborhood concentrate projects to subspace respectively.
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that institute State step (1) initialization procedure as follows: use mutual information method choice to go out one group of representativeness from the participle of all documents is expressed The set of lexical item constitutes lexical item index;Then it is a text vector by each document representation respectively according to lexical item index, the most often The size that the dimension of individual text vector i.e. indexes corresponding to the lexical item selected, each element value tfidf weight table of vector Show;If each document text vector represents, then in corpus of text, all documents i.e. constitute a text vector space;Former Taking affine propagation clustering algorithm to produce in beginning text vector space and specify K initial clustering, each document obtains its initial classes , the initial clustering classification of all documents is not collected formation initial classes ownership oriental matrix.
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 2, it is characterised in that institute Stating in step (1), vectorial each element value tfidf weight represents, method is as follows:
Certain lexical item t during lexical item is indexedi, document xjTfidf weight table be shown as:
tfidf i , j = tf i , j × idf i = tf i , j × log ( | D | df i ) ;
Wherein tfI, jRepresent lexical item tiAt document xjThe word frequency of middle appearance, | D | is the quantity of all documents, df in corpus of textiIt is Lexical item tiNumber of documents once at least occurred, it is assumed that lexical item index isThen document xjIt is expressed as M dimensional vector xj=[tfidf1, j, tfidf2, j..., tfidfM, j]T
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that with Data point belongs to similar nearest-neighbor point number ξ and adhere to inhomogeneous nearest-neighbor point number ζ separately with data point equal, selects Contiguous range by corpus of text situation depending on.
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that institute State data point xiAverage between class distance and average inter-object distance be expressed as:
P i = Σ p : x p ∈ N i e | | x i - x p | | 2 | N i e | ;
Q i = Σ q : x q ∈ N i o | | x i - x q | | 2 | N i o | ;
The average between class distance P and average inter-object distance Q of all data points are expressed as:
P = Σ i Σ p : x p ∈ N i e | | x i - x p | | 2 | N i e | ;
Q = Σ i Σ q : x q ∈ N i o | | x i - x q | | 2 | N i o | ;
Wherein | | the quantity of the data point included in expression collection.
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that if The initial vector of document is expressed as m dimension, and after subspace projection, each document vector representation is l dimension, then subspace projection matrix W The characteristic vector corresponding to maximum l eigenvalue that every string is obtained by (P-Q) singular value decomposition constitutes (l≤m), i.e. W= [w1, w2..., wl], then subspace is obtained by the following manner: Y=WTX。
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that institute State in step (1) and (2-2), for subspace Y={y1..., yn, produced by affine propagation clustering algorithm and specify K cluster Class method for distinguishing is as follows: find K authentic specimen example E={e1..., eKRepresent K inhomogeneity C={c1..., cK, thus Maximize following object function:
max F ( { c j } j = 1 K ) = Σ j = 1 K Σ y i ∈ c j s ( y i , e j ) ;
Wherein with ejClass for sample instantiation is marked as cj, cjApoplexy due to endogenous wind comprises all by ejAs the data point of sample instantiation, s (yi, ej) represent data point yiWith sample instantiation ejSimilarity;
In the solving of above object function maximization problems, introduce B={bij∈ 0,1}, i, j=1 ..., the expression shape of n} Formula, is i.e. transformed into Zero-one integer programming problem, and above-mentioned object function is changed into:
max F ( { b i j } ) = Σ i = 1 n Σ j = 1 n b i j s ( y i , y j ) ;
This object function must be based on following three item constraints:
bii=1, if bji=1;
Σ j = 1 n b i j = 1 ;
Σ i = 1 n b i i = K ;
Solve based on planning problem with superior function, output parameter B, then can show that the similar and heterogeneous relationships between sample, I.e. work as yiBy yjElect its sample instantiation, then b asij=1, otherwise bij=0;Work as yiSelf it is sample instantiation, then a bii=1, Otherwise bii=0.
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 7, it is characterised in that ginseng The optimum of number B is solved and is obtained by conventional belief propagation method.
CN201310230981.4A 2013-06-09 2013-06-09 Iteration Text Clustering Method based on self adaptation sub-space learning Active CN103279556B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310230981.4A CN103279556B (en) 2013-06-09 2013-06-09 Iteration Text Clustering Method based on self adaptation sub-space learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310230981.4A CN103279556B (en) 2013-06-09 2013-06-09 Iteration Text Clustering Method based on self adaptation sub-space learning

Publications (2)

Publication Number Publication Date
CN103279556A CN103279556A (en) 2013-09-04
CN103279556B true CN103279556B (en) 2016-08-24

Family

ID=49062075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310230981.4A Active CN103279556B (en) 2013-06-09 2013-06-09 Iteration Text Clustering Method based on self adaptation sub-space learning

Country Status (1)

Country Link
CN (1) CN103279556B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886072B (en) * 2014-03-24 2016-08-24 河南理工大学 Search result clustering system in the search engine of colliery
CN105095275B (en) * 2014-05-13 2019-04-05 中国科学院自动化研究所 The method and device of clustering documents
CN104573710B (en) * 2014-12-25 2018-11-13 北京交通大学 A kind of Subspace clustering method smoothly characterized certainly based on latent space
CN105139031A (en) * 2015-08-21 2015-12-09 天津中科智能识别产业技术研究院有限公司 Data processing method based on subspace clustering
CN106294733B (en) * 2016-08-10 2019-05-07 成都轻车快马网络科技有限公司 Page detection method based on text analyzing
CN107203625B (en) * 2017-05-26 2020-03-20 北京邮电大学 Palace clothing text clustering method and device
CN108536844B (en) * 2018-04-13 2021-09-03 吉林大学 Text-enhanced network representation learning method
CN110727769B (en) 2018-06-29 2024-04-19 阿里巴巴(中国)有限公司 Corpus generation method and device and man-machine interaction processing method and device
CN109145976A (en) * 2018-08-14 2019-01-04 聚时科技(上海)有限公司 A kind of multiple view cluster machine learning method based on optimal neighbours' core
CN109726394A (en) * 2018-12-18 2019-05-07 电子科技大学 Short text Subject Clustering method based on fusion BTM model
CN110135499A (en) * 2019-05-16 2019-08-16 北京工业大学 Clustering method based on the study of manifold spatially adaptive Neighborhood Graph
CN111159337A (en) * 2019-12-20 2020-05-15 中国建设银行股份有限公司 Chemical expression extraction method, device and equipment
CN111966579A (en) * 2020-07-24 2020-11-20 复旦大学 Self-adaptive text input generation method based on natural language processing and machine learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214181A (en) * 2010-04-12 2011-10-12 无锡科利德斯科技有限公司 Fuzzy evolution calculation-based text clustering method
CN102332012A (en) * 2011-09-13 2012-01-25 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214181A (en) * 2010-04-12 2011-10-12 无锡科利德斯科技有限公司 Fuzzy evolution calculation-based text clustering method
CN102332012A (en) * 2011-09-13 2012-01-25 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Document Clustering via Adaptive Subspace Iteration;Tao Li etc.;《SIGIR 2004》;20040729;218-225 *

Also Published As

Publication number Publication date
CN103279556A (en) 2013-09-04

Similar Documents

Publication Publication Date Title
CN103279556B (en) Iteration Text Clustering Method based on self adaptation sub-space learning
Jia et al. Bagging-based spectral clustering ensemble selection
Guan et al. Text clustering with seeds affinity propagation
Rubin et al. Statistical topic models for multi-label document classification
CN101201894B (en) Method for recognizing human face from commercial human face database based on gridding computing technology
Popat et al. Hierarchical document clustering based on cosine similarity measure
CN103488662A (en) Clustering method and system of parallelized self-organizing mapping neural network based on graphic processing unit
CN105930862A (en) Density peak clustering algorithm based on density adaptive distance
CN110674407A (en) Hybrid recommendation method based on graph convolution neural network
CN102929894A (en) Online clustering visualization method of text
CN104881689A (en) Method and system for multi-label active learning classification
CN109784405A (en) Cross-module state search method and system based on pseudo label study and semantic consistency
CN104699698A (en) Graph query processing method based on massive data
Sánchez et al. Efficient algorithms for a robust modularity-driven clustering of attributed graphs
CN110364264A (en) Medical data collection feature dimension reduction method based on sub-space learning
CN105335510A (en) Text data efficient searching method
CN109739984A (en) A kind of parallel KNN network public-opinion sorting algorithm of improvement based on Hadoop platform
Chen et al. Clustering and ranking in heterogeneous information networks via gamma-poisson model
CN106971005A (en) Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment
CN103218419B (en) Web tab clustering method and system
CN105160046A (en) Text-based data retrieval method
Maudes et al. Random projections for linear SVM ensembles
Mei et al. Proximity-based k-partitions clustering with ranking for document categorization and analysis
Cobos et al. Clustering of web search results based on an Iterative Fuzzy C-means Algorithm and Bayesian Information Criterion
CN105787072A (en) Field knowledge extracting and pushing method oriented to progress

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20180110

Address after: 510601, fifteenth floor, No. 289, Guangzhou Avenue, Yuexiu District, Guangzhou, Guangdong

Patentee after: Guangdong Southern Newspaper Media Group New Media Co., Ltd.

Address before: 510601 Guangzhou Avenue, Yuexiu District, Guangzhou, Guangdong Province, No. 289

Patentee before: Nanfang Daily Group

TR01 Transfer of patent right