CN103279556B

CN103279556B - Iteration Text Clustering Method based on self adaptation sub-space learning

Info

Publication number: CN103279556B
Application number: CN201310230981.4A
Authority: CN
Inventors: 吴娴; 杨兴锋; 张东明; 何崑
Original assignee: NANFANG DAILY GROUP
Current assignee: Guangdong Southern Newspaper Media Group New Media Co., Ltd.
Priority date: 2013-06-09
Filing date: 2013-06-09
Publication date: 2016-08-24
Anticipated expiration: 2033-06-09
Also published as: CN103279556A

Abstract

The invention discloses a kind of iteration Text Clustering Method based on self adaptation sub-space learning, comprise the following steps: (1) initializes: corpus of text is expressed as text vector space, using affine propagation clustering method to produce initial K cluster, the cluster category table of all texts is shown as initial classes ownership oriental matrix.(2) iteration between subspace projection and cluster: initial classes is belonged to oriental matrix as priori, to maximize average neighborhood edge for object solving subspace projection matrix, by text vector space projection to subspace, and use affine propagation clustering method to produce K cluster in subspace, thus update class ownership oriental matrix；Belong to oriental matrix based on subspace projection matrix and class and calculate convergent function, until function convergence, exit iteration, complete text cluster.The present invention is unrestricted to size and the distribution of text data, and subspace solution and cluster are fused under Unified frame, is obtained the cluster result of global optimum by the strategy of iteration.

Description

Iteration Text Clustering Method based on self adaptation sub-space learning

Technical field

The present invention relates to machine learning and area of pattern recognition, particularly to a kind of based on self adaptation sub-space learning repeatedly For Text Clustering Method, the method is based on average neighborhood edge maximized self adaptation sub-space learning method, and uses repeatedly The strategy in generation is used it for solving text cluster problem.

Background technology

Along with Internet technology and the universal and development of database technology, people can obtain easily and store in a large number Data.Data in reality exist the most in the form of text, and text cluster, as a kind of means, can carry out group to text message Knit, make a summary and navigate, contribute to accurately obtaining information needed from immense text message resource, therefore, the most obtain Extensive concern.

In text cluster, text is commonly used vector space model (Vector Space Model, VSM) and is represented, but this The feature represented is the higher-dimension of feature space and openness.In this case, the most a large amount of between incoherent dimension Redundancy so that the similarity measurement in clustering algorithm loses its resolving ability, thus affects the final performance of text cluster. The Normal practice solving this problem uses dimensionality reduction technology by text vector space projection to subspace, then uses cluster to calculate The document that low-dimensional is represented by method is divided into different classes.But, the Text Clustering Method of this routine only regards dimensionality reduction as cluster The pre-treatment step of algorithm, has isolated potential contacting between subspace projection and cluster, accordingly, it is difficult to ensure projection gained Subspace is optimum for cluster.

In order to overcome this limitation, C. Ding et al. was at the data mining international conference (International of 2002 Conference on Data Mining, ICDM) on delivered one entitled: Adaptive dimension reduction The article of for clustering high dimensional data, proposes the concept (Adaptive of self-adaptive reduced-dimensions Dimension reduction, ADR), using dimensionality reduction as a dynamic process, and integrate with cluster process.In view of Its practicality and motility, ADR develops into following various ways in application process:

Li et al. in Research into information retrieval and developed international conference (ACM Special Interest Group in 2004 Of Information Retrieval, ACM SIGIR) in deliver one entitled: Document clustering via The article of adaptive subspace iteration, completes category division simultaneously and subspace identifies two tasks, but civilian The subspace of middle employing identifies and need to model each cluster classification respectively so that the solving of this problem becomes a combined optimization and ask Topic, initializes the nonindependence between matrix simultaneously and is also easy to the final performance of impact cluster.

Ding and Li in 2007 at machine learning international conference (IEEE International on Machine Learning, ICML) on deliver one entitled: Adaptive dimension reduction using The article of discriminative analysis and K-means clustering, by linear discriminant analysis (Linear Discriminant Analysis, LDA) and K mean cluster (K-means) be integrated into LDA-Km structure, be used for solving text Clustering problem.But, subspace projection based on LDA easily meets with small sample problem, and only to the literary composition with Gauss distribution Notebook data relative efficiency.

Ye et al. in 2007 at computer vision and pattern recognition international conference (IEEE International Conference on Computer Vision and Pattern Recognition, CVPR) on deliver one entitled: The article of Adaptive distance metric learning for clustering, it is proposed that nonlinear adaptive is measured Learning algorithm (nonlinear adaptive metric learning, NAML), converts core study, dimensionality reduction and cluster simultaneously It is a battle array mark optimization problem, under the framework of expectation maximization (Expectation Maximization, EM), passes through iteration Method solve cluster result.But the limitation of NAML is that its optimization process necessarily depends upon multiple key parameter, in data The appearance of Expired Drugs it is easily caused in the case of insufficient.

Although thinking of self-adaptive reduced-dimensions and associated method can solve the problem that specific text cluster problem, but there is also with On some technological deficiency of pointing out, limit its range of application, the improvement for Text Clustering Algorithm reserves certain space.Therefore, Study a kind of Text Clustering Method with stronger generalization ability and adaptive ability, it has also become a class being of practical significance Topic.

Summary of the invention

Present invention is primarily targeted at the shortcoming overcoming prior art with not enough, it is provided that a kind of based on self adaptation subspace The iteration Text Clustering Method of study, the method, by under subspace solution and Cluster-Fusion to Unified frame, utilizes iteration optimization Strategy obtain global optimum's result.The method to the quantity of text data and distribution without special constraint, and can be effective Avoid Multiple Optimization problem, relate to less adjustment parameter, therefore, there is stronger generalization ability and adaptive ability, classification More reasonable.

The purpose of the present invention is achieved through the following technical solutions: iteration text cluster side based on self adaptation sub-space learning Method comprises the following steps:

(1) initialize: corpus of text is expressed as the mathematical form in text vector space, spatially uses at text vector Affine propagation clustering (K-Affinity Propagation, K-AP) method produces initial K cluster, and then obtains representing text The initial classes ownership oriental matrix of all document generic in language material；

(2) iteration optimization between subspace projection and cluster, comprises the following steps:

(2-1) using the middle initial classes ownership oriental matrix obtained of step (1) as priori, use based on average neighborhood Edge maximizes the sub-space learning method of (Average Neighborhood Margin Maximization, ANMM) and solves Subspace projection matrix, and based on initial classes ownership oriental matrix and subspace projection matrix calculus convergent function value；

If (2-2) the not met condition of convergence, then by urtext vector space according to current subspace projection matrix projection In subspace, in subspace, continue to take K-AP algorithm to produce specify K cluster, update current class ownership instruction square Battle array；

(2-3) using the class ownership oriental matrix after renewal as priori, use and maximize based on average neighborhood edge Sub-space learning method solve subspace projection matrix, and based on the class ownership oriental matrix after updating and subspace projection Matrix, calculates convergent function value；

(2-4) repeating step (2-2)-(2-3), until meeting the condition of convergence, stopping iteration, final from iterative process output Class ownership oriental matrix, obtain the final cluster result of all documents.

Concrete, described step (1) initialization procedure is as follows: use mutually from the participle of all documents of corpus of text is expressed Information approach is selected the set of one group of representativeness lexical item and is constituted lexical item index；Then according to each by after participle of lexical item index Document representation is a text vector, the size that text vector dimension i.e. indexes corresponding to the lexical item selected, and vectorial is every Individual element value tfidf weight represents；If each document is expressed as a text vector, the then all documents in corpus of text I.e. constitute a urtext vector space；Taking K-AP algorithm to produce in urtext vector space specifies K initially to gather Class, each document obtains its initial category, and the initial clustering classification information of all documents collects formation initial classes ownership instruction Matrix.

More specifically, in described step (1), vectorial each element value tfidf weight represents, method is as follows: for word Certain lexical item t in entry index_i, document x_iTfidf weight table be shown as:

{tfidf}_{i, j} = {tf}_{i, j} \times {idf}_{i} = {tf}_{i, j} \times \log (\frac{| D |}{{df}_{i}});

Wherein tf_{I, j}Represent lexical item t_iAt document x_jThe word frequency of middle appearance, | D | is the quantity of all documents in corpus of text, df_iIt it is lexical item t_iNumber of documents once at least occurred, it is assumed that lexical item index is v=[t₁, t₂..., t_m], then document x_jPermissible It is expressed as m dimensional vector x_j=[tfidf_{L, j,}tfidf_{2, j}, tfidf_{M, j}]^T。

Concrete, in described step (2-1), solve son based on average neighborhood edge maximized sub-space learning method empty Between the method for projection matrix be:

For certain data point x in text vector space_i, calculate text data and concentrate other data point and x_iDistance, And according to distance and data point x_iBelonging to classification information, divide them into following two subset:It it is similar neighborhood Collection, comprises and x_iBelong to ξ similar nearest-neighbor point；It is foreign peoples's neighborhood collection, comprises and x_iAdhere to inhomogeneous ζ separately recently Neighborhood point；

Ask for data point x respectively_iAverage between class distance and average inter-object distance；

Ask for the average between class distance P and average inter-object distance Q of all data points in text vector space；

For all of data point, in constraintsUnder, maximize its average neighborhood edge function, i.e. Average inter-object distance is minimized while the average between class distance of satisfied maximization；Thus obtain subspace projection matrix W.

Further, in order to keep sample balance, belong to data point similar nearest-neighbor point number ξ and with number It is equal, depending on the contiguous range of selection is by corpus of text situation that inhomogeneous nearest-neighbor point number ζ is adhered at strong point separately.

Concrete, described data point x_iAverage between class distance and average inter-object distance be expressed as:

P_{i} = \underset{p : x_{p} &Element; N_{i}^{e}}{Σ} \frac{{| | x_{i} - x_{p} | |}^{2}}{{| N}_{i}^{e} |};

Q_{i} = \underset{q : x_{q} &Element; N_{i}^{o}}{Σ} \frac{{| | x_{i} - x_{q} | |}^{2}}{| N_{i}^{o} |};

The average between class distance P and average inter-object distance Q of all data points are expressed as:

P = \underset{i}{Σ} \underset{p : x_{p} &Element; N_{i}^{e}}{Σ} \frac{{| | x_{i} - x_{p} | |}^{2}}{| N_{i}^{e} |};

Q = \underset{i}{Σ} \underset{q : x_{q} &Element; N_{i}^{o}}{Σ} \frac{{| | x_{i} - x_{q} | |}^{2}}{| N_{i}^{o} |};

Wherein | | the quantity of the data point included in expression collection.

Concrete, described average neighborhood edge function is:

Solving of subspace projection must be in constraintsUnder, i.e. maximize following object function:

Further, if the initial vector of document is expressed as m dimension, after subspace projection, each document vector representation is l Dimension, then the feature corresponding to maximum l eigenvalue that every string of subspace projection matrix W is obtained by (P-Q) singular value decomposition Vector constitutes (l≤m), i.e. W=[w₁, w₂..., w_l], then subspace is obtained by the following manner: Y=W^TX。

Concrete, in described step (1) and (2-2), for subspace Y={y₁..., y_n, use affine propagation clustering side It is as follows that method produces K cluster class method for distinguishing: finds K authentic specimen example E={e₁..., e_KRepresent K inhomogeneity C= {c₁..., c_K, thus maximize following object function:

\max F ({c_{j}}_{j = 1}^{K}) = Σ_{j = 1}^{K} \underset{y_{i} &Element; c_{j}}{Σ} s (y_{i}, e_{j});

Wherein with e_jClass for sample instantiation is marked as c_j, c_jApoplexy due to endogenous wind comprises all by e_jData as sample instantiation Point, s (y_i, e_j) represent data point y_iWith sample instantiation e_jSimilarity；

In the solving of above object function maximization problems, introduce B={b_ij∈ 0,1}, i, j=1 ..., the expression of n} Form, can be transformed into Zero-one integer programming problem, and above-mentioned object function is changed into:

\max F ({b_{ij}}) = Σ_{i = 1}^{n} Σ_{j = 1}^{n} b_{ij} s (y_{i}, y_{j});

This object function must be based on following three item constraints:

b_ii=1, ifb_ji=1；

Σ_{j = 1}^{n} b_{ij} = 1;

Σ_{i = 1}^{n} b_{ii} = K .

Preferably, the optimum of above-mentioned parameter B is solved and is obtained by conventional belief propagation method, and calculating process can refer to Zhang et al. was published in an entitled K-AP of International Conference of Data Mining in 2010: The article of Generating Specified K Clusters by Efficient Affinity Propagation.This Bright this state-of-the-art technology is applied to text cluster, it is possible to obtain clustering more rational than traditional methods such as K-means. In the two values matrix B obtained after K-AP clusters, if b_ij=1, then show y_iSelf it is sample instantiation, y_iBelong to c_iClass；If b_ij =1, then y_iSample instantiation be y_j, y_iWith y_jBelong to c_jClass；Thus may indicate that the similar and foreign peoples between the sample of subspace Relation, and the sample of subspace is corresponding with the sample in text vector space, the class ownership that thus can update all documents refers to Show matrix.

The present invention compared with prior art, has the advantage that and beneficial effect:

1, Text Clustering Method of the prior art is by the text vector space projection of higher-dimension to subspace mostly, so After cluster in subspace, but the method be difficult to ensure that tried to achieve subspace for cluster be optimum.The present invention will Subspace solution and Cluster-Fusion, under Unified frame, utilize the strategy of iteration optimization to ensure that acquisition global optimum result, So that classification more rationally, accurately.

2, the present invention solves subspace projection to maximize average neighborhood edge for criterion, it is to avoid little in traditional method Sample problem, and data are distributed without special requirement, Multiple Optimization problem can be effectively prevented from, and relate to less adjustment Parameter, has stronger generalization ability and adaptive ability.

3, the present invention uses fast affine propagation clustering (K-AP) that text set is divided into A the cluster class that user specifies Not, use and with experimental verification K-AP algorithm to produce in text cluster the most reasonably cluster stroke than traditional method Point.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of the present invention；

Fig. 2 (a) (e) is that the inventive method iteration cluster calculation on NG20 and subdata base thereof draws convergent function value Curve synoptic diagram；

Fig. 3 is that the inventive method iteration cluster calculation on Classic3 data base draws the curve chart of convergent function value；

Fig. 4 is that the inventive method iteration cluster calculation on K1b data base draws the curve chart of convergent function value；

Fig. 5 (a) (e) is the degree of accuracy of the present invention and algorithm LDA-Km after each iteration on NG20 and subdata base thereof Comparative result figure；

Fig. 6 is the degree of accuracy comparative result of the present invention and algorithm LDA-Km after each iteration on Classic3 data base Figure；

Fig. 7 is the degree of accuracy comparative result figure of the present invention and algorithm LDA-Km after each iteration on K1b data base；

Detailed description of the invention

Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention do not limit In this.

Embodiment 1

As it is shown in figure 1, iteration Text Clustering Method based on self adaptation sub-space learning, comprise the following steps:

(1) clustering initialization in text vector space: use mutual trust during the participle of all documents is expressed from corpus of text Breath method choice goes out the set of one group of representativeness lexical item and constitutes lexical item index；Then according to lexical item index by each document representation it is One text vector, the size that the dimension of text vector i.e. indexes corresponding to the lexical item selected, each element value of vector Represent by tfidf weight；In corpus of text, all documents i.e. constitute an original text vector space；At urtext vector Taking affine propagation clustering algorithm to produce K the initial clustering (K-AP) specified in space, it is initially affiliated that each document obtains it Classification, the classification information of all clustering documents collects formation initial classes ownership oriental matrix.

(2-1) using the middle initial classes ownership oriental matrix obtained of step (1) as priori, use based on average neighborhood Edge maximized sub-space learning method solves subspace projection matrix, refers to initial classes ownership based on subspace projection matrix Show matrix calculus convergent function value；

If (2-2) the not met condition of convergence, then urtext vector space is thrown by current subspace projection matrix Shadow, in subspace, continues to take K-AP algorithm to produce in subspace and specifies K cluster, thus updates current class and belong to and refer to Show matrix；

(2-3) using the class ownership oriental matrix after renewal as priori, use and maximize based on average neighborhood edge Sub-space learning method solve subspace projection matrix, belong to oriental matrix based on subspace projection matrix and the class after updating Calculate convergent function value；

Text data in the present embodiment is respectively derived from 20Newsgroups(NG20), Classic3 and K1b corpus. The attribute of the corpus of text collection used is shown in Table 1, and all documents in corpus are the most by participle.

The attribute list of the corpus of text collection in table 1 embodiment

In described step (1), take mutual information method can extract m representational word from the participle of document is expressed , constitute lexical item index v=[t₁, t₂..., t_m], then each document of language material may be expressed as m dimensional vector.Assume x_jFor language Jth document in material, corresponding to lexical item t_i, the element value of vector can be represented by tfidf weight:

tfid f_{i, j} = t f_{i, j} \times id f_{i} = t f_{i, j} \times \log (\frac{| D |}{d f_{i}}) - - - [1]

Wherein, tf_{I, j}Represent lexical item t_iAt document x_jThe word frequency of middle appearance.At idf_iCalculating in, | D | is in corpus of text The quantity of all documents, df_iIt it is lexical item t in corpus of text_iNumber of documents once at least occurred.Document x_jM dimensional vector X can be represented sequentially as_j=[tfidf_{1, j}, tfidf_{2, j}..., tfidf_{M, j}]^T.The vector representation of all documents can be stacked up Constitute text vector space, be expressed as the matrix of m × n size: X⁰=[x₁, x₂..., x_n].In the present embodiment, extraction is set 2000 representational lexical items, then m=2000.As a example by Binary corpus, wherein comprise 500 samples, then text vectors Space can be expressed as the matrix of 2000 × 500 sizes.

The iterative optimization procedure between subspace projection and cluster in described step (2) is with class ownership oriental matrix as bridge Beam, its iterative process is the core methed of the present invention, can be described as self adaptation sub-space learning method (Adaptive Subspace learning, ASL), concrete iterative process is as follows:

Initialize: corpus of text is expressed as urtext vector space X⁰, in urtext vector space X⁰Upper employing K- AP cluster obtains initial classes ownership oriental matrix L⁰；The initial classes comprising text categories information is belonged to oriental matrix L⁰Enter step Suddenly, in (2-1), use ANMM subspace projection, obtain subspace projection matrix W₁, based on L⁰And W₁Calculate convergent function value Score₁；

1st iteration: by urtext vector space X⁰According to subspace projection matrix W₁Project to subspace Y¹In, Subspace Y¹Upper employing K-AP cluster obtains class ownership oriental matrix L¹；The class comprising text categories information is belonged to oriental matrix L¹Enter in step (2-3), use ANMM subspace projection, obtain subspace projection matrix W₂, based on L¹And W₂Calculate convergence Functional value Score₂；

T takes turns iteration: by urtext vector space X⁰According to subspace projection matrix W_tProject to subspace Y^tIn, Subspace Y^tUpper employing K-AP clusters, it is thus achieved that class ownership oriental matrix L^t, the class comprising text categories information is belonged to oriental matrix L^tEnter in step (2-3), use ANMM subspace projection, obtain subspace projection matrix W_t+1, based on L^tAnd W_t+1Calculate and receive Hold back functional value Score_t+1；

Need to calculate its convergent function value after each iteration above, as follows:

Wherein P (L^t) and Q (L^t) it is to belong to oriental matrix L according to class^tIn text vector space X⁰Upper the most calculated averagely Between class distance and average inter-object distance.If meeting condition of convergence Score set_t+1-Score_t≤ ∈, or reach to set Big number of iterations T, exits iteration, and obtains final class ownership oriental matrix, i.e. obtains the affiliated of each document in corpus of text Classification, completes text cluster.

In described step (2), either in urtext vector space X⁰On, or on the subspace that projection obtains (it is embodied as { Y in an iterative process¹..., Y^t), it is required for carrying out affine propagation clustering (K-AP) and produces K cluster.With As a example by affine propagation clustering (K-AP) in subspace, (sample space is y={y₁…y_n), the thought of K-AP is to find K truly Sample instantiation E={e₁..., e_K, it is used for representing K different classes of C={c₁..., c_K, thus maximize following object function:

\max F ({c_{j}}_{j = 1}^{K}) = Σ_{j = 1}^{K} \underset{y_{i} &Element; c_{j}}{Σ} s (y_{i}, e_{j}) - - - [4]

Wherein with e_jClass for sample instantiation is marked as c_j, c_jApoplexy due to endogenous wind comprises all by e_jData as sample instantiation Point, s (y_i, e_j) represent data point y_iWith sample instantiation e_jSimilarity.In the solving of object function maximization problems, introduce B ={b_ij∈ 0,1}, i, j=1 ..., and n}, Zero-one integer programming problem can be transformed into, then object function is changed into:

\max F ({b_{ij}}) = Σ_{i = 1}^{n} Σ_{j = 1}^{n} b_{ij} s (y_{i}, y_{j}) - - - [5]

This object function must be based on following three item constraints:

b_ii=1, ifb_ji=1

Σ_{j = 1}^{n} b_{ij} = 1 - - - [6]

Σ_{i = 1}^{n} b_{ii} = K

Section 1 constraint shows if y_jBy y_iElect its sample instantiation, y as_iIt must be a sample instantiation；Section 2 is about Bundle shows each data point y_iHave and an only sample instantiation；The number of Section 3 constraint representation sample instantiation is necessary for K, from And ensure that K-AP method can produce K the cluster that user specifies.

Maximization problems based on constraints above can be expressed with a factor graph, and being related to that the optimum of B solves can To be obtained by belief propagation method reasoning, calculating process can refer to Zhang et al. and was published in International in 2010 One entitled K-AP:Generating Specified K Clusters by of Conference of Data Mining The article of Efficient Affinity Propagation.Parameter B describes the similar and heterogeneous relationships between sample, thus The cluster classification of document in corpus of text can be obtained.

In described step (1), (2), all documents can obtain its generic information by K-AP clustering algorithm, collects Class ownership oriental matrix L(is become to may particularly denote in an iterative process as { L⁰, L¹..., L^t).Matrix L size is n × K, wherein n Being the number of documents in corpus, K is the categorical measure that cluster produces.If jth document belongs to kth class, then L_jk=1, otherwise L_jk=0。

In described step (2), belong to oriental matrix L as priori using the classification of document, based on average neighborhood edge The sub-space learning method maximizing (ANMM) solves subspace, and method is as follows:

First the average between class distance of all data points and average inter-object distance are calculated.Assume x_iIt is that original text vector is empty Certain data point between, can obtain x from class ownership oriental matrix L_iClassification information；Calculate x_iWith other data point Distance, and according to the cluster classification information of distance and data point, divides them into following two subset: similar neighborhood collectionComprise and x_iBelong to ξ similar nearest-neighbor point；Foreign peoples's neighborhood collectionComprise and x_iAdhere to inhomogeneous ζ arest neighbors separately Territory point.

Then data point x_iAverage between class distance and average inter-object distance be expressed as:

P_{i} = \underset{p : x_{p} &Element; N_{i}^{e}}{Σ} \frac{{| | x_{i} - x_{p} | |}^{2}}{{| N}_{i}^{e} |} - - - [7]

Q_{i} = \underset{q : x_{q} &Element; N_{i}^{o}}{Σ} \frac{{| | x_{i} - x_{q} | |}^{2}}{| N_{i}^{o} |} - - - [8]

For all data points, then have:

P = \underset{i}{Σ} \underset{p : x_{p} &Element; N_{i}^{e}}{Σ} \frac{{| | x_{i} - x_{p} | |}^{2}}{| N_{i}^{e} |} - - - [9]

Q = \underset{i}{Σ} \underset{q : x_{q} &Element; N_{i}^{o}}{Σ} \frac{{| | x_{i} - x_{q} | |}^{2}}{| N_{i}^{o} |} - - - [10]

Wherein | | the quantity of the data point included in expression subset.

Secondly, if by x_iProject in subspace, i.e. y_i=W^Tx_i, need also exist for while maximizing average between class distance Minimize average inter-object distance, therefore, for all of data point, its average neighborhood edge function must be maximized:

Solving of subspace projection must be in constraintsUnder, maximize following object function:

Wherein P and Q can be calculated with formula [9] and [10].Assume the vector representation of each document in corpus of text Tieing up for m, after subspace projection, each document vector representation is l dimension, then every string of projection matrix W is by (P-Q) singular value decomposition The characteristic vector corresponding to maximum l eigenvalue obtained constitutes (l≤m), i.e. W=[w₁、w₂..., w_l].In the present embodiment, The initial dimension of document vector is m=2000.The dimension l of subspace can be set to a fixed constant, it is also possible to according to eigenvalue Distribution dynamic changes.Keeping characteristics value is more than 10 by the present embodiment^-5Characteristic vector, as a example by Binary corpus, (P-Q) is strange Different value decompose after obtain more than 10^-5Eigenvalue number be 1080, therefore l=1080, at the beginning of the document of each 2000 × 1 sizes Beginning vector x_iIt is projected into the low dimensional vector y of 1080 × 1 sizes_i。

In this example, it is assumed that α_iIt is that corpus of text concentrates document d_iCorrect category label, β_iFor text cluster hereinafter Shelves d_iObtain category label, for a corpus of text collection comprising n sample, weigh cluster accuracy estimate as Under:

Accuracy = \frac{Σ_{i = 1}^{n} δ (α_{i}, map (β_{i}))}{n} - - - [13]

If wherein x=y, then δ (x, y)=1, otherwise δ (x, y)=0；Optimum map function can be by classical bipartite graph Maximum weight matching method finds.

In the present invention it needs to be determined that following parameter: ξ similar nearest-neighbor point, ζ foreign peoples's nearest-neighbor point and algorithm Maximum iterations T.In order to keep the balance of sample, in the present embodiment, typically take ξ=ζ.The parameter of ξ and ζ selects to adopt With valued combinations conventional in Local Feature Extraction 5,10,15,20}, to different values on different corpus of text collection Combination is tested, and the cluster accuracy obtained is as shown in table 2:

Table 2 parameter ξ and ζ select test result to compare

	Binary	Multi5	Multi10	NG10	NG20	Classic3	K1b
								ξ=ζ=5	0.880	0.848	0.556	0.683	0.539	0.977	0.732
ξ=ζ=10	0.920	0.906	0.604	0.713	0.580	0.989	0.814
								ξ=ζ=15	0.906	0.890	0.574	0.711	0.550	0.989	0.822
ξ=ζ=20	0.894	0.872	0.568	0.703	0.546	0.986	0.797

The present embodiment uses the most excellent value ξ=ζ=10 of ξ and ζ in upper table.The selection of maximum iteration time T and formula [3] in, the calculating of convergent function value is relevant.Fig. 2,3,4 sets forth on NG20 and subdata base, Classic3 and K1b storehouse The curve chart of convergent function value is calculated after self adaptation sub-space learning method (ASL) of the present invention iteration every time.It can be seen that The change starting convergent function value from the 5th iteration is relatively small, and algorithm tends to convergence, the therefore maximum iteration time quilt of ASL It is set to T=10, thus ensures the number of times of enough iteration.

Propose the effectiveness of algorithm to test the present invention, table 3 provides ASL method of the present invention with other method at phase identical text The comparative result of clustering performance in this corpus.

Table 3 clustering performance of distinct methods in same text corpus compares

Corpus of text collection	NMF	LPI	ASI	LDA-Km	ASL
						Binary	0.864	0.872	0.898	0.906	0.920
Multi5	0.818	0.830	0.870	0.882	0.906
						Multi10	0.476	0.494	0.558	0.566	0.604
NG10	0.625	0.653	0.662	0.671	0.713
						NG20	0.529	0.532	0.551	0.558	0.580
Classic3	0.963	0.972	0.980	0.984	0.989
						Yahoo	0.722	0.760	0.802	0.805	0.814

In upper table, Non-negative Matrix Factorization (NMF) and local keep indexing the dimensionality reduction technology that (LPI) is latest development, by former Beginning text vector space projection, in subspace, then clusters in projection subspace；ASI, LDA-Km and ASL be then by Dimensionality reduction, as a dynamic process, is integrated with cluster, from the results, it was seen that dynamically obtain preferable lower-dimensional subspace energy Enough performances effectively promoting text cluster.

Owing to, in the case of only considering that in class, data are distributed, LDA-Km can be reduced to ASI, and therefore, ASI is LDA-Km A kind of special case.In order to illustrate that subspace projection and cluster are optimal with which kind of combination, emphasis is compared by the present embodiment Two kinds of methods of LDA-Km and ASL:

Fig. 5,6,7 provide LDA-Km and ASL on NG20 and subdata base, Classic3 and K1b storehouse after each iteration Cluster accuracy compare.Initialization is marked as t₀, its purpose is to provide initial for ANMM or LDA in subspace projection Sample class information, t₁Secondary iteration is equivalent to spatially carry out, at the stator obtained, the Normal practice that clusters；From t₂To t₁₀ Secondary iteration, LDA-Km and ASL improves the ability of clustering performance the most in an iterative manner.But on identical iterations, ASL It is easier to stably arrive of a relatively high cluster accuracy.Indicated above, under identical implementation condition, ASL compares LDA-Km There is the adaptive ability of higher sub-space learning.

Above by this example demonstrates that alternative manner based on self adaptation sub-space learning, with traditional method and pattern Compare, no matter can more effectively solve text cluster problem from performance and usefulness, reach can be practical level.

Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention are not by above-described embodiment Limit, the change made under other any spirit without departing from the present invention and principle, modify, substitute, combine, simplify, All should be the substitute mode of equivalence, within being included in protection scope of the present invention.

Claims

1. iteration Text Clustering Method based on self adaptation sub-space learning, it is characterised in that comprise the following steps:

(1) initialize: corpus of text is expressed as the mathematical form in text vector space, spatially uses affine at text vector Propagation clustering method produces K initial cluster, and then obtains representing the initial classes of all document generic in corpus of text Ownership oriental matrix；

(2-1) using the middle initial classes ownership oriental matrix obtained of step (1) as priori, use based on average neighborhood edge Maximized sub-space learning method solves subspace projection matrix, and throws based on initial classes ownership oriental matrix and subspace Shadow matrix calculus convergent function value；

If (2-2) the not met condition of convergence, then by urtext vector space according to current subspace projection matrix projection to son In space, in subspace, continue to take affine propagation clustering method to produce specify K cluster, update current class ownership instruction Matrix；

(2-3) using the class ownership oriental matrix after renewal as priori, use based on the maximized son in average neighborhood edge Space learning method solves subspace projection matrix, and based on the class ownership oriental matrix after updating and subspace projection matrix Calculate convergent function value；

(2-4) repeating step (2-2)-(2-3), until meeting the condition of convergence, stopping iteration, from the class that iterative process output is final Ownership oriental matrix, obtains the final cluster result of all documents；

In described step (2-1), solve subspace projection matrix based on average neighborhood edge maximized sub-space learning method Step be:

For certain data point x in text vector space_i, calculate other data point and x_iDistance, and according to distance with And data point x_iBelonging to classification information, divide them into following two subset:It is similar neighborhood collection, comprises and x_iBelong to same ξ nearest-neighbor point of class；It is foreign peoples's neighborhood collection, comprises and x_iAdhere to inhomogeneous ζ nearest-neighbor point separately；

For all of data point, in constraintsUnder, maximize its average neighborhood edge function, i.e. meeting Average inter-object distance is minimized while maximizing average between class distance；Thus obtain subspace projection matrix W；

Described average neighborhood edge function is:

γ = \underset{i}{Σ} (\underset{p : x_{p} &Element; N_{i}^{e}}{Σ} \frac{| | y_{i} - y_{p} | |^{2}}{| N_{i}^{e} |} - \underset{q : x_{q} &Element; N_{i}^{o}}{Σ} \frac{| | y_{i} - y_{q} | |^{2}}{| N_{i}^{o} |});

Wherein, y_i=W^Tx_i, i.e. data point x_iThe text vector behind subspace, y is projected to by subspace projection matrix W_pAnd y_q Represent the text vector after the data point that similar neighborhood collection and foreign peoples's neighborhood concentrate projects to subspace respectively.

Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that institute State step (1) initialization procedure as follows: use mutual information method choice to go out one group of representativeness from the participle of all documents is expressed The set of lexical item constitutes lexical item index；Then it is a text vector by each document representation respectively according to lexical item index, the most often The size that the dimension of individual text vector i.e. indexes corresponding to the lexical item selected, each element value tfidf weight table of vector Show；If each document text vector represents, then in corpus of text, all documents i.e. constitute a text vector space；Former Taking affine propagation clustering algorithm to produce in beginning text vector space and specify K initial clustering, each document obtains its initial classes , the initial clustering classification of all documents is not collected formation initial classes ownership oriental matrix.

Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 2, it is characterised in that institute Stating in step (1), vectorial each element value tfidf weight represents, method is as follows:

Certain lexical item t during lexical item is indexed_i, document x_jTfidf weight table be shown as:

{tfidf}_{i, j} = {tf}_{i, j} \times {idf}_{i} = {tf}_{i, j} \times \log (\frac{| D |}{{df}_{i}});

Wherein tf_{I, j}Represent lexical item t_iAt document x_jThe word frequency of middle appearance, | D | is the quantity of all documents, df in corpus of text_iIt is Lexical item t_iNumber of documents once at least occurred, it is assumed that lexical item index isThen document x_jIt is expressed as M dimensional vector x_j=[tfidf_{1, j}, tfidf_{2, j}..., tfidf_{M, j}]^T。

Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that with Data point belongs to similar nearest-neighbor point number ξ and adhere to inhomogeneous nearest-neighbor point number ζ separately with data point equal, selects Contiguous range by corpus of text situation depending on.

Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that institute State data point x_iAverage between class distance and average inter-object distance be expressed as:

P_{i} = \underset{p : x_{p} &Element; N_{i}^{e}}{Σ} \frac{| | x_{i} - x_{p} | |^{2}}{| N_{i}^{e} |};

Q_{i} = \underset{q : x_{q} &Element; N_{i}^{o}}{Σ} \frac{| | x_{i} - x_{q} | |^{2}}{| N_{i}^{o} |};

P = \underset{i}{Σ} \underset{p : x_{p} &Element; N_{i}^{e}}{Σ} \frac{| | x_{i} - x_{p} | |^{2}}{| N_{i}^{e} |};

Q = \underset{i}{Σ} \underset{q : x_{q} &Element; N_{i}^{o}}{Σ} \frac{| | x_{i} - x_{q} | |^{2}}{| N_{i}^{o} |};

Wherein | | the quantity of the data point included in expression collection.

Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that if The initial vector of document is expressed as m dimension, and after subspace projection, each document vector representation is l dimension, then subspace projection matrix W The characteristic vector corresponding to maximum l eigenvalue that every string is obtained by (P-Q) singular value decomposition constitutes (l≤m), i.e. W= [w₁, w₂..., w_l], then subspace is obtained by the following manner: Y=W^TX。

Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that institute State in step (1) and (2-2), for subspace Y={y₁..., y_n, produced by affine propagation clustering algorithm and specify K cluster Class method for distinguishing is as follows: find K authentic specimen example E={e₁..., e_KRepresent K inhomogeneity C={c₁..., c_K, thus Maximize following object function:

\max F ({c_{j}}_{j = 1}^{K}) = Σ_{j = 1}^{K} \underset{y_{i} &Element; c_{j}}{Σ} s (y_{i}, e_{j});

Wherein with e_jClass for sample instantiation is marked as c_j, c_jApoplexy due to endogenous wind comprises all by e_jAs the data point of sample instantiation, s (y_i, e_j) represent data point y_iWith sample instantiation e_jSimilarity；

In the solving of above object function maximization problems, introduce B={b_ij∈ 0,1}, i, j=1 ..., the expression shape of n} Formula, is i.e. transformed into Zero-one integer programming problem, and above-mentioned object function is changed into:

\max F ({b_{i j}}) = Σ_{i = 1}^{n} Σ_{j = 1}^{n} b_{i j} s (y_{i}, y_{j});

This object function must be based on following three item constraints:

b_ii=1, if b_ji=1；

Σ_{j = 1}^{n} b_{i j} = 1;

Σ_{i = 1}^{n} b_{i i} = K;

Solve based on planning problem with superior function, output parameter B, then can show that the similar and heterogeneous relationships between sample, I.e. work as y_iBy y_jElect its sample instantiation, then b as_ij=1, otherwise b_ij=0；Work as y_iSelf it is sample instantiation, then a b_ii=1, Otherwise b_ii=0.

Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 7, it is characterised in that ginseng The optimum of number B is solved and is obtained by conventional belief propagation method.