CN103279556B - Iteration Text Clustering Method based on self adaptation sub-space learning - Google Patents
Iteration Text Clustering Method based on self adaptation sub-space learning Download PDFInfo
- Publication number
- CN103279556B CN103279556B CN201310230981.4A CN201310230981A CN103279556B CN 103279556 B CN103279556 B CN 103279556B CN 201310230981 A CN201310230981 A CN 201310230981A CN 103279556 B CN103279556 B CN 103279556B
- Authority
- CN
- China
- Prior art keywords
- text
- subspace
- iteration
- cluster
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of iteration Text Clustering Method based on self adaptation sub-space learning, comprise the following steps: (1) initializes: corpus of text is expressed as text vector space, using affine propagation clustering method to produce initial K cluster, the cluster category table of all texts is shown as initial classes ownership oriental matrix.(2) iteration between subspace projection and cluster: initial classes is belonged to oriental matrix as priori, to maximize average neighborhood edge for object solving subspace projection matrix, by text vector space projection to subspace, and use affine propagation clustering method to produce K cluster in subspace, thus update class ownership oriental matrix;Belong to oriental matrix based on subspace projection matrix and class and calculate convergent function, until function convergence, exit iteration, complete text cluster.The present invention is unrestricted to size and the distribution of text data, and subspace solution and cluster are fused under Unified frame, is obtained the cluster result of global optimum by the strategy of iteration.
Description
Technical field
The present invention relates to machine learning and area of pattern recognition, particularly to a kind of based on self adaptation sub-space learning repeatedly
For Text Clustering Method, the method is based on average neighborhood edge maximized self adaptation sub-space learning method, and uses repeatedly
The strategy in generation is used it for solving text cluster problem.
Background technology
Along with Internet technology and the universal and development of database technology, people can obtain easily and store in a large number
Data.Data in reality exist the most in the form of text, and text cluster, as a kind of means, can carry out group to text message
Knit, make a summary and navigate, contribute to accurately obtaining information needed from immense text message resource, therefore, the most obtain
Extensive concern.
In text cluster, text is commonly used vector space model (Vector Space Model, VSM) and is represented, but this
The feature represented is the higher-dimension of feature space and openness.In this case, the most a large amount of between incoherent dimension
Redundancy so that the similarity measurement in clustering algorithm loses its resolving ability, thus affects the final performance of text cluster.
The Normal practice solving this problem uses dimensionality reduction technology by text vector space projection to subspace, then uses cluster to calculate
The document that low-dimensional is represented by method is divided into different classes.But, the Text Clustering Method of this routine only regards dimensionality reduction as cluster
The pre-treatment step of algorithm, has isolated potential contacting between subspace projection and cluster, accordingly, it is difficult to ensure projection gained
Subspace is optimum for cluster.
In order to overcome this limitation, C. Ding et al. was at the data mining international conference (International of 2002
Conference on Data Mining, ICDM) on delivered one entitled: Adaptive dimension reduction
The article of for clustering high dimensional data, proposes the concept (Adaptive of self-adaptive reduced-dimensions
Dimension reduction, ADR), using dimensionality reduction as a dynamic process, and integrate with cluster process.In view of
Its practicality and motility, ADR develops into following various ways in application process:
Li et al. in Research into information retrieval and developed international conference (ACM Special Interest Group in 2004
Of Information Retrieval, ACM SIGIR) in deliver one entitled: Document clustering via
The article of adaptive subspace iteration, completes category division simultaneously and subspace identifies two tasks, but civilian
The subspace of middle employing identifies and need to model each cluster classification respectively so that the solving of this problem becomes a combined optimization and ask
Topic, initializes the nonindependence between matrix simultaneously and is also easy to the final performance of impact cluster.
Ding and Li in 2007 at machine learning international conference (IEEE International on Machine
Learning, ICML) on deliver one entitled: Adaptive dimension reduction using
The article of discriminative analysis and K-means clustering, by linear discriminant analysis (Linear
Discriminant Analysis, LDA) and K mean cluster (K-means) be integrated into LDA-Km structure, be used for solving text
Clustering problem.But, subspace projection based on LDA easily meets with small sample problem, and only to the literary composition with Gauss distribution
Notebook data relative efficiency.
Ye et al. in 2007 at computer vision and pattern recognition international conference (IEEE International
Conference on Computer Vision and Pattern Recognition, CVPR) on deliver one entitled:
The article of Adaptive distance metric learning for clustering, it is proposed that nonlinear adaptive is measured
Learning algorithm (nonlinear adaptive metric learning, NAML), converts core study, dimensionality reduction and cluster simultaneously
It is a battle array mark optimization problem, under the framework of expectation maximization (Expectation Maximization, EM), passes through iteration
Method solve cluster result.But the limitation of NAML is that its optimization process necessarily depends upon multiple key parameter, in data
The appearance of Expired Drugs it is easily caused in the case of insufficient.
Although thinking of self-adaptive reduced-dimensions and associated method can solve the problem that specific text cluster problem, but there is also with
On some technological deficiency of pointing out, limit its range of application, the improvement for Text Clustering Algorithm reserves certain space.Therefore,
Study a kind of Text Clustering Method with stronger generalization ability and adaptive ability, it has also become a class being of practical significance
Topic.
Summary of the invention
Present invention is primarily targeted at the shortcoming overcoming prior art with not enough, it is provided that a kind of based on self adaptation subspace
The iteration Text Clustering Method of study, the method, by under subspace solution and Cluster-Fusion to Unified frame, utilizes iteration optimization
Strategy obtain global optimum's result.The method to the quantity of text data and distribution without special constraint, and can be effective
Avoid Multiple Optimization problem, relate to less adjustment parameter, therefore, there is stronger generalization ability and adaptive ability, classification
More reasonable.
The purpose of the present invention is achieved through the following technical solutions: iteration text cluster side based on self adaptation sub-space learning
Method comprises the following steps:
(1) initialize: corpus of text is expressed as the mathematical form in text vector space, spatially uses at text vector
Affine propagation clustering (K-Affinity Propagation, K-AP) method produces initial K cluster, and then obtains representing text
The initial classes ownership oriental matrix of all document generic in language material;
(2) iteration optimization between subspace projection and cluster, comprises the following steps:
(2-1) using the middle initial classes ownership oriental matrix obtained of step (1) as priori, use based on average neighborhood
Edge maximizes the sub-space learning method of (Average Neighborhood Margin Maximization, ANMM) and solves
Subspace projection matrix, and based on initial classes ownership oriental matrix and subspace projection matrix calculus convergent function value;
If (2-2) the not met condition of convergence, then by urtext vector space according to current subspace projection matrix projection
In subspace, in subspace, continue to take K-AP algorithm to produce specify K cluster, update current class ownership instruction square
Battle array;
(2-3) using the class ownership oriental matrix after renewal as priori, use and maximize based on average neighborhood edge
Sub-space learning method solve subspace projection matrix, and based on the class ownership oriental matrix after updating and subspace projection
Matrix, calculates convergent function value;
(2-4) repeating step (2-2)-(2-3), until meeting the condition of convergence, stopping iteration, final from iterative process output
Class ownership oriental matrix, obtain the final cluster result of all documents.
Concrete, described step (1) initialization procedure is as follows: use mutually from the participle of all documents of corpus of text is expressed
Information approach is selected the set of one group of representativeness lexical item and is constituted lexical item index;Then according to each by after participle of lexical item index
Document representation is a text vector, the size that text vector dimension i.e. indexes corresponding to the lexical item selected, and vectorial is every
Individual element value tfidf weight represents;If each document is expressed as a text vector, the then all documents in corpus of text
I.e. constitute a urtext vector space;Taking K-AP algorithm to produce in urtext vector space specifies K initially to gather
Class, each document obtains its initial category, and the initial clustering classification information of all documents collects formation initial classes ownership instruction
Matrix.
More specifically, in described step (1), vectorial each element value tfidf weight represents, method is as follows: for word
Certain lexical item t in entry indexi, document xiTfidf weight table be shown as:
Wherein tfI, jRepresent lexical item tiAt document xjThe word frequency of middle appearance, | D | is the quantity of all documents in corpus of text,
dfiIt it is lexical item tiNumber of documents once at least occurred, it is assumed that lexical item index is v=[t1, t2..., tm], then document xjPermissible
It is expressed as m dimensional vector xj=[tfidfL, j,tfidf2, j, tfidfM, j]T。
Concrete, in described step (2-1), solve son based on average neighborhood edge maximized sub-space learning method empty
Between the method for projection matrix be:
For certain data point x in text vector spacei, calculate text data and concentrate other data point and xiDistance,
And according to distance and data point xiBelonging to classification information, divide them into following two subset:It it is similar neighborhood
Collection, comprises and xiBelong to ξ similar nearest-neighbor point;It is foreign peoples's neighborhood collection, comprises and xiAdhere to inhomogeneous ζ separately recently
Neighborhood point;
Ask for data point x respectivelyiAverage between class distance and average inter-object distance;
Ask for the average between class distance P and average inter-object distance Q of all data points in text vector space;
For all of data point, in constraintsUnder, maximize its average neighborhood edge function, i.e.
Average inter-object distance is minimized while the average between class distance of satisfied maximization;Thus obtain subspace projection matrix W.
Further, in order to keep sample balance, belong to data point similar nearest-neighbor point number ξ and with number
It is equal, depending on the contiguous range of selection is by corpus of text situation that inhomogeneous nearest-neighbor point number ζ is adhered at strong point separately.
Concrete, described data point xiAverage between class distance and average inter-object distance be expressed as:
The average between class distance P and average inter-object distance Q of all data points are expressed as:
Wherein | | the quantity of the data point included in expression collection.
Concrete, described average neighborhood edge function is:
Solving of subspace projection must be in constraintsUnder, i.e. maximize following object function:
Further, if the initial vector of document is expressed as m dimension, after subspace projection, each document vector representation is l
Dimension, then the feature corresponding to maximum l eigenvalue that every string of subspace projection matrix W is obtained by (P-Q) singular value decomposition
Vector constitutes (l≤m), i.e. W=[w1, w2..., wl], then subspace is obtained by the following manner: Y=WTX。
Concrete, in described step (1) and (2-2), for subspace Y={y1..., yn, use affine propagation clustering side
It is as follows that method produces K cluster class method for distinguishing: finds K authentic specimen example E={e1..., eKRepresent K inhomogeneity C=
{c1..., cK, thus maximize following object function:
Wherein with ejClass for sample instantiation is marked as cj, cjApoplexy due to endogenous wind comprises all by ejData as sample instantiation
Point, s (yi, ej) represent data point yiWith sample instantiation ejSimilarity;
In the solving of above object function maximization problems, introduce B={bij∈ 0,1}, i, j=1 ..., the expression of n}
Form, can be transformed into Zero-one integer programming problem, and above-mentioned object function is changed into:
This object function must be based on following three item constraints:
bii=1, ifbji=1;
Preferably, the optimum of above-mentioned parameter B is solved and is obtained by conventional belief propagation method, and calculating process can refer to
Zhang et al. was published in an entitled K-AP of International Conference of Data Mining in 2010:
The article of Generating Specified K Clusters by Efficient Affinity Propagation.This
Bright this state-of-the-art technology is applied to text cluster, it is possible to obtain clustering more rational than traditional methods such as K-means.
In the two values matrix B obtained after K-AP clusters, if bij=1, then show yiSelf it is sample instantiation, yiBelong to ciClass;If bij
=1, then yiSample instantiation be yj, yiWith yjBelong to cjClass;Thus may indicate that the similar and foreign peoples between the sample of subspace
Relation, and the sample of subspace is corresponding with the sample in text vector space, the class ownership that thus can update all documents refers to
Show matrix.
The present invention compared with prior art, has the advantage that and beneficial effect:
1, Text Clustering Method of the prior art is by the text vector space projection of higher-dimension to subspace mostly, so
After cluster in subspace, but the method be difficult to ensure that tried to achieve subspace for cluster be optimum.The present invention will
Subspace solution and Cluster-Fusion, under Unified frame, utilize the strategy of iteration optimization to ensure that acquisition global optimum result,
So that classification more rationally, accurately.
2, the present invention solves subspace projection to maximize average neighborhood edge for criterion, it is to avoid little in traditional method
Sample problem, and data are distributed without special requirement, Multiple Optimization problem can be effectively prevented from, and relate to less adjustment
Parameter, has stronger generalization ability and adaptive ability.
3, the present invention uses fast affine propagation clustering (K-AP) that text set is divided into A the cluster class that user specifies
Not, use and with experimental verification K-AP algorithm to produce in text cluster the most reasonably cluster stroke than traditional method
Point.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the present invention;
Fig. 2 (a) (e) is that the inventive method iteration cluster calculation on NG20 and subdata base thereof draws convergent function value
Curve synoptic diagram;
Fig. 3 is that the inventive method iteration cluster calculation on Classic3 data base draws the curve chart of convergent function value;
Fig. 4 is that the inventive method iteration cluster calculation on K1b data base draws the curve chart of convergent function value;
Fig. 5 (a) (e) is the degree of accuracy of the present invention and algorithm LDA-Km after each iteration on NG20 and subdata base thereof
Comparative result figure;
Fig. 6 is the degree of accuracy comparative result of the present invention and algorithm LDA-Km after each iteration on Classic3 data base
Figure;
Fig. 7 is the degree of accuracy comparative result figure of the present invention and algorithm LDA-Km after each iteration on K1b data base;
Detailed description of the invention
Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention do not limit
In this.
Embodiment 1
As it is shown in figure 1, iteration Text Clustering Method based on self adaptation sub-space learning, comprise the following steps:
(1) clustering initialization in text vector space: use mutual trust during the participle of all documents is expressed from corpus of text
Breath method choice goes out the set of one group of representativeness lexical item and constitutes lexical item index;Then according to lexical item index by each document representation it is
One text vector, the size that the dimension of text vector i.e. indexes corresponding to the lexical item selected, each element value of vector
Represent by tfidf weight;In corpus of text, all documents i.e. constitute an original text vector space;At urtext vector
Taking affine propagation clustering algorithm to produce K the initial clustering (K-AP) specified in space, it is initially affiliated that each document obtains it
Classification, the classification information of all clustering documents collects formation initial classes ownership oriental matrix.
(2) iteration optimization between subspace projection and cluster, comprises the following steps:
(2-1) using the middle initial classes ownership oriental matrix obtained of step (1) as priori, use based on average neighborhood
Edge maximized sub-space learning method solves subspace projection matrix, refers to initial classes ownership based on subspace projection matrix
Show matrix calculus convergent function value;
If (2-2) the not met condition of convergence, then urtext vector space is thrown by current subspace projection matrix
Shadow, in subspace, continues to take K-AP algorithm to produce in subspace and specifies K cluster, thus updates current class and belong to and refer to
Show matrix;
(2-3) using the class ownership oriental matrix after renewal as priori, use and maximize based on average neighborhood edge
Sub-space learning method solve subspace projection matrix, belong to oriental matrix based on subspace projection matrix and the class after updating
Calculate convergent function value;
(2-4) repeating step (2-2)-(2-3), until meeting the condition of convergence, stopping iteration, final from iterative process output
Class ownership oriental matrix, obtain the final cluster result of all documents.
Text data in the present embodiment is respectively derived from 20Newsgroups(NG20), Classic3 and K1b corpus.
The attribute of the corpus of text collection used is shown in Table 1, and all documents in corpus are the most by participle.
The attribute list of the corpus of text collection in table 1 embodiment
In described step (1), take mutual information method can extract m representational word from the participle of document is expressed
, constitute lexical item index v=[t1, t2..., tm], then each document of language material may be expressed as m dimensional vector.Assume xjFor language
Jth document in material, corresponding to lexical item ti, the element value of vector can be represented by tfidf weight:
Wherein, tfI, jRepresent lexical item tiAt document xjThe word frequency of middle appearance.At idfiCalculating in, | D | is in corpus of text
The quantity of all documents, dfiIt it is lexical item t in corpus of textiNumber of documents once at least occurred.Document xjM dimensional vector
X can be represented sequentially asj=[tfidf1, j, tfidf2, j..., tfidfM, j]T.The vector representation of all documents can be stacked up
Constitute text vector space, be expressed as the matrix of m × n size: X0=[x1, x2..., xn].In the present embodiment, extraction is set
2000 representational lexical items, then m=2000.As a example by Binary corpus, wherein comprise 500 samples, then text vectors
Space can be expressed as the matrix of 2000 × 500 sizes.
The iterative optimization procedure between subspace projection and cluster in described step (2) is with class ownership oriental matrix as bridge
Beam, its iterative process is the core methed of the present invention, can be described as self adaptation sub-space learning method (Adaptive
Subspace learning, ASL), concrete iterative process is as follows:
Initialize: corpus of text is expressed as urtext vector space X0, in urtext vector space X0Upper employing K-
AP cluster obtains initial classes ownership oriental matrix L0;The initial classes comprising text categories information is belonged to oriental matrix L0Enter step
Suddenly, in (2-1), use ANMM subspace projection, obtain subspace projection matrix W1, based on L0And W1Calculate convergent function value
Score1;
1st iteration: by urtext vector space X0According to subspace projection matrix W1Project to subspace Y1In,
Subspace Y1Upper employing K-AP cluster obtains class ownership oriental matrix L1;The class comprising text categories information is belonged to oriental matrix
L1Enter in step (2-3), use ANMM subspace projection, obtain subspace projection matrix W2, based on L1And W2Calculate convergence
Functional value Score2;
T takes turns iteration: by urtext vector space X0According to subspace projection matrix WtProject to subspace YtIn,
Subspace YtUpper employing K-AP clusters, it is thus achieved that class ownership oriental matrix Lt, the class comprising text categories information is belonged to oriental matrix
LtEnter in step (2-3), use ANMM subspace projection, obtain subspace projection matrix Wt+1, based on LtAnd Wt+1Calculate and receive
Hold back functional value Scoret+1;
Need to calculate its convergent function value after each iteration above, as follows:
Wherein P (Lt) and Q (Lt) it is to belong to oriental matrix L according to classtIn text vector space X0Upper the most calculated averagely
Between class distance and average inter-object distance.If meeting condition of convergence Score sett+1-Scoret≤ ∈, or reach to set
Big number of iterations T, exits iteration, and obtains final class ownership oriental matrix, i.e. obtains the affiliated of each document in corpus of text
Classification, completes text cluster.
In described step (2), either in urtext vector space X0On, or on the subspace that projection obtains
(it is embodied as { Y in an iterative process1..., Yt), it is required for carrying out affine propagation clustering (K-AP) and produces K cluster.With
As a example by affine propagation clustering (K-AP) in subspace, (sample space is y={y1…yn), the thought of K-AP is to find K truly
Sample instantiation E={e1..., eK, it is used for representing K different classes of C={c1..., cK, thus maximize following object function:
Wherein with ejClass for sample instantiation is marked as cj, cjApoplexy due to endogenous wind comprises all by ejData as sample instantiation
Point, s (yi, ej) represent data point yiWith sample instantiation ejSimilarity.In the solving of object function maximization problems, introduce B
={bij∈ 0,1}, i, j=1 ..., and n}, Zero-one integer programming problem can be transformed into, then object function is changed into:
This object function must be based on following three item constraints:
bii=1, ifbji=1
Section 1 constraint shows if yjBy yiElect its sample instantiation, y asiIt must be a sample instantiation;Section 2 is about
Bundle shows each data point yiHave and an only sample instantiation;The number of Section 3 constraint representation sample instantiation is necessary for K, from
And ensure that K-AP method can produce K the cluster that user specifies.
Maximization problems based on constraints above can be expressed with a factor graph, and being related to that the optimum of B solves can
To be obtained by belief propagation method reasoning, calculating process can refer to Zhang et al. and was published in International in 2010
One entitled K-AP:Generating Specified K Clusters by of Conference of Data Mining
The article of Efficient Affinity Propagation.Parameter B describes the similar and heterogeneous relationships between sample, thus
The cluster classification of document in corpus of text can be obtained.
In described step (1), (2), all documents can obtain its generic information by K-AP clustering algorithm, collects
Class ownership oriental matrix L(is become to may particularly denote in an iterative process as { L0, L1..., Lt).Matrix L size is n × K, wherein n
Being the number of documents in corpus, K is the categorical measure that cluster produces.If jth document belongs to kth class, then Ljk=1, otherwise
Ljk=0。
In described step (2), belong to oriental matrix L as priori using the classification of document, based on average neighborhood edge
The sub-space learning method maximizing (ANMM) solves subspace, and method is as follows:
First the average between class distance of all data points and average inter-object distance are calculated.Assume xiIt is that original text vector is empty
Certain data point between, can obtain x from class ownership oriental matrix LiClassification information;Calculate xiWith other data point
Distance, and according to the cluster classification information of distance and data point, divides them into following two subset: similar neighborhood collectionComprise and xiBelong to ξ similar nearest-neighbor point;Foreign peoples's neighborhood collectionComprise and xiAdhere to inhomogeneous ζ arest neighbors separately
Territory point.
Then data point xiAverage between class distance and average inter-object distance be expressed as:
For all data points, then have:
Wherein | | the quantity of the data point included in expression subset.
Secondly, if by xiProject in subspace, i.e. yi=WTxi, need also exist for while maximizing average between class distance
Minimize average inter-object distance, therefore, for all of data point, its average neighborhood edge function must be maximized:
Solving of subspace projection must be in constraintsUnder, maximize following object function:
Wherein P and Q can be calculated with formula [9] and [10].Assume the vector representation of each document in corpus of text
Tieing up for m, after subspace projection, each document vector representation is l dimension, then every string of projection matrix W is by (P-Q) singular value decomposition
The characteristic vector corresponding to maximum l eigenvalue obtained constitutes (l≤m), i.e. W=[w1、w2..., wl].In the present embodiment,
The initial dimension of document vector is m=2000.The dimension l of subspace can be set to a fixed constant, it is also possible to according to eigenvalue
Distribution dynamic changes.Keeping characteristics value is more than 10 by the present embodiment-5Characteristic vector, as a example by Binary corpus, (P-Q) is strange
Different value decompose after obtain more than 10-5Eigenvalue number be 1080, therefore l=1080, at the beginning of the document of each 2000 × 1 sizes
Beginning vector xiIt is projected into the low dimensional vector y of 1080 × 1 sizesi。
In this example, it is assumed that αiIt is that corpus of text concentrates document diCorrect category label, βiFor text cluster hereinafter
Shelves diObtain category label, for a corpus of text collection comprising n sample, weigh cluster accuracy estimate as
Under:
If wherein x=y, then δ (x, y)=1, otherwise δ (x, y)=0;Optimum map function can be by classical bipartite graph
Maximum weight matching method finds.
In the present invention it needs to be determined that following parameter: ξ similar nearest-neighbor point, ζ foreign peoples's nearest-neighbor point and algorithm
Maximum iterations T.In order to keep the balance of sample, in the present embodiment, typically take ξ=ζ.The parameter of ξ and ζ selects to adopt
With valued combinations conventional in Local Feature Extraction 5,10,15,20}, to different values on different corpus of text collection
Combination is tested, and the cluster accuracy obtained is as shown in table 2:
Table 2 parameter ξ and ζ select test result to compare
Binary | Multi5 | Multi10 | NG10 | NG20 | Classic3 | K1b | |
ξ=ζ=5 | 0.880 | 0.848 | 0.556 | 0.683 | 0.539 | 0.977 | 0.732 |
ξ=ζ=10 | 0.920 | 0.906 | 0.604 | 0.713 | 0.580 | 0.989 | 0.814 |
ξ=ζ=15 | 0.906 | 0.890 | 0.574 | 0.711 | 0.550 | 0.989 | 0.822 |
ξ=ζ=20 | 0.894 | 0.872 | 0.568 | 0.703 | 0.546 | 0.986 | 0.797 |
The present embodiment uses the most excellent value ξ=ζ=10 of ξ and ζ in upper table.The selection of maximum iteration time T and formula
[3] in, the calculating of convergent function value is relevant.Fig. 2,3,4 sets forth on NG20 and subdata base, Classic3 and K1b storehouse
The curve chart of convergent function value is calculated after self adaptation sub-space learning method (ASL) of the present invention iteration every time.It can be seen that
The change starting convergent function value from the 5th iteration is relatively small, and algorithm tends to convergence, the therefore maximum iteration time quilt of ASL
It is set to T=10, thus ensures the number of times of enough iteration.
Propose the effectiveness of algorithm to test the present invention, table 3 provides ASL method of the present invention with other method at phase identical text
The comparative result of clustering performance in this corpus.
Table 3 clustering performance of distinct methods in same text corpus compares
Corpus of text collection | NMF | LPI | ASI | LDA-Km | ASL |
Binary | 0.864 | 0.872 | 0.898 | 0.906 | 0.920 |
Multi5 | 0.818 | 0.830 | 0.870 | 0.882 | 0.906 |
Multi10 | 0.476 | 0.494 | 0.558 | 0.566 | 0.604 |
NG10 | 0.625 | 0.653 | 0.662 | 0.671 | 0.713 |
NG20 | 0.529 | 0.532 | 0.551 | 0.558 | 0.580 |
Classic3 | 0.963 | 0.972 | 0.980 | 0.984 | 0.989 |
Yahoo | 0.722 | 0.760 | 0.802 | 0.805 | 0.814 |
In upper table, Non-negative Matrix Factorization (NMF) and local keep indexing the dimensionality reduction technology that (LPI) is latest development, by former
Beginning text vector space projection, in subspace, then clusters in projection subspace;ASI, LDA-Km and ASL be then by
Dimensionality reduction, as a dynamic process, is integrated with cluster, from the results, it was seen that dynamically obtain preferable lower-dimensional subspace energy
Enough performances effectively promoting text cluster.
Owing to, in the case of only considering that in class, data are distributed, LDA-Km can be reduced to ASI, and therefore, ASI is LDA-Km
A kind of special case.In order to illustrate that subspace projection and cluster are optimal with which kind of combination, emphasis is compared by the present embodiment
Two kinds of methods of LDA-Km and ASL:
Fig. 5,6,7 provide LDA-Km and ASL on NG20 and subdata base, Classic3 and K1b storehouse after each iteration
Cluster accuracy compare.Initialization is marked as t0, its purpose is to provide initial for ANMM or LDA in subspace projection
Sample class information, t1Secondary iteration is equivalent to spatially carry out, at the stator obtained, the Normal practice that clusters;From t2To t10
Secondary iteration, LDA-Km and ASL improves the ability of clustering performance the most in an iterative manner.But on identical iterations, ASL
It is easier to stably arrive of a relatively high cluster accuracy.Indicated above, under identical implementation condition, ASL compares LDA-Km
There is the adaptive ability of higher sub-space learning.
Above by this example demonstrates that alternative manner based on self adaptation sub-space learning, with traditional method and pattern
Compare, no matter can more effectively solve text cluster problem from performance and usefulness, reach can be practical level.
Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention are not by above-described embodiment
Limit, the change made under other any spirit without departing from the present invention and principle, modify, substitute, combine, simplify,
All should be the substitute mode of equivalence, within being included in protection scope of the present invention.
Claims (8)
1. iteration Text Clustering Method based on self adaptation sub-space learning, it is characterised in that comprise the following steps:
(1) initialize: corpus of text is expressed as the mathematical form in text vector space, spatially uses affine at text vector
Propagation clustering method produces K initial cluster, and then obtains representing the initial classes of all document generic in corpus of text
Ownership oriental matrix;
(2) iteration optimization between subspace projection and cluster, comprises the following steps:
(2-1) using the middle initial classes ownership oriental matrix obtained of step (1) as priori, use based on average neighborhood edge
Maximized sub-space learning method solves subspace projection matrix, and throws based on initial classes ownership oriental matrix and subspace
Shadow matrix calculus convergent function value;
If (2-2) the not met condition of convergence, then by urtext vector space according to current subspace projection matrix projection to son
In space, in subspace, continue to take affine propagation clustering method to produce specify K cluster, update current class ownership instruction
Matrix;
(2-3) using the class ownership oriental matrix after renewal as priori, use based on the maximized son in average neighborhood edge
Space learning method solves subspace projection matrix, and based on the class ownership oriental matrix after updating and subspace projection matrix
Calculate convergent function value;
(2-4) repeating step (2-2)-(2-3), until meeting the condition of convergence, stopping iteration, from the class that iterative process output is final
Ownership oriental matrix, obtains the final cluster result of all documents;
In described step (2-1), solve subspace projection matrix based on average neighborhood edge maximized sub-space learning method
Step be:
For certain data point x in text vector spacei, calculate other data point and xiDistance, and according to distance with
And data point xiBelonging to classification information, divide them into following two subset:It is similar neighborhood collection, comprises and xiBelong to same
ξ nearest-neighbor point of class;It is foreign peoples's neighborhood collection, comprises and xiAdhere to inhomogeneous ζ nearest-neighbor point separately;
Ask for data point x respectivelyiAverage between class distance and average inter-object distance;
Ask for the average between class distance P and average inter-object distance Q of all data points in text vector space;
For all of data point, in constraintsUnder, maximize its average neighborhood edge function, i.e. meeting
Average inter-object distance is minimized while maximizing average between class distance;Thus obtain subspace projection matrix W;
Described average neighborhood edge function is:
Solving of subspace projection must be in constraintsUnder, i.e. maximize following object function:
Wherein, yi=WTxi, i.e. data point xiThe text vector behind subspace, y is projected to by subspace projection matrix WpAnd yq
Represent the text vector after the data point that similar neighborhood collection and foreign peoples's neighborhood concentrate projects to subspace respectively.
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that institute
State step (1) initialization procedure as follows: use mutual information method choice to go out one group of representativeness from the participle of all documents is expressed
The set of lexical item constitutes lexical item index;Then it is a text vector by each document representation respectively according to lexical item index, the most often
The size that the dimension of individual text vector i.e. indexes corresponding to the lexical item selected, each element value tfidf weight table of vector
Show;If each document text vector represents, then in corpus of text, all documents i.e. constitute a text vector space;Former
Taking affine propagation clustering algorithm to produce in beginning text vector space and specify K initial clustering, each document obtains its initial classes
, the initial clustering classification of all documents is not collected formation initial classes ownership oriental matrix.
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 2, it is characterised in that institute
Stating in step (1), vectorial each element value tfidf weight represents, method is as follows:
Certain lexical item t during lexical item is indexedi, document xjTfidf weight table be shown as:
Wherein tfI, jRepresent lexical item tiAt document xjThe word frequency of middle appearance, | D | is the quantity of all documents, df in corpus of textiIt is
Lexical item tiNumber of documents once at least occurred, it is assumed that lexical item index isThen document xjIt is expressed as
M dimensional vector xj=[tfidf1, j, tfidf2, j..., tfidfM, j]T。
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that with
Data point belongs to similar nearest-neighbor point number ξ and adhere to inhomogeneous nearest-neighbor point number ζ separately with data point equal, selects
Contiguous range by corpus of text situation depending on.
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that institute
State data point xiAverage between class distance and average inter-object distance be expressed as:
The average between class distance P and average inter-object distance Q of all data points are expressed as:
Wherein | | the quantity of the data point included in expression collection.
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that if
The initial vector of document is expressed as m dimension, and after subspace projection, each document vector representation is l dimension, then subspace projection matrix W
The characteristic vector corresponding to maximum l eigenvalue that every string is obtained by (P-Q) singular value decomposition constitutes (l≤m), i.e. W=
[w1, w2..., wl], then subspace is obtained by the following manner: Y=WTX。
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 1, it is characterised in that institute
State in step (1) and (2-2), for subspace Y={y1..., yn, produced by affine propagation clustering algorithm and specify K cluster
Class method for distinguishing is as follows: find K authentic specimen example E={e1..., eKRepresent K inhomogeneity C={c1..., cK, thus
Maximize following object function:
Wherein with ejClass for sample instantiation is marked as cj, cjApoplexy due to endogenous wind comprises all by ejAs the data point of sample instantiation, s
(yi, ej) represent data point yiWith sample instantiation ejSimilarity;
In the solving of above object function maximization problems, introduce B={bij∈ 0,1}, i, j=1 ..., the expression shape of n}
Formula, is i.e. transformed into Zero-one integer programming problem, and above-mentioned object function is changed into:
This object function must be based on following three item constraints:
bii=1, if bji=1;
Solve based on planning problem with superior function, output parameter B, then can show that the similar and heterogeneous relationships between sample,
I.e. work as yiBy yjElect its sample instantiation, then b asij=1, otherwise bij=0;Work as yiSelf it is sample instantiation, then a bii=1,
Otherwise bii=0.
Iteration Text Clustering Method based on self adaptation sub-space learning the most according to claim 7, it is characterised in that ginseng
The optimum of number B is solved and is obtained by conventional belief propagation method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310230981.4A CN103279556B (en) | 2013-06-09 | 2013-06-09 | Iteration Text Clustering Method based on self adaptation sub-space learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310230981.4A CN103279556B (en) | 2013-06-09 | 2013-06-09 | Iteration Text Clustering Method based on self adaptation sub-space learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103279556A CN103279556A (en) | 2013-09-04 |
CN103279556B true CN103279556B (en) | 2016-08-24 |
Family
ID=49062075
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310230981.4A Active CN103279556B (en) | 2013-06-09 | 2013-06-09 | Iteration Text Clustering Method based on self adaptation sub-space learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103279556B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103886072B (en) * | 2014-03-24 | 2016-08-24 | 河南理工大学 | Search result clustering system in the search engine of colliery |
CN105095275B (en) * | 2014-05-13 | 2019-04-05 | 中国科学院自动化研究所 | The method and device of clustering documents |
CN104573710B (en) * | 2014-12-25 | 2018-11-13 | 北京交通大学 | A kind of Subspace clustering method smoothly characterized certainly based on latent space |
CN105139031A (en) * | 2015-08-21 | 2015-12-09 | 天津中科智能识别产业技术研究院有限公司 | Data processing method based on subspace clustering |
CN106294733B (en) * | 2016-08-10 | 2019-05-07 | 成都轻车快马网络科技有限公司 | Page detection method based on text analyzing |
CN107203625B (en) * | 2017-05-26 | 2020-03-20 | 北京邮电大学 | Palace clothing text clustering method and device |
CN108536844B (en) * | 2018-04-13 | 2021-09-03 | 吉林大学 | Text-enhanced network representation learning method |
CN110727769B (en) | 2018-06-29 | 2024-04-19 | 阿里巴巴(中国)有限公司 | Corpus generation method and device and man-machine interaction processing method and device |
CN109145976A (en) * | 2018-08-14 | 2019-01-04 | 聚时科技(上海)有限公司 | A kind of multiple view cluster machine learning method based on optimal neighbours' core |
CN109726394A (en) * | 2018-12-18 | 2019-05-07 | 电子科技大学 | Short text Subject Clustering method based on fusion BTM model |
CN110135499A (en) * | 2019-05-16 | 2019-08-16 | 北京工业大学 | Clustering method based on the study of manifold spatially adaptive Neighborhood Graph |
CN111159337A (en) * | 2019-12-20 | 2020-05-15 | 中国建设银行股份有限公司 | Chemical expression extraction method, device and equipment |
CN111966579A (en) * | 2020-07-24 | 2020-11-20 | 复旦大学 | Self-adaptive text input generation method based on natural language processing and machine learning |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102214181A (en) * | 2010-04-12 | 2011-10-12 | 无锡科利德斯科技有限公司 | Fuzzy evolution calculation-based text clustering method |
CN102332012A (en) * | 2011-09-13 | 2012-01-25 | 南方报业传媒集团 | Chinese text sorting method based on correlation study between sorts |
-
2013
- 2013-06-09 CN CN201310230981.4A patent/CN103279556B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102214181A (en) * | 2010-04-12 | 2011-10-12 | 无锡科利德斯科技有限公司 | Fuzzy evolution calculation-based text clustering method |
CN102332012A (en) * | 2011-09-13 | 2012-01-25 | 南方报业传媒集团 | Chinese text sorting method based on correlation study between sorts |
Non-Patent Citations (1)
Title |
---|
Document Clustering via Adaptive Subspace Iteration;Tao Li etc.;《SIGIR 2004》;20040729;218-225 * |
Also Published As
Publication number | Publication date |
---|---|
CN103279556A (en) | 2013-09-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103279556B (en) | Iteration Text Clustering Method based on self adaptation sub-space learning | |
Jia et al. | Bagging-based spectral clustering ensemble selection | |
Guan et al. | Text clustering with seeds affinity propagation | |
Rubin et al. | Statistical topic models for multi-label document classification | |
CN101201894B (en) | Method for recognizing human face from commercial human face database based on gridding computing technology | |
Popat et al. | Hierarchical document clustering based on cosine similarity measure | |
CN103488662A (en) | Clustering method and system of parallelized self-organizing mapping neural network based on graphic processing unit | |
CN105930862A (en) | Density peak clustering algorithm based on density adaptive distance | |
CN110674407A (en) | Hybrid recommendation method based on graph convolution neural network | |
CN102929894A (en) | Online clustering visualization method of text | |
CN104881689A (en) | Method and system for multi-label active learning classification | |
CN109784405A (en) | Cross-module state search method and system based on pseudo label study and semantic consistency | |
CN104699698A (en) | Graph query processing method based on massive data | |
Sánchez et al. | Efficient algorithms for a robust modularity-driven clustering of attributed graphs | |
CN110364264A (en) | Medical data collection feature dimension reduction method based on sub-space learning | |
CN105335510A (en) | Text data efficient searching method | |
CN109739984A (en) | A kind of parallel KNN network public-opinion sorting algorithm of improvement based on Hadoop platform | |
Chen et al. | Clustering and ranking in heterogeneous information networks via gamma-poisson model | |
CN106971005A (en) | Distributed parallel Text Clustering Method based on MapReduce under a kind of cloud computing environment | |
CN103218419B (en) | Web tab clustering method and system | |
CN105160046A (en) | Text-based data retrieval method | |
Maudes et al. | Random projections for linear SVM ensembles | |
Mei et al. | Proximity-based k-partitions clustering with ranking for document categorization and analysis | |
Cobos et al. | Clustering of web search results based on an Iterative Fuzzy C-means Algorithm and Bayesian Information Criterion | |
CN105787072A (en) | Field knowledge extracting and pushing method oriented to progress |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20180110 Address after: 510601, fifteenth floor, No. 289, Guangzhou Avenue, Yuexiu District, Guangzhou, Guangdong Patentee after: Guangdong Southern Newspaper Media Group New Media Co., Ltd. Address before: 510601 Guangzhou Avenue, Yuexiu District, Guangzhou, Guangdong Province, No. 289 Patentee before: Nanfang Daily Group |
|
TR01 | Transfer of patent right |