CN109670037A

CN109670037A - K-means Text Clustering Method based on topic model and rough set

Info

Publication number: CN109670037A
Application number: CN201811324306.7A
Authority: CN
Inventors: 谢珺; 段利国; 郝晓燕; 梁凤梅; 续欣莹; 靳红伟
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2019-04-23

Abstract

The invention discloses a kind of K-means Text Clustering Method based on topic model and rough set.The shortcomings that for K-means algorithm, proposes the optimization method to initial center point, use LDA topic model, by co-occurrence of the lexical item in documentation level, the semantic information in text is efficiently extracted, while word spatial transformation being the theme space, realize theme dimensionality reduction, then in conjunction with Rough Set Knowledge Reduction theory, redundancy theme feature is deleted, to improve theme feature extraction efficiency, optimize initial center point, improves K-means text cluster effect.

Description

K-means Text Clustering Method based on topic model and rough set

Technical field

The present invention relates to text cluster field more particularly to a kind of K-means texts based on topic model and rough set Clustering method.

Background technique

With the development and application of network technology, information resources explosive growth, text mining, information filtering and information are searched There is unprecedented prospect in the research of rope.Therefore, clustering technique is just becoming the core of text information digging technology.Text is poly- Class is an important technology for being used to find data distribution and its implicit data pattern in text mining.Cluster is by that will have The data of similitude are divided into different groups to realize, so that the element in each cluster shares some common traits, usually It is far and near according to the distance metric of definition.K-means cluster is a kind of Classic Clustering Algorithms based on division, because its principle is simple, It is easily achieved, the advantages that fast convergence rate and is used widely.However this algorithm will lead to difference to different initial values Cluster result, be easily trapped into local minimum, to peel off sensitivity the disadvantages of.The shortcomings that for K-means algorithm, proposes to first The optimization method of beginning central point is efficiently extracted in text using LDA topic model by co-occurrence of the lexical item in documentation level Semantic information, while word spatial transformation being the theme space, realizes theme dimensionality reduction, managed then in conjunction with Rough Set Knowledge Reduction By deleting the theme feature of redundancy, improve theme feature extraction efficiency, optimize the selection of initial center point, improve K-means text This Clustering Effect.

Summary of the invention

It is provided a kind of based on topic model and rough set it is an object of the invention to avoid the deficiencies in the prior art place K-means Text Clustering Method.

The purpose of the present invention can be realized by using following technical measures, be designed a kind of based on topic model and thick The K-means Text Clustering Method of rough collection, comprising steps of choosing text set, text set is expressed as by this vectorization of composing a piece of writing of going forward side by side Text-lexical item matrix；Text modeling is carried out to text-lexical item matrix using LDA topic model, modeling parameters are estimated, Document-theme matrix is obtained, while generating low-dimensional theme feature；Wherein, low-dimensional theme feature indicates each of text set The theme probability of the appearance of word；Document-theme matrix conversion is the theme lexical item decision system, is led using neighborhood rough set The reduction of topic feature obtains the reduction set of theme according to the different degree of theme；Theme reduction set is carried out to the pact of theme value Letter obtains the complete reduction set of theme, optimizes the selection of initial center point；K-means is carried out to the completely brief set of theme Text cluster.

Wherein, in the step of carrying out text modeling to text-lexical item matrix using LDA topic model, comprising steps of from The concentration of theme corresponding to a document in document sets randomly selects out a theme, the word corresponding to the theme being drawn into Language concentrate it is random extract a word, repeat aforesaid operations, until traversing word all in document completely；Using general The thought of rate statistics models document sets, two matrixes: text-theme matrix and theme-word matrix is obtained, to excavate text This potential semantic information.

It wherein, further include that pretreated step is carried out to text set before the step of carrying out text vector to text set Suddenly；Wherein, pretreated mode includes at least stammerer participle and removes stop words.

Wherein, in the step of carrying out K-means text cluster to the completely brief set of theme, K-means algorithmic procedure Include the following steps, it is assumed that text set is divided into c classification:

Randomly choose the initial center of c class；

In kth iteration, to any one sample, it is asked to arrive the distance at c each center of classification initial center, and by sample Originally it is grouped into apart from the class where shortest center；

Such central value is updated using the methods of mean value；

C all cluster centres is updated using abovementioned steps, if cluster centre value remains unchanged, i.e., objective function is received It holds back, then stops iteration.

Wherein, in the step of carrying out the reduction of theme feature using neighborhood rough set, the reduction mode packet of theme feature Include the reduction of theme reduction and theme value.

Wherein, in the different degree according to theme, in the step of obtaining the reduction set of theme, pass through in reduction calculating process Judge whether subject importance is greater than zero and obtains reduction set, will be greater than zero theme and be put into reduction set.

Wherein, the method for calculating subject importance is the method for computation attribute dependency degree, specific steps are as follows: calculates theme Positive domain number of samples under collection calculates the difference of the attribute dependability of each theme according to the positive domain calculated, and obtains each master The different degree of topic.

It wherein, further include the step of Cluster Assessment after the step of carrying out K-means text cluster to completely brief set Suddenly.

It is different from the prior art, the K-means Text Clustering Method of the invention based on topic model and rough set is directed to The shortcomings that K-means algorithm, proposes the optimization method to initial center point, using LDA topic model, by lexical item in documentation level In co-occurrence, efficiently extract the semantic information in text, while word spatial transformation being the theme space, realize theme dimensionality reduction, Then in conjunction with Rough Set Knowledge Reduction theory, redundancy theme is deleted, optimizes the selection of initial center point, improves k-means text Clustering Effect.

Detailed description of the invention

Fig. 1 is a kind of process of K-means Text Clustering Method based on topic model and rough set provided by the invention Schematic diagram；

Fig. 2 is a kind of logic of K-means Text Clustering Method based on topic model and rough set provided by the invention Schematic diagram；

Fig. 3 is text-in a kind of K-means Text Clustering Method based on topic model and rough set provided by the invention The structural schematic diagram of theme matrix model.

Specific embodiment

Further more detailed description is made to technical solution of the present invention With reference to embodiment.Obviously, it is retouched The embodiment stated is only a part of the embodiments of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, Those of ordinary skill in the art's every other embodiment obtained without creative labor, all should belong to The scope of protection of the invention.

Refering to fig. 1 and Fig. 2, Fig. 1 are that a kind of K-means text based on topic model and rough set provided by the invention is poly- The flow diagram of class method；Fig. 2 is a kind of K-means text cluster based on topic model and rough set provided by the invention The logical schematic of method.The step of this method includes:

S110: choosing text set, and text set is expressed as text-lexical item matrix by this vectorization of composing a piece of writing of going forward side by side.Text-master It is as shown in Figure 3 to inscribe matrix norm type structure.From the figure 3, it may be seen that LDA topic model can divide by force each word in document sets With theme, therefore non-active theme can be retained, influence the distribution of theme, the problem for causing theme too wide in range.

S120: text modeling is carried out to text-lexical item matrix using LDA topic model, modeling parameters are estimated, are obtained To document-theme matrix, while generating low-dimensional theme feature；Wherein, low-dimensional theme feature indicates each of text set word Appearance theme probability.

S130: document-theme matrix conversion is the theme lexical item decision system, and neighborhood rough set is utilized to carry out theme feature Reduction the reduction set of theme is obtained according to the different degree of theme.

Document-theme matrix conversion is the theme lexical item decision system TDS=(TU, TC ∪ D, V, f), it is coarse using neighborhood Collection carries out the reduction of theme feature, and wherein TU is the M piece article containing N number of theme, i.e. text-theme matrix, that is, domain, TC is K theme, i.e. attribute set, and D is text categories, i.e. decision attribute, and V is theme value, and f is an information function, is used for Value in theme is distributed into lexical item.For k-th theme, f_K:TC→V_K,V_KIt is the codomain of theme.

To obtained document-theme matrix, the reduction of theme feature, including theme reduction and master are carried out using neighborhood rough set The reduction of topic value, to achieve the purpose that optimization initial center point.The positive domain number of samples under theme subset is calculated, according to calculating The positive domain POS come_k(D), the difference of the dependency degree between each theme is then calculated, To obtain the different degree SIG of each theme, it is then manually entered different degree lower limit, EFC is the control parameter of different degree lower limit, Take the number close to zero.It can be seen that in the algorithm, remain subject importance it is maximum that, that is, ensure that core not by Reduction.It follows that neighborhood rough set can be used to evaluate data for the importance of classification.

S140: by theme reduction set_REDThe reduction for carrying out theme value, obtains the complete reduction set RED' of theme.

Present invention introduces neighborhood rough set models, carry out reduction to redundancy theme feature, reach optimization initial center point Purpose.The problem of rough set will be handled is described as a message system, and information system DT (U, C ∪ D, V, f) is known as one certainly Plan system, wherein U is sample set, also referred to as domain { χ₁,χ₂,...,χ_n, A=C ∪ D is attribute set, and wherein C is condition Attribute set, also referred to as characteristic set { a₁,a₂,...,a_m, for describing the characteristic information of each sample, D indicates decision category Property set.F indicates the information function of decision system, f_aFor the information function of attribute a, V is the codomain of information function f.For number Value type data judge the neighbor relationships between its similarity degree and sample by calculating the distance between sample.

S150: K-means text cluster is carried out to the completely brief set of theme.

Wherein, in the step of carrying out K-means text cluster to completely brief set, K-means algorithmic procedure includes Following steps, it is assumed that text set is divided into c classification:

Randomly choose the initial center of c class；

Such central value is updated using the methods of mean value；

The method for the calculating different degree that the present invention uses is the method for computation attribute dependency degree, and class categories D is to text master Topic TC dependency degree be

Theme decision system obtains relation database table RED (B)=(TU an of Relative Reduced Concept after attribute reduction_B,TB∪ D, V, f), in RED (B), then each of RED (B) theme is considered as a decision rule by reduction by redundancy theme d_X, X ∈ TU_BAnd X matching ruleIt is led on the basis of this theme rule set extraction The reduction of topic value.

It wherein, further include Cluster Assessment after the step of carrying out K-means text cluster to the completely brief set of theme The step of.

Specifically, evaluating cluster result using F value, it is both accuracy rate (Precision) and recall rate (Recall) Harmonic average, give predefined classification i and cluster classification j, calculation formula is as follows:

Accuracy rate P (i, j)=N_ij/N_j

Recall rate R (i, j)=N_ij/N_i

Wherein, N_ijIt is the text number clustered in classification j comprising predefined classification i, N_jIt is actual text in cluster classification j This number, N_iIt is to give the text number that should have in predefined classification i.

The judgement schematics of cluster result are as follows:

Wherein, n is the number of test text.As can be seen that F value is bigger, Clustering Effect is better.

In order to verify the validity of this paper algorithm, three kinds of improved k-means Text Clustering Algorithms and different models are selected Carry out Clustering Effect comparative experiments.The data set of selection is Fudan University's testing material library, chooses art, economy and sport therein etc. The text of ten classifications, totally 2000 articles, 200 articles of each classification, every text number of words 500 to 8000 differ.By scheming As can be seen that this paper algorithm is better than other three kinds of clustering algorithms, while LDA topic model is highlighted and having existed with rough set connected applications It is preferable to verify the Model tying effect for advantage in terms of text cluster.Comparison result is as shown in the table.

Method	F value (%)
		Original k-means	73.67
Rough set	79.54
		LDA topic model	84.19
Algorithm 1	87.31
		Algorithm 2	78.68
Algorithm 3	85.32
		The method of the present invention	92.03

It is different from the prior art, the K-means Text Clustering Method of the invention based on topic model and rough set is directed to The shortcomings that K-means algorithm, proposes the optimization method to initial center point, using LDA topic model, by lexical item in documentation level In co-occurrence, efficiently extract the semantic information in text, while word spatial transformation being the theme space, realize theme dimensionality reduction, Then in conjunction with Rough Set Knowledge Reduction theory, redundancy theme feature is deleted, improves theme feature extraction efficiency, optimizes initial center Point improves k-means text cluster effect.

The above is only embodiments of the present invention, are not intended to limit the scope of the invention, all to utilize the present invention Equivalent structure or equivalent flow shift made by specification and accompanying drawing content is applied directly or indirectly in other relevant technologies Field is included within the scope of the present invention.

Claims

1. a kind of K-means Text Clustering Method based on topic model and rough set characterized by comprising

Text set is chosen, text set is expressed as text-lexical item matrix by this vectorization of composing a piece of writing of going forward side by side；

Text modeling is carried out to text-lexical item matrix using LDA topic model, modeling parameters are estimated, document-master is obtained Matrix is inscribed, while generating low-dimensional theme feature；Wherein, low-dimensional theme feature indicates the master of each of the text set appearance of word Inscribe probability；

Document-theme matrix conversion is the theme lexical item decision system, neighborhood rough set is utilized to carry out the reduction of theme feature, root According to the different degree of theme, the reduction set of theme is obtained；

The reduction that theme reduction set is carried out to theme value, obtains the complete reduction set of theme；

K-means text cluster is carried out to completely brief set.

2. the K-means Text Clustering Method according to claim 1 based on topic model and rough set, feature exist In, in the step of carrying out text modeling to text-lexical item matrix using LDA topic model, comprising steps of

Document sets are modeled using the thought of probability statistics, obtain two matrixes: text-theme matrix and theme-word square Battle array, to excavate the potential semantic information of text.

The concentration of the theme corresponding to the document in document sets randomly selects out a theme, right from the theme institute being drawn into The word answered concentrate it is random extract a word, repeat aforesaid operations, until traversing word all in document completely.

3. the K-means Text Clustering Method according to claim 1 based on topic model and rough set, feature exist In further including carrying out pretreated step to text set before the step of carrying out text vector to text set；Wherein, in advance The mode of processing includes at least stammerer participle and removes stop words.

4. the K-means Text Clustering Method according to claim 1 based on topic model and rough set, feature exist In in the step of carrying out K-means text cluster to the complete reduction set of theme, K-means algorithmic procedure includes following step Suddenly, it is assumed that text set is divided into c classification:

Randomly choose the initial center of c classification；

In kth iteration, to any one text, it is asked to arrive the distance at c each center of classification initial center, and sample is returned To apart from the class where shortest center；

Such central value is updated using the methods of mean value；

C all cluster centres is updated using abovementioned steps, if cluster centre value remains unchanged, i.e., objective function is restrained, Then stop iteration.

5. the K-means Text Clustering Method according to claim 1 based on topic model and rough set, feature exist In in the step of carrying out the reduction of theme feature using neighborhood rough set, the reduction mode of theme feature includes theme reduction With the reduction of theme value.

6. the K-means Text Clustering Method according to claim 1 based on topic model and rough set, feature exist In in the different degree according to theme, in the step of obtaining the reduction set of theme, by judging theme weight in reduction calculating process It spends and whether is greater than zero and obtains reduction set, will be greater than zero theme and be put into reduction set.

7. the K-means Text Clustering Method according to claim 6 based on topic model and rough set, feature exist In the method for calculating subject importance is the method for computation attribute dependency degree, specific steps are as follows: calculate the positive domain under theme subset Number of samples calculates the difference of the dependency degree between each theme according to the positive domain calculated, and obtains the different degree of each theme.

8. the K-means Text Clustering Method according to claim 1 based on topic model and rough set, feature exist In further including the steps that Cluster Assessment after the step of to brief set carries out K-means text cluster completely.