CN100446032C

CN100446032C - Self-organized mapping network based document clustering method

Info

Publication number: CN100446032C
Application number: CNB2006100097619A
Authority: CN
Inventors: 刘远超; 关毅; 徐志明; 刘秉权; 林磊
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2006-03-02
Filing date: 2006-03-02
Publication date: 2008-12-24
Anticipated expiration: 2026-03-02
Also published as: CN1808474A

Abstract

The present invention relates to a file clustering method, particularly to a file clustering method based on self-organization mapping network. The present invention overcomes the problems of the self-adaptation of an input document data, neurone insufficiency utilize brought by a fixed structure of the input document data, network mapping insufficiency accuracy, edge effect, etc., which are difficult for an existing clustering method based on self-organization mapping network to overcome. The method of the present invention has the steps that all selected documents are found out; an output layer of the self-organization mapping network is initiated into an annular structure; the annular structure is at least divided into two parts and each sector is used as a neuron; an R2 clustering criterion factor of the current output layer is calculated; whether the clustering criterion factor R2 is more than a threshold mu is judged, if yes, the training of the self-organization mapping network is stopped and the selected file is classified according to the current self-organization mapping network, if not, the neuron with maximal square sum of deviation in class is looked for in the current output layer to interpolate a new neuron nearby; all neurons of the current output layer are trained.

Description

A kind of document clustering method based on self-organized mapping network

Technical field

The present invention relates to a kind of document clustering method.

Background technology

As a kind of unsupervised machine learning method, cluster has higher robotization processing power, has become the important means that text message is effectively organized, made a summary and navigates.The purpose of clustering documents is to excavate structural information wherein by collection of document is put in order automatically, browses thereby be convenient to the user, improves the access efficiency of information.It is mainly used in comprising the aspect such as automatic arrangement, user interest excavation of Digital Library Services, search engine return results.In numerous document clustering methods, the self-organization mapping that T.Kohonen proposes (Self-Organizing Maps is called for short SOM) has caused the concern that the researchist is more.Clustering documents has higher-dimension and the characteristics relevant with semanteme, and SOM can realize the order-preserving mapping of high dimensional data to the two dimensional surface space preferably.So-called order-preserving mapping is meant that similarity is bigger each other different document often is mapped on the same neuron or neuron located adjacent one another of SOM output layer, so the visual level and the homing capability of SOM cluster result are better.In addition, some noise documents also can be easy to be found in the SOM output layer, make that the anti-noise ability of this method is also stronger.

But network structure among the SOM and neuron number need be determined before training, thereby be difficult to accomplish to importing the self-adaptation of document data.The SOM of fixed sturcture also brought such as neuron owe to utilize, network mapping is owed accurately and problem such as edge effect.This is because its fixing network structure is difficult to reflect the structural information of importing data, causes its dirigibility relatively poor.In general, the number of the number of output node and training sample mode class has relation.If the node number is more than mode class, then a kind of may be category division to be got meticulous, and another possibility is to occur dying for the sake of honour a little, and promptly in training process, certain node was never won and away from other triumph nodes.If the node number is less than the mode class number, then be not enough to distinguish all mode class, the result of training will make close mode class merge into a class.The output layer of SOM generally adopts rectangular configuration in the practice, and node as much as possible is set, and therefore makes this method be easy to occur the neuronic situation of owing to utilize.

In order to obtain ideal results, can understand the structure of input data in advance, but this nothing that has influenced cluster again instructs characteristic.And under the overwhelming majority's situation, priori can not allow the operator select what a suitable network scale in advance, so it has influenced the practical application of SOM.It is worthy of note had the researchist to recognize this problem at present, wherein a kind of relatively typical method is GHSOM.As shown in Figures 2 and 3, this model can come network is expanded by the mode of inserting row or row in output layer, so that the thematic structure of the data of reflection input adaptively.But the rectangular configuration that this method adopts causes the scale expansion of network too fast easily, owes to utilize phenomenon thereby neuron takes place easily.What is called owes to utilize, and is meant because the insertion neuron is too much, makes same class document be shone upon by a plurality of different neurons.

Summary of the invention

The invention provides a kind of document clustering method based on self-organized mapping network, with overcome existing SOM self-organization mapping clustering method be difficult to accomplish to the self-adaptation of input document data with and neuron that fixed sturcture was brought is owed to utilize, network mapping is owed accurately and problem such as edge effect.Method of the present invention realizes by following step: one, utilize term to find out all selected documents in retrieval person's specified scope; Two, the output layer with self-organized mapping network is initialized as loop configuration, and loop configuration divided equally at least is two halves, and wherein each is fan-shaped respectively as a neuron; Three, import selected document, carry out the training of self-organized mapping network, calculate the R of current output layer ²The clustering criteria coefficient, described clustering criteria coefficients R ²Be expressed as:

R^{2} = 1 - \frac{P_{c}}{T},

Described R ²Span be [0,1], wherein T is total sum of squares of deviations of all samples, supposes then to define at total c the neuron of moment t output layer

P_{c} = Σ_{k = 1}^{c} S_{k},

S wherein _kExpression neuron N _kSum of squares of deviations in the class of institute's mapped sample; Four, judge R ²Whether the clustering criteria coefficient greater than threshold value μ, μ=0.3; Five, the result of step 4 then stops the training of self-organized mapping network for being, selected document is classified according to the output layer neuron formation of current self-organized mapping network; Six, finish; Seven, the result of step 4 then seeks the neuron with sum of squares of deviations in the maximum kind for not in current output layer, inserts new neuron in its vicinity, and each weights of the output layer of initialization loop configuration, returns step 3 then.

Method of the present invention has adopted closed annular output layer structure.The advantage of this structure is to carry out progressively neuron and expands, and can overcome the Boundary Effect problem that rectangular configuration and other structures are brought easily.The output layer of the inventive method adopts closed loop configuration, neuron of the fan-shaped representative of wherein each, as shown in Figure 4.The advantage of this structure is that fan-shaped number can be got any integer value, therefore can reflect the category distribution information of importing collection of document preferably.Therefore each fan-shaped adjacent neurons that same number is all arranged in this model in addition can guarantee the symmetry of structure, has also avoided the edge effect problem of rectangular configuration.When needs expand output layer, can insert the neuron of arbitrary number, therefore help avoiding the neuronic problem of owing to utilize.

The inventive method is a less scale with netinit at first, then under the guidance of clustering criteria function network structure is dynamically adjusted, with the theme regularity of distribution of true reflection input document.Decomposition strategy has been used for reference the thought of top-down hierarchical clustering algorithm, supposes that all documents can be divided into two classes at least, and therefore output layer includes only two neurons when initialization.Crossing near the neuron of growing and making new advances the neuron that utilizes subsequently, so that refinement is to the expression of input data.Adopt R ²The clustering criteria coefficient is as basis for estimation, and seeking balance between neuronic mistake is utilized and owed to utilize is to determine a kind of optimum network scale that can truly reflect input data structure.The clustering criteria function control effectively to network size by estimating the relation between neuron and the document, avoids unrestricted growth.

Method of the present invention has overcome to be used the SOM model to carry out the incidental neuron of clustering documents to owe to utilize and cross the problem of utilizing traditionally, and cluster F value is than being significantly improved with class methods.

The computing method of cluster F value: the overall quality of clustering documents is estimated with cluster F value.For some cluster classification r of cluster generation and original predetermine class s, the definition of recall rate recall and accurate rate precision is respectively:

recall(r，s)＝n(r，s)/n _s (5)

precision(r，s)＝n(r，s)/n _r (6)

Wherein (r s) is classification r after the cluster and the common document number among the predefine classification s to n.n _rBe the document number among the cluster classification r, n _sIt is the document number among the predefine classification s.(r s) is definition F

F(r，s)＝(2*recall(r，s)*precision(r，s))/((precison(r，s)+recall(r，s))(7)

Then the overall assessment function of cluster result is

F = \underset{i}{Σ} \frac{n_{i}}{n} \max {F (i, j)} - - - (8)

Here, n is the input document number of cluster.And n _iDocument number among the expression predefine classification i.

Description of drawings

Fig. 1 is the synoptic diagram of the inventive method, Fig. 2 is the rectangular configuration synoptic diagram that output layer adopted of existing GHSOM method, Fig. 3 is that the GHSOM method is inserted new neuron synoptic diagram, Fig. 4 is the loop configuration synoptic diagram that output layer adopted of the inventive method, and Fig. 5 is that the output layer of the inventive method inserts new neuron synoptic diagram.

Embodiment

Specify present embodiment below in conjunction with Fig. 1 to Fig. 5.Method of the present invention realizes by following step: one, utilize term to find out all selected documents in retrieval person's specified scope; Two, the output layer with self-organized mapping network is initialized as loop configuration, and loop configuration divided equally at least is two halves, and wherein each is fan-shaped respectively as a neuron; Three, import selected document, carry out the training of self-organized mapping network, calculate the R of current output layer ²The clustering criteria coefficient; Four, judge R ²Whether the clustering criteria coefficient is greater than threshold value μ; Five, the result of step 4 then stops the training of self-organized mapping network for being, selected document is classified according to the output layer neuron formation of current self-organized mapping network; Six, finish; Seven, the result of step 4 then seeks the neuron with sum of squares of deviations in the maximum kind for not in current output layer, inserts new neuron in its vicinity, and each weights of the output layer of initialization loop configuration, returns step 3 then.

When utilizing existing SOM method to carry out clustering documents, neuron on the output layer generally is expressed as and imports the vector that document has same dimension, and its weights are initialized as the small random number, and the weights of input document on each feature dimensions then depend on the frequency of occurrences of this feature dimensions in document.Feature dimensions generally is made of through feature selecting all notional words (filtering out insignificant stop words) in the input collection of document.The purpose of feature selecting is only to keep the speech structure cluster space that classification is had strong separating capacity.Through training up, the node of SOM output layer becomes the neurocyte to AD HOC class sensitivity, and corresponding vector then becomes the center vector of each input pattern class, therefore can play the cluster effect.

Self-organization is mapped with three main processes: competition, cooperation and cynapse are regulated.For each input document d _i, the neuron in the network calculates itself and d respectively _iBetween similarity.The neuron of similarity maximum will win competition, become the triumph neuron.The triumph neuron determines the topological neighborhood position of excitor nerve unit, thereby the basis of adjacent neurons cooperation is provided.Have only the neuron in triumph neuron and the neighborhood thereof to have the right to carry out the adjustment of weight vector.The amplitude that weights are adjusted is by learning rate

Control, this parameter will reduce gradually along with the carrying out of study.Neighborhood scope r _j(t) also increase in time and reduce.Therefore when the training beginning, there are a large amount of neurons to be adjusted weights, and have only victor oneself to be adjusted weights at last.

The following formula of the general employing of neuronic weights adjustment:

Dist (d wherein _i, n _j(t)) expression document vector d _iWith the neuron vector n _j(t) distance.n _j(t+1) and n _j(t) then represent neuron n respectively _jAdjust the back and adjust preceding weight vector.

Be learning rate function, r _j(t) be the neighborhood function.The value of the two is got bigger initial value when network begins to train, successively decrease gradually along with the carrying out of training then.

Order | N _i(t) | expression neuron N _iThe number of files that t shone upon at a time, m _iBe neuron N _iPairing vector.N then _iSum of squares of deviations is in the class of institute's mapped sample

S_{i} = \underset{d_{j} &RightArrow; N_{i}}{Σ} {(d_{j} - m_{i})}^{T} (d_{j} - m_{i}) - - - (2)

S _iMore little, N then _iThe document that is shone upon is " pure " more, and the possibility that comes from same theme is big more.

At moment t, suppose that output layer has c neuron, then definition

P_{c} = Σ_{k = 1}^{c} S_{k} .

Suppose the total sum of squares of deviations of T, then for all samples

T = Σ_{i = 1}^{| D |} {(d_{i} - \overset{&OverBar;}{x})}^{T} (d_{i} - \overset{&OverBar;}{x}) - - - (3)

Wherein

\overset{&OverBar;}{x} = \frac{1}{| D |} Σ_{i = 1}^{| D |} d_{i}

The mean vector of representing all training samples.| D| represents to import the number of sample.Then

R^{2} = 1 - \frac{P_{c}}{T} - - - (4)

The clustering criteria coefficients R ²Span be [0,1], and its concrete value is generally along with the growth of network size is the dull trend that increases.Therefore need setting threshold μ to stop increasing of network in due course, prevent the neuronic phenomenon of owing to utilize.If R ²Value less than a certain threshold value μ, need be in having maximum kind the neuron N of sum of squares of deviations _MaxNear the new neuron of insertion is so that refinement is to the expression of input data.Concrete grammar is to investigate and N _MaxTwo the most adjacent neurons suppose that wherein neuron N ' has sum of squares of deviations in the less class, then at N _MaxAnd insert a neuron N between the N ' _New, and N _NewWeight vector be initialized as N _MaxAnd the average of the vector of N ' representative.

Application process of the present invention is: the user imports term and gives search engine, the result that search engine will find by retrieval returns, these documents that return will be as the input of clustering method of the present invention, pass through clustering processing, make the result who returns be classified processing, improve effect of visualization, thereby improved recall precision.

Claims

1, a kind of document clustering method based on self-organized mapping network is characterized in that it realizes by following step: one, utilize term to find out all selected documents in retrieval person's specified scope; Two, the output layer with self-organized mapping network is initialized as loop configuration, and loop configuration divided equally at least is two halves, and wherein each is fan-shaped respectively as a neuron; Three, import selected document, carry out the training of self-organized mapping network, calculate the R of current output layer ²The clustering criteria coefficient, described clustering criteria coefficients R ²Be expressed as:

R^{2} = 1 - \frac{P_{c}}{T},

P_{c} = Σ_{k = 1}^{c} S_{k},

S wherein _kExpression neuron N _kSum of squares of deviations in the class of institute's mapped sample; Four, judge R ²Whether the clustering criteria coefficient greater than threshold value μ, μ=0.3; Five, the result of step 4 then stops the training of self-organized mapping network for being, selected document is classified according to the output layer neuron formation of current self-organized mapping network; Six, finish; Seven, the result of step 4 then seeks the neuron with sum of squares of deviations in the maximum kind for not in current output layer, insert new neuron in its vicinity, and each weights of the output layer of initialization loop configuration returns step 3 then.