CN100446032C - Self-organized mapping network based document clustering method - Google Patents

Self-organized mapping network based document clustering method Download PDF

Info

Publication number
CN100446032C
CN100446032C CNB2006100097619A CN200610009761A CN100446032C CN 100446032 C CN100446032 C CN 100446032C CN B2006100097619 A CNB2006100097619 A CN B2006100097619A CN 200610009761 A CN200610009761 A CN 200610009761A CN 100446032 C CN100446032 C CN 100446032C
Authority
CN
China
Prior art keywords
self
neuron
output layer
mapping network
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2006100097619A
Other languages
Chinese (zh)
Other versions
CN1808474A (en
Inventor
刘远超
关毅
徐志明
刘秉权
林磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CNB2006100097619A priority Critical patent/CN100446032C/en
Publication of CN1808474A publication Critical patent/CN1808474A/en
Application granted granted Critical
Publication of CN100446032C publication Critical patent/CN100446032C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2137Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on criteria of topology preservation, e.g. multidimensional scaling or self-organising maps

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a file clustering method, particularly to a file clustering method based on self-organization mapping network. The present invention overcomes the problems of the self-adaptation of an input document data, neurone insufficiency utilize brought by a fixed structure of the input document data, network mapping insufficiency accuracy, edge effect, etc., which are difficult for an existing clustering method based on self-organization mapping network to overcome. The method of the present invention has the steps that all selected documents are found out; an output layer of the self-organization mapping network is initiated into an annular structure; the annular structure is at least divided into two parts and each sector is used as a neuron; an R2 clustering criterion factor of the current output layer is calculated; whether the clustering criterion factor R2 is more than a threshold mu is judged, if yes, the training of the self-organization mapping network is stopped and the selected file is classified according to the current self-organization mapping network, if not, the neuron with maximal square sum of deviation in class is looked for in the current output layer to interpolate a new neuron nearby; all neurons of the current output layer are trained.

Description

A kind of document clustering method based on self-organized mapping network
Technical field
The present invention relates to a kind of document clustering method.
Background technology
As a kind of unsupervised machine learning method, cluster has higher robotization processing power, has become the important means that text message is effectively organized, made a summary and navigates.The purpose of clustering documents is to excavate structural information wherein by collection of document is put in order automatically, browses thereby be convenient to the user, improves the access efficiency of information.It is mainly used in comprising the aspect such as automatic arrangement, user interest excavation of Digital Library Services, search engine return results.In numerous document clustering methods, the self-organization mapping that T.Kohonen proposes (Self-Organizing Maps is called for short SOM) has caused the concern that the researchist is more.Clustering documents has higher-dimension and the characteristics relevant with semanteme, and SOM can realize the order-preserving mapping of high dimensional data to the two dimensional surface space preferably.So-called order-preserving mapping is meant that similarity is bigger each other different document often is mapped on the same neuron or neuron located adjacent one another of SOM output layer, so the visual level and the homing capability of SOM cluster result are better.In addition, some noise documents also can be easy to be found in the SOM output layer, make that the anti-noise ability of this method is also stronger.
But network structure among the SOM and neuron number need be determined before training, thereby be difficult to accomplish to importing the self-adaptation of document data.The SOM of fixed sturcture also brought such as neuron owe to utilize, network mapping is owed accurately and problem such as edge effect.This is because its fixing network structure is difficult to reflect the structural information of importing data, causes its dirigibility relatively poor.In general, the number of the number of output node and training sample mode class has relation.If the node number is more than mode class, then a kind of may be category division to be got meticulous, and another possibility is to occur dying for the sake of honour a little, and promptly in training process, certain node was never won and away from other triumph nodes.If the node number is less than the mode class number, then be not enough to distinguish all mode class, the result of training will make close mode class merge into a class.The output layer of SOM generally adopts rectangular configuration in the practice, and node as much as possible is set, and therefore makes this method be easy to occur the neuronic situation of owing to utilize.
In order to obtain ideal results, can understand the structure of input data in advance, but this nothing that has influenced cluster again instructs characteristic.And under the overwhelming majority's situation, priori can not allow the operator select what a suitable network scale in advance, so it has influenced the practical application of SOM.It is worthy of note had the researchist to recognize this problem at present, wherein a kind of relatively typical method is GHSOM.As shown in Figures 2 and 3, this model can come network is expanded by the mode of inserting row or row in output layer, so that the thematic structure of the data of reflection input adaptively.But the rectangular configuration that this method adopts causes the scale expansion of network too fast easily, owes to utilize phenomenon thereby neuron takes place easily.What is called owes to utilize, and is meant because the insertion neuron is too much, makes same class document be shone upon by a plurality of different neurons.
Summary of the invention
The invention provides a kind of document clustering method based on self-organized mapping network, with overcome existing SOM self-organization mapping clustering method be difficult to accomplish to the self-adaptation of input document data with and neuron that fixed sturcture was brought is owed to utilize, network mapping is owed accurately and problem such as edge effect.Method of the present invention realizes by following step: one, utilize term to find out all selected documents in retrieval person's specified scope; Two, the output layer with self-organized mapping network is initialized as loop configuration, and loop configuration divided equally at least is two halves, and wherein each is fan-shaped respectively as a neuron; Three, import selected document, carry out the training of self-organized mapping network, calculate the R of current output layer 2The clustering criteria coefficient, described clustering criteria coefficients R 2Be expressed as: R 2 = 1 - P c T , Described R 2Span be [0,1], wherein T is total sum of squares of deviations of all samples, supposes then to define at total c the neuron of moment t output layer P c = Σ k = 1 c S k , S wherein kExpression neuron N kSum of squares of deviations in the class of institute's mapped sample; Four, judge R 2Whether the clustering criteria coefficient greater than threshold value μ, μ=0.3; Five, the result of step 4 then stops the training of self-organized mapping network for being, selected document is classified according to the output layer neuron formation of current self-organized mapping network; Six, finish; Seven, the result of step 4 then seeks the neuron with sum of squares of deviations in the maximum kind for not in current output layer, inserts new neuron in its vicinity, and each weights of the output layer of initialization loop configuration, returns step 3 then.
Method of the present invention has adopted closed annular output layer structure.The advantage of this structure is to carry out progressively neuron and expands, and can overcome the Boundary Effect problem that rectangular configuration and other structures are brought easily.The output layer of the inventive method adopts closed loop configuration, neuron of the fan-shaped representative of wherein each, as shown in Figure 4.The advantage of this structure is that fan-shaped number can be got any integer value, therefore can reflect the category distribution information of importing collection of document preferably.Therefore each fan-shaped adjacent neurons that same number is all arranged in this model in addition can guarantee the symmetry of structure, has also avoided the edge effect problem of rectangular configuration.When needs expand output layer, can insert the neuron of arbitrary number, therefore help avoiding the neuronic problem of owing to utilize.
The inventive method is a less scale with netinit at first, then under the guidance of clustering criteria function network structure is dynamically adjusted, with the theme regularity of distribution of true reflection input document.Decomposition strategy has been used for reference the thought of top-down hierarchical clustering algorithm, supposes that all documents can be divided into two classes at least, and therefore output layer includes only two neurons when initialization.Crossing near the neuron of growing and making new advances the neuron that utilizes subsequently, so that refinement is to the expression of input data.Adopt R 2The clustering criteria coefficient is as basis for estimation, and seeking balance between neuronic mistake is utilized and owed to utilize is to determine a kind of optimum network scale that can truly reflect input data structure.The clustering criteria function control effectively to network size by estimating the relation between neuron and the document, avoids unrestricted growth.
Method of the present invention has overcome to be used the SOM model to carry out the incidental neuron of clustering documents to owe to utilize and cross the problem of utilizing traditionally, and cluster F value is than being significantly improved with class methods.
The computing method of cluster F value: the overall quality of clustering documents is estimated with cluster F value.For some cluster classification r of cluster generation and original predetermine class s, the definition of recall rate recall and accurate rate precision is respectively:
recall(r,s)=n(r,s)/n s (5)
precision(r,s)=n(r,s)/n r (6)
Wherein (r s) is classification r after the cluster and the common document number among the predefine classification s to n.n rBe the document number among the cluster classification r, n sIt is the document number among the predefine classification s.(r s) is definition F
F(r,s)=(2*recall(r,s)*precision(r,s))/((precison(r,s)+recall(r,s))(7)
Then the overall assessment function of cluster result is
F = Σ i n i n max { F ( i , j ) } - - - ( 8 )
Here, n is the input document number of cluster.And n iDocument number among the expression predefine classification i.
Description of drawings
Fig. 1 is the synoptic diagram of the inventive method, Fig. 2 is the rectangular configuration synoptic diagram that output layer adopted of existing GHSOM method, Fig. 3 is that the GHSOM method is inserted new neuron synoptic diagram, Fig. 4 is the loop configuration synoptic diagram that output layer adopted of the inventive method, and Fig. 5 is that the output layer of the inventive method inserts new neuron synoptic diagram.
Embodiment
Specify present embodiment below in conjunction with Fig. 1 to Fig. 5.Method of the present invention realizes by following step: one, utilize term to find out all selected documents in retrieval person's specified scope; Two, the output layer with self-organized mapping network is initialized as loop configuration, and loop configuration divided equally at least is two halves, and wherein each is fan-shaped respectively as a neuron; Three, import selected document, carry out the training of self-organized mapping network, calculate the R of current output layer 2The clustering criteria coefficient; Four, judge R 2Whether the clustering criteria coefficient is greater than threshold value μ; Five, the result of step 4 then stops the training of self-organized mapping network for being, selected document is classified according to the output layer neuron formation of current self-organized mapping network; Six, finish; Seven, the result of step 4 then seeks the neuron with sum of squares of deviations in the maximum kind for not in current output layer, inserts new neuron in its vicinity, and each weights of the output layer of initialization loop configuration, returns step 3 then.
When utilizing existing SOM method to carry out clustering documents, neuron on the output layer generally is expressed as and imports the vector that document has same dimension, and its weights are initialized as the small random number, and the weights of input document on each feature dimensions then depend on the frequency of occurrences of this feature dimensions in document.Feature dimensions generally is made of through feature selecting all notional words (filtering out insignificant stop words) in the input collection of document.The purpose of feature selecting is only to keep the speech structure cluster space that classification is had strong separating capacity.Through training up, the node of SOM output layer becomes the neurocyte to AD HOC class sensitivity, and corresponding vector then becomes the center vector of each input pattern class, therefore can play the cluster effect.
Self-organization is mapped with three main processes: competition, cooperation and cynapse are regulated.For each input document d i, the neuron in the network calculates itself and d respectively iBetween similarity.The neuron of similarity maximum will win competition, become the triumph neuron.The triumph neuron determines the topological neighborhood position of excitor nerve unit, thereby the basis of adjacent neurons cooperation is provided.Have only the neuron in triumph neuron and the neighborhood thereof to have the right to carry out the adjustment of weight vector.The amplitude that weights are adjusted is by learning rate
Figure C20061000976100061
Control, this parameter will reduce gradually along with the carrying out of study.Neighborhood scope r j(t) also increase in time and reduce.Therefore when the training beginning, there are a large amount of neurons to be adjusted weights, and have only victor oneself to be adjusted weights at last.
The following formula of the general employing of neuronic weights adjustment:
Figure C20061000976100062
Dist (d wherein i, n j(t)) expression document vector d iWith the neuron vector n j(t) distance.n j(t+1) and n j(t) then represent neuron n respectively jAdjust the back and adjust preceding weight vector.
Figure C20061000976100071
Be learning rate function, r j(t) be the neighborhood function.The value of the two is got bigger initial value when network begins to train, successively decrease gradually along with the carrying out of training then.
Order | N i(t) | expression neuron N iThe number of files that t shone upon at a time, m iBe neuron N iPairing vector.N then iSum of squares of deviations is in the class of institute's mapped sample
S i = Σ d j → N i ( d j - m i ) T ( d j - m i ) - - - ( 2 )
S iMore little, N then iThe document that is shone upon is " pure " more, and the possibility that comes from same theme is big more.
At moment t, suppose that output layer has c neuron, then definition P c = Σ k = 1 c S k . Suppose the total sum of squares of deviations of T, then for all samples
T = Σ i = 1 | D | ( d i - x ‾ ) T ( d i - x ‾ ) - - - ( 3 )
Wherein x ‾ = 1 | D | Σ i = 1 | D | d i The mean vector of representing all training samples.| D| represents to import the number of sample.Then
R 2 = 1 - P c T - - - ( 4 )
The clustering criteria coefficients R 2Span be [0,1], and its concrete value is generally along with the growth of network size is the dull trend that increases.Therefore need setting threshold μ to stop increasing of network in due course, prevent the neuronic phenomenon of owing to utilize.If R 2Value less than a certain threshold value μ, need be in having maximum kind the neuron N of sum of squares of deviations MaxNear the new neuron of insertion is so that refinement is to the expression of input data.Concrete grammar is to investigate and N MaxTwo the most adjacent neurons suppose that wherein neuron N ' has sum of squares of deviations in the less class, then at N MaxAnd insert a neuron N between the N ' New, and N NewWeight vector be initialized as N MaxAnd the average of the vector of N ' representative.
Application process of the present invention is: the user imports term and gives search engine, the result that search engine will find by retrieval returns, these documents that return will be as the input of clustering method of the present invention, pass through clustering processing, make the result who returns be classified processing, improve effect of visualization, thereby improved recall precision.

Claims (1)

1, a kind of document clustering method based on self-organized mapping network is characterized in that it realizes by following step: one, utilize term to find out all selected documents in retrieval person's specified scope; Two, the output layer with self-organized mapping network is initialized as loop configuration, and loop configuration divided equally at least is two halves, and wherein each is fan-shaped respectively as a neuron; Three, import selected document, carry out the training of self-organized mapping network, calculate the R of current output layer 2The clustering criteria coefficient, described clustering criteria coefficients R 2Be expressed as: R 2 = 1 - P c T , Described R 2Span be [0,1], wherein T is total sum of squares of deviations of all samples, supposes then to define at total c the neuron of moment t output layer P c = Σ k = 1 c S k , S wherein kExpression neuron N kSum of squares of deviations in the class of institute's mapped sample; Four, judge R 2Whether the clustering criteria coefficient greater than threshold value μ, μ=0.3; Five, the result of step 4 then stops the training of self-organized mapping network for being, selected document is classified according to the output layer neuron formation of current self-organized mapping network; Six, finish; Seven, the result of step 4 then seeks the neuron with sum of squares of deviations in the maximum kind for not in current output layer, insert new neuron in its vicinity, and each weights of the output layer of initialization loop configuration returns step 3 then.
CNB2006100097619A 2006-03-02 2006-03-02 Self-organized mapping network based document clustering method Expired - Fee Related CN100446032C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006100097619A CN100446032C (en) 2006-03-02 2006-03-02 Self-organized mapping network based document clustering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006100097619A CN100446032C (en) 2006-03-02 2006-03-02 Self-organized mapping network based document clustering method

Publications (2)

Publication Number Publication Date
CN1808474A CN1808474A (en) 2006-07-26
CN100446032C true CN100446032C (en) 2008-12-24

Family

ID=36840365

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006100097619A Expired - Fee Related CN100446032C (en) 2006-03-02 2006-03-02 Self-organized mapping network based document clustering method

Country Status (1)

Country Link
CN (1) CN100446032C (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488662A (en) * 2013-04-01 2014-01-01 哈尔滨工业大学深圳研究生院 Clustering method and system of parallelized self-organizing mapping neural network based on graphic processing unit
CN104731811B (en) * 2013-12-20 2018-10-09 北京师范大学珠海分校 A kind of clustering information evolution analysis method towards extensive dynamic short text
CN108427967B (en) * 2018-03-13 2021-08-27 中国人民解放军战略支援部队信息工程大学 Real-time image clustering method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5897627A (en) * 1997-05-20 1999-04-27 Motorola, Inc. Method of determining statistically meaningful rules
JP2004062482A (en) * 2002-07-29 2004-02-26 Fuji Xerox Co Ltd Data classifier

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5897627A (en) * 1997-05-20 1999-04-27 Motorola, Inc. Method of determining statistically meaningful rules
JP2004062482A (en) * 2002-07-29 2004-02-26 Fuji Xerox Co Ltd Data classifier

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
基于SOM 神经网络的医疗诊断专家***研究. 于子翊,茹蓓.新乡师范高等专科学校学报,第19卷第5期. 2005
基于SOM 神经网络的医疗诊断专家***研究. 于子翊,茹蓓.新乡师范高等专科学校学报,第19卷第5期. 2005 *
自组织特征映射网络的分析与应用. 程勖,杨毅恒,陈薇伶.长春师范学院学报(自然科学版),第24卷第4期. 2005
自组织特征映射网络的分析与应用. 程勖,杨毅恒,陈薇伶.长春师范学院学报(自然科学版),第24卷第4期. 2005 *

Also Published As

Publication number Publication date
CN1808474A (en) 2006-07-26

Similar Documents

Publication Publication Date Title
CN110443281B (en) Text classification self-adaptive oversampling method based on HDBSCAN (high-density binary-coded decimal) clustering
CN107247938A (en) A kind of method of high-resolution remote sensing image City Building function classification
CN102201236B (en) Speaker recognition method combining Gaussian mixture model and quantum neural network
CN106096727A (en) A kind of network model based on machine learning building method and device
CN101196905A (en) Intelligent pattern searching method
CN103080979B (en) From the system and method for photo synthesis portrait sketch
CN110503721B (en) Fracture terrain keeping method based on weighted radial basis function interpolation
CN101908213B (en) SAR image change detection method based on quantum-inspired immune clone
CN103544526A (en) Improved particle swarm algorithm and application thereof
CN100446032C (en) Self-organized mapping network based document clustering method
CN116166960B (en) Big data characteristic cleaning method and system for neural network training
Mac Parthaláin et al. Fuzzy-rough set bireducts for data reduction
CN110197202A (en) A kind of local feature fine granularity algorithm of target detection
CN102789493A (en) Self-adaptive dual-harmony optimization method
CN109840558B (en) Self-adaptive clustering method based on density peak value-core fusion
CN109766945A (en) The complex network construction method combined is clustered with density peaks based on mapping
CN103123685A (en) Text mode recognition method
CN111782904B (en) Unbalanced data set processing method and system based on improved SMOTE algorithm
CN113759336B (en) Sea clutter suppression method under graph feature learning
Cao et al. An optimized chameleon algorithm based on local features
CN115496138A (en) Self-adaptive density peak value clustering method based on natural neighbors
CN108717551A (en) A kind of fuzzy hierarchy clustering method based on maximum membership degree
CN104036024A (en) Spatial clustering method based on GACUC (greedy agglomerate category utility clustering) and Delaunay triangulation network
Cui et al. Weighted particle swarm clustering algorithm for self-organizing maps
CN104156423B (en) Multiple dimensioned video key frame extracting method based on integer programming

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20081224

Termination date: 20100302