JP2008102737A

JP2008102737A - Stored document classification apparatus, stored document classification method, program, and recording medium

Info

Publication number: JP2008102737A
Application number: JP2006284659A
Authority: JP
Inventors: Yoshihide Sato; 吉秀佐藤; Harumi Kawashima; 晴美川島; Yuichiro Sekiguchi; 裕一郎関口; Hidenori Okuda; 英範奥田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2006-10-19
Filing date: 2006-10-19
Publication date: 2008-05-01
Anticipated expiration: 2026-10-19
Also published as: JP4807880B2

Abstract

PROBLEM TO BE SOLVED: To provide a stored document classification apparatus, a stored document classification method, a program, and a recording medium, in which document classification to be easily observed as a whole can be achieved while considering deflection of contents in a set of documents and reducing user's load. SOLUTION: When the deflection of contents in a set of documents such that there are many documents related to a specific topic and there are a few documents related other topics, a cluster of documents related to the large topic is classified by finer grading as compared with the clusters of other documents. COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、文書集合をその内容に応じて分類するクラスタリング処理において、注目度が高い話題に関する文書の数が多く、注目度が低い話題に関する文書の数が少ない等のように、文書集合に内容的な偏りがある場合に、大きな話題に関連する文書を、他よりも細かい粒度で分類することによって、全体として見易い文書分類を行う技術に関する。
In the clustering process for classifying a document set according to its contents, the present invention provides a content to a document set such that the number of documents related to a topic with high attention is large and the number of documents related to a topic with low attention is small. The present invention relates to a technique for classifying documents that are easy to see as a whole by classifying documents related to a large topic at a finer granularity than others when there is a general bias.

文書等の大量のデータを分類する場合、よく用いられるクラスタリング手法として、最短距離法、最長距離法、群平均法、ウォード法、ｋ−ｍｅａｎｓ法等があり、これらを大きく分けると、階層的手法と分割最適化手法とに分けることができる（たとえば、非特許文献１参照）。 In order to classify a large amount of data such as documents, clustering methods that are often used include the shortest distance method, the longest distance method, the group average method, the Ward method, the k-means method, and the like. And a division optimization method (see, for example, Non-Patent Document 1).

上記「階層的手法」は、初期状態として、データ１個１個をそれぞれクラスタとみなし、最も距離的に近いクラスタ同士を結合しながら集約するボトムアップ手法、または、逆に、全データを含む１クラスタから開始し、このクラスタを分割しながら、細分化するトップダウン手法がある。いずれの方法も、生成されたクラスタ群は、デンドログラムと呼ぶ樹状の階層的構造を持ち、最下層では、個々のデータがクラスタを構成する最も細分化した状態になり、最上層では、全データが１クラスタに収まった最も集約された状態になる。任意の階層を指定すれば、任意のクラスタ数への分割を行うことができる。 The above-mentioned “hierarchical method” is a bottom-up method in which each piece of data is regarded as a cluster as an initial state, and the clusters that are closest in distance are combined together, or conversely, 1 that includes all data. There is a top-down method that starts from a cluster and subdivides the cluster while dividing it. In either method, the generated cluster group has a dendritic hierarchical structure called a dendrogram, and in the lowermost layer, individual data is in the most fragmented state constituting the cluster, and in the uppermost layer, all the data It becomes the most aggregated state in which data fits in one cluster. If an arbitrary hierarchy is specified, division into an arbitrary number of clusters can be performed.

一方、上記「分割最適化手法」は、予め分割するクラスタ数を指定し、分割の良さを表わす評価関数が最適になるように、個々のデータの所属するクラスタを変える手法である。 On the other hand, the “division optimization method” is a method of designating the number of clusters to be divided in advance and changing clusters to which individual data belong so that an evaluation function representing the goodness of division is optimized.

上記いずれの手法においても、文書内の各単語の出現回数等に基づいて、文書の非類似性を表わす文書間距離を計算し、文書間距離が近い文書同士を結合し、遠い文書同士を分離する方法で、文書集合の分類を実現する。 In any of the above methods, based on the number of occurrences of each word in the document, the inter-document distance representing the dissimilarity of the documents is calculated, the documents having a short inter-document distance are combined, and the distant documents are separated. In this way, the document set is classified.

クラスタリング処理を行う際、たとえば、「文書集合を３個のクラスタに分割する」といった分類数の指定や、「文書間距離が０．９以下のクラスタのみ結合する」等、距離の閾値の指定を事前に行うことによって、粗い分類や細かい分類等、利用者が望む粒度で、文書を分類する。
神▲島▼敏弘著「データマイニング分野のクラスタリング手法（１）―クラスタリングを使ってみよう！―」人工知能学会誌、vol.18, no.1, pp.59-65(2003年1月)。 When performing clustering processing, for example, specify the number of classifications such as “divide a document set into three clusters” or specify a threshold of distance, such as “join only clusters whose inter-document distance is 0.9 or less”. By performing in advance, the documents are classified at a granularity desired by the user, such as coarse classification and fine classification.
God ▲ Shima Toshihiro “Clustering Techniques in the Data Mining Field (1): Let's Use Clustering!” Journal of the Japanese Society for Artificial Intelligence, vol.18, no.1, pp.59-65 (January 2003).

しかし、上記従来技術では、文書の内容に基づいてクラスタを形成するので、特定の話題に関連する文書数が多い場合、極端に大きなクラスタ（含まれている文書数が極端に多いクラスタ）が生成される一方、相対的に極めて小さなクラスタ（含まれている文書数が少ないクラスタ）も数多く生成される。大きなクラスタは、文書の内容が類似しているために生成されたものであるが、１つのクラスタに含まれている文書の数が多ければ、そのクラスタに含まれている文書間の類似度のばらつきが大きく、類似度が高い文書同士もあれば、類似度が多少低い文書同士もある。したがって、上記クラスタをさらに細分化すれば、そこに含まれている種々の細かな話題を発見し易い。 However, in the above prior art, clusters are formed based on the contents of documents, so if there are many documents related to a specific topic, an extremely large cluster (cluster with an extremely large number of documents included) is generated. On the other hand, many relatively small clusters (clusters with a small number of documents included) are also generated. A large cluster is generated because the contents of the documents are similar, but if there are a large number of documents included in one cluster, the degree of similarity between the documents included in the cluster is determined. Some documents have large variations and high similarities, and some documents have slightly similarities. Therefore, if the cluster is further subdivided, it is easy to find various detailed topics included therein.

しかし、クラスタを細分化するために、分割数や距離の閾値の指定を変更すると、この変更による影響が文書集合全体に及び、元々小さいクラスタまでも細分化し、逆に様々な話題を発見することが困難になるという問題がある。 However, if you change the number of divisions or distance threshold specification to subdivide the cluster, the effect of this change will affect the entire document set, and even small clusters will be subdivided, and various topics will be discovered. There is a problem that becomes difficult.

文書数が変化しない静的な文書集合を扱う場合、利用者が指定したクラスタのみを対象として、再度クラスタリング処理することによって細分化すれば、上記問題を解決できるので、特に大きな問題とはならない。 When dealing with a static document set in which the number of documents does not change, the above problem can be solved by subdividing the cluster specified by the user again by subjecting it to the cluster specified by the user.

しかし、たとえばニュース記事のように、文書数が日々増加するような文書集合を対象とし、日々分類を行いながら話題を発見しようとする場合、大きな事件等の発生に起因して関連記事が急増し、極めて大きなクラスタが生成されると、その度に、細分化対象のクラスタを、利用者が指定する必要があり、利用者の負担が大きいという問題がある。 However, for example, when a document set whose number of documents increases every day, such as a news article, is used to discover topics while performing daily classification, related articles increase rapidly due to the occurrence of a large incident. When an extremely large cluster is generated, it is necessary for the user to designate a cluster to be subdivided each time, and there is a problem that the burden on the user is heavy.

本発明は、文書集合に内容的な偏りがあることを考慮し、全体として見易い文書分類を、利用者の負担を少なくして実現することができる蓄積文書分類装置、蓄積文書分類方法、プログラムおよび記録媒体を提供することを目的とする。
In consideration of the content bias in a document set, the present invention provides an accumulated document classification device, an accumulated document classification method, a program, and a document classification that can realize easy-to-view document classification as a whole with less burden on the user. An object is to provide a recording medium.

本発明は、特定の話題に関連する文書の数が多く、他の話題に関連する文書が少ない等のように、文書集合に内容的な偏りがある場合、大きな話題に関連する文書のクラスタを、他の文書のクラスタよりも細かい粒度で分類するものである。
In the present invention, when there is a content bias in a document set such as a large number of documents related to a specific topic and a small number of documents related to other topics, a cluster of documents related to a large topic is generated. And classifying with a finer granularity than other document clusters.

本発明によれば、文書集合に内容的な偏りがあることを考慮し、全体として見易い文書分類を、利用者の負担を少なくして実現することができるという効果を奏する。
According to the present invention, considering that there is a content bias in a document set, document classification that is easy to see as a whole can be realized with less burden on the user.

発明を実施するための最良の形態は、以下の実施例である。 The best mode for carrying out the invention is the following examples.

図１は、本発明の実施例１である蓄積文書分類装置１００を示すブロック図である。 FIG. 1 is a block diagram showing an accumulated document classification apparatus 100 that is Embodiment 1 of the present invention.

蓄積文書分類装置１００は、ニュース記事等のように、逐次追加される性質の文書を対象として分類する実施例である。 The accumulated document classification apparatus 100 is an embodiment that classifies documents with properties that are added sequentially, such as news articles.

蓄積文書分類装置１００は、文書記録部１１と、文書解析部１２と、文書情報管理部１３と、文書間距離算出部１４と、文書間距離記録部１５と、クラスタリング部１６と、クラスタ記録部１７と、文書話題係数更新部１８と、文書話題係数記録部１９とを有する。 The accumulated document classification apparatus 100 includes a document recording unit 11, a document analysis unit 12, a document information management unit 13, an inter-document distance calculation unit 14, an inter-document distance recording unit 15, a clustering unit 16, and a cluster recording unit. 17, a document topic coefficient update unit 18, and a document topic coefficient recording unit 19.

図２は、蓄積文書分類装置１００の動作原理を示すフローチャートである。 FIG. 2 is a flowchart showing the operation principle of the accumulated document classification device 100.

まず、Ｓ１で、文書間の距離を算出する。Ｓ２で、各文書に記述されている内容の話題性が大きい文書について、各文書に記述されている内容の話題性の大きさを数値化する。続いて、Ｓ３で、各文書の上記文書話題係数を、他の文書との斥力のように扱い、Ｓ１で算出した文書間距離を仮想的に拡大した文書間偏重距離を算出する。最後に、Ｓ４で、上記文書間偏重距離に基づいて、近距離の文書同士を集約する（クラスタリングする）。 First, in S1, the distance between documents is calculated. In S2, for the document having a large topicality of the contents described in each document, the magnitude of the topicality of the contents described in each document is quantified. Subsequently, in S3, the document topic coefficient of each document is treated like a repulsive force with another document, and an inter-document bias distance obtained by virtually expanding the inter-document distance calculated in S1 is calculated. Finally, in S4, documents at short distances are aggregated (clustered) based on the inter-document deviation distance.

図３は、文書記録部１１に記録されているデータ例を示す図である。 FIG. 3 is a diagram illustrating an example of data recorded in the document recording unit 11.

文書記録部１１は、分析対象の文書データを、図３に示すように記録する。各文書には「０００１」、「０００２」等の一意な文書ＩＤを付与し、また、文書の作成時刻、収集時刻を取得できる場合、収集時刻等を、文書に付随する時刻情報に併せて記録する。 The document recording unit 11 records the document data to be analyzed as shown in FIG. Each document is given a unique document ID such as “0001” or “0002”, and when the creation time and collection time of the document can be acquired, the collection time is recorded together with the time information attached to the document. To do.

文書解析部１２は、文書記録部１１から、文書ＩＤ、時刻情報、本文の組を１文書ずつ取得し、本文を解析し、本文中に出現する単語とその出現回数とを集計し、文書ＩＤ、時刻情報と共に、文書情報管理部１３に記録する。 The document analysis unit 12 acquires a document ID, time information, and text combination from the document recording unit 11 one document at a time, analyzes the text, aggregates the words appearing in the text and the number of appearances thereof, The time information is recorded in the document information management unit 13.

図４は、文書情報管理部１３に記録されているデータの例を示す図である。 FIG. 4 is a diagram illustrating an example of data recorded in the document information management unit 13.

文章は、名詞、動詞、助詞、感嘆詞、さらに記号等の様々な要素で構成されているが、図４に示す例は、本文中から「政府」、「消費税」等の名詞のみを取得した例である。 Sentences are composed of various elements such as nouns, verbs, particles, exclamations, and symbols, but the example shown in Fig. 4 obtains only nouns such as "government" and "consumption tax" from the text. This is an example.

本文データから名詞を取得する場合、形態素解析等の文書解析手法を用いる。形態素解析を行うと、日本語文書を構成する最小単位である形態素への分かち書きを行うことができ、各形態素に「名詞」、「動詞」、「助詞」、「記号」等のタイプを付与することができる。文書解析部１２は、タイプが「名詞」である形態素のみを取得する。 When acquiring nouns from text data, document analysis techniques such as morphological analysis are used. When morpheme analysis is performed, it is possible to perform writing into morphemes, which are the smallest units that make up a Japanese document, and give each morpheme a type such as "noun", "verb", "particle", "symbol" be able to. The document analysis unit 12 acquires only morphemes whose type is “noun”.

文書間距離算出部１４は、文書情報管理部１３に登録されている文書のそれぞれについて、単語とその出現回数を取得し、異なる２文書間の距離を次に説明する方法で算出し、文書間距離記録部１５に記録する。 The inter-document distance calculation unit 14 acquires a word and the number of appearances of each document registered in the document information management unit 13 and calculates the distance between two different documents by the method described below. Recorded in the distance recording unit 15.

図５は、文書間距離記録部１５に記録されている文書間の距離を示す図である。 FIG. 5 is a diagram showing the distance between documents recorded in the inter-document distance recording unit 15.

文書間距離記録部１５には、図５に示すように、互いに異なる２文書間の距離を記録し、管理する。文書間距離を算出する場合、基本的かつ精度の高い方法として知られるベクトル空間モデルを用いる。ベクトル空間モデルについては、たとえば、「北研二、津田和彦、獅々堀正幹著「情報検索アルゴリズム」共立出版、pp.60-63」に記載されている。このベクトル空間モデルは、文書中に出現する単語の重要度を数値化することによって、文書をベクトルで表現し、異なる２文書のベクトル間のコサイン非類似度を、上記文書間の距離とする方法である。この方法によって、内容が類似する２文書間の距離を、小さな値として数値化し、類似しない２文書間の距離を、大きな値として数値化することができる。 As shown in FIG. 5, the inter-document distance recording unit 15 records and manages the distance between two different documents. When calculating the distance between documents, a vector space model known as a basic and highly accurate method is used. The vector space model is described in, for example, “Kitakenji, Tsuda Kazuhiko, Sasabori Masami“ Information Retrieval Algorithm ”Kyoritsu Shuppan, pp. 60-63”. This vector space model is a method of expressing a document as a vector by quantifying the importance of words appearing in the document, and setting the cosine dissimilarity between two different document vectors as the distance between the documents. It is. By this method, the distance between two documents having similar contents can be digitized as a small value, and the distance between two documents not similar can be digitized as a large value.

クラスタリング部１６は、文書間距離記録部１５に記録されている文書間距離を取得し、全ての文書について、内容が類似する文書毎にまとめ上げるクラスタリング処理を実行する。 The clustering unit 16 acquires the inter-document distance recorded in the inter-document distance recording unit 15, and executes a clustering process for collecting all documents with similar contents.

図６は、クラスタ記録部１７に記録されているデータ例を示す図である。 FIG. 6 is a diagram illustrating an example of data recorded in the cluster recording unit 17.

クラスタリング部１６による処理結果は、図６に示すように、クラスタ記録部１７に記録する。図６に示す例では、生成された各クラスタに、「Ｃ００１」、「Ｃ００２」等の一意なクラスタＩＤを付与し、各クラスタに属する文書の文書ＩＤを列挙する形式を採用している。なお、クラスタ記録部１７は、文書話題係数更新部１８に、１クラス内の文書ＩＤの一覧を送る。 The processing result by the clustering unit 16 is recorded in the cluster recording unit 17 as shown in FIG. In the example shown in FIG. 6, a unique cluster ID such as “C001” or “C002” is assigned to each generated cluster, and a document ID of a document belonging to each cluster is enumerated. The cluster recording unit 17 sends a list of document IDs in one class to the document topic coefficient updating unit 18.

次に、クラスタリング部１６が行う処理について説明する。 Next, processing performed by the clustering unit 16 will be described.

図７は、蓄積文書分類装置１００におけるクラスタリング部１６の動作を示すフローチャートである。 FIG. 7 is a flowchart showing the operation of the clustering unit 16 in the accumulated document classification apparatus 100.

文書間距離記録部１５に記録されている文書間距離の情報に加え、文書話題係数記録部１９に記録されている情報をも用いて処理する。 In addition to the inter-document distance information recorded in the inter-document distance recording unit 15, the information recorded in the document topic coefficient recording unit 19 is also used for processing.

図８は、文書話題係数記録部１９に記録されているデータ例を示す図である。 FIG. 8 is a diagram illustrating an example of data recorded in the document topic coefficient recording unit 19.

図８に示すように、後述の方法で決定した文書話題係数と、文書ＩＤとを対にして、文書話題係数記録部１９が記録する。 As shown in FIG. 8, the document topic coefficient recording unit 19 records a document topic coefficient determined by a method described later and a document ID as a pair.

Ｓ１１で、文書間距離記録部１５に記録されている２文書間の距離として、「０００１と０００２との間の距離＝０．６８」、「０００１と０００３との間の距離＝０．８９」、…を全て取得する。 In S11, “distance between 0001 and 0002 = 0.68” and “distance between 0001 and 0003 = 0.89” are the distances between the two documents recorded in the inter-document distance recording unit 15. Get all of ...

Ｓ１２で、文書話題係数記録部１９から、文書ＩＤと文書話題係数との組を全て取得し、Ｓ１１で取得した文書間距離と照合して、２文書間の偏重距離を算出する。上記「偏重距離」は、第１、第２の文書の２つの文書の間において、第１、２の文書間の距離に、第１の文書の文書話題係数を乗じた距離である。上記「偏重距離」は、第１、第２の文書の２つの文書の間において、第１、２の文書間の距離に、第１、２の文書の文書話題係数のうちで大きい文書話題係数を乗じた値であると考えてもよい。 In S12, all pairs of document IDs and document topic coefficients are acquired from the document topic coefficient recording unit 19, and collated with the inter-document distance acquired in S11 to calculate the biased distance between the two documents. The “absolute weight distance” is a distance obtained by multiplying the distance between the first and second documents by the document topic coefficient of the first document between the two documents of the first and second documents. The above-mentioned “uneven weight” is a document topic coefficient larger than the document topic coefficient of the first and second documents, between the two documents of the first and second documents. You may think that it is the value which multiplied.

次に、文書ＩＤが０００１である文書と、文書ＩＤが０００２である文書との間の偏重距離を算出する具体例について説明する。 Next, a specific example of calculating the bias distance between the document with the document ID 0001 and the document with the document ID 0002 will be described.

文書ＩＤが０００１である文書の文書話題係数が、１．２であり、文書ＩＤが０００２である文書の文書話題係数が、文書話題係数記録部１９に記録されていない場合であり、文書ＩＤが０００１である文書と文書ＩＤが０００２である文書との距離が、０．６８である場合、距離０．６８に、文書ＩＤが０００１である文書の文書話題係数１．２を乗じた値０．８１６が、これらの文書間の偏重距離である。 The document topic coefficient of the document whose document ID is 0001 is 1.2, the document topic coefficient of the document whose document ID is 0002 is not recorded in the document topic coefficient recording unit 19, and the document ID is When the distance between the document with 0001 and the document with document ID 0002 is 0.68, a value obtained by multiplying the distance 0.68 by the document topic coefficient 1.2 of the document with document ID 0001 is 0. Reference numeral 816 denotes a biased distance between these documents.

文書ＩＤが０００１、０００３である文書は、いずれも文書話題係数記録部１９に、文書話題係数が記録されているが、この場合、値のより大きな文書話題係数を採用する。このように、値のより大きな文書話題係数を採用するのは、大きな文書話題係数を使用することによって、当該クラスタを***し易くするためである。 The document topic coefficients are recorded in the document topic coefficient recording unit 19 for the documents whose document IDs are 0001 and 0003. In this case, the document topic coefficient having a larger value is adopted. The reason why the document topic coefficient having a larger value is used in this way is to facilitate the division of the cluster by using the large document topic coefficient.

実施例１では、いずれの文書話題係数も１．２であるので、この値（１．２）を、文書ＩＤが０００１、０００３である２つの文書の間の距離０．８９に乗じた１．０６８が、上記２つの文書の間の偏重距離である。 In the first embodiment, since any document topic coefficient is 1.2, this value (1.2) is multiplied by a distance 0.89 between two documents whose document IDs are 0001 and 0003. 068 is the bias distance between the two documents.

図８に示すように、０００２と０００４とのように、いずれの文書話題係数も存在しない場合、文書間距離自体を偏重距離とみなす。上記のようにして、全ての異なる２文書間の偏重距離を算出する。 As shown in FIG. 8, when there is no document topic coefficient such as 0002 and 0004, the inter-document distance itself is regarded as a biased distance. As described above, the bias distance between two different documents is calculated.

Ｓ１３で、偏重距離を用いてクラスタリング処理を実施する。クラスタリング処理は、たとえば、最長距離法と呼ばれる手法を用いる。 In S13, clustering processing is performed using the biased distance. The clustering process uses, for example, a technique called the longest distance method.

上記「最長距離法」は、近いクラスタ同士を結合する手法であり、この場合、１つ目のクラスタに含まれている１つの文書と、２つ目のクラスタに含まれている１つの文書との距離のうちで、最も遠い距離を、上記１つ目のクラスタと上記２つ目のクラスタとの距離であるとみなしてクラスタリングする手法である。なお、上記「最長距離法」の詳細は、上記非特許文献１（神▲島▼敏弘著「データマイニング分野のクラスタリング手法（１）―クラスタリングを使ってみよう！―」人工知能学会誌、vol.18, no.1, pp.59-65(2003年1月)に記載されている。 The “longest distance method” is a method of joining close clusters together. In this case, one document included in the first cluster, one document included in the second cluster, and In this case, the farthest distance is regarded as the distance between the first cluster and the second cluster for clustering. For details of the “longest distance method”, see Non-Patent Document 1 (Kami ▲ Toshihiro Shima “Clustering Method in the Data Mining Field (1) – Let's Use Clustering!”) Journal of the Japanese Society for Artificial Intelligence, vol. 18, no.1, pp.59-65 (January 2003).

クラスタリング処理を行うと、たとえば、０００１の文書と０００３の文書と００２２の文書と００２５の文書と００３０の文書とによって構成されるクラスタが生成され、他も同様に、１以上の文書からなるクラスタが多数生成される。これら生成された各クラスタには、たとえば、Ｃ００１、Ｃ００２等のように、クラスタを識別するクラスタＩＤを付与する。 When the clustering process is performed, for example, a cluster composed of a document of 0001, a document of 0003, a document of 0022, a document of 0025, and a document of 0030 is generated. Similarly, a cluster composed of one or more documents is created. Many are generated. Each generated cluster is assigned a cluster ID for identifying the cluster, such as C001, C002, and the like.

最後に、Ｓ１４で、クラスタＩＤと、そのクラスタに属する文書ＩＤとの一覧を組みにして、図６に示すように、クラスタ記録部１７に出力する。 Finally, in S14, a list of cluster IDs and document IDs belonging to the cluster is combined and output to the cluster recording unit 17 as shown in FIG.

文書話題係数更新部１８は、クラスタリング処理の結果を利用して、各文書に記述されている内容の話題性の高さを示す文書話題係数を算出し、文書話題係数記録部１９に記録されている各文書の文書話題係数を更新する。 The document topic coefficient update unit 18 calculates a document topic coefficient indicating the level of topicality of the contents described in each document using the result of the clustering process, and is recorded in the document topic coefficient recording unit 19. Update the document topic coefficient of each document.

図９は、文書話題係数更新部１８が行う処理を示すフローチャートである。 FIG. 9 is a flowchart showing processing performed by the document topic coefficient update unit 18.

まず、Ｓ２１で、クラスタ記録部１７から、未取得の１クラスタについて、含まれる文書ＩＤの一覧を取得し、その数を集計する。 First, in S21, a list of document IDs included in one unacquired cluster is acquired from the cluster recording unit 17, and the number is counted.

たとえば、図６に示すように、クラスタＣ００１に含まれている文書のＩＤである「０００１」、「０００３」、「００２２」、「００２５」、「００３０」を取得すると、文書数が５であると集計される。 For example, as shown in FIG. 6, when “0001”, “0003”, “0022”, “0025”, “0030”, which are IDs of documents included in the cluster C001, are acquired, the number of documents is five. And are counted.

Ｓ２２では、Ｓ２１で集計した文書数が、予め定めた数（設定数）以上であるか否かを判定する。たとえば、上記設定数が１０文書であるとすると、クラスタＣ００１は１０文書以下であるので（Ｓ２２のＮＯ）、Ｓ２３、Ｓ２４のステップを実施せずに、Ｓ２５に進む。Ｓ２５では、全てのクラスタを処理し終えたかどうかを判定し、全てのクラスタの処理が終わるまで、Ｓ２１〜Ｓ２４の処理を繰り返す。 In S22, it is determined whether or not the number of documents tabulated in S21 is equal to or greater than a predetermined number (set number). For example, if the set number is 10 documents, the cluster C001 is 10 documents or less (NO in S22), so the process proceeds to S25 without performing steps S23 and S24. In S25, it is determined whether or not all the clusters have been processed, and the processes in S21 to S24 are repeated until all the clusters are processed.

Ｓ２２で、上記設定数未満であると判定されると（Ｓ２２のＹＥＳ）、Ｓ２３で、クラスタ内の各文書の文書話題係数を算出する。 If it is determined in S22 that the number is less than the set number (YES in S22), the document topic coefficient of each document in the cluster is calculated in S23.

図１０は、文書話題係数の決定方法の一例を示す図である。 FIG. 10 is a diagram illustrating an example of a document topic coefficient determination method.

文書数がたとえば１０未満であるクラスタ内の全文書の文書話題係数を、たとえば１とし、文書数が１０以上であるクラスタ内の全文書の文書話題係数を１．２とする。つまり、前回のクラスタリング結果に基づいて文書数の多いクラスタを選択し、文書数が多くなるのは、その文書の注目度が高いためであると考え、上記クラスタに含まれている文書の文書話題係数を、他よりも大きな値とする。 For example, the document topic coefficient of all documents in the cluster having the number of documents of less than 10 is set to 1, for example, and the document topic coefficient of all documents in the cluster having the number of documents of 10 or more is set to 1.2. In other words, based on the previous clustering result, a cluster with a large number of documents is selected, and the reason why the number of documents increases is that the degree of attention of the documents is high. The coefficient is set to a value larger than the others.

なお、文書数が１０以上である場合に、文書話題係数を１．２にするように、一定値にするのではなく、文書数に応じて、文書話題係数を増加させるようにしてもよい。 When the number of documents is 10 or more, the document topic coefficient may be increased according to the number of documents, instead of a constant value so that the document topic coefficient is 1.2.

つまり、「文書話題係数」は、特定の話題に関する文書の数が増加した場合に、これらの文書間の距離を仮想的に拡大させることによって、これらの文書がさらに細分化され易くするための斥力として用いる値である。したがって、文書数が１０以上であるクラスタ内の全文書の文書話題係数として、１よりも大きな値を与える必要がある。 In other words, the “document topic coefficient” is a repulsive force that makes it easier to subdivide these documents by virtually increasing the distance between these documents when the number of documents related to a specific topic increases. It is a value used as. Therefore, it is necessary to give a value larger than 1 as the document topic coefficient of all the documents in the cluster having 10 or more documents.

Ｓ２４では、Ｓ２３で得た文書話題係数に基づいて、文書話題係数記録部１９に記録されている文書話題係数を更新する。このときに、文書話題係数が既に記録されていれば、新規に得られた文書話題係数を、記録済みの値に乗じて更新する。たとえば、文書ＩＤ００１８の文書の話題係数が１．２と記録され、Ｓ２３でも、文書ＩＤ００１８の文書の文書話題係数として１．２という値が得られたとすると、これらを掛け合わせた値「１．４４」を、更新後の００１８の文書話題係数として、文書話題係数記録部１９に記録する。一方、文書ＩＤ００１８の文書の文書話題係数が記録されていなければ、Ｓ２３で得られた値「１．２」をそのまま記録する。 In S24, the document topic coefficient recorded in the document topic coefficient recording unit 19 is updated based on the document topic coefficient obtained in S23. At this time, if the document topic coefficient is already recorded, the newly obtained document topic coefficient is updated by multiplying the recorded value. For example, if the topic coefficient of the document with the document ID 0018 is recorded as 1.2, and the value of 1.2 is obtained as the document topic coefficient of the document with the document ID 0018 in S23, a value “1.44” obtained by multiplying these values is obtained. Is recorded in the document topic coefficient recording unit 19 as the document topic coefficient 0018 after the update. On the other hand, if the document topic coefficient of the document with the document ID 0018 is not recorded, the value “1.2” obtained in S23 is recorded as it is.

上記Ｓ２１〜Ｓ２４の処理を、クラスタ記録部１７に記録されている全てのクラスタに処理し終えると（Ｓ２５の終了）、文書話題係数更新部１８は、処理を終了する。 When the processes of S21 to S24 have been processed for all the clusters recorded in the cluster recording unit 17 (end of S25), the document topic coefficient update unit 18 ends the process.

上記実施例では、文書記録部１１に文書が追加入力されると、１文書の入力毎、数文書の入力毎、１時間毎、１日毎等、何らかのタイミングで、一連の処理を再実行する。このような繰り返し処理を行うと、文書数が多いクラスタが生成される度に、上記クラスタに含まれる文書の文書話題係数が、次々と大きな値へ更新される。クラスタリング部１６では、図７に示した処理を行う度に、文書話題係数記録部１９を参照するので、常に最新の文書話題係数が反映されたクラスタリング処理を行うことができる。 In the above-described embodiment, when a document is additionally input to the document recording unit 11, a series of processing is re-executed at some timing, such as every input of one document, every input of several documents, every hour, or every day. When such repeated processing is performed, each time a cluster having a large number of documents is generated, the document topic coefficient of the documents included in the cluster is updated to a larger value one after another. Since the clustering unit 16 refers to the document topic coefficient recording unit 19 every time the processing shown in FIG. 7 is performed, the clustering process in which the latest document topic coefficient is always reflected can be performed.

図１１は、文書増加によって、クラスタが***する様子を示す図である。 FIG. 11 is a diagram illustrating a state in which clusters are divided due to an increase in documents.

Ｃ００１〜Ｃ００３が存在し、Ｃ００３に含まれる文書（図１１中、文書を黒点で示す）数が多く、Ｓ２２で一定以上の文書数であると判定されたとする。このときに、Ｃ００３に含まれている文書には、文書話題係数（実施例１では１．２）が与えられ、これが他の文書との斥力として作用する。したがって、２度目のクラスタリング時には、Ｃ００３に含まれている文書間の距離は拡大し、また同時に、Ｃ００１やＣ００２内の文書と、Ｃ００３内の文書との距離も拡大する。その結果、Ｃ００３に含まれている文書は、２度目のクラスタリング処理では、２個のクラスタＣ００３とＣ００４とに***する。 Assume that C001 to C003 exist, the number of documents included in C003 (the document is indicated by a black dot in FIG. 11) is large, and it is determined in S22 that the number of documents is equal to or greater than a certain value. At this time, a document topic coefficient (1.2 in the first embodiment) is given to the document included in C003, and this acts as a repulsive force with other documents. Therefore, at the time of the second clustering, the distance between documents included in C003 increases, and at the same time, the distance between the documents in C001 and C002 and the documents in C003 also increases. As a result, the document included in C003 is split into two clusters C003 and C004 in the second clustering process.

たとえば、Ｃ００４内の文書と類似性が高い文書が、その後さらに増加すれば、Ｃ００４に含まれている文書は、さらに大きな文書話題係数を与えられ、以後のクラスタリング時に***する場合もある。
For example, if the number of documents having high similarity to the documents in C004 further increases after that, the documents included in C004 may be given a larger document topic coefficient, and may be split during subsequent clustering.

図１２は、本発明の実施例２である蓄積文書分類装置２００を示すブロック図である。 FIG. 12 is a block diagram showing an accumulated document classification apparatus 200 that is Embodiment 2 of the present invention.

符号１１〜１９の名称は、実施例１における各名称と同一である。 The names 11-19 are the same as the names in the first embodiment.

蓄積文書分類装置２００は、文書記録部１１と、文書解析部１２と、文書情報管理部１３と、文書間距離算出部１４と、文書間距離記録部１５と、クラスタリング部１６と、クラスタ記録部１７と、文書話題係数更新部１８と、文書話題係数記録部１９と、単語話題度算出部２０と、単語話題度記録部２１とを有する。 The accumulated document classification device 200 includes a document recording unit 11, a document analysis unit 12, a document information management unit 13, an inter-document distance calculation unit 14, an inter-document distance recording unit 15, a clustering unit 16, and a cluster recording unit. 17, a document topic coefficient update unit 18, a document topic coefficient recording unit 19, a word topic level calculation unit 20, and a word topic level recording unit 21.

文書話題係数更新部１８は、文書情報管理部１３から、文書ＩＤと単語一覧を取得する。単語話題度算出部２０は、文書情報管理部１３に記録されている各文書の時刻情報と、上記文書に出現する単語を取得し、各単語の現在時刻における話題性の大きさを数値化して、単語話題度記録部２１に出力する。 The document topic coefficient update unit 18 acquires a document ID and a word list from the document information management unit 13. The word topic level calculation unit 20 acquires time information of each document recorded in the document information management unit 13 and a word appearing in the document, and quantifies the magnitude of topicality of each word at the current time. , Output to the word topic level recording unit 21.

次に、蓄積文書分類装置２００の動作について説明する。 Next, the operation of the accumulated document classification device 200 will be described.

文書記録部１１に記録された文書を、文書解析部１２が、単語に分割し、出現回数を集計し、文書情報管理部１３に記録する手順と、文書情報管理部１３に記録された文書間の距離を、文書間距離算出部１４が算出し、文書間距離記録部１５に記録するまでの手順は、実施例１と同様である。 The document analysis unit 12 divides the document recorded in the document recording unit 11 into words, counts the number of appearances, and records it in the document information management unit 13, and between the documents recorded in the document information management unit 13 This procedure is the same as that in the first embodiment until the inter-document distance calculation unit 14 calculates the distance and records it in the inter-document distance recording unit 15.

図１３は、単語話題度算出部２０における動作を示すフローチャートである。 FIG. 13 is a flowchart showing the operation of the word topic level calculation unit 20.

単語話題度算出部２０は、Ｓ３１で、過去の処理の結果記録された内容が、単語話題度記録部２１に残っていれば、それを全て削除する。このように、過去の処理結果を削除するのは、常に最新の結果に基づいて、単語話題度を算出するためである。 If the content recorded as a result of the past processing remains in the word topic level recording unit 21 in S31, the word topic level calculation unit 20 deletes all of them. Thus, the past processing result is deleted in order to always calculate the word topic level based on the latest result.

Ｓ３２で、文書情報管理部１３に記録された文書のうちで、最新ｍ日以内の時刻情報を持つ１文書について、上記時刻情報と、単語と、その出現回数とを取得する。 In S32, the time information, the word, and the number of appearances are acquired for one document having time information within the latest m days among the documents recorded in the document information management unit 13.

続いて、Ｓ３２で取得した各単語について、最新ｍ日での総出現回数と、最新ｎ日（ｍ＞ｎとする）での総出現回数とをそれぞれ集計し、バッファに保持する（Ｓ３３）。 Subsequently, for each word acquired in S32, the total number of appearances on the latest m days and the total number of appearances on the latest n days (m> n) are totaled and stored in the buffer (S33).

最新ｍ日以内の時刻情報を持つ文書の処理が終わるまで（Ｓ３４のＹＥＳ）、Ｓ３２とＳ３３との処理を繰り返す。ここまでの処理を終えると、「政府」や「消費税」等、各単語が出現する回数の合計値（総出現回数）が、最新ｍ日、最新ｎ日のそれぞれについて集計される。 Until the processing of the document having time information within the latest m days is completed (YES in S34), the processing in S32 and S33 is repeated. When the processing so far is completed, the total number of times each word appears, such as “government” and “consumption tax” (total number of appearances), is counted for each of the latest m days and the latest n days.

Ｓ３５で、各単語の最新ｎ日での総出現回数を、最新ｍ日での総出現回数で割って、単語話題度を算出する。ｍやｎの値を固定値としてもよいが、ここでは、ｍとして、入力文書のうちで最も古い文書と最も新しい文書との時刻情報の差分を与え、さらに、ｎ＝ｍ／４としてｎの値を決定とする。たとえば、最も古い文書が８月１日であり、最も新しい文書が８月２８日であったとすると、ｍ＝２８、ｎ＝２８／４＝７となる。したがって、最新７日での総出現回数と最新２８日（すなわち全期間）での総出現回数との比を得る。この比は、直近７日間における当該単語の話題の程度を示すものであり、着目している単語について、上記比の推移を見ると、上記着目している単語についての話題性の変化を認識することができる。 In S35, the word topic degree is calculated by dividing the total number of appearances of each word on the latest n days by the total number of appearances on the latest m days. The values of m and n may be fixed values, but here, as m, the difference in time information between the oldest document and the newest document among the input documents is given, and n = m / 4 is set as n = m / 4. Let the value be determined. For example, if the oldest document is August 1 and the newest document is August 28, then m = 28 and n = 28/4 = 7. Therefore, the ratio of the total number of appearances in the latest 7 days and the total number of appearances in the latest 28 days (that is, the entire period) is obtained. This ratio indicates the degree of topic of the word in the most recent 7 days. When the change in the ratio is observed for the focused word, the change in topicality for the focused word is recognized. be able to.

最後に、Ｓ３６で、単語話題度が一定値以上の単語についてのみ、単語と単語話題度とを、単語話題度記録部２１に記録する。 Finally, in S 36, the word and the word topic level are recorded in the word topic level recording unit 21 only for words having a word topic level of a certain value or more.

図１４は、実施例２における単語話題度記録２１におけるデータ例を示す図である。 FIG. 14 is a diagram illustrating an example of data in the word topic level record 21 in the second embodiment.

図１４に示す例は、単語話題度が０．３未満の単語を無視し、０．３以上の単語についてのみ記録した例である。 The example shown in FIG. 14 is an example in which words having a word topic level of less than 0.3 are ignored and only words of 0.3 or more are recorded.

「今日」や「これ」のような一般的な単語が、時期によらず一定の頻度で出現すると仮定すると、７日間に１００回出現した単語は、２８日間では４００回出現することになり、単語話題度は１／４、すなわち０．２５になる。図１４に示す例では、単語話題度が０．３以上の単語のみを記録しているので、最近７日間で、以前に比べて出現回数が増加した単語が選択される。 Assuming that common words like “today” and “this” appear at a certain frequency regardless of time, a word that appears 100 times in 7 days will appear 400 times in 28 days, The word topic level is 1/4, that is, 0.25. In the example shown in FIG. 14, since only words having a word topic degree of 0.3 or more are recorded, words that have increased in the number of appearances over the last seven days are selected.

続いて、実施例２における文書話題係数更新部１８が行う処理について説明する。 Next, processing performed by the document topic coefficient update unit 18 according to the second embodiment will be described.

図１５は、実施例２における文書話題係数更新部１８が行う処理を示すフローチャートである。 FIG. 15 is a flowchart illustrating processing performed by the document topic coefficient update unit 18 according to the second embodiment.

まず、Ｓ４１で、文書話題係数記録部１９に記録されている内容があれば、それを全て削除する。このように、過去の文書話題係数を記録から削除するのは、常に最新の文書話題係数を記録するためである。Ｓ４２で、文書情報管理部１３中の１文書について、文書ＩＤと単語との一覧を取得する。Ｓ４３では、Ｓ４２で取得した単語を、単語話題度記録部２１に照会し、同一の単語が、単語話題度記録部２１中に存在するか否かを確認する。重複がある場合にのみ（Ｓ４３のＹＥＳ）、文書話題係数を算出するＳ４４に移る。 First, if there are contents recorded in the document topic coefficient recording unit 19 in S41, all of them are deleted. In this way, the past document topic coefficient is deleted from the record in order to always record the latest document topic coefficient. In S42, a list of document IDs and words is acquired for one document in the document information management unit 13. In S 43, the word acquired in S 42 is referred to the word topic level recording unit 21 to check whether or not the same word exists in the word topic level recording unit 21. Only when there is an overlap (YES in S43), the process proceeds to S44 for calculating the document topic coefficient.

Ｓ４４では、Ｓ４３で重複があると判定した文書について、文書話題係数を算出する。文書話題係数を決定する最も単純な方法は、文書話題係数を、全て１．２のような固定値にする方法である。すなわち、Ｓ４３で重複があると判定された文書の文書話題係数を、全て１．２とする方法が、最も単純な方法である。算出した文書話題係数を文書ＩＤとともに、文書話題係数記録部１９に、実施例１と同様に、図８に示すように記録する。 In S44, a document topic coefficient is calculated for the document determined to have an overlap in S43. The simplest method for determining the document topic coefficient is to set all the document topic coefficients to fixed values such as 1.2. That is, the simplest method is to set all the document topic coefficients of documents determined to have duplicates in S43 to 1.2. The calculated document topic coefficient is recorded together with the document ID in the document topic coefficient recording unit 19 as shown in FIG.

なお、文書話題係数の決定方法が唯一ではなく、単語の盛り上がりの度合いを表わす数値である単語話題度を用い、単語話題度の大きい単語を含む文書ほど、もしくは単語話題度の大きい単語を数多く含む文書ほど、大きな文書話題係数を付与する方法もある。 Note that the document topic coefficient determination method is not the only method, and the word topic degree, which is a numerical value indicating the degree of word swell, is used, and a document containing a word with a high word topic degree or a large number of words with a high word topic degree is included. There is also a method of assigning a larger document topic coefficient to a document.

Ｓ４２〜Ｓ４４の処理を、全ての文書について繰り返すと、処理が終了する（Ｓ４５の終了）。 When the processes of S42 to S44 are repeated for all documents, the process ends (end of S45).

図１５に示す処理によって、文書話題係数記録部１９には、単語話題度記録部２１に記録された単語（話題性の高い単語）を含む文書の文書話題係数のみが記録される。 With the processing shown in FIG. 15, the document topic coefficient recording unit 19 records only the document topic coefficient of the document including the word (word with high topicality) recorded in the word topic degree recording unit 21.

実施例１のように文書話題係数を乗算で更新するのではなく処理の度に、Ｓ４１で削除する理由は、単語話題度が現在時刻を基点として算出する値であり、時間経過後に算出する場合、その時刻を基点として、新規に算出し直した値を用いる必要があるためである。 The reason why the document topic coefficient is deleted in S41 each time processing is performed instead of updating the document topic coefficient by multiplication as in the first embodiment is a value that the word topic degree is calculated based on the current time, and is calculated after the elapse of time. This is because it is necessary to use a newly recalculated value with the time as a base point.

クラスタリング部１６が行う処理は、実施例１における処理と同様である。 The processing performed by the clustering unit 16 is the same as the processing in the first embodiment.

上記実施例によれば、大きなクラスタに含まれる文書に対して、他の文書との間に斥力を作用させることによって、他の文書との距離を仮想的に増加させ、文書が集中した高密度の領域のみを細かい粒度で、他の領域は通常の粒度で分類することができる。 According to the above-described embodiment, a repulsive force is exerted on a document included in a large cluster with another document to virtually increase the distance from the other document, and the high density where the documents are concentrated. Only the region can be classified with a fine granularity, and the other regions can be classified with a normal granularity.

また、上記実施例によれば、文書が逐次増加する状況において、最新の偏り状況に基づいて、文書間距離を自動的に拡大するので、適切な粒度で分類するために必要であった利用者の作業が不要になる。 Further, according to the above-described embodiment, since the distance between documents is automatically expanded based on the latest biased situation in a situation where the number of documents sequentially increases, the user required to classify with an appropriate granularity. Work is no longer necessary.

さらに、上記実施例によれば、変化しない文書間距離を記録し、時間経過と共に変化する文書話題度だけを更新するので、毎回文書間距離を計算する方法に比べ、処理が高速である。 Furthermore, according to the above embodiment, since the inter-document distance that does not change is recorded and only the document topic level that changes with the passage of time is updated, the processing is faster than the method of calculating the inter-document distance every time.

上記実施例を方法の発明として把握することができる。つまり、上記実施例は、文書間の距離を算出し、記憶装置に記憶する文書間距離算出工程と、各文書に記述されている内容の話題性の大きさを数値化し、この数値化した文書話題係数を更新し、記憶装置に記憶する文書話題係数更新工程と、上記文書間の距離と、上記文書における文書話題係数とに基づいて、各文書間距離を仮想的に拡大した文書間偏重距離を算出し、記憶装置に記憶する文書間偏重距離算出工程と、上記文書間偏重距離に基づいて、近隣の文書同士を集約し、記憶装置に記憶するクラスタリング工程とを有する蓄積文書分類方法の例である。 The above embodiment can be grasped as a method invention. In other words, in the above embodiment, the distance between documents is calculated, the inter-document distance calculation step stored in the storage device, and the topicality of the contents described in each document are quantified, and the quantified document is calculated. Inter-document bias distance obtained by virtually expanding the inter-document distance based on the document topic coefficient update step of updating the topic coefficient and storing it in the storage device, the distance between the documents, and the document topic coefficient in the document An example of an accumulated document classification method including: an inter-document deviation distance calculation step for calculating and storing in a storage device; and a clustering step for aggregating neighboring documents based on the inter-document deviation distance and storing in a storage device It is.

また、上記実施例をプログラムの発明として把握することができる。つまり、上記実施例は、文書間の距離を算出し、記憶装置に記憶する文書間距離算出手順と、各文書に記述されている内容の話題性の大きさを数値化し、この数値化した文書話題係数を更新し、記憶装置に記憶する文書話題係数更新手順と、上記文書間の距離と、上記文書における文書話題係数とに基づいて、各文書間距離を仮想的に拡大した文書間偏重距離を算出し、記憶装置に記憶する文書間偏重距離算出手順と、上記文書間偏重距離に基づいて、近隣の文書同士を集約し、記憶装置に記憶するクラスタリング手順とをコンピュータに実行させるプログラムの例である。 Moreover, the said Example can be grasped | ascertained as invention of a program. That is, in the above-described embodiment, the distance between documents is calculated, the inter-document distance calculation procedure stored in the storage device, and the degree of topicality of the contents described in each document are quantified. Inter-document bias distance obtained by virtually expanding the inter-document distance based on the document topic coefficient update procedure for updating the topic coefficient and storing it in the storage device, the distance between the documents, and the document topic coefficient in the document Example of a program that causes a computer to execute an inter-document weighted distance calculation procedure for calculating and storing a storage device, and a clustering procedure for aggregating neighboring documents based on the inter-document weighted distance and storing them in the memory device It is.

さらに、上記プログラムを、ＣＤ、ＤＶＤ、ＨＤ、半導体メモリ等の記録媒体に記録するようにしてもよい。
Furthermore, the program may be recorded on a recording medium such as a CD, DVD, HD, or semiconductor memory.

本発明の実施例１である蓄積文書分類装置１００を示すブロック図である。1 is a block diagram illustrating an accumulated document classification device 100 that is Embodiment 1 of the present invention. FIG. 蓄積文書分類装置１００の動作原理を示すフローチャートである。4 is a flowchart illustrating an operation principle of the accumulated document classification device 100. 文書記録部１１に記録されているデータ例を示す図である。3 is a diagram illustrating an example of data recorded in a document recording unit 11. FIG. 文書情報管理部１３に記録されているデータの例を示す図である。4 is a diagram illustrating an example of data recorded in a document information management unit 13. FIG. 文書間距離記録部１５に記録されている文書間の距離を示す図である。FIG. 6 is a diagram illustrating a distance between documents recorded in an inter-document distance recording unit 15. クラスタ記録部１７に記録されているデータ例を示す図である。6 is a diagram illustrating an example of data recorded in a cluster recording unit 17. FIG. 蓄積文書分類装置１００におけるクラスタリング部１６の動作を示すフローチャートである。4 is a flowchart showing an operation of a clustering unit 16 in the accumulated document classification device 100. 文書話題係数記録部１９に記録されているデータ例を示す図である。6 is a diagram illustrating an example of data recorded in a document topic coefficient recording unit 19. FIG. 文書話題係数更新部１８が行う処理を示すフローチャートである。It is a flowchart which shows the process which the document topic coefficient update part 18 performs. 文書話題係数の決定方法の一例を示す図である。It is a figure which shows an example of the determination method of a document topic coefficient. 文書増加によって、クラスタが***する様子を示す図である。It is a figure which shows a mode that a cluster is divided | segmented by document increase. 本発明の実施例２である蓄積文書分類装置２００を示すブロック図である。It is a block diagram which shows the stored document classification device 200 which is Example 2 of this invention. 単語話題度算出部２０における動作を示すフローチャートである。4 is a flowchart showing an operation in a word topic degree calculation unit 20. 実施例２における単語話題度記録２１におけるデータ例を示す図である。It is a figure which shows the example of data in the word topic degree recording 21 in Example 2. FIG. 実施例２における文書話題係数更新部１８が行う処理を示すフローチャートである。12 is a flowchart illustrating processing performed by a document topic coefficient update unit 18 according to the second embodiment.

符号の説明Explanation of symbols

１００…蓄積文書分類装置、
１１…文書記録部、
１２…文書解析部、
１３…文書情報管理部、
１４…文書間距離算出部、
１５…文書間距離記録部、
１６…クラスタリング部、
１７…クラスタ記録部、
１８…文書話題係数更新部、
１９…文書話題係数記録部、
２０…単語話題度算出部、
２１…単語話題度記録部。 100 ... Accumulated document classification device,
11 ... Document recording part,
12 ... Document analysis section,
13 ... Document information management department,
14: Inter-document distance calculation unit,
15 ... Inter-document distance recording section,
16 ... clustering unit,
17 ... cluster recording part,
18 ... Document topic coefficient update unit,
19 ... Document topic coefficient recording unit,
20 ... Word topic level calculation unit,
21 ... Word topic level recording section.

Claims

文書間の距離を算出する文書間距離算出手段と；
各文書に記述されている内容の話題性の大きさを数値化し、この数値化した文書話題係数を更新する文書話題係数更新手段と；
上記文書間の距離と、上記文書における文書話題係数とに基づいて、各文書間距離を仮想的に拡大した文書間偏重距離を算出する文書間偏重距離算出手段と；
上記文書間偏重距離に基づいて、近隣の文書同士を集約するクラスタリング手段と；
を有することを特徴とする蓄積文書分類装置。 An inter-document distance calculating means for calculating a distance between documents;
Document topic coefficient updating means for quantifying the topic level of the contents described in each document and updating the quantified document topic coefficient;
An inter-document biased distance calculating means for calculating an inter-document biased distance obtained by virtually expanding each inter-document distance based on the distance between the documents and the document topic coefficient in the document;
Clustering means for aggregating neighboring documents based on the inter-document bias distance;
An accumulated document classification device characterized by comprising:

請求項１において、
上記文書話題係数更新手段は、文書数が所定数よりも多いクラスタを選択し、この選択されたクラスタ以外のクラスタに含まれている文書の文書話題係数よりも大きな文書話題係数を、上記選択されたクラスタに含まれている文書に与える手段であることを特徴とする蓄積文書分類装置。 In claim 1,
The document topic coefficient updating unit selects a cluster having a document number larger than a predetermined number, and selects a document topic coefficient larger than the document topic coefficient of a document included in a cluster other than the selected cluster. An accumulated document classification device, characterized in that it is a means for giving to documents included in a cluster.

請求項１において、
上記文書話題係数更新手段は、入力文書集合中の単語の出現頻度の時間的変化を検出し、この検出の結果、以前よりも出現頻度が増加した単語を含む文書について、他の文書よりも大きな文書話題係数を与える手段であることを特徴とする蓄積文書分類装置。 In claim 1,
The document topic coefficient updating means detects a temporal change in the appearance frequency of words in the input document set, and as a result of this detection, a document containing a word whose appearance frequency has increased from before is larger than other documents. An accumulated document classification apparatus characterized by being means for giving a document topic coefficient.

文書間の距離を算出し、記憶装置に記憶する文書間距離算出工程と；
各文書に記述されている内容の話題性の大きさを数値化し、この数値化した文書話題係数を更新し、記憶装置に記憶する文書話題係数更新工程と；
上記文書間の距離と、上記文書における文書話題係数とに基づいて、各文書間距離を仮想的に拡大した文書間偏重距離を算出し、記憶装置に記憶する文書間偏重距離算出工程と；
上記文書間偏重距離に基づいて、近隣の文書同士を集約し、記憶装置に記憶するクラスタリング工程と；
を有することを特徴とする蓄積文書分類方法。 An inter-document distance calculating step of calculating a distance between documents and storing it in a storage device;
A document topic coefficient updating step of quantifying the degree of topicality of the contents described in each document, updating the digitized document topic coefficient, and storing it in a storage device;
An inter-document eccentric distance calculation step of calculating an inter-document eccentric distance obtained by virtually expanding each inter-document distance based on the inter-document distance and the document topic coefficient in the document, and storing it in a storage device;
A clustering step of aggregating neighboring documents based on the inter-document bias distance and storing them in a storage device;
An accumulated document classification method characterized by comprising:

文書間の距離を算出し、記憶装置に記憶する文書間距離算出手順と；
各文書に記述されている内容の話題性の大きさを数値化し、この数値化した文書話題係数を更新し、記憶装置に記憶する文書話題係数更新手順と；
上記文書間の距離と、上記文書における文書話題係数とに基づいて、各文書間距離を仮想的に拡大した文書間偏重距離を算出し、記憶装置に記憶する文書間偏重距離算出手順と；
上記文書間偏重距離に基づいて、近隣の文書同士を集約し、記憶装置に記憶するクラスタリング手順と；
をコンピュータに実行させるプログラム。 An inter-document distance calculation procedure for calculating a distance between documents and storing it in a storage device;
A document topic coefficient update procedure for quantifying the degree of topicality of the contents described in each document, updating the digitized document topic coefficient, and storing it in a storage device;
An inter-document eccentric distance calculation procedure for calculating an inter-document eccentric distance obtained by virtually expanding each inter-document distance based on the inter-document distance and the document topic coefficient in the document;
A clustering procedure in which neighboring documents are aggregated based on the inter-document bias distance and stored in a storage device;
A program that causes a computer to execute.

文書間の距離を算出し、記憶装置に記憶する文書間距離算出手順と；
各文書に記述されている内容の話題性の大きさを数値化し、この数値化した文書話題係数を更新し、記憶装置に記憶する文書話題係数更新手順と；
上記文書間の距離と、上記文書における文書話題係数とに基づいて、各文書間距離を仮想的に拡大した文書間偏重距離を算出し、記憶装置に記憶する文書間偏重距離算出手順と；
上記文書間偏重距離に基づいて、近隣の文書同士を集約し、記憶装置に記憶するクラスタリング手順と；
をコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体。 An inter-document distance calculation procedure for calculating a distance between documents and storing it in a storage device;
A document topic coefficient update procedure for quantifying the degree of topicality of the contents described in each document, updating the digitized document topic coefficient, and storing it in a storage device;
An inter-document eccentric distance calculation procedure for calculating an inter-document eccentric distance obtained by virtually expanding each inter-document distance based on the inter-document distance and the document topic coefficient in the document;
A clustering procedure in which neighboring documents are aggregated based on the inter-document bias distance and stored in a storage device;
The computer-readable recording medium which recorded the program which makes a computer perform.