JP2008171336A

JP2008171336A - Document cluster processing apparatus, document cluster processing method, and program

Info

Publication number: JP2008171336A
Application number: JP2007006064A
Authority: JP
Inventors: Harumi Kawashima; 晴美川島; Yoshihide Sato; 吉秀佐藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-01-15
Filing date: 2007-01-15
Publication date: 2008-07-24

Abstract

<P>PROBLEM TO BE SOLVED: To offer a document cluster processing apparatus, a document cluster processing method and a program which can give an evaluation value suitable for human feeling, the evaluation value showing a merit of a coherence of a document cluster. <P>SOLUTION: The document cluster processing apparatus has a split means to divide a document into a word or a character string, a document analysis means to give "1" to the word or the character string concerned if each of all words or character strings that appear in the document cluster appears in all documents contained in the document cluster, and to give "0" to the word or the character string concerned if it does not appear; a distribution computing means which computes the distribution in the document in the cluster about each word or character string; and an evaluation value computing means which computes an average value of the distribution of all words or character strings which appear in the document cluster, as an evaluation value of the document cluster concerned. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、特定の話題に関する文書情報をテーマ毎にまとめた文書クラスタ単位で閲覧可能にする文書クラスタの処理に関し、特に、文書クラスタ内の出現単語の偏りを利用して、文書クラスタのまとまり度合いを示す評価値を算出することによって、文書クラスタの概要を容易に把握することができる文書クラスタ処理に関する。
The present invention relates to processing of a document cluster that enables document information related to a specific topic to be browsed in units of document clusters that are grouped by theme, and in particular, the degree of grouping of document clusters by using the bias of appearance words in the document cluster. It is related with the document cluster process which can grasp | ascertain the outline | summary of a document cluster easily by calculating the evaluation value which shows.

近年、インターネット等のコンピュータネットワークの発達に伴い、電子化された情報が大量に発信され続けている。したがって、ある話題に関する情報を取得したいと考えた場合、複数の情報源が公開しているｗｅｂページを、１つ１つ閲覧する必要があり、これには、大変な労力が強いられる。 In recent years, with the development of computer networks such as the Internet, a large amount of computerized information has been transmitted. Therefore, when it is desired to acquire information related to a certain topic, it is necessary to browse the web pages published by a plurality of information sources one by one, which requires great effort.

従来、自然言語処理や情報検索技術分野において、電子化されたテキストを、テキスト内で出現する単語のベクトルで表わし、単語のベクトルが類似しているテキストを、１つの文書クラスタにまとめ、文書クラスタを一覧表示する技術が知られている（たとえば、特許文献１参照）。 Conventionally, in the field of natural language processing and information retrieval technology, digitized text is represented by a vector of words appearing in the text, and texts with similar word vectors are grouped into one document cluster. A technique for displaying a list of images is known (for example, see Patent Document 1).

この従来技術は、文書間の類似度に基づいて、文書クラスタを作成し、文書クラスタ内において、他の文書との類似度が高く、文書の作成時刻が新しい文書に、高い評価値を付与し、この評価値の高い文書を含む順に、文書クラスタを並べて、表示する発明である。この発明以外にも、文書クラスタ内の文書集合に、類似度平均（組み合わせた文書間の類似度を求め、この求めた類似度を、文書を組み合わせた数で割った値）を求め、類似度平均の高い文書クラスタ順に、文書クラスタを並べて表示する発明が広く利用されている。 This prior art creates a document cluster based on the similarity between documents, and assigns a high evaluation value to a document that has a high similarity to other documents in the document cluster and a new document creation time. In this invention, the document clusters are arranged and displayed in the order including the document with the high evaluation value. In addition to this invention, the average similarity (the similarity between the combined documents is calculated and the calculated similarity is divided by the number of combined documents) is calculated for the document set in the document cluster. An invention that displays document clusters side by side in the order of document clusters having a high average is widely used.

これらの発明における文書間の類似度は、単語出現頻度を用いた文書ベクトルで文書を表わし、文書ベクトル間のコサイン類似度を適用する手法が広く用いられている。すなわち、文書ｄ_ｎを文書ベクトル The similarity between documents in these inventions is widely used as a method in which a document is represented by a document vector using a word appearance frequency and the cosine similarity between the document vectors is applied. In other words, the document vector of the document _{d n}

によって表わす。なお、ｖは、単語集合Ｗ＝｛ｗ_１，ｗ_２，……，ｗ_ｖ｝中の単語の総数であり、ｘ_ｎｉは、文書ｄ_ｎにおける単語ｗ_ｉの重みである。このときに、文書ｄ_ｊとｄ_ｋとの類似度は、各文書ベクトルがなす角

Is represented by Incidentally, v is word set _{_{W = {w 1, w 2}} , ......, w v} is the total number of words _{in, x ni} is the weight of word _{w i} in the document _{d n.} At this time, the similarity between the documents d _j and d _k is the angle formed by each document vector.

で表わされる。

It is represented by

また、単語ｗ_ｉの重みは、単語の文書内での出現頻度ｔｆ（term frequency）をそのまま利用する場合や、出現頻度ｔｆに、ｉｄｆ（単語出現数を全文書数で割った値の対数）を乗算したｔｆ・ｉｄｆ（term frequency/inverse document）を利用する。つまり、類似度の高い文書同士は、この単語の重みの傾向が似通っている文書同士である。
特開２００６−１２００６９号公報 The weight of the word w _i is the case where the appearance frequency tf (term frequency) in the document of the word is used as it is, or the appearance frequency tf is idf (logarithm of the value obtained by dividing the word appearance number by the total document number). Tf · idf (term frequency / inverse document) multiplied by is used. That is, documents with high similarity are documents having similar word weight trends.
JP 2006-120069 A

ところで、コンピュータによって生成された文書クラスタを見て、人間が文書クラスタの概要を把握する場合、平均類似度が高い文書同士でも、類似していると人間が感じないという感覚の不一致があることがある。一方、人間が、文書クラスタ内で重要である単語を発見し、まとまりがよい文書クラスタであると判断した文書クラスタでも、文書クラスタの平均類似度が低いことがある。 By the way, when looking at a computer-generated document cluster and a person grasps the outline of the document cluster, even documents with a high average similarity may have a disagreement that people do not feel that they are similar. is there. On the other hand, even in a document cluster in which a person has found an important word in the document cluster and determined that the document cluster is well organized, the average similarity of the document cluster may be low.

すなわち、上記従来例では、複数の文書クラスタを同時に提示する際、平均類似度が高い文書クラスタ順に、文書クラスタを提示した場合、人間が見てまとまりがよいと感じる順番と一致しないことがあるという問題がある。 That is, in the above-described conventional example, when a plurality of document clusters are presented at the same time, when document clusters are presented in the order of document clusters having a high average similarity, the order in which humans feel that they are well organized may not match. There's a problem.

図１０は、文書クラスタのまとまりのよさを説明する図である。 FIG. 10 is a diagram for explaining the goodness of grouping of document clusters.

文書クラスタ１は、３つの文書Ｄ１、Ｄ２、Ｄ３によって構成され、文書クラスタ内で、単語ｗ_１，ｗ_２，……，ｗ_１０が出現する。ここでの単語の重みは、簡易的に、文書内に単語が、出現すれば、「１」で示し、出現しなければ「０」で示す。 The document cluster 1 is composed of three documents D1, D2, and D3, and words w ₁ , w ₂ ,..., W ₁₀ appear in the document cluster. The weight of the word here is simply indicated by “1” if a word appears in the document, and indicated by “0” if it does not appear.

文書クラスタ２は、３つの文書Ｄ４、Ｄ５、Ｄ６によって構成されている。 The document cluster 2 is composed of three documents D4, D5, and D6.

文書クラスタ１において、３つの単語ｗ_１，ｗ_２，ｗ_３が、文書クラスタ内の全文書で出現し、まとまりのよい文書クラスタであると判断する。 In the document cluster 1, it is determined that three words w ₁ , w ₂ , and w ₃ appear in all the documents in the document cluster and are well-organized document clusters.

文書クラスタ１には、その全文書で出現する単語が３つ存在しているので、人間は、文書クラス１はまとまりのよい文書クラスタであると判断する。 Since there are three words that appear in all the documents in the document cluster 1, a human determines that the document class 1 is a well-organized document cluster.

文書クラスタ２において、文書クラスタ内の全文書で出現する単語は、存在しないが、２文書に出現する単語を多く含んでいる。 In the document cluster 2, the word that appears in all the documents in the document cluster does not exist, but includes many words that appear in the two documents.

人間は、文書クラスタ毎に複数の文書を提示されると、まず各文書にどんなことが書かれているのかに目を通し、重要な単語を拾い出す。次に、文書間で重要な単語が多くの文書に共通して出現していると、まとまりのよい文書クラスタだと判断する。その際、１つの文書にしか出現しないような出現頻度の低い単語は無視される傾向がある。 When a human being is presented with a plurality of documents for each document cluster, he / she first looks at what is written in each document and picks up important words. Next, if an important word between documents appears in common in many documents, it is determined that the document cluster is well-organized. At that time, words with a low appearance frequency that appear only in one document tend to be ignored.

たとえば、特定のヘアケア商品を話題にした文書集合を閲覧する場合、ヘアケア商品に関係のある単語をより重要だと人間が感じ、その単語が多くの文書に含まれていると、まとまりのよい文書クラスタであると判断する。しかし、「今日」といった一般的な単語が、文書クラスタ内の多くの文書に含まれていたとしても、人間は重要だと感じず、まとまりのよさに影響を与えない。 For example, when browsing a collection of documents that talk about a specific hair care product, humans feel that words related to the hair care product are more important, and if the word is included in many documents, the document is well organized. Judged to be a cluster. However, even if a common word such as “today” is included in many documents in the document cluster, humans do not feel it is important and does not affect the unity.

ところで、これら２つの文書クラスタについて、平均類似度を求めると、以下のようになる。 By the way, when the average similarity is obtained for these two document clusters, it is as follows.

文書クラスタ１の平均類似度＝３
文書クラスタ２の平均類似度＝３．３３
つまり、平均類似度は、文書クラスタ２の方が、文書クラスタ１よりも高い。 Average similarity of document cluster 1 = 3
Average similarity of document cluster 2 = 3.33
That is, the average similarity is higher in the document cluster 2 than in the document cluster 1.

人間は、文書クラスタ２よりも文書クラスタ１が、まとまりのよい文書クラスタであると判断するが、平均類似度は、文書クラスタ１よりも文書クラスタ２が高いので、平均類似度と人間が判断する文書クラスタのまとまりのよさとが一致しないという問題がある。 The human determines that the document cluster 1 is a more coherent document cluster than the document cluster 2, but the average similarity is higher than the document cluster 1 by the human, so the human determines that the average similarity is the average similarity. There is a problem that the unity of the document cluster does not match.

本発明は、文書クラスタのまとまりのよさを表わす評価値であって、人間の感覚に合った評価値を付与することができる文書クラスタ処理装置、文書クラスタ処理方法およびプログラムを提供することを目的とする。
It is an object of the present invention to provide a document cluster processing apparatus, a document cluster processing method, and a program that are evaluation values representing the goodness of unity of document clusters, and that can be given evaluation values that match human senses. To do.

本発明は、内容が互いに類似している文書の集合である文書クラスタに評価値を付与し、複数の文書クラスタを一度に表示する文書クラスタ処理装置であって、文書を単語または文字列に分割する分割手段と、文書クラスタ内に出現する全ての単語または文字列のそれぞれが、上記文書クラスタに含まれている各文書に出現すれば、当該単語または文字列に「１」を付与し、出現しなければ、当該単語または文字列に「０」を付与する文書解析手段と、各単語または文字列について、文書クラスタ内文書における分散を算出する分散算出手段と、文書クラスタ内に出現する全ての単語または文字列の分散を平均した値を、当該文書クラスタの評価値として算出する評価値算出手段とを有する文書クラスタ処理装置である。
The present invention relates to a document cluster processing apparatus that assigns evaluation values to a document cluster that is a set of documents whose contents are similar to each other, and displays a plurality of document clusters at once, and divides the document into words or character strings If each of the word or character string appearing in the document cluster appears in each document included in the document cluster, “1” is assigned to the word or character string Otherwise, the document analysis means for assigning “0” to the word or character string, the variance calculation means for calculating the variance in the documents in the document cluster for each word or character string, and all the occurrences in the document cluster This is a document cluster processing apparatus having evaluation value calculation means for calculating a value obtained by averaging the dispersion of words or character strings as an evaluation value of the document cluster.

本発明によれば、文書クラスタのまとまりのよさを表わす評価値であって、人間の感覚に合った評価値を付与することができるという効果を奏する。
According to the present invention, there is an effect that it is possible to provide an evaluation value that represents the goodness of unity of a document cluster and that is suitable for a human sense.

発明を実施するための最良の形態は、以下の実施例である。 The best mode for carrying out the invention is the following examples.

図１は、本発明の実施例１である文書クラスタ処理装置２０の構成を示すブロック図である。 FIG. 1 is a block diagram showing the configuration of a document cluster processing apparatus 20 that is Embodiment 1 of the present invention.

実施例１では、文書クラスタ内のそれぞれの単語が、文書クラスタ内の文書に出現する分布に着目し、各単語がどれだけ多くの文書に出現するか、または出現していないのかを、評価値として算出する。 In the first embodiment, paying attention to the distribution in which each word in the document cluster appears in the document in the document cluster, the evaluation value indicates how many documents each word appears or does not appear in. Calculate as

文書クラスタ処理装置２０は、文書解析部２１と、単語蓄積部２２と、評価値算出部２３と、評価値蓄積部２４とを有する。 The document cluster processing apparatus 20 includes a document analysis unit 21, a word storage unit 22, an evaluation value calculation unit 23, and an evaluation value storage unit 24.

また、文書クラスタ処理装置２０は、文書蓄積部１１と、文書クラスタ記録部１２と、文書クラスタ表示部１３とに接続されている。 Further, the document cluster processing device 20 is connected to the document storage unit 11, the document cluster recording unit 12, and the document cluster display unit 13.

文書蓄積部１１は、文書を蓄積し、文書クラスタ処理装置２０に文書を送る。文書クラスタ表示部１３は、文書クラスタ処理装置２０が出力する評価値に基づいて、文書クラスタを表示する。 The document storage unit 11 stores the document and sends the document to the document cluster processing apparatus 20. The document cluster display unit 13 displays the document cluster based on the evaluation value output from the document cluster processing apparatus 20.

文書解析部２１は、文書クラスタ記録部１２に記録されている情報に基づいて、文書の本文を文書蓄積部１１から取得し、予め設定されている品詞の単語を抽出し、単語蓄積部２２に保存する。 Based on the information recorded in the document cluster recording unit 12, the document analysis unit 21 acquires the text of the document from the document storage unit 11, extracts a word with a preset part of speech, and stores it in the word storage unit 22. save.

単語蓄積部２２は、文書解析部２１が抽出した単語と、この抽出された単語についての各文書における単語の重みとを保持する。 The word storage unit 22 holds the word extracted by the document analysis unit 21 and the weight of the word in each document for the extracted word.

評価値算出部２３は、単語蓄積部２２から、単語とその重みとを取得し、評価値を算出する。 The evaluation value calculation unit 23 acquires a word and its weight from the word storage unit 22 and calculates an evaluation value.

評価値蓄積部２４は、文書クラスタ毎の評価値を蓄積する。 The evaluation value storage unit 24 stores evaluation values for each document cluster.

図２は、文書蓄積部１１に蓄積されているデータ例を示す図である。 FIG. 2 is a diagram illustrating an example of data stored in the document storage unit 11.

文書蓄積部１１には、図２に示すように、文書ＩＤ１１１に対応して、タイトル１１２と、本文１１３とが蓄積されている。 As shown in FIG. 2, the document storage unit 11 stores a title 112 and a text 113 corresponding to the document ID 111.

図３は、文書クラスタ記録部１２に記録されているデータ例を示す図である。 FIG. 3 is a diagram illustrating an example of data recorded in the document cluster recording unit 12.

文書クラスタ記録部１２には、図３に示すように、複数の文書ＩＤ１２２と、文書クラスタを一意に示す文書クラスタＩＤ１２１とが対応付けて記録されている。 As shown in FIG. 3, the document cluster recording unit 12 records a plurality of document IDs 122 and a document cluster ID 121 that uniquely indicates a document cluster in association with each other.

次に、文書解析部２１の処理について説明する。 Next, processing of the document analysis unit 21 will be described.

図４は、文書解析部２１の動作を示すフローチャートである。 FIG. 4 is a flowchart showing the operation of the document analysis unit 21.

文書解析部２１は、文書クラスタ記録部１２に記録されている文書クラスタＩＤ１２１を全て取得し（Ｓ１０）、未処理の文書クラスタＩＤを１つ選択し（Ｓ１１）、この選択された未処理の文書クラスタＩＤに対応付けられている複数の文書ＩＤ１２２を取得する（Ｓ１２）。この取得した複数の文書ＩＤ１２２中の文書ＩＤ１１１に対応する本文１１３を、文書蓄積部１１から取得する（Ｓ１３）。 The document analysis unit 21 acquires all the document cluster IDs 121 recorded in the document cluster recording unit 12 (S10), selects one unprocessed document cluster ID (S11), and selects the selected unprocessed document. A plurality of document IDs 122 associated with the cluster ID are acquired (S12). The body 113 corresponding to the document ID 111 in the acquired plurality of document IDs 122 is acquired from the document storage unit 11 (S13).

次に、本文について、形態素解析やｎ−ｇｒａｍ分析等の処理を実施し、本文を、単語や文字列に分割する。以下では、形態素解析によって、本文を「形態素」に分割した場合について説明する。分割して取得した形態素（以下、「単語」と呼ぶ）を、単語蓄積部２２に記録する（Ｓ１４）。この際、単語に品詞情報を付与し、処理に使用する品詞を予め定めることによって、該当する品詞の単語のみを、単語蓄積部２２に記録するようにしてもよい。 Next, processing such as morphological analysis and n-gram analysis is performed on the text, and the text is divided into words and character strings. Hereinafter, a case where the text is divided into “morphemes” by morphological analysis will be described. The divided and acquired morphemes (hereinafter referred to as “words”) are recorded in the word storage unit 22 (S14). At this time, part-of-speech information may be assigned to the word, and a part-of-speech word to be used for processing may be determined in advance, so that only the word with the relevant part-of-speech is recorded in the word storage unit 22.

次に、Ｓ１１に戻り、未処理の文書クラスタがあれば（Ｓ１１のＹＥＳ）、この未処理の文書クラスタについて、Ｓ１２〜Ｓ１４の処理を実行する。全ての文書クラスタについて、Ｓ１２〜Ｓ１４の処理を実行すると（Ｓ１１のＮＯ）、文書解析部２１の処理が修了する。 Next, returning to S11, if there is an unprocessed document cluster (YES in S11), the processes of S12 to S14 are executed for this unprocessed document cluster. When the processing of S12 to S14 is executed for all the document clusters (NO in S11), the processing of the document analysis unit 21 is completed.

図５は、単語蓄積部２２に蓄積されているデータ例を示す図である。 FIG. 5 is a diagram illustrating an example of data stored in the word storage unit 22.

単語蓄積部２２は、図５に示すように、Ｓ１４で抽出された単語と、この抽出された単語についての各文書における単語の重みとを保持している。 As shown in FIG. 5, the word accumulation unit 22 holds the word extracted in S <b> 14 and the word weight in each document for the extracted word.

この例では、上記「単語の重み」は、次の通りである。つまり、所定の単語が、所定の文書に出現すれば、上記所定の文書についての上記単語の重みは「１」であり、所定の単語が、所定の文書に出現しなければ、上記所定の文書についての上記単語の重みは「０」である。 In this example, the “word weight” is as follows. That is, if the predetermined word appears in the predetermined document, the weight of the word for the predetermined document is “1”, and if the predetermined word does not appear in the predetermined document, the predetermined document The weight of the word for is “0”.

ここで、文書に出現する単語の重みについて、出現する単語のみを記録するようにしてもよい。なお、文書クラスタの評価値を算出するときに、各文書に記録されていない単語の重みを０として算出する。このようにすることによって、記録容量を少なく抑えることができる。 Here, only the appearing word may be recorded as the weight of the word appearing in the document. Note that when calculating the evaluation value of the document cluster, the weight of a word not recorded in each document is calculated as zero. By doing so, the recording capacity can be reduced.

評価値算出部２３は、文書解析部２１の処理が終了すると、単語蓄積部２２から、単語とその重みとを取得し、評価値を算出し、文書クラスタ毎の評価値を、評価値蓄積部２４に記録する。 When the processing of the document analysis unit 21 is completed, the evaluation value calculation unit 23 acquires a word and its weight from the word storage unit 22, calculates an evaluation value, and uses the evaluation value for each document cluster as an evaluation value storage unit. 24.

上記評価値は、たとえば、以下の単語分散の式（１）を用いて算出することができる。 The evaluation value can be calculated using, for example, the following word dispersion formula (1).

ここで、ｘ_ｉｊは、文書ｄ_ｉにおける単語ｗ_ｊの重みであり、ｎは、文書クラスタ内の文書総数であり、ｍは、文書クラスタ内での単語ｗ_ｊの重み平均値である。

Here, x _ij is the weight of the word w _j in the document d _i , n is the total number of documents in the document cluster, and m is the weight average value of the word w _j in the document cluster.

この単語分散は、所定の単語が多くの文書に出現している場合と、ごくわずかな文書に出現している場合とに偏りが少ないと判断され、単語分散が小さくなる。つまり、たとえば、５文書中４文書に出現する単語の分散と、５文書中１文書に出現する単語の分散とは互いに等しく、分散が小さい（偏りが少ない）。したがって、文書クラスタ内に、単語分散が小さい単語が多い程、単語分散の平均値が小さく、まとまりがよいと考えることができる。 This word distribution is judged to be less biased when a predetermined word appears in many documents and when it appears in a very few documents, and the word distribution becomes smaller. That is, for example, the variance of words appearing in 4 of 5 documents and the variance of words appearing in 1 of 5 documents are equal to each other and the variance is small (there is little bias). Therefore, it can be considered that the more words having a small word variance in the document cluster, the smaller the average value of the word variance and the better the unity.

そこで、文書クラスタのまとまりのよさを表わす評価値として、文書クラスタ内に出現する単語の単語分散平均を用いる。文書クラスタＣ_ｋの評価値は、以下の式（２）で求めることができる。 Therefore, a word distribution average of words appearing in the document cluster is used as an evaluation value representing the unity of the document cluster. The evaluation value of the document cluster C _k can be obtained by the following equation (2).

ここで、文書クラスタＣ_ｋには、文書ｄ_ｋｉ（ｉ＝１〜ｎ_ｋ）が存在し、ｘ_ｋｉｊは、文書ｄ_ｋｉにおける単語ｗ_ｊの重みである。ｎ_ｋは、文書クラスタＣ_ｋにおける文書総数であり、ｖ_ｋは、文書クラスタＣ_ｋにおける単語総数であり、ｍ_ｋｊは、文書クラスタＣ_ｋにおける単語ｗ_ｊの重み平均である。

Here, the document cluster C _k includes the document d _ki (i = 1 to n _k ), and x _kij is the weight of the word w _j in the document d _ki . n _k is the total number of documents in the document cluster C _k , v _k is the total number of words in the document cluster C _k , and m _kj is the weighted average of the words w _j in the document cluster C _k .

実際に、図１０に示す例において、各単語ｗ_１，ｗ_２，……，ｗ_１０のそれぞれに対して、式（１）を用いて単語分散を算出する。まず、文書クラスタ１における単語Ｗ_１の単語分散を求める。 Actually, in the example shown in FIG. 10, the word variance is calculated using Expression (1) for each of the words w ₁ , w ₂ ,..., W ₁₀ . First, the word dispersion of the word W ₁ in the document cluster 1.

単語Ｗ_１の重み平均＝（１＋１＋１）／３＝１
Ｗ_１の単語分散＝（（１−１）^２＋（１−１）^２＋（１−１）^２）／３＝０
文書クラスタ１において、単語Ｗ_１は、全ての文書に出現しているので、単語分散の値が小さく、その単語分散の値は、０である。 Weight average of word W ₁ = (1 + 1 + 1) / 3 = 1
W ₁ word variance = ((1-1) ² + (1-1) ² + (1-1) ² ) / 3 = 0
In document clusters 1, the words W _1, since appeared in all documents, the value of the word dispersion is small, the value of the word dispersion is zero.

これと同様に、文書クラスタ１の単語ｗ_１，ｗ_２，……，ｗ_１０のそれぞれについて、上記式（１）によって、単語分散を求め、上記式（２）によって、単語分散の平均である文書クラスタ１の評価値を算出する。 Similarly, for each of the words w ₁ , w ₂ ,..., W ₁₀ of the document cluster 1, the word variance is obtained by the above equation (1), and the average of the word variance is obtained by the above equation (2). The evaluation value of the document cluster 1 is calculated.

文書クラスタ１の評価値＝（０＋０＋０＋０．２２２２＋０．２２２２＋，……，＋０．２２２２）／１０＝０．１５５６
文書クラスタ２についても、上記と同様に、各単語の単語分散を求め、この求めた単語分散の平均である文書クラスタ２の評価値を算出する。 Evaluation value of document cluster 1 = (0 + 0 + 0 + 0.2222 + 0.2222 +,..., +0.2222) /10=0.1556
Also for the document cluster 2, the word variance of each word is obtained in the same manner as described above, and the evaluation value of the document cluster 2 that is the average of the obtained word variance is calculated.

文書クラスタ２の評価値＝（０．２２２２＋０．２２２２＋，……，＋０．２２２２）／１０＝０．２２２２
この結果、文書クラスタ２の評価値よりも、文書クラスタ１の評価値が、小さくなり、文書クラスタ１が、まとまりのよい文書クラスタであると判断することができる。 Evaluation value of document cluster 2 = (0.2222 + 0.2222 +,..., +0.2222) /10=0.2222
As a result, the evaluation value of the document cluster 1 becomes smaller than the evaluation value of the document cluster 2, and it can be determined that the document cluster 1 is a well-organized document cluster.

図６は、評価値蓄積部２４に蓄積されているデータ例を示す図である。 FIG. 6 is a diagram illustrating an example of data stored in the evaluation value storage unit 24.

文書クラスタＣ１、Ｃ２、Ｃ３、……のそれぞれについて算出された評価値２４２が、評価値蓄積部２４に記録される。 Evaluation values 242 calculated for each of the document clusters C1, C2, C3,... Are recorded in the evaluation value storage unit 24.

文書クラスタ表示部１３は、評価値蓄積部２４からデータを取得し、文書クラスタのまとまりのよい順、すなわち評価値２４２の小さい順に、文書クラスタを表示する。また、文書クラスタ表示部１３は、文書クラスタ内の文書数に基づいて、文書数の多い順や少ない順等に並べ替えることもでき、複数の並べ替えを切り替えるようにしてもよい。さらに、文書クラスタ表示部１３は、ユーザに文書クラスタを表示する際、各文書のタイトルや抜粋等を、文書蓄積部１１から抽出し、提示する。 The document cluster display unit 13 acquires data from the evaluation value storage unit 24, and displays the document clusters in the order in which the document clusters are well organized, that is, the evaluation value 242 is in ascending order. Further, the document cluster display unit 13 can rearrange the documents in the order of increasing or decreasing the number of documents based on the number of documents in the document cluster. Further, when displaying the document cluster to the user, the document cluster display unit 13 extracts the titles and excerpts of each document from the document storage unit 11 and presents them.

実施例１において、文書解析部２１が本文解析処理を実行する場合、形態素解析ではなく、ｎ−ｇｒａｍ解析で本文解析処理を実行すると、本文は、ｎ個の文字によって構成される複数の文字列に分割される。分割された文字列を、単語と同様に扱い、単語蓄積部２２に記録する。したがって、ｎ−ｇｒａｍ解析で本文解析処理を実行した場合も、評価値算出部２３は、形態素解析を用いた上記説明と何ら変わることなく、文書クラスタの評価値を算出することができる。 In the first embodiment, when the document analysis unit 21 executes the text analysis process, when the text analysis process is executed by n-gram analysis instead of morphological analysis, the text is a plurality of character strings composed of n characters. It is divided into. The divided character string is handled in the same manner as a word and recorded in the word storage unit 22. Therefore, even when the text analysis process is executed by n-gram analysis, the evaluation value calculation unit 23 can calculate the evaluation value of the document cluster without any difference from the above description using the morphological analysis.

また、実施例１では、文書クラスタ内に出現する単語の重みを、文書に出現すれば、「１」、出現しなければ、「０」として、文書クラスタの評価値を算出し、文書クラスタ内の文書に共通して出現する（または出現しない）単語が多い程小さな評価値を付与することを可能とする。つまり、ごくわずかな文書に出現している単語が多くても、文書クラスタの評価値が小さくなる。 In the first embodiment, the weight of a word appearing in a document cluster is set to “1” if it appears in the document, and “0” if it does not appear. It is possible to assign a smaller evaluation value as there are more words that appear (or do not appear) in common in the document. That is, even if there are many words appearing in a very small number of documents, the evaluation value of the document cluster is small.

なお、単語の重みとして、文書に出現する単語の数（出現数）を用いるようにしてもよく、このようにした場合、出現数にばらつきの少ない単語が多く存在する文書クラスタ程、小さな評価値を付与することができる。
Note that the number of words appearing in the document (number of appearances) may be used as the word weight, and in this case, the smaller the evaluation value is, the more the document cluster has many words with little variation in the number of appearances. Can be granted.

図７は、本発明の実施例２である文書クラスタ処理装置３０を示すブロック図である。 FIG. 7 is a block diagram showing the document cluster processing apparatus 30 according to the second embodiment of the present invention.

文書クラスタ処理装置３０は、評価値算出部２３ａと、評価値蓄積部２４とを有する。 The document cluster processing apparatus 30 includes an evaluation value calculation unit 23 a and an evaluation value storage unit 24.

文書クラスタ処理装置３０は、文書クラスタ処理装置３０の入力となる文書クラスタ記録部１２と、文書ベクトル蓄積部１４と、文書クラスタ処理装置３０が出力する評価値に基づいて文書クラスタを表示する文書クラスタ表示部１３とに接続されている。文書クラスタ表示部１３は、文書のタイトルや抜粋等を表示する場合の入力として、文書蓄積部１１に接続されている。 The document cluster processing device 30 includes a document cluster recording unit 12 that is an input to the document cluster processing device 30, a document vector storage unit 14, and a document cluster that displays a document cluster based on an evaluation value output from the document cluster processing device 30. It is connected to the display unit 13. The document cluster display unit 13 is connected to the document storage unit 11 as an input for displaying a document title, an excerpt, and the like.

文書蓄積部１１、文書クラスタ記録部１２、文書クラスタ表示部１３は、実施例１と同様である。実施例２が実施例１と違う点は、文書クラスタ処理装置３０が、文書のベクトルを入力する点である。 The document storage unit 11, the document cluster recording unit 12, and the document cluster display unit 13 are the same as those in the first embodiment. The second embodiment is different from the first embodiment in that the document cluster processing apparatus 30 inputs a document vector.

図８は、実施例２における文書ベクトル蓄積部１４に蓄積されているデータ例を示す図である。 FIG. 8 is a diagram illustrating an example of data stored in the document vector storage unit 14 according to the second embodiment.

文書を表現した文書ベクトルは、文書ベクトル蓄積部１４に蓄積され、たとえば、図８に示す構成を取る。文書ＩＤ１４１毎に、本文を構成する要素が蓄積されている。たとえば、文書ＩＤ＝００００１の要素は、「ヘアケア」、「サンプル」、「パッケージ」等である。また、各要素には、重みが既に付与されている。たとえば、ｔｆ・ｉｄｆ値が、重みとして既に付与されている。具体的には、「ヘアケア」は「１．５３」という重みが付与されている。この重みは、文書クラスタリングの処理で決定された値である。 The document vector representing the document is stored in the document vector storage unit 14 and has, for example, the configuration shown in FIG. For each document ID 141, elements constituting the body are stored. For example, the elements of document ID = 00001 are “hair care”, “sample”, “package”, and the like. Each element is already given a weight. For example, the tf · idf value is already assigned as the weight. Specifically, “hair care” is given a weight of “1.53”. This weight is a value determined by the document clustering process.

評価値算出部２３ａは、文書クラスタ記録部１２に記録されている文書クラスタの情報に基づいて、文書ベクトル蓄積部１４から文書ベクトルを取得し、文書クラスタ毎に、文書クラスタの評価値を算出し、評価値蓄積部２４に記録する。 The evaluation value calculation unit 23a acquires a document vector from the document vector storage unit 14 based on the document cluster information recorded in the document cluster recording unit 12, and calculates an evaluation value of the document cluster for each document cluster. And recorded in the evaluation value storage unit 24.

次に、実施例２において、評価値算出部２３ａが文書クラスタの評価値を算出する処理について説明する。 Next, a process in which the evaluation value calculation unit 23a calculates the evaluation value of the document cluster in the second embodiment will be described.

図９は、実施例２において、評価値算出部２３ａが文書クラスタの評価値を算出する動作を示すフローチャートである。 FIG. 9 is a flowchart illustrating an operation in which the evaluation value calculation unit 23a calculates the evaluation value of the document cluster in the second embodiment.

まず、文書クラスタ記録部１２に記録されている文書クラスタＩＤを全て取得し（Ｓ２０）、未処理のクラスタＩＤを１つ選択し（Ｓ２１）、この選択された文書クラスタＩＤに対応する複数の文書ＩＤを取得する（Ｓ２２）。図３に示す例において、文書クラスタＩＤＣ１が選択されると、これに対応する文書ＩＤ（００００１、０００２４、０００２５）を取得する。この取得した文書ＩＤに基づいて、文書ベクトル蓄積部１４から、各文書ＩＤに対応する文書ベクトルを取得する（Ｓ２３）。 First, all the document cluster IDs recorded in the document cluster recording unit 12 are acquired (S20), one unprocessed cluster ID is selected (S21), and a plurality of documents corresponding to the selected document cluster ID is selected. ID is acquired (S22). In the example shown in FIG. 3, when the document cluster ID C1 is selected, the corresponding document IDs (00001, 0024, 0205) are acquired. Based on the acquired document ID, a document vector corresponding to each document ID is acquired from the document vector storage unit 14 (S23).

次に、取得した複数の文書ベクトルを用いて、文書クラスタ内に出現する全てのベクトル要素を、実施例１の単語と同様に扱い、式（２）を用いて、評価値を算出する（Ｓ２４）。 Next, using the acquired plurality of document vectors, all vector elements appearing in the document cluster are handled in the same manner as the words in the first embodiment, and the evaluation value is calculated using Expression (2) (S24). ).

この際、単語の重みを、そのまま文書クラスタの評価値として用いると、重みの大きい単語の単語分散値は大きく、重みの小さい単語の単語分散値は小さくなる傾向にある。そのため、重みの大きい単語の単語分散値が評価値に強く影響する場合がある。そこで、文書ベクトル蓄積部に記録されている単語の重みを用いずに、文書に単語が出現すれば「１」とし、出現しなければ「０」として、文書クラスタの評価値を算出すようにしてもよい。 At this time, if the word weight is used as it is as the evaluation value of the document cluster, the word variance value of the word having a large weight tends to be large and the word variance value of a word having a small weight tends to be small. For this reason, the word variance value of a word having a large weight may strongly affect the evaluation value. Therefore, without using the word weight recorded in the document vector storage unit, the evaluation value of the document cluster is calculated as “1” if the word appears in the document and “0” if it does not appear. May be.

このように算出した評価値は、文書クラスタＩＤと対応付けて、評価値蓄積部２４に記録する（Ｓ２５）。 The evaluation value calculated in this way is recorded in the evaluation value storage unit 24 in association with the document cluster ID (S25).

Ｓ２１に戻り、未処理の文書クラスタがあれば（Ｓ２１のＹＥＳ）、この文書クラスタについて、Ｓ２２〜Ｓ２５の処理を実行する。全ての文書クラスタについて処理を実施すると（Ｓ２１のＮＯ）、評価値算出部２３ａは、処理を終了する。 Returning to S21, if there is an unprocessed document cluster (YES in S21), the processing of S22 to S25 is executed for this document cluster. When the process is performed for all the document clusters (NO in S21), the evaluation value calculation unit 23a ends the process.

なお、ベクトル表現された文書は、本文だけであるとは限らず、タイトルと本文との両方を対象として、ベクトル表現された値であってもよい。つまり、図２において、本文１１３だけでなく、タイトル１１２からも、単語を抽出するようにしてもよい。つまり、Ｓ１３の処理を、本文だけでなく、タイトル＋本文に実行するようにしてもよい。 Note that the vector-represented document is not limited to the text only, and may be a vector-represented value for both the title and the text. That is, in FIG. 2, words may be extracted not only from the text 113 but also from the title 112. That is, the processing of S13 may be executed not only on the text but also on the title + text.

上記実施例は、重要な語が多くの文書に共通していることに着目した実施例である。ただ、人間が重要と感じる単語を、コンピュータが適切に抽出することができない場合があり、この場合でも、上記のように、単語の分散を用いることによって、文書クラスタのまとまりのよさを表わす評価値を算出することができる。 The above embodiment focuses on the fact that important words are common to many documents. However, the computer may not be able to properly extract words that humans feel important, and even in this case, the evaluation value that represents the goodness of the clustering of the document cluster by using word distribution as described above. Can be calculated.

図１１は、単語蓄積部２２に蓄積する他のデータ例を示す図である。 FIG. 11 is a diagram illustrating another example of data stored in the word storage unit 22.

上記説明では、文書内に出現する要素のみが、重みと共に蓄積されているが、文書集合全体に出現する単語または文字列について、図１１に示すように、単語の重みを保持したマトリックスデータを入力するようにしてもよい。 In the above description, only the elements appearing in the document are stored together with the weights. However, as shown in FIG. 11, matrix data holding word weights is input for words or character strings appearing in the entire document set. You may make it do.

つまり、上記実施例は、内容が互いに類似している文書の集合である文書クラスタに評価値を付与し、複数の文書クラスタを一度に表示する文書クラスタ処理装置の例である。また、文書解析部２１は、文書を単語または文字列に分割する分割手段の例である。また、文書解析部２１は、文書クラスタ内に出現する全ての単語または文字列のそれぞれが、上記文書クラスタに含まれている文書に出現すれば、当該文書における当該単語または文字列の重みを「１」とし、上記文書クラスタに含まれている文書に出現しなければ、当該文書における当該単語または文字列の重みを「０」とする文書解析手段の例である。 In other words, the above-described embodiment is an example of a document cluster processing apparatus that assigns evaluation values to document clusters that are collections of documents whose contents are similar to each other and displays a plurality of document clusters at a time. The document analysis unit 21 is an example of a dividing unit that divides a document into words or character strings. In addition, when all the words or character strings appearing in the document cluster appear in the documents included in the document cluster, the document analysis unit 21 sets the weight of the word or character string in the document to “ This is an example of document analysis means that sets the weight of the word or character string in the document to “0” if it does not appear in the documents included in the document cluster.

評価値算出部２３は、上記文書クラスタ内に出現する各単語または文字列について、文書クラスタ内文書における分散を算出する分散算出手段の例であり、上記文書クラスタ内に出現する全ての単語または文字列の分散を平均した値を、当該文書クラスタの評価値として算出する評価値算出手段の例である。 The evaluation value calculation unit 23 is an example of a variance calculation unit that calculates a variance in a document in a document cluster for each word or character string that appears in the document cluster, and all words or characters that appear in the document cluster. It is an example of the evaluation value calculation means which calculates the value which averaged dispersion | distribution of a column as an evaluation value of the said document cluster.

この場合、上記文書解析手段は、単語または文字列が、各文書に出現する数をカウントする手段であり、上記評価値算出手段は、各文書内に、単語または文字列が出現する数を用いて、単語または文字列の分散を算出する手段である。 In this case, the document analysis unit is a unit that counts the number of words or character strings that appear in each document, and the evaluation value calculation unit uses the number of words or character strings that appear in each document. The means for calculating the variance of the word or character string.

また、評価値算出部２３ａは、各文書に対する文書ベクトルを入力し、文書クラスタ内に出現する全てのベクトル要素に対して、文書クラスタ内文書における分散を算出する分散算出手段の例であり、文書クラスタ内に出現する全てのベクトル要素の分散を平均した値を、当該文書クラスタの評価値として算出する評価値算出手段の例である。 The evaluation value calculation unit 23a is an example of a variance calculation unit that inputs a document vector for each document and calculates the variance in the documents in the document cluster for all vector elements that appear in the document cluster. It is an example of an evaluation value calculation means for calculating a value obtained by averaging the variances of all vector elements appearing in a cluster as an evaluation value of the document cluster.

この場合、上記分散算出手段は、文書内にベクトル要素が出現すれば、各ベクトル要素を、「１」に変換し、文書内にベクトル要素が出現しなければ、各ベクトル要素を、「０」に変換し、ベクトル要素の分散を算出する手段である。 In this case, the variance calculation means converts each vector element to “1” if a vector element appears in the document, and sets each vector element to “0” if no vector element appears in the document. Is a means for calculating the variance of vector elements.

しかも、上記実施例を方法の発明として把握することができる。つまり、上記実施例は、内容が互いに類似している文書の集合である文書クラスタに評価値を付与し、複数の文書クラスタを一度に表示する文書クラスタ処理方法であって、文書を単語または文字列に分割し、記憶装置に記憶する分割工程と、文書クラスタ内に出現する全ての単語または文字列のそれぞれが、上記文書クラスタに含まれている文書に出現すれば、当該文書における当該単語または文字列の重みを「１」とし、出現しなければ、当該文書における当該単語または文字列の重みを「０」とし、記憶装置に記憶する文書解析工程と、上記文書クラスタ内に出現する各単語または文字列について、文書クラスタ内文書における分散を算出し、記憶装置に記憶する分散算出工程と、上記文書クラスタ内に出現する全ての単語または文字列の分散を平均した値を、当該文書クラスタの評価値として算出し、記憶装置に記憶する評価値算出工程とを有する文書クラスタ処理方法の例である。 Moreover, the above embodiment can be grasped as a method invention. In other words, the embodiment described above is a document cluster processing method for assigning an evaluation value to a document cluster that is a set of documents whose contents are similar to each other, and displaying a plurality of document clusters at the same time. The dividing step of dividing into columns and storing in the storage device, and if all the words or character strings appearing in the document cluster appear in the documents included in the document cluster, The character string weight is set to “1”, and if it does not appear, the word or character string weight in the document is set to “0”, the document analysis step stored in the storage device, and each word appearing in the document cluster Or, for a character string, the variance calculation step for calculating the variance in the documents in the document cluster and storing it in the storage device, and all the words or characters appearing in the document cluster An average value of variance, calculated as the evaluation value of the document cluster is an example of a document clustering method with an evaluation value calculation step of storing in a storage device.

さらに、上記実施例は、内容が互いに類似している文書の集合である文書クラスタに評価値を付与し、複数の文書クラスタを一度に表示する文書クラスタ処理方法であって、各文書に対する文書ベクトルを入力し、文書クラスタ内に出現する全てのベクトル要素に対して、文書クラスタ内文書における分散を算出し、記憶装置に記憶する分散算出工程と、文書クラスタ内に出現する全てのベクトル要素の分散を平均した値を、当該文書クラスタの評価値として算出し、記憶装置に記憶する評価値算出工程とを有する文書クラスタ処理方法の例である。 Further, the above embodiment is a document cluster processing method for assigning an evaluation value to a document cluster which is a set of documents whose contents are similar to each other, and displaying a plurality of document clusters at the same time. For all vector elements that appear in the document cluster, calculate the variance in the document in the document cluster, and store it in the storage device, and the variance of all the vector elements that appear in the document cluster This is an example of a document cluster processing method including an evaluation value calculation step of calculating a value obtained by averaging the values as an evaluation value of the document cluster and storing it in a storage device.

そして、上記実施例は、上記文書クラスタ処理方法をコンピュータに実行させるプログラムの例である。
The above embodiment is an example of a program that causes a computer to execute the document cluster processing method.

本発明の実施例１である文書クラスタ処理装置２０の構成を示すブロック図である。It is a block diagram which shows the structure of the document cluster processing apparatus 20 which is Example 1 of this invention. 文書蓄積部１１に蓄積されているデータ例を示す図である。4 is a diagram illustrating an example of data stored in a document storage unit 11. FIG. 文書クラスタ記録部１２に記録されているデータ例を示す図である。4 is a diagram illustrating an example of data recorded in a document cluster recording unit 12. FIG. 文書解析部２１の動作を示すフローチャートである。4 is a flowchart showing the operation of the document analysis unit 21. 単語蓄積部２２に蓄積されているデータ例を示す図である。FIG. 4 is a diagram illustrating an example of data stored in a word storage unit 22. 評価値蓄積部２４に蓄積されているデータ例を示す図である。4 is a diagram illustrating an example of data stored in an evaluation value storage unit 24. FIG. 本発明の実施例２である文書クラスタ処理装置３０を示すブロック図である。It is a block diagram which shows the document cluster processing apparatus 30 which is Example 2 of this invention. 実施例２における文書ベクトル蓄積部１４に蓄積されているデータ例を示す図である。FIG. 10 is a diagram illustrating an example of data stored in a document vector storage unit 14 according to the second embodiment. 実施例２において、評価値算出部２３ａが文書クラスタの評価値を算出する動作を示すフローチャートである。In Example 2, it is a flowchart which shows the operation | movement in which the evaluation value calculation part 23a calculates the evaluation value of a document cluster. 文書クラスタのまとまりのよさを説明する図である。It is a figure explaining the goodness of a cluster of a document cluster. 単語蓄積部２２に蓄積する他のデータ例を示す図である。It is a figure which shows the other example of data accumulate | stored in the word storage part.

符号の説明Explanation of symbols

１１…文書蓄積部、
１２…文書クラスタ記録部、
１３…文書クラスタ表示部、
２０…文書クラスタ処理装置、
２１…文書解析部、
２２…単語蓄積部、
２３…評価値算出部、
２４…評価値蓄積部、
３０…文書クラスタ処理装置、
２３ａ…評価値算出部、
１４…文書ベクトル蓄積部。 11 ... Document storage unit,
12 ... Document cluster recording unit,
13 ... Document cluster display section,
20 ... Document cluster processing device,
21 ... Document analysis department,
22 ... word storage unit,
23 ... evaluation value calculation unit,
24 ... evaluation value storage unit,
30 ... Document cluster processing device,
23a ... evaluation value calculation unit,
14: Document vector storage unit.

Claims

内容が互いに類似している文書の集合である文書クラスタに評価値を付与し、複数の文書クラスタを一度に表示する文書クラスタ処理装置であって、
文書を単語または文字列に分割する分割手段と；
文書クラスタ内に出現する全ての単語または文字列のそれぞれが、上記文書クラスタに含まれている文書に出現すれば、当該文書における当該単語または文字列の重みを「１」とし、上記文書クラスタに含まれている文書に出現しなければ、当該文書における当該単語または文字列の重みを「０」とする文書解析手段と；
上記文書クラスタ内に出現する各単語または文字列について、文書クラスタ内文書における分散を算出する分散算出手段と；
上記文書クラスタ内に出現する全ての単語または文字列の分散を平均した値を、当該文書クラスタの評価値として算出する評価値算出手段と；
を有することを特徴とする文書クラスタ処理装置。 A document cluster processing apparatus that assigns an evaluation value to a document cluster, which is a set of documents whose contents are similar to each other, and displays a plurality of document clusters at a time,
A dividing means for dividing the document into words or character strings;
If all the words or character strings appearing in the document cluster appear in the documents included in the document cluster, the weight of the word or character string in the document is set to “1”, and Document analysis means for setting the weight of the word or character string in the document to “0” if it does not appear in the contained document;
A variance calculating means for calculating a variance in the documents in the document cluster for each word or character string appearing in the document cluster;
An evaluation value calculating means for calculating a value obtained by averaging dispersion of all words or character strings appearing in the document cluster as an evaluation value of the document cluster;
A document cluster processing apparatus comprising:

請求項１において、
上記文書解析手段は、単語または文字列が、各文書に出現する数をカウントする手段であり、
上記評価値算出手段は、各文書内に、単語または文字列が出現する数を用いて、単語または文字列の分散を算出する手段であることを特徴とする文書クラスタ処理装置。 In claim 1,
The document analysis means is means for counting the number of words or character strings appearing in each document,
The document cluster processing apparatus, wherein the evaluation value calculation means is means for calculating a variance of words or character strings using the number of occurrences of words or character strings in each document.

内容が互いに類似している文書の集合である文書クラスタに評価値を付与し、複数の文書クラスタを一度に表示する文書クラスタ処理装置であって、
各文書に対する文書ベクトルを入力し、文書クラスタ内に出現する全てのベクトル要素に対して、文書クラスタ内文書における分散を算出する分散算出手段と；
文書クラスタ内に出現する全てのベクトル要素の分散を平均した値を、当該文書クラスタの評価値として算出する評価値算出手段と；
を有することを特徴とする文書クラスタ処理装置。 A document cluster processing apparatus that assigns an evaluation value to a document cluster, which is a set of documents whose contents are similar to each other, and displays a plurality of document clusters at a time,
A variance calculation means for inputting a document vector for each document and calculating a variance in the documents in the document cluster for all vector elements appearing in the document cluster;
An evaluation value calculating means for calculating a value obtained by averaging the variances of all vector elements appearing in the document cluster as an evaluation value of the document cluster;
A document cluster processing apparatus comprising:

請求項３において、
上記分散算出手段は、文書内にベクトル要素が出現すれば、各ベクトル要素を、「１」に変換し、文書内にベクトル要素が出現しなければ、各ベクトル要素を、「０」に変換し、ベクトル要素の分散を算出する手段であることを特徴とする文書クラスタ処理装置。 In claim 3,
The variance calculation means converts each vector element to “1” if a vector element appears in the document, and converts each vector element to “0” if no vector element appears in the document. A document cluster processing apparatus, which is means for calculating a variance of vector elements.

内容が互いに類似している文書の集合である文書クラスタに評価値を付与し、複数の文書クラスタを一度に表示する文書クラスタ処理方法であって、
文書を単語または文字列に分割し、記憶装置に記憶する分割工程と；
文書クラスタ内に出現する全ての単語または文字列のそれぞれが、上記文書クラスタに含まれている文書に出現すれば、当該文書における当該単語または文字列の重みを「１」とし、出現しなければ、当該文書における当該単語または文字列の重みを「０」とし、記憶装置に記憶する文書解析工程と；
上記文書クラスタ内に出現する各単語または文字列について、文書クラスタ内文書における分散を算出し、記憶装置に記憶する分散算出工程と；
上記文書クラスタ内に出現する全ての単語または文字列の分散を平均した値を、当該文書クラスタの評価値として算出し、記憶装置に記憶する評価値算出工程と；
を有することを特徴とする文書クラスタ処理方法。 A document cluster processing method for assigning an evaluation value to a document cluster, which is a set of documents whose contents are similar to each other, and displaying a plurality of document clusters at a time,
A dividing step of dividing the document into words or character strings and storing them in a storage device;
If all the words or character strings appearing in the document cluster appear in the documents included in the document cluster, the weight of the word or character string in the document is set to “1”. A document analysis step in which the weight of the word or character string in the document is set to “0” and stored in the storage device;
A variance calculation step of calculating a variance in the documents in the document cluster for each word or character string appearing in the document cluster and storing the variance in the storage device;
An evaluation value calculating step of calculating a value obtained by averaging dispersion of all words or character strings appearing in the document cluster as an evaluation value of the document cluster and storing the value in a storage device;
A document cluster processing method characterized by comprising:

内容が互いに類似している文書の集合である文書クラスタに評価値を付与し、複数の文書クラスタを一度に表示する文書クラスタ処理方法であって、
各文書に対する文書ベクトルを入力し、文書クラスタ内に出現する全てのベクトル要素に対して、文書クラスタ内文書における分散を算出し、記憶装置に記憶する分散算出工程と；
文書クラスタ内に出現する全てのベクトル要素の分散を平均した値を、当該文書クラスタの評価値として算出し、記憶装置に記憶する評価値算出工程と；
を有することを特徴とする文書クラスタ処理方法。 A document cluster processing method for assigning an evaluation value to a document cluster, which is a set of documents whose contents are similar to each other, and displaying a plurality of document clusters at a time,
A variance calculation step of inputting a document vector for each document, calculating a variance in the documents in the document cluster for all vector elements appearing in the document cluster, and storing the variance in the storage device;
An evaluation value calculating step in which a value obtained by averaging the variances of all vector elements appearing in the document cluster is calculated as an evaluation value of the document cluster and stored in a storage device;
A document cluster processing method characterized by comprising:

請求項５または請求項６記載の方法をコンピュータに実行させるプログラム。 The program which makes a computer perform the method of Claim 5 or Claim 6.