JP4972271B2

JP4972271B2 - Search result presentation device

Info

Publication number: JP4972271B2
Application number: JP2004167287A
Authority: JP
Inventors: 祐一小川; 菅谷　　奈津子; 忠孝松林; 隆明弥生; 正明原
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2004-06-04
Filing date: 2004-06-04
Publication date: 2012-07-11
Anticipated expiration: 2024-06-04
Also published as: JP2005346560A

Description

本発明は、大量の電子化文書の中からユーザーが目的とする文書を検索する文書検索において、検索結果を効率よく参照するための検索結果提示方法およびその装置並びに検索結果提示プログラムを格納した記録媒体に関する。 The present invention relates to a search result presentation method and apparatus for efficiently referring to a search result in a document search for searching a target document from a large number of electronic documents, and a record storing a search result presentation program It relates to the medium.

近年、パーソナルコンピュータやインターネットの普及に伴い、電子化文書が大量に存在するようになった。その大量の電子化文書の中からユーザーが目的とする文書（以下、目的文書と呼ぶ）を効率よく検索する文書検索技術が盛んに開発されており、中でも検索条件として入力された文章（以下、種文章と呼ぶ）と類似した文書を検索する類似文書検索が注目されている。 In recent years, with the spread of personal computers and the Internet, a large number of electronic documents have come to exist. Document search technology that efficiently searches for a user's target document (hereinafter referred to as a target document) from a large number of electronic documents has been actively developed. A similar document search that searches for a document similar to a seed document has been attracting attention.

類似文書検索方法の１つとして、「特開２００２−７３６８１号公報」で開示されている技術（以下、従来技術１と呼ぶ）がある。従来技術１では、検索条件として指定された種文章から複数の特徴的な語（以下、特徴語と呼ぶ）を抽出し、その特徴語を用いて種文章に類似した文書を検索する。 As one of similar document search methods, there is a technique (hereinafter referred to as conventional technique 1) disclosed in “Japanese Patent Laid-Open No. 2002-73681”. In the prior art 1, a plurality of characteristic words (hereinafter referred to as characteristic words) are extracted from the seed text specified as the search condition, and a document similar to the seed text is searched using the characteristic words.

特開２００２−７３６８１号公報JP 2002-73681 A

一般的に文章中には複数のサブトピックが含まれる場合が多い。サブトピックとは、文章の概念に含まれる部分的な概念又は内容のことを示す。例えば、文章が「Ｈ社の地上デジタル放送対応プラズマテレビ」の場合、文章中のサブトピックについては、（１）「テレビ」、（２）「プラズマテレビ」、（３）「Ｈ社」および（４）「地上デジタル放送対応」等が含まれる。しかし、検索条件として種文章を用いる類似文書検索では、種文章に複数のサブトピックが含まれる場合、検索によって得られる文書（以下、適合文書と呼ぶ）の集合にはそれぞれのサブトピックに関する文書が混在する。例えば、検索者が（２）「プラズマテレビ」および（４）「地上デジタル放送対応」に関する情報が知りたいときに、種文章として「Ｈ社の地上デジタル放送対応プラズマテレビ」が選択されたとする。この場合、適合文書の集合には、前述４つのサブトピックを単体あるいは複数含む文書が混在する。この結果、従来技術１を用いて得られた検索結果を上位から順に参照した場合、検索者にとって必要のない（２）および（４）以外のサブトピックに関する適合文書についても１件ずつ目的文書であるかどうかを判断していかなくてはならない。すなわち、目的文書にたどり着くまでに非常に多くの時間を要してしまう。 In general, a sentence often includes a plurality of subtopics. A subtopic indicates a partial concept or content included in a sentence concept. For example, when the sentence is “Company H for terrestrial digital broadcasting”, the subtopics in the sentence are (1) “TV”, (2) “Plasma TV”, (3) “Company H” and ( 4) “Digital Terrestrial Broadcasting compatible” is included. However, in a similar document search using seed text as a search condition, if a seed text includes a plurality of subtopics, a set of documents (hereinafter referred to as conforming documents) obtained by the search includes documents related to the respective subtopics. Mixed. For example, when the searcher wants to know information on (2) “plasma television” and (4) “terrestrial digital broadcast compatible”, “H company's digital television broadcast compatible plasma television” is selected as the seed sentence. In this case, a set of conforming documents includes a single document or a document including a plurality of the four subtopics. As a result, when the search results obtained using the prior art 1 are referred to in order from the top, the conforming documents related to subtopics other than (2) and (4) that are not necessary for the searcher are also included in the target document one by one. We must judge whether there is. That is, it takes a very long time to reach the target document.

本発明は、上記の課題を解決すべく、同じサブトピックで適合した適合文書をグルーピングして提示することで、素早く目的文書を探し出すことができる検索結果提示方法を提供することを目的とする。 In order to solve the above-described problems, an object of the present invention is to provide a search result presentation method that can quickly find a target document by grouping and presenting conforming documents that match in the same subtopic.

上記目的を達成するために、本発明は、指定された検索条件に対する検索結果として得られた検索結果文書集合を分類して表示する検索結果提示装置において、前記指定された検索条件から特徴語を抽出する特徴語抽出手段と、前記特徴語抽出手段で抽出された特徴語間の関連性を判定し、関連する特徴語間をまとめた関連単語リストを生成する関連単語リスト生成手段と、前記関連単語リスト生成手段で生成された各関連単語リストに対する適合文書の関連単語リスト適合度を算出する関連単語リスト適合度算出手段と、前記関連単語リスト適合度算出手段で算出された関連単語リスト適合度から、該適合文書の該関連単語リストに対する適合性を判定し、該適合性が高いと判定された場合には該適合文書を該関連単語リストに関連付けて保持する分類判定手段と、前記分類判定手段で各関連単語リストに関連付けられた適合文書集合に対して、分類の識別情報を付与する分類識別情報付与手段と、前記分類識別情報付与手段で生成された識別情報を各分類に付与して、検索結果集合文書を表示する検索結果表示手段とを有することを特徴とする。 In order to achieve the above object, the present invention provides a search result presentation apparatus that classifies and displays a search result document set obtained as a search result for a specified search condition. a feature word extraction means for extracting, to determine the association between feature words extracted by the feature word extraction means, a related word list generating means for generating a related word list summarizing between related feature words, the associated a related word list fitness calculating means for calculating the related word list fit of relevant documents for each related word list generated by the word list generation means, the related word list fitness calculated by said related word list fitness calculating means from determines suitability for the related word list of the relevant documents, if the compatibility is high the Most judgment in association with the relevant documents to the related word list A classification determining means for lifting, with respect to adaptation document set associated with each related word list by said category determining unit, a classification identification information assigning means for assigning identification information of the classification generated by the classification identification information adding unit Search result display means for displaying the search result set document by assigning the identification information to each classification .

本発明によれば、適合文書集合に検索条件に含まれる複数のサブトピックが混在している場合でも、サブトピック別に適合文書が表示されているため、目的文書を効率よく探し出すことができる。 According to the present invention, even when a plurality of subtopics included in the search condition are mixed in the conforming document set, the conforming document is displayed for each subtopic, so that the target document can be efficiently searched.

本発明に係る検索結果提示方法及びその装置並びに検索結果提示プログラムを格納した記録媒体の実施の形態について図面を用いて説明する。 Embodiments of a search result presentation method and apparatus and a recording medium storing a search result presentation program according to the present invention will be described with reference to the drawings.

［第１の実施の形態］
本発明に係る第１の実施の形態について図１乃至図９を用いて説明する。 [First Embodiment]
A first embodiment according to the present invention will be described with reference to FIGS.

図１Ａは、本発明に係る第１の実施の形態における文書検索システムの全体構成をプログラムを主体に示す図であり、図１Ｂは、本発明に係る第１の実施の形態における文書検索システムの全体構成を機能的に示す図である。本発明における第１の実施の形態は、ディスプレイ１００、キーボード１０１、中央演算処理装置（ＣＰＵ）１０２、磁気ディスク装置１０３、フレキシブルディスクドライブ（ＦＤＤ）１０４、主メモリ１０５、これらを結ぶバス１０６および他の機器と本システムを接続するネットワーク１０７から構成される。 FIG. 1A is a diagram mainly showing a program of the entire configuration of the document search system according to the first embodiment of the present invention, and FIG. 1B is a diagram of the document search system according to the first embodiment of the present invention. It is a figure which shows the whole structure functionally. The first embodiment of the present invention includes a display 100, a keyboard 101, a central processing unit (CPU) 102, a magnetic disk device 103, a flexible disk drive (FDD) 104, a main memory 105, a bus 106 connecting them, and others. Network 107 for connecting the system and this system.

磁気ディスク装置１０３は二次記憶装置の一つであり、テキスト１８０が格納される。FＦＤＤ１０４を介してフレキシブルディスク１０８に格納されている情報が、主メモリ１０５あるいは磁気ディスク装置１０３へ読み込まれる。 The magnetic disk device 103 is one of secondary storage devices and stores text 180. Information stored in the flexible disk 108 is read into the main memory 105 or the magnetic disk device 103 via the FDDD 104.

主メモリ１０５には、システム制御プログラム１１０、登録制御プログラム１２０、検索制御プログラム１３０、検索結果分類制御プログラム１４０、サブトピックラベル生成制御プログラム１５０、文書ファイル取得プログラム１２１、テキスト登録プログラム１２２、検索条件取得プログラム１３１、特徴語抽出プログラム１３２、テキスト読込プログラム１３３、検索結果出力プログラム１３４、サブトピック抽出プログラム１４１、分類判定プログラム１４２、ラベル用特徴語抽出プログラム１５１、共有ライブラリ１６０およびワークエリア１７０が確保される。 The main memory 105 includes a system control program 110, a registration control program 120, a search control program 130, a search result classification control program 140, a subtopic label generation control program 150, a document file acquisition program 121, a text registration program 122, and a search condition acquisition. A program 131, a feature word extraction program 132, a text reading program 133, a search result output program 134, a subtopic extraction program 141, a classification determination program 142, a label feature word extraction program 151, a shared library 160, and a work area 170 are secured. .

共有ライブラリ１６０は、適合度算出プログラム１６１で構成される。
システム制御プログラム１１０は、登録制御プログラム１２０および検索制御プログラム１３０で構成される。
登録制御プログラム１２０は、文書ファイル取得プログラム１２１およびテキスト登録プログラム１２２で構成される。 The shared library 160 is configured with a fitness calculation program 161.
The system control program 110 includes a registration control program 120 and a search control program 130.
The registration control program 120 includes a document file acquisition program 121 and a text registration program 122.

検索制御プログラム１３０は、検索条件取得プログラム１３１、特徴語抽出プログラム１３２、テキスト読込プログラム１３３、検索結果出力プログラム１３４、検索結果分類制御プログラム１４０、サブトピック生成制御プログラム１５０で構成されるとともに、適合度算出プログラム１６１を呼び出す構成をとる。 The search control program 130 includes a search condition acquisition program 131, a feature word extraction program 132, a text reading program 133, a search result output program 134, a search result classification control program 140, and a subtopic generation control program 150, and a fitness level. The calculation program 161 is called up.

検索結果分類制御プログラム１４０は、サブトピック抽出プログラム１４１および分類判定プログラム１４２で構成されるとともに、適合度算出プログラム１６１を呼び出す構成をとる。
サブトピック生成制御プログラム１５０は、ラベル用特徴語抽出プログラム１５１で構成される。 The search result classification control program 140 includes a subtopic extraction program 141 and a classification determination program 142, and also has a configuration for calling the fitness calculation program 161.
The subtopic generation control program 150 includes a label feature word extraction program 151.

登録制御プログラム１２０および検索制御プログラム１３０は、ユーザーによるキーボード１０１からの入力に応じてシステム制御プログラム１１０によって起動され、それぞれ文書ファイル取得プログラム１２１、テキスト登録プログラム１２２の制御と、検索条件取得プログラム１３１、特徴語抽出プログラム１３２、テキスト読込プログラム１３３、検索結果出力プログラム１３４、検索結果分類制御プログラム１４０、サブトピックラベル生成制御プログラム１５０および適合度算出プログラム１６１の制御を行なう。 The registration control program 120 and the search control program 130 are activated by the system control program 110 in response to an input from the keyboard 101 by the user, and control of the document file acquisition program 121 and the text registration program 122 and the search condition acquisition program 131, respectively. The feature word extraction program 132, the text reading program 133, the search result output program 134, the search result classification control program 140, the subtopic label generation control program 150, and the fitness calculation program 161 are controlled.

本実施の形態では、キーボード１０１から入力されたコマンドにより登録制御プログラム１２０および検索制御プログラム１３０が起動されるものとしたが、他の入力装置を介して入力されたコマンドあるいはイベントにより起動されるものであってもかまわない。 In this embodiment, the registration control program 120 and the search control program 130 are activated by a command input from the keyboard 101. However, the registration control program 120 and the search control program 130 are activated by a command or event input via another input device. It doesn't matter.

また、これらのプログラムを磁気ディスク１０３、フレキシブルディスク１０８、ＭＯ、ＣＤ−ＲＯＭ、ＤＶＤ等の記憶媒体（図１Ａ及び図１Ｂには示していない）に格納し、駆動装置を介して主メモリ１０５に読み込み、ＣＰＵ１０２によって実行することが可能である。また、これらのプログラムをネットワーク１０７を介して主メモリ１０５に読みこみ、ＣＰＵ１０２によって実行することも可能である。この場合、ＣＰＵ１０２内には，各プログラム１１０、１２０（１２１〜１２２）、１３０（１３１〜１３４、１４０（１４１〜１４２）、１５０（１５１））、１６１によって実行される機能的な部分を有することになる。１０２−１１０はシステム制御部である。１０２−１２０は登録制御部、１０２−１３０は検索制御部、１０２−１４０は検索結果分類制御部、１０２−１５０はサブトピックラベル生成制御部である。さらに、１０２−１６１は適合度算出部である。 Further, these programs are stored in a storage medium (not shown in FIGS. 1A and 1B) such as a magnetic disk 103, a flexible disk 108, an MO, a CD-ROM, and a DVD, and are stored in the main memory 105 via a drive device. It can be read and executed by the CPU 102. Also, these programs can be read into the main memory 105 via the network 107 and executed by the CPU 102. In this case, the CPU 102 has functional parts executed by the programs 110, 120 (121-122), 130 (131-134, 140 (141-142), 150 (151)), 161. become. Reference numerals 102 to 110 denote system control units. 102-120 is a registration control unit, 102-130 is a search control unit, 102-140 is a search result classification control unit, and 102-150 is a subtopic label generation control unit. Reference numeral 102-161 denotes a fitness level calculation unit.

また、本実施の形態ではテキスト１８０は磁気ディスク装置１０３に格納されるものとしたが、フレキシブルディスク１０８、ＭＯ、ＣＤ−ＲＯＭ、ＤＶＤ等の記憶媒体（図１Ａ及び図１Ｂには示していない）に格納し、駆動装置を介して主メモリ１０５に読み込み利用することも可能であるし、あるいはネットワーク１０７を介して、他のシステムに接続された記憶媒体（図１Ａ及び図１Ｂには示していない）に格納されるものとしてもよい。
また、さらにはネットワーク１０７に直接接続された記憶媒体に格納されるものとしても構わない。
以上が、本第１の実施の形態における文書検索システムの構成の説明である。 In this embodiment, the text 180 is stored in the magnetic disk device 103. However, the storage medium such as the flexible disk 108, MO, CD-ROM, and DVD (not shown in FIGS. 1A and 1B). Can be stored in the storage device and read into the main memory 105 via the driving device, or can be stored in a storage medium connected to another system via the network 107 (not shown in FIGS. 1A and 1B). ) May be stored.
Further, it may be stored in a storage medium directly connected to the network 107.
The above is the description of the configuration of the document search system in the first embodiment.

次に、本第１の実施の形態における文書検索システムの処理手順について説明する。 Next, the processing procedure of the document search system in the first embodiment will be described.

まず、システム制御部１０２−１１０における、システム制御プログラム１１０に基づく処理手順について説明する。
システム制御部１０２−１１０は、システム制御プログラム１１０に基づいて、まずキーボード１０１から入力されたコマンドを解析する。この結果が登録実行のコマンドであると解析された場合には、システム制御部１０２−１１０は、登録制御プログラム１２０を起動して、文書の登録を行なう。 First, a processing procedure based on the system control program 110 in the system control unit 102-110 will be described.
Based on the system control program 110, the system control unit 102-110 first analyzes a command input from the keyboard 101. When it is analyzed that the result is a registration execution command, the system control unit 102-110 activates the registration control program 120 to register the document.

また、検索実行のコマンドであると解析された場合には、システム制御部１０２−１１０は、検索制御プログラム１３０を起動して、検索条件として入力されたキーワードを用いた論理演算式や複数の単語や文、文章あるいは文書（以下、まとめて種文章と呼ぶ）に関連した文書の検索を行なう。
以上が、システム制御プログラム１１０に基づく処理手順である。 If it is determined that the command is a search execution command, the system control unit 102-110 activates the search control program 130, and uses a logical operation expression or a plurality of words using the keyword input as the search condition. Search for documents related to texts, sentences, sentences or documents (hereinafter collectively referred to as seed sentences).
The processing procedure based on the system control program 110 has been described above.

次に、登録制御部１０２−１２０における、システム制御プログラム１１０により起動される登録制御プログラム１２０に基づく処理手順について説明する。 Next, a processing procedure based on the registration control program 120 activated by the system control program 110 in the registration control unit 102-120 will be described.

登録制御部１０２−１２０は、登録制御プログラム１２０に基づいて、まず文書ファイル取得プログラム１２１を起動し、ＦＤＤ１０４を介してフレキシブルディスク１０８に格納されている文書ファイルを読み込む。 Based on the registration control program 120, the registration control unit 102-120 first activates the document file acquisition program 121 and reads the document file stored in the flexible disk 108 via the FDD 104.

次に、登録制御部１０２−１２０は、テキスト登録プログラム１２２を起動して、前記文書ファイル取得プログラム１２１で読み込まれた文書ファイルからテキストを抽出し、テキスト１８０として磁気ディスク装置１０３に格納する。
以上が、登録制御プログラム１２０に基づく処理手順である。 Next, the registration control unit 102-120 activates the text registration program 122, extracts text from the document file read by the document file acquisition program 121, and stores it as text 180 in the magnetic disk device 103.
The processing procedure based on the registration control program 120 has been described above.

なお、文書ファイルはフレキシブルディスク１０８に格納されているものとしたが、ＭＯ、ＣＤ−ＲＯＭ、ＤＶＤ等の記憶媒体（図１Ａ、図１Ｂには示していない）に格納されるものとしてもよいし、ネットワーク１０７を介して、他のシステムに接続された記憶媒体（図１Ａ、図１Ｂには示していない）に格納されるものとしてもよい。 Although the document file is stored on the flexible disk 108, it may be stored on a storage medium (not shown in FIGS. 1A and 1B) such as an MO, a CD-ROM, and a DVD. The data may be stored in a storage medium (not shown in FIGS. 1A and 1B) connected to another system via the network 107.

また、文書ファイル取得プログラム１２１で読み込まれた文書ファイルはテキストが抽出できるものならばよく、テキストファイルとして保存されているものであってもよいし、アプリケーションソフトの保存形式であってもよい。 The document file read by the document file acquisition program 121 may be any file that can extract text, and may be saved as a text file or may be saved in a format for application software.

次に、検索制御部１０２−１３０における、システム制御プログラム１１０により起動される検索制御プログラム１３０に基づく処理手順について図２に示すＰＡＤ図を用いて説明する。 Next, a processing procedure based on the search control program 130 activated by the system control program 110 in the search control unit 102-130 will be described with reference to the PAD diagram shown in FIG.

検索制御部１０２−１３０は、検索制御プログラム１３０に基づいて、まず検索条件取得プログラム１３１を起動し、検索条件を読み込み、ワークエリア１７０に格納する（ステップ２００）。 Based on the search control program 130, the search control unit 102-130 first activates the search condition acquisition program 131, reads the search condition, and stores it in the work area 170 (step 200).

次に、検索制御部１０２−１３０は、特徴語抽出プログラム１３２を起動し、前記検索条件取得プログラム１３１により取得された検索条件から検索条件の特徴を表す文字列（以下、特徴語５００と呼ぶ）を抽出し、ワークエリア１７０に格納する（ステップ２１０）。 Next, the search control unit 102-130 activates the feature word extraction program 132, and a character string representing the characteristics of the search condition from the search conditions acquired by the search condition acquisition program 131 (hereinafter referred to as the feature word 500). Is extracted and stored in the work area 170 (step 210).

次に、検索制御部１０２−１３０は、テキスト１８０に含まれるすべてのテキストに対して、ステップ２２１〜ステップ２２２を繰り返し実行する（ステップ２２０）。まず、検索制御部１０２−１３０は、テキスト読込プログラム１３３を起動し、磁気ディスク装置１０３に格納されているテキスト１８０からテキストを１つ読み込み、ワークエリア１７０に格納する（ステップ２２１）。次に、検索制御部１０２−１３０は、適合度算出プログラム１６１を起動し、前記テキスト読込プログラム１３３により読み込まれたテキストに対し、例えば、従来技術１に記載されたように、検索条件に対するテキストの適合度を算出し、その算出結果５０１をワークエリア１７０に格納する（ステップ２２２）。 Next, the search control unit 102-130 repeatedly executes Steps 221 to 222 for all the texts included in the text 180 (Step 220). First, the search control unit 102-130 starts the text reading program 133, reads one text from the text 180 stored in the magnetic disk device 103, and stores it in the work area 170 (step 221). Next, the search control unit 102-130 activates the fitness calculation program 161, and the text read by the text reading program 133 is read from the text corresponding to the search condition as described in, for example, Prior Art 1. The fitness is calculated, and the calculation result 501 is stored in the work area 170 (step 222).

次に、検索制御部１０２−１３０は、検索結果分類制御部１０２−１４０に対して、検索結果分類制御プログラム１４０を起動し、前記特徴語抽出プログラム１３２により抽出された特徴語５００から検索条件に関するサブトピックの抽出およびサブトピックプロファイルの生成を行なう。このサブトピックプロファイル５０２を用いて、適合度算出部１０２−１６１における前記適合度算出プログラム１６１によって算出された適合度が予め設定された適合性判定閾値以上のテキスト（以下、適合テキストと呼ぶ）が、検索条件に含まれるサブトピックに関してそれぞれどこの分類に属するかを判定し、その分類判定結果５０６をワークエリア１７０に格納する（ステップ２３０）。 Next, the search control unit 102-130 activates the search result classification control program 140 for the search result classification control unit 102-140, and relates to the search condition from the feature word 500 extracted by the feature word extraction program 132. Extract subtopics and generate subtopic profiles. Using this subtopic profile 502, text whose fitness calculated by the fitness calculation program 161 in the fitness calculation unit 102-161 is equal to or higher than a preset fitness determination threshold (hereinafter referred to as fitness text). Then, it is determined to which classification each subtopic included in the search condition belongs, and the classification determination result 506 is stored in the work area 170 (step 230).

次に、検索制御部１０２−１３０は、サブトピックラベル生成制御部１０２−１５０に対して、検索結果分類制御プログラム１４０により抽出されたすべてのサブトピックに対して、ステップ２４１を繰り返し実行する（ステップ２４０）。 Next, the search control unit 102-130 repeatedly executes step 241 for all subtopics extracted by the search result classification control program 140 with respect to the subtopic label generation control unit 102-150 (step 241). 240).

検索制御部１０２−１３０は、サブトピックラベル生成制御部１０２−１５０に対して、サブトピックラベル生成制御プログラム１５０を起動し、前記検索結果分類制御プログラム１４０により生成されたサブトピックプロファイル５０２から重要な特徴語を抽出し、抽出された特徴語をサブトピックのラベル（以下、サブトピックラベルと呼ぶ）５０３としてワークエリア１７０に格納する（ステップ２４１）。 The search control unit 102-130 activates the subtopic label generation control program 150 to the subtopic label generation control unit 102-150, and performs an important operation from the subtopic profile 502 generated by the search result classification control program 140. Feature words are extracted, and the extracted feature words are stored in the work area 170 as subtopic labels (hereinafter referred to as subtopic labels) 503 (step 241).

そして、検索制御部１０２−１３０は、検索結果出力プログラム１３４を起動し、各適合テキストの前記検索結果分類制御プログラム１４０（１４２）の分類判定結果５０６に基づいて、サブトピック別に適合テキスト５０４およびサブトピックラベル５０３を表示する（ステップ２５０）。 Then, the search control unit 102-130 activates the search result output program 134, and based on the classification determination result 506 of the search result classification control program 140 (142) of each relevant text, the relevant text 504 and sub The topic label 503 is displayed (step 250).

以上が、検索制御プログラム１３０に基づく処理手順である。 The processing procedure based on the search control program 130 has been described above.

なお、適合度算出部１０２−１６１で実行する適合度算出プログラム１６１は、例えば、従来技術１を用いるものとしたが、ベクトル空間法における余弦尺度を用いた適合度算出方法など他の適合度算出方法を適用してもよい。また、検索条件がキーワードを用いた論理演算式の場合には、特徴語抽出プログラム１３２に関する処理を行なわず、特開平１１−１５４１６４号公報や特開２００１−８４２５５号公報で開示されている方法を用いて検索条件に対する適合度算出方法を適用してもよい。 The fitness calculation program 161 executed by the fitness calculation unit 102-161 uses, for example, the conventional technique 1, but other fitness calculation such as a fitness calculation method using a cosine scale in the vector space method. A method may be applied. When the search condition is a logical operation expression using a keyword, the process disclosed in Japanese Patent Application Laid-Open No. 11-154164 and Japanese Patent Application Laid-Open No. 2001-84255 is performed without performing the processing related to the feature word extraction program 132. It is also possible to apply the method for calculating the degree of fitness for the search condition.

また、上記ステップ２２０ではテキスト１８０に含まれるすべてのテキストに対して、ステップ２２１〜ステップ２２２を繰り返すものとしたが、予め付与された日付などの属性情報を条件にテキスト１８０に含まれる一部のテキストに対して繰り返されるものであってもよい。これにより、検索処理時間の高速化が図れる。 In Step 220, Steps 221 to 222 are repeated for all the texts included in the text 180. However, some texts included in the text 180 are subject to attribute information such as a date assigned in advance. It may be repeated for the text. As a result, the search processing time can be increased.

また、特徴語抽出プログラム１３２により抽出される特徴語５００は、検索条件が種文章である場合、漢字やカタカナといった文字種の境界で分割された文字列であってもよいし、文章中に存在するスペースなどの区切り文字により分割された文字列であってもよいし、形態素解析により抽出される単語やn-gramとして抽出される文字列であってもよいし、その他の方法により抽出された文字列であってもかまわない。一方、検索条件がキーワードを用いた論理演算式の場合は、用いられたキーワードを特徴語としてもよい。 Further, when the search condition is a seed sentence, the feature word 500 extracted by the feature word extraction program 132 may be a character string divided at a character type boundary such as kanji or katakana, or exists in the sentence. It may be a character string divided by delimiters such as spaces, a word extracted by morphological analysis, a character string extracted as n-gram, or a character extracted by other methods It may be a row. On the other hand, when the search condition is a logical operation expression using a keyword, the used keyword may be used as a feature word.

また、本実施の形態ではテキスト読込プログラム１３３によって読み込まれたテキスト全体を対象に適合度を算出するものとしたが、テキスト全体でなくてもよい。例えばＳＧＭＬ（Standard Generalized Markup Language）やＸＭＬ（extensible markup language）などの構造化テキストについてはテキストの一部の構造を対象としてもよい。これにより、該テキストに対する適合度算出処理の負荷が軽減し、検索処理時間の高速化が図れる。 Further, in the present embodiment, the fitness is calculated for the entire text read by the text reading program 133, but it may not be the entire text. For example, for a structured text such as SGML (Standard Generalized Markup Language) or XML (extensible markup language), a partial structure of the text may be targeted. Thereby, the load of the fitness calculation processing for the text is reduced, and the search processing time can be increased.

次に、検索結果分類制御部１０２−１４０における、図２のステップ２３０で検索制御プログラム１３０により起動される検索結果分類制御プログラム１４０に基づく処理手順について、図３に示すＰＡＤ図を用いて説明する。 Next, a processing procedure based on the search result classification control program 140 activated by the search control program 130 in step 230 of FIG. 2 in the search result classification control unit 102-140 will be described using the PAD diagram shown in FIG. .

まず、検索結果分類制御部１０２−１４０は、サブトピック抽出プログラム１４１を起動し、前記特徴語抽出プログラム１３２により検索条件から抽出された特徴語から各特徴語間の関連性を考慮してサブトピックおよびサブトピック別の特徴語を抽出し、これらをサブトピックプロファイル５０２としてワークエリア１７０に格納する（ステップ３００）。 First, the search result classification control unit 102-140 starts the subtopic extraction program 141, and considers the relevance between each feature word from the feature words extracted from the search condition by the feature word extraction program 132. Then, feature words for each subtopic are extracted and stored in the work area 170 as a subtopic profile 502 (step 300).

次に、検索結果分類制御部１０２−１４０は、すべての適合テキストに対して、ステップ３２０を繰り返し実行する（ステップ３１０）。
次に、検索結果分類制御部１０２−１４０は、サブトピック抽出プログラム１４１により抽出されたすべてのサブトピックに対して、ステップ３２１〜ステップ３２２を繰り返し実行する（ステップ３２０）。 Next, the search result classification control unit 102-140 repeatedly executes Step 320 for all the matching texts (Step 310).
Next, the search result classification control unit 102-140 repeatedly executes Steps 321 to 322 for all the subtopics extracted by the subtopic extraction program 141 (Step 320).

まず、検索結果分類制御部１０２−１４０は、適合度算出部１０２−１６１での適合度算出プログラム１６１を起動し、サブトピックに関するサブトピックプロファイルの特徴語の総数、および適合テキストに含まれるサブトピックプロファイルの特徴語の数を用いて、次に示す（１）式によりサブトピックに対する適合テキストの適合度（以下、サブトピック別適合度と呼ぶ）５０４を算出し、その算出結果をワークエリア１７０に格納する（ステップ３２１）。 First, the search result classification control unit 102-140 starts the fitness level calculation program 161 in the fitness level calculation unit 102-161, and calculates the total number of feature words of the subtopic profile related to the subtopic and the subtopic included in the matching text. Using the number of feature words in the profile, the matching text 504 (hereinafter referred to as the subtopic matching score) 504 is calculated using the following formula (1), and the calculation result is stored in the work area 170. Store (step 321).

サブトピックに対する適合テキストの適合度＝（対象テキストに含まれる特徴語の数）／（特徴語の総数）（１）
次に、分類判定プログラム１４２を起動し、前記適合度算出プログラム１６１によって算出されたサブトピック別適合度５０４を、該サブトピックに対する適合性を判定する際の適合性判定基準値（以下、サブトピック適合性判定閾値と呼ぶ）５０５と比較する。この結果、サブトピック適合性判定閾値以上であった場合は、該適合テキストを該サブトピックの分類に属するものと判定し、その分類判定結果５０６をワークエリア１７０に格納する（ステップ３２２）。 Relevance of matching text to subtopic = (number of feature words included in target text) / (total number of feature words) (1)
Next, the classification determination program 142 is started, and the subtopic-specific fitness 504 calculated by the fitness calculation program 161 is used as a fitness determination reference value (hereinafter referred to as subtopic) when determining the suitability for the subtopic. (Referred to as a suitability determination threshold) 505. As a result, if it is equal to or greater than the subtopic suitability determination threshold, it is determined that the relevant text belongs to the classification of the subtopic, and the classification determination result 506 is stored in the work area 170 (step 322).

以上が、検索結果分類制御プログラム１４０に基づく処理手順である。 The processing procedure based on the search result classification control program 140 has been described above.

なお、上記ステップ３２１におけるサブトピック別適合度の算出には上記（１）式を適用したが、ベクトル空間法における余弦尺度など他の適合度算出式を適用してもよい。 Although the above formula (1) is applied to the calculation of the subtopic-specific fitness in step 321, other fitness calculation formulas such as a cosine scale in the vector space method may be applied.

また、上記ステップ３２２では、サブトピック適合性判定閾値を用いて適合テキストがどのサブトピックの分類に属するかを判定するものとしたが、該サブトピックに関するサブトピック別適合度の降順に、所定の件数の適合テキストを該サブトピックの分類に属するものとして判定してもよい。 In step 322, the subtopic relevance determination threshold value is used to determine which subtopic classification the relevance text belongs to. In the descending order of the subtopic relevance level for the subtopic, The number of matching texts may be determined as belonging to the subtopic classification.

次に、検索結果分類制御部１０２−１４０における、検索結果分類制御プログラム１４０により起動されるサブトピック抽出プログラム１４１に基づく処理手順について、図４に示すＰＡＤ図を用いて説明する。 Next, a processing procedure based on the subtopic extraction program 141 activated by the search result classification control program 140 in the search result classification control unit 102-140 will be described with reference to the PAD diagram shown in FIG.

まず、検索結果分類制御部１０２−１４０は、前記特徴語抽出プログラム１３２で抽出されたすべての特徴語５００に対して、ステップ４０１を繰り返し実行する（ステップ４００）。
そして、図６に示す出現パターン生成処理６００により、すべての適合テキスト（文書１、文書２、…）５０１における特徴語（H-company, satellite, digital, plasma, television, broadcast, …）５００の出現の有無を“１”or“０”で表した出現パターン６１０を生成し、ワークエリア１７０に格納する（ステップ４０１）。 First, the search result classification control unit 102-140 repeatedly executes step 401 for all feature words 500 extracted by the feature word extraction program 132 (step 400).
Then, the appearance of feature words (H-company, satellite, digital, plasma, television, broadcast,...) 500 in all matching texts (document 1, document 2,...) 501 by appearance pattern generation processing 600 shown in FIG. An appearance pattern 610 representing the presence or absence of “1” or “0” is generated and stored in the work area 170 (step 401).

次に、検索結果分類制御部１０２−１４０は、すべての特徴語の中から２つの特徴語の組み合わせを重複なく生成し、各組み合わせについてステップ４１１〜ステップ４１４を繰り返し実行する（ステップ４１０）。ここで、各組み合わせに含まれる特徴語を、それぞれ特徴語Ａと特徴語Ｂとして、以下説明する。 Next, the search result classification control unit 102-140 generates a combination of two feature words from all the feature words without duplication, and repeatedly executes Step 411 to Step 414 for each combination (Step 410). Here, feature words included in each combination will be described below as feature words A and B, respectively.

まず、検索結果分類制御部１０２−１４０は、図６に示す単語間関連度算出処理６０１により、２つの特徴語Ａと特徴語Ｂの出現パターン６１０を用いて、後述する（４）式に基づく余弦尺度により特徴語Ａと特徴語Ｂ間の関連度（以下、単語間関連度と呼ぶ）を算出し、ワークエリア１７０に格納する（ステップ４１１）。 First, the search result classification control unit 102-140 uses the appearance pattern 610 of two feature words A and feature words B by the inter-word relevance calculation processing 601 shown in FIG. The degree of association between feature word A and feature word B (hereinafter referred to as the degree of association between words) is calculated using the cosine scale and stored in work area 170 (step 411).

次に、図６に示すように、特徴語Ａと特徴語Ｂ間の単語間関連度が、予め設定された関連性判定閾値以上であった場合、ステップ４１３〜ステップ４１４を実行してグルーピング処理６０２を行って関連単語リスト６１２を作成する（ステップ４１２）。
まず、特徴語Ｂを特徴語Ａに関する単語リスト（以下、関連単語リストと呼ぶ）に入れ、ワークエリア１７０に格納する（ステップ４１３）。
次に、特徴語Ａを特徴語Ｂに関する関連単語リストに入れ、ワークエリア１７０に格納する（ステップ４１４）。 Next, as shown in FIG. 6, when the degree of association between words between the feature word A and the feature word B is equal to or higher than a preset relevance determination threshold, step 413 to step 414 are executed to perform grouping processing. 602 is performed to create a related word list 612 (step 412).
First, the feature word B is put into a word list (hereinafter referred to as a related word list) relating to the feature word A and stored in the work area 170 (step 413).
Next, the feature word A is put into the related word list for the feature word B and stored in the work area 170 (step 414).

次に、図６に示すように、各特徴語に関する関連単語リスト間でリストに含まれる特徴語を比較する。この結果、含まれる特徴語が同じである関連単語リスト間については重複排除処理６０３をして１つの関連単語リストにまとめる。この結果、最終的に得られた関連単語リストをサブトピックプロファイル６１３とし、ワークエリア１７０に格納する（ステップ４２０）。 Next, as shown in FIG. 6, the feature words included in the list are compared between the related word lists for each feature word. As a result, de-duplication processing 603 is performed between the related word lists having the same feature words included in one related word list. As a result, the related word list finally obtained is set as a subtopic profile 613 and stored in the work area 170 (step 420).

以上が、サブトピック抽出プログラム１４１での処理手順である。 The processing procedure in the subtopic extraction program 141 has been described above.

なお、上記ステップ４１１における単語間関連度の算出方法については余弦尺度を用いたが、他の単語間関連度の算出方法を適用してもよい。
また、サブトピック抽出プログラム１４１では特徴語間のグルーピングを行なうために、適合テキストにおける特徴語の出現パターンから特徴語間の単語間関連度を算出したが、検索条件がキーワードを用いた論理演算式の場合は、論理演算式からand関係やor関係などの特徴語間の論理関係を解析することで、特徴語間の単語間関連度算出およびグルーピングを行なってもよい。また、検索条件が種文章の場合は、特徴語間の出現位置や修飾関係を解析することで、特徴語間の単語間関連度算出およびグルーピングを行なってもよい。また、検索条件や適合テキストだけでなく、関連語辞書を用いて特徴語間の単語関連度算出およびグルーピングを行なってもよい。 Note that the cosine scale is used as the method for calculating the degree of association between words in step 411, but other methods for calculating the degree of association between words may be applied.
Further, in order to perform grouping between feature words, the subtopic extraction program 141 calculates the degree of association between words from the appearance pattern of feature words in the matching text. In this case, by calculating a logical relationship between feature words such as an AND relationship or an relationship from a logical operation expression, the degree of association between words between feature words may be calculated and grouped. If the search condition is a seed sentence, the degree of association between the feature words and the grouping may be calculated by analyzing the appearance position and the modification relationship between the feature words. Further, not only the search condition and the matching text but also the related word dictionary may be used to calculate the word relevance between the feature words and group them.

また、特徴語間のグルーピングには、各特徴語に関する関連単語リストを生成する方法で行なうものとしたが、予め設定されたグループ数に基づいて、一般的なクラスタリング手法である最小距離法、最大距離法、群平均法およびK-Means法を用いて特徴語間をグルーピングしてもよいし、その他のグルーピング手法を用いてもよい。 In addition, grouping between feature words is performed by a method of generating a related word list for each feature word. However, based on a preset number of groups, a minimum distance method, a maximum The feature words may be grouped using a distance method, a group average method, and a K-Means method, or other grouping methods may be used.

以下、本実施の形態における文書検索システムにおいて、検索結果分類制御プログラム１４０およびサブトピックラベル生成制御プログラム１５０に基づく具体的な処理の流れを図５を用いて説明する。 Hereinafter, a specific processing flow based on the search result classification control program 140 and the subtopic label generation control program 150 in the document search system according to the present embodiment will be described with reference to FIG.

図５に示した実施例は、文書１「In recent years the pace of development toward digital video and satellite digital broadcasting has been rapid. This is producing a global expansion of the market for large-display home theater systems used with AV sources such as DVD that deliver high-quality, digital sound and vision. The 52-inch display of the AAA provides easy viewing pleasure in a living-room for the whole family. There are two sets of component inputs for interfacing with future digital broadcast devices and digital video equipment.」（タイトル：The 52-inch display of the AAA）、文書２「The ultimate in plasma television technology. This flagship of the plasma line is a blend of performance, style and usability, featuring a Learning AV NET that puts complete control of an entire home theater system in the palm of your hand. The ultra-thin, sculpted lines and high-gloss titanium finish of this "best-in-class" series is perfect for the widescreen enthusiast who demands unparalleled performance in a sleek elegant design. The H-company BBB's Series with technology is truly the ultimate in plasma television.」（タイトル：H-company plasma television technology）および文書３が磁気ディスク装置１０３に格納された文書検索システムにおいて、検索者が「H社のプラズマテレビ」に関する情報を知るために種文章５１０「H-company has become the first manufacturer in the world to perfect broadcast satellite digital high-definition plasma television in 37V. The television's high-definition plasma display panel (PDP) uses the alternate lighting of surfaces (ALIS) format and is the first to enable such high-resolution definition in the 37V, which has until now been difficult with this size. It is configured in the consumer industry's smallest pixel pitch of 0.81mm 5 0.45mm and delivers the high resolution of 1,024 pixels horizontally and 1,024 pixels vertically, thereby allowing the maximum enjoyment of the superior picture quality of digital high-definition television viewing.」が選択された結果、特徴語抽出プログラム１３２（例えば従来技術１に記載された方法）により種文章５１０のプロファイルとして特徴語５００、適合度算出プログラム１６１により検索結果として文書１および文書２の適合テキスト５０１が得られた状態である。 The example shown in FIG. 5 is described in Document 1, “In recent years the pace of development toward digital video and satellite digital broadcasting has been rapid.This is producing a global expansion of the market for large-display home theater systems used with AV sources. such as DVD that deliver high-quality, digital sound and vision.The 52-inch display of the AAA provides easy viewing pleasure in a living-room for the whole family.There are two sets of component inputs for interfacing with future digital broadcast devices. and digital video equipment. (Title: The 52-inch display of the AAA), Document 2, “The ultimate in plasma television technology. This flagship of the plasma line is a blend of performance, style and usability, featuring a Learning AV NET that puts complete control of an entire home theater system in the palm of your hand.The ultra-thin, sculpted lines and high-gloss titanium finish of this "best-in-class" series is perfect for the widescreen enthusiast w The H-company BBB's Series with technology is truly the ultimate in plasma television. (title: H-company plasma television technology) and document 3 stored in the magnetic disk unit 103 In the search system, searchers can find information about "Company H's plasma TV" in the text 510 "H-company has become the first manufacturer in the world to perfect broadcast satellite digital high-definition plasma television in 37V. high-definition plasma display panel (PDP) uses the alternate lighting of surfaces (ALIS) format and is the first to enable such high-resolution definition in the 37V, which has until now been difficult with this size.It is configured in the consumer industry's smallest pixel pitch of 0.81mm 5 0.45mm and delivers the high resolution of 1,024 pixels horizontally and 1,024 pixels vertically, thus allowing the maxim As a result of selection of “um enjoyment of the superior picture quality of digital high-definition television viewing.”, the feature word 500 is applied as a profile of the seed sentence 510 by the feature word extraction program 132 (for example, the method described in the related art 1). In this state, the matching text 501 of the document 1 and the document 2 is obtained as a search result by the degree calculation program 161.

まず、検索結果分類制御部１０２−１４０において、サブトピック抽出プログラム１４１が実行され、適合テキスト５０１における特徴語５００の出現パターン６１０から、単語間関連度算出処理６０１により、各特徴語間の単語間関連度を算出する。そこで、算出された各特徴語間の単語間関連度から特徴語５００に含まれる特徴語間を、グルーピング処理６０２によりグルーピングし、種文章に関するサブトピックプロファイル５０２を生成する。本図に示した実施例では、特徴語５００から３つのサブトピックが抽出されており、それぞれ「H-company」「plasma」「television 」を要素とするサブトピックプロファイル１、「satellite」「digital」「broadcast 」を要素とするサブトピックプロファイル２、「plasma」「display」「panel」を要素とするサブトピックプロファイル３、…が生成されている。 First, in the search result classification control unit 102-140, the subtopic extraction program 141 is executed. From the appearance pattern 610 of the feature word 500 in the matching text 501, the inter-word relevance calculation processing 601 performs inter-word-word-interval between feature words. Calculate relevance. Therefore, the feature words included in the feature word 500 are grouped by the grouping process 602 based on the calculated degree of association between the feature words, and a subtopic profile 502 related to the seed sentence is generated. In the embodiment shown in the figure, three subtopics are extracted from the feature word 500, and subtopic profile 1, “satellite”, and “digital” each having “H-company”, “plasma”, and “television” as elements. A subtopic profile 2 having “broadcast” as an element, a subtopic profile 3 having “plasma”, “display”, and “panel” as elements are generated.

次に、サブトピックラベル生成制御部１０２−１５０において、すべてのサブトピック（Ｓ１、Ｓ２、Ｓ３、…）に対して、サブトピックラベル生成制御プログラム１５０が実行され、各サブトピックプロファイル５０２から重要な特徴語を抽出して、サブトピックの内容を示すサブトピックラベル５０３を生成する。本図に示した実施例では、サブトピック１（Ｓ１）については「H-company」「plasma」「television」、サブトピック２（Ｓ２）については「satellite」「digital」「broadcast」、サブトピック３（Ｓ３）については「plasma」「display」「panel」が、それぞれ抽出され、サブトピックラベル５０３として生成されている。 Next, in the subtopic label generation control unit 102-150, the subtopic label generation control program 150 is executed for all subtopics (S1, S2, S3,...). A feature word is extracted to generate a subtopic label 503 indicating the content of the subtopic. In the embodiment shown in the figure, “H-company”, “plasma”, “television” for subtopic 1 (S1), “satellite”, “digital”, “broadcast”, subtopic 3 for subtopic 2 (S2). For (S3), “plasma”, “display”, and “panel” are extracted and generated as subtopic labels 503, respectively.

次に、適合テキスト５０１に対して適合度算出プログラム１６１が実行され、上記（１）式によりサブトピック別適合度５０４を算出する。本図に示した実施例では、文書１については、サブトピック１〜サブトピック３(本図中ではＳ１〜Ｓ３と表示)に対するサブトピック別適合度がそれぞれ、“0.0”、“1.0”、“0.3”と算出されている。また、文書２については、サブトピック１〜サブトピック３(本図中ではＳ１〜Ｓ３と表示)に対するサブトピック別適合度がそれぞれ、“1.0”、“0.0”、“0.3”と算出されている。 Next, the fitness level calculation program 161 is executed for the fitness text 501, and the subtopic-specific fitness level 504 is calculated by the above equation (1). In the embodiment shown in this figure, for document 1, the subtopic matching degrees for subtopic 1 to subtopic 3 (shown as S1 to S3 in the figure) are “0.0”, “1.0”, “ Calculated as 0.3 ”. For document 2, the subtopic matching degrees for subtopic 1 to subtopic 3 (shown as S1 to S3 in the figure) are calculated as “1.0”, “0.0”, and “0.3”, respectively. .

次に、分類判定プログラム１４２が実行され、適合テキスト５０１に対してサブトピック別適合度５０４およびサブトピック適合性判定閾値５０５から、該適合テキストがどこのサブトピックの分類に属するかを判定する。本図の実施例では、各サブトピックのサブトピック適合性判定閾値を“0.5”としているため、文書１はサブトピック２、文書２はサブトピック１の分類に属するものと判定される。以上が、検索結果分類制御プログラム１４０およびサブトピックラベル生成制御プログラム１５０の具体的な処理の流れである。 Next, the classification determination program 142 is executed to determine which subtopic category the matching text belongs to from the subtopic matching level 504 and the subtopic matching threshold 505 for the matching text 501. In the example of this figure, since the subtopic suitability determination threshold value of each subtopic is “0.5”, it is determined that document 1 belongs to subtopic 2 and document 2 belongs to the subtopic 1 classification. The above is the specific processing flow of the search result classification control program 140 and the subtopic label generation control program 150.

以下、図５に示したサブトピック抽出プログラム１４１の具体的な処理の流れについて図６を用いて説明する。
まず、出現パターン生成処理６００により、適合テキスト５０１における特徴語５００の出現パターン６１０を生成する。例えば文書１〜文書６に対して、特徴語「plasma」は文書１、文書３および文書６に出現している場合、出現パターンとして次に示す（２）式を生成する。また、特徴語「television」は文書１、文書３、文書４、文書５および文書６に出現している場合、出現パターンとして次に示す（３）式を生成する。 The specific processing flow of the subtopic extraction program 141 shown in FIG. 5 will be described below with reference to FIG.
First, the appearance pattern generation processing 600 generates an appearance pattern 610 of the feature word 500 in the matching text 501. For example, when the characteristic word “plasma” appears in the documents 1, 3, and 6 with respect to the documents 1 to 6, the following expression (2) is generated as an appearance pattern. When the feature word “television” appears in the document 1, document 3, document 4, document 5, and document 6, the following expression (3) is generated as an appearance pattern.

「plasma」の出現パターン＝（１，０，１，０，０，１）（２）
「television」の出現パターン＝（１，０，１，１，１，１）（３）
次に、単語間関連度算出処理６０１により、出現パターン６１０から各特徴語間の関連度６１１を算出する。特徴語間の関連度算出方法は、各特徴語の出現パターンを特徴ベクトルと考えて、余弦尺度より算出する。例えば、特徴語「plasma」と特徴語「television」の出現パターンがそれぞれ（２）式、（３）式であった場合、特徴語「plasma」と特徴語「television」間の単語間関連度は次の（４）式より“0.77”となる。 Appearance pattern of “plasma” = (1, 0, 1, 0, 0, 1) (2)
Appearance pattern of “television” = (1,0,1,1,1,1) (3)
Next, a relevance level 611 between feature words is calculated from the appearance pattern 610 by an interword relevance calculation process 601. In the method of calculating the degree of association between feature words, the appearance pattern of each feature word is considered as a feature vector and is calculated from a cosine scale. For example, when the appearance patterns of the feature word “plasma” and the feature word “television” are the expressions (2) and (3), respectively, the inter-word relationship between the feature word “plasma” and the feature word “television” is From the following equation (4), it is “0.77”.

次に、グルーピング処理６０２より、特徴語間の関連度６１１を用いて特徴語５００に含まれる各特徴語別に関連単語リストを生成する。この結果、各特徴語に関する関連単語リスト６１２が得られる。本図の実施例では、関連単語リストに含まれる単語間関連度の閾値を“0.5”として、「H-company 」に関する関連単語リストは「H-company」「plasma」「television」、「satellite 」に関する関連単語リストは「satellite」「digital」「broadcast」、「broadcast」に関する関連単語リスト「satellite」「digital」「broadcast」が生成されている。

Next, the grouping process 602 generates a related word list for each feature word included in the feature word 500 using the degree of association 611 between feature words. As a result, a related word list 612 relating to each feature word is obtained. In the embodiment of the figure, the threshold value of the degree of association between words included in the related word list is set to “0.5”, and the related word list regarding “H-company” is “H-company”, “plasma”, “television”, “satellite”. Related word lists for “satellite”, “digital”, “broadcast”, and related word lists “satellite”, “digital”, and “broadcast” for “broadcast” are generated.

次に、重複排除処理６０３により、関連単語リスト６１２から関連単語リスト間を比較することで、含まれる特徴語の構成が同じである関連単語リスト間を１つにまとめる。この結果、最終的に得られる関連単語リストをサブトピックプロファイルとして、サブトピックプロファイル６１３が得られる。本図の実施例では、「satellite」と「broadcast」に関する関連単語リストについて特徴語の構成が同じであるため、それらの単語関連リストを１つにまとめる。この結果、関連単語リスト「H-company」「plasma」「television」と「satellite」「digital」「broadcast」がそれぞれサブトピックプロファイル１、サブトピックプロファイル２として生成されている。 Next, by comparing the related word lists from the related word list 612 by the deduplication process 603, the related word lists having the same configuration of the feature words included are combined into one. As a result, a subtopic profile 613 is obtained using the finally obtained related word list as a subtopic profile. In the example of this figure, since the configuration of feature words is the same for the related word lists related to “satellite” and “broadcast”, these word related lists are combined into one. As a result, related word lists “H-company”, “plasma”, “television”, “satellite”, “digital”, and “broadcast” are generated as subtopic profile 1 and subtopic profile 2, respectively.

以上が、サブトピック抽出プログラム１４１の具体的な処理の流れである。
なお、検索条件がキーワードを用いた論理演算式の場合は、and関係又はor関係のキーワードをまとめて、単語関連リストを生成してもよい。（５）式の例では、and関係のキーワードをまとめて、それぞれ「H-company」「plasma」「television」、「satellite」「digital」「broadcast」および「plasma」「display」「panel」の３つの関連単語リストが生成される。
(“H-company” and “plasma” and “television”) or (“plasma” and “display” and “panel”) or (“satellite” and “digital” and “broadcast”) （５）
以下、本実施の形態における文書検索システムにおいて、検索結果出力プログラム１３４によって提示される検索結果の具体的な提示例を図７〜図９を用いて説明する。 The above is the specific processing flow of the subtopic extraction program 141.
In the case where the search condition is a logical operation expression using keywords, keywords related to and or relations may be collected to generate a word related list. In the example of the formula (5), keywords related to “and” and “H-company” “plasma” “television” “satellite” “digital” “broadcast” and “plasma” “display” “panel” 3 Two related word lists are generated.
(“H-company” and “plasma” and “television”) or (“plasma” and “display” and “panel”) or (“satellite” and “digital” and “broadcast”) (5)
Hereinafter, in the document search system according to the present embodiment, a specific example of the search result presented by the search result output program 134 will be described with reference to FIGS.

図７に示した検索結果一覧表示の実施例では、図５に示した適合テキスト５０１をサブトピック別に種文章に対する適合度の降順で出力されている（７００）。また、各サブトピックにはサブトピックラベルが出力されている。この結果、文書１についてはサブトピック２「satellite、digital、broadcast」の３番目、文書２についてはサブトピック１「H-company、plasma、television」の１番目に出力されており、それぞれ種文章に対する適合度、サブトピック別適合度およびタイトルが出力されている。 In the embodiment of the search result list display shown in FIG. 7, the matching text 501 shown in FIG. 5 is output in descending order of the matching degree with respect to the seed sentence for each subtopic (700). A subtopic label is output for each subtopic. As a result, document 1 is output as the third subtopic 2 “satellite, digital, broadcast”, and document 2 is output as the first subtopic 1 “H-company, plasma, television”. Relevance, subtopic relevance and title are output.

ここで、検索者が「H社のプラズマテレビ」に関する情報を知るために図５で示した種文章が選択されたとした場合、図７に示されている各サブトピックラベルより検索者は目的文書がサブトピック１「H-company、plasma、television 」の分類に属する適合文書の中に存在すると判断できる。この結果、検索者は適合文書集合の中からサブトピック１「H-company、plasma、television 」の分類に属する適合文書のみを参照すればよいため、目的文書を素早く探し出すことができる。 Here, if the searcher selects the seed sentence shown in FIG. 5 in order to know information related to “Company H's plasma television”, the searcher selects the target document from each subtopic label shown in FIG. Can be determined to exist in the conforming documents belonging to the classification of subtopic 1 “H-company, plasma, television”. As a result, the searcher only needs to refer to the matching documents belonging to the classification of subtopic 1 “H-company, plasma, television” from the matching document set, so that the target document can be quickly found.

なお、図７に示した実施例では、各適合テキストに対して、種文章に対する適合度、サブトピック別適合度およびタイトルを出力するものとしたが、登録処理時に日付など各文書の属性情報も登録しておき、それらの情報を出力してもよい。
また、図７に示した実施例では、各適合テキストの出力順を種文章に対する適合度の降順で出力するものとしたが、サブトピック別適合度の降順で出力するものとしてもよいし、これらを図８に示すように表示オプションで選択できるようにしておいてもよい（８００）。 In the embodiment shown in FIG. 7, for each matching text, the matching level for the seed text, the matching level for each subtopic, and the title are output. However, the attribute information of each document such as the date is also registered during the registration process. You may register and output those information.
In the embodiment shown in FIG. 7, the output order of each matching text is output in the descending order of the matching degree with respect to the seed sentence. However, it may be output in the descending order of the matching degree for each subtopic. As shown in FIG. 8, a display option may be selected (800).

図８に示した実施例では、表示オプションとして種文章に対する適合度の降順で出力するかあるいはサブトピック別適合度の降順で出力するかを選択可能としたインターフェースを備えており、図８ではサブトピック別適合度順が選択されていることにより、サブトピック別適合度の降順で適合テキストが出力されている。この結果、文書１についてはサブトピック２「satellite、digital、broadcast」の１番目、文書２についてはサブトピック１「H-company、plasma、television」の１番目に出力されている。これにより、各サブトピックの情報に特化した文書を素早く探し出すことができる。 The embodiment shown in FIG. 8 is provided with an interface that can select whether to output in descending order of suitability for the seed text or in descending order of suitability for each subtopic as a display option. Since the topic-specific suitability order is selected, the suitability text is output in descending order of the subtopic fit. As a result, the document 1 is output as the first sub-topic 2 “satellite, digital, broadcast”, and the document 2 is output as the first sub-topic 1 “H-company, plasma, television”. As a result, it is possible to quickly find a document specialized in the information on each subtopic.

以上説明したように、図７または図８に示すように、検索結果の表示時に適合文書を、適合度算出プログラム１６１による検索条件適合度算出ステップで算出された検索条件適合度または適合度算出プログラム１６１による関連単語リスト適合度算出ステップで算出された関連単語リスト適合度のいずれかを降順で表示することを特徴とする。 As described above, as shown in FIG. 7 or FIG. 8, the search condition fitness or the fitness calculation program calculated in the search condition fitness calculation step by the fitness calculation program 161 is used as the matching document when the search result is displayed. One of the related word list suitability calculated in the related word list suitability calculating step 161 is displayed in descending order.

また、図７および図８に示した実施例では、サブトピック別に適合テキストの一覧表示として出力しているが、図９に示すように、各サブトピックに関してそれぞれ何件の適合テキストが存在するかを示し（９００）、知りたい情報に関するサブトピックを検索者に選択させた上で、そのサブトピックの分類に属する適合テキストのみを出力（９０１）してもよい。本図の実施例では、１番目のサブトピック「H-company、plasma、television」に１０３件、２番目のサブトピック「satellite、digital、broadcast」に４５件、３番目のサブトピック「plasma、display、panel」に６７件の適合文書が適合しており、各サブトピックの分類に属する適合文書数がそれぞれ示されている。即ち、図９に示すように、分類判定プログラム１４２により判定された結果に基づいて、関連単語リスト生成処理６００〜６０２で生成された関連単語リスト別（１番目のサブトピック、２番目のサブトピック、３番目のサブトピック、…）にそれぞれ関連付けられた適合文書の件数を表示することに特徴を有する。 Further, in the embodiment shown in FIG. 7 and FIG. 8, the list of matching texts is output for each subtopic, but as shown in FIG. 9, how many matching texts exist for each subtopic. (900), the searcher may be allowed to select a subtopic related to the information he wants to know, and only the relevant text belonging to the subtopic classification may be output (901). In the example of this figure, 103 cases are in the first subtopic “H-company, plasma, television”, 45 cases are in the second subtopic “satellite, digital, broadcast”, and the third subtopic is “plasma, display” , Panel ”corresponds to 67 conforming documents, and the number of conforming documents belonging to each subtopic classification is shown. That is, as shown in FIG. 9, based on the result determined by the classification determination program 142, each related word list generated by the related word list generation processing 600 to 602 (first subtopic, second subtopic) The third subtopic is characterized in that it displays the number of relevant documents associated with each subtopic.

また、検索者によって１番目のサブトピック「H-company、plasma、television 」が選択されており、この結果、サブトピック「H-company、plasma、television」の分類に属する適合文書に関する検索結果一覧表示が示されている。これにより、容易にどのような検索結果が得られたかを大枠で把握することができ、かつ目的文書を効率よく探し出すことができる。 In addition, the first subtopic “H-company, plasma, television” is selected by the searcher, and as a result, a list of search results for conforming documents belonging to the classification of the subtopic “H-company, plasma, television” is displayed. It is shown. As a result, it is possible to easily grasp what kind of search results have been obtained, and to efficiently search for the target document.

以上が、検索結果出力プログラム１３４によって提示される検索結果の具体的な提示の実施例である。 The above is an example of specific presentation of search results presented by the search result output program 134.

以上が、本実施の形態における文書検索システムの処理手順である。 The above is the processing procedure of the document search system in the present embodiment.

以上説明したように、本発明の第１の実施の形態によれば、適合文書集合を検索条件に関するサブトピック別にグルーピングして提示することで、目的文書であるかどうかの判断の対象となる適合文書を少なくすることができることから、検索者は目的文書を素早く探し出すことができる。
［第２の実施の形態］
次に、本発明に係る第２の実施の形態について図１０および図１１を用いて説明する。 As described above, according to the first embodiment of the present invention, the relevant document set is grouped and presented by subtopics related to the search condition, so that the relevant target can be determined as to whether it is the target document. Since the number of documents can be reduced, the searcher can quickly find the target document.
[Second Embodiment]
Next, a second embodiment according to the present invention will be described with reference to FIGS.

第１の実施の形態におけるサブトピック抽出プログラム１４１では、図６に示すように、特徴語間の関連性判定を適合テキスト５０１における特徴語の出現パターン６１０から特徴語間の関連度を算出することで行った。しかし、関連語辞書を用いることでより精度の高い特徴語間の関連性判定を行なうことができる。このため、本第２の実施の形態では、関連語辞書を用いることで特徴語間の関連性判定を行なう。 In the subtopic extraction program 141 according to the first embodiment, as shown in FIG. 6, the relevance between feature words is calculated from the appearance pattern 610 of the feature words in the matching text 501 for determining the relevance between feature words. I went there. However, it is possible to determine the relevance between feature words with higher accuracy by using the related word dictionary. For this reason, in the second embodiment, the relevance determination between feature words is performed by using a related word dictionary.

即ち、本第２の実施の形態は、図１に示した第１の実施の形態とほぼ同様な構成を取るが、検索結果分類制御部１０２−１４０でのサブトピック抽出プログラム１４１の処理手順が異なる。
以下、第２の実施の形態である第１の実施の形態とは異なるサブトピック抽出プログラム１４１ａの処理手順について図１０に示すＰＡＤ図を用いて説明する。
まず、検索結果分類制御部１０２−１４０は、すべての特徴語の中から２つの特徴語の組み合わせを重複なく生成し、各組み合わせについてステップ１０１１〜ステップ１０１４を繰り返し実行する（ステップ１０１０）。ここで、各組み合わせに含まれる特徴語を、それぞれ特徴語Ａと特徴語Ｂとして、以下説明する。 That is, the second embodiment has almost the same configuration as the first embodiment shown in FIG. 1, but the processing procedure of the subtopic extraction program 141 in the search result classification control unit 102-140 is the same. Different.
Hereinafter, the processing procedure of the subtopic extraction program 141a different from the first embodiment which is the second embodiment will be described with reference to the PAD diagram shown in FIG.
First, the search result classification control unit 102-140 generates a combination of two feature words from all the feature words without duplication, and repeatedly executes Step 1011 to Step 1014 for each combination (Step 1010). Here, feature words included in each combination will be described below as feature words A and B, respectively.

まず、図１１に示すように、単語間関連度取得処理１１０１により、関連語辞書１１１１を参照することで特徴語Ａと特徴語Ｂ間の単語間関連度を取得し、ワークエリア１７０に格納する。なお、関連語辞書に単語間関連度の記載がなく、関連性のある単語間のみが記載されている場合は、関連性のある単語間の単語関連度を“１”、関連性のない単語間の単語間関連度を“０”とする（ステップ１０１１）。 First, as shown in FIG. 11, the degree of association between words between the feature word A and the feature word B is obtained by referring to the related word dictionary 1111 by the word-to-word association degree obtaining process 1101 and stored in the work area 170. . In addition, when there is no description of the degree of association between words in the related word dictionary, and only between related words is described, the degree of word association between related words is “1”, and the word is not related. The degree of association between words is set to “0” (step 1011).

次に、図１１に示すように、特徴語Ａと特徴語Ｂ間の単語間関連度が、予め設定された関連性判定閾値以上であった場合、ステップ１０１３〜ステップ１０１４を実行してグルーピング処理６０２を行って関連単語リスト６１２を作成する（ステップ１０１２）。
まず、特徴語Ｂを特徴語Ａに関する関連単語リストに入れ、ワークエリア１７０に格納する（ステップ１０１３）。
次に、特徴語Ａを特徴語Ｂに関する関連単語リストに入れ、ワークエリア１７０に格納する（ステップ１０１４）。 Next, as shown in FIG. 11, when the inter-word relevance between the feature word A and the feature word B is equal to or higher than a preset relevance determination threshold, step 1013 to step 1014 are executed to perform grouping processing. 602 is performed to create a related word list 612 (step 1012).
First, the feature word B is put in the related word list for the feature word A and stored in the work area 170 (step 1013).
Next, the feature word A is put into the related word list for the feature word B and stored in the work area 170 (step 1014).

次に、図１１に示すように、各特徴語に関する関連単語リスト間でリストに含まれる特徴語を比較する。この結果、含まれる特徴語が同じである関連単語リスト間については重複排除処理６０３をして１つの関連単語リストにまとめる。この結果、最終的に得られた関連単語リストをサブトピックプロファイル６１３とし、ワークエリア１７０に格納する（ステップ１０２０）。 Next, as shown in FIG. 11, the feature words included in the list are compared between the related word lists related to the feature words. As a result, de-duplication processing 603 is performed between the related word lists having the same feature words included in one related word list. As a result, the related word list finally obtained is set as a subtopic profile 613 and stored in the work area 170 (step 1020).

以上が、サブトピック抽出プログラム１４１ａでの処理手順である。 The processing procedure in the subtopic extraction program 141a has been described above.

なお、特徴語間のグルーピングについては、第１の実施の形態と同様に、各特徴語に関する関連単語リストを生成する方法で行なうものとしたが、予め設定されたグループ数に基づいて、一般的なクラスタリング手法である最小距離法、最大距離法、群平均法およびK-Means法を用いて特徴語間をグルーピングしてもよいし、その他のグルーピング手法を用いてもよい。 Note that grouping between feature words is performed by a method of generating a related word list for each feature word, as in the first embodiment, but based on a preset number of groups, The feature words may be grouped using a minimum distance method, a maximum distance method, a group average method, and a K-Means method, which are simple clustering methods, or other grouping methods may be used.

次に、検索結果分類制御部１０２−１４０における、図１０に示したサブトピック抽出プログラム１４１ａの具体的な処理の流れについて図１１を用いて説明する。まず、単語間関連度取得処理１１０１により、関連度辞書１１１１を参照することで、各特徴語間の単語間関連度１１１２を取得する。本図の実施例では、特徴語「H-company」と特徴語「satellite」の単語間関連度は、関連語辞書１１１１から“0.15”となる。以降、グルーピング処理６０２および重複排除処理６０３については、第１の実施の形態と同様な処理を行なう。 Next, a specific processing flow of the subtopic extraction program 141a shown in FIG. 10 in the search result classification control unit 102-140 will be described with reference to FIG. First, the word-to-word association degree acquisition process 1101 refers to the degree-of-relationship dictionary 1111 to obtain the degree of association between words 1112 between feature words. In the example of this figure, the degree of association between words of the feature word “H-company” and the feature word “satellite” is “0.15” from the related word dictionary 1111. Thereafter, grouping processing 602 and deduplication processing 603 are performed in the same manner as in the first embodiment.

以上が、サブトピック抽出プログラム１４１ａの具体的な処理の流れである。 The above is the specific processing flow of the subtopic extraction program 141a.

［第３の実施の形態］
次に、本発明に係る第３の実施の形態について図１２、図１３および図１４を用いて説明する。 [Third Embodiment]
Next, a third embodiment according to the present invention will be described with reference to FIG. 12, FIG. 13, and FIG.

第１の実施の形態におけるサブトピックラベル生成制御プログラム１５０では、サブトピックラベルの生成方法として、サブトピックプロファイルに含まれる特徴語を単に抽出するだけのものであった。しかし、単なる特徴語の羅列よりも文章の形で提示した方が特徴語間の関係が分かるため、サブトピックの内容が把握しやすい。このため、本発明に係る第３の実施の形態におけるサブトピックラベル生成制御プログラム１５０ａでは、サブトピックの内容が理解しやすいように、サブトピックラベルを文、段落、節および章のような文章の形で生成する。即ち、サブトピックラベル生成制御プログラム１５０ａで生成された各サブトピック（各関連単語リスト）に含まれる特徴語を用いて、文、段落、節および章のうち少なくとも１つ以上を、各分類に対する識別情報とする。 In the subtopic label generation control program 150 in the first embodiment, the feature word included in the subtopic profile is simply extracted as a method for generating the subtopic label. However, it is easier to grasp the contents of the subtopics because the relationship between the feature words can be understood by presenting them in the form of sentences rather than simply enumerating the feature words. Therefore, in the subtopic label generation control program 150a according to the third embodiment of the present invention, the subtopic label is changed to a sentence such as a sentence, paragraph, section, and chapter so that the contents of the subtopic can be easily understood. Generate in the form. That is, using the feature words included in each subtopic (each related word list) generated by the subtopic label generation control program 150a, at least one of sentences, paragraphs, sections and chapters is identified for each category. Information.

本第３の実施の形態では、図１に示した第１の実施の形態とほぼ同様の構成を取るが、サブトピックラベル生成制御部１０２−１５０でのサブトピックラベル生成制御プログラム１５０の構成が異なる。図１２に示すように本第３の実施の形態におけるサブトピックラベル生成制御プログラム１５０ａには、ラベル用特徴語抽出プログラム１５１の代わりに、テキストブロック分割プログラム１２０１とラベル用ブロック抽出プログラム１２０２が新たに加わるとともに、適合度算出プログラム１６１を呼び出す構成をとる。 In the third embodiment, the configuration is almost the same as that of the first embodiment shown in FIG. 1, but the configuration of the subtopic label generation control program 150 in the subtopic label generation control unit 102-150 is the same. Different. As shown in FIG. 12, a text block division program 1201 and a label block extraction program 1202 are newly added to the subtopic label generation control program 150a in the third embodiment instead of the label feature word extraction program 151. In addition, a configuration is adopted in which the fitness calculation program 161 is called.

以下、サブトピックラベル生成制御部１０２−１５０における、第１の実施の形態とは異なるサブトピックラベル生成制御プログラム１５０ａの処理手順について、図１３に示すＰＡＤ図を用いて説明する。 Hereinafter, the processing procedure of the subtopic label generation control program 150a different from that of the first embodiment in the subtopic label generation control unit 102-150 will be described with reference to the PAD diagram shown in FIG.

まず、すべてのサブトピックについて、ステップ１３１０およびステップ１３２０を繰り返し実行する（ステップ１３００）
次に、該サブトピックに分類されたすべての適合テキストについて、ステップ１３１１〜ステップ１３１２を繰り返し実行する（ステップ１３１０）。 First, step 1310 and step 1320 are repeatedly executed for all subtopics (step 1300).
Next, Step 1311 to Step 1312 are repeatedly executed for all the matching text classified into the subtopic (Step 1310).

まず、テキストブロック分割プログラム１２０１を起動し、適合テキストを文などのブロックに分割する（ステップ１３１１）。 First, the text block division program 1201 is activated to divide the conforming text into blocks such as sentences (step 1311).

次に、該適合テキストのすべてのブロックについて、ステップ１３１３を繰り返し実行する（ステップ１３１２）。 Next, step 1313 is repeatedly executed for all blocks of the matching text (step 1312).

適合度算出プログラム１６１を起動し、サブトピックプロファイルの特徴語の総数およびブロックに含まれる特徴語の数を用いて、次の（６）式によりサブトピックに対するブロックの適合度（以下、ブロック別適合度と呼ぶ）を算出する（ステップ１３１３）。 The fitness calculation program 161 is started, and the fitness of a block with respect to a subtopic (hereinafter referred to as adaptation for each block) using the following equation (6) using the total number of feature words of the subtopic profile and the number of feature words included in the block. Is calculated (step 1313).

サブトピックスに対するブロック別適合度＝（ブロックに含まれる特徴語の数）／（サブトピックプロファイルの特徴語の総数）（６）
次に、ラベル用ブロック抽出プログラム１２０２を起動し、該サブトピックについてブロック別適合度が最も高く付与されたブロックを該サブトピックのサブトピックラベルとする（ステップ１３２０）。 Relevance by block to subtopics = (number of feature words included in block) / (total number of feature words of subtopic profile) (6)
Next, the label block extraction program 1202 is activated, and the block assigned with the highest degree of fitness for each block for the subtopic is set as the subtopic label of the subtopic (step 1320).

以上が、サブトピックラベル生成制御プログラム１５０ａの処理手順である。
なお、上記ステップ１３１３におけるブロック別適合度の算出方法については（６）式を用いたが、ベクトル空間法における余弦尺度など他の適合度算出式を適用してもよい。 The above is the processing procedure of the subtopic label generation control program 150a.
Note that the formula (6) is used for the calculation method of the block-by-block fitness in step 1313, but other fitness calculation formulas such as a cosine scale in the vector space method may be applied.

また、上記ステップ１３２０については、各サブトピックについてブロック別適合度が最も高く付与されたブロックをサブトピックラベルとしたが、そのブロックが複数存在する場合は該適合テキストの検索条件に対する適合度、サブトピック別適合度および該ブロックの出現位置等を用いてブロックを一意に決めてもよいし、その他の方法を用いてもよい。また、ブロック別適合度が予め設定されたブロック別適合性判定閾値を越えたブロックについて複数のブロックをサブトピックラベルとしてもよい。これにより、サブトピックの内容を詳細に提示することができる。 In step 1320, the block given the highest fitness level for each subtopic for each subtopic is used as a subtopic label. If there are a plurality of blocks, the fitness level for the search condition of the relevant text, A block may be uniquely determined using the topic-specific fitness, the appearance position of the block, or the like, or other methods may be used. A plurality of blocks may be used as subtopic labels for blocks whose block suitability exceeds a preset block suitability determination threshold. Thereby, the content of the subtopic can be presented in detail.

また、サブトピックラベル生成制御プログラム１５０ａでは、サブトピックラベルを生成するための情報源としてすべての適合テキストを対象としたが、適合度算出プログラム１６１で算出された検索条件に対する適合度やサブトピック別適合度に閾値を設け、それぞれ閾値を越えた適合テキストのみを対象としてもよい。これにより、検索者にとって精度の高いトピックラベルを提示することができる。また、検索条件が種文章の場合は、サブトピックラベルを生成するための情報源として、適合テキストだけでなく種文章を対象としてもよい。これにより、検索者の検索目的にあったトピックラベルを提示することができる。 Further, in the subtopic label generation control program 150a, all relevant texts are targeted as information sources for generating subtopic labels. However, the degree of conformity with respect to the search condition calculated by the degree-of-fit calculation program 161 or by subtopic A threshold may be provided for the degree of matching, and only matching text that exceeds the threshold may be targeted. Thereby, a topic label with high accuracy can be presented to the searcher. Further, when the search condition is a seed sentence, not only the matching text but also the seed sentence may be targeted as an information source for generating the subtopic label. Thereby, the topic label suitable for the search purpose of the searcher can be presented.

また、検索条件が種文章の場合は、サブトピックラベルを生成するための情報源として、適合テキストだけでなく種文章を対象としてもよい。これにより、検索者の検索目的にあったトピックラベルを提示することができる。即ち、トピックラベル生成ステップ（識別情報付与ステップ）において、検索条件が種文章の場合は，種文章と関連単語リストに関連付けられた適合文書（適合テキスト）との少なくとも一方を用いて適合度算出プログラム１６１に基づく要素別（ブロック別）適合度算出およびラベル用要素抽出プログラムに基づく分類識別用要素判定を行なうことを特徴とする。 Further, when the search condition is a seed sentence, not only the matching text but also the seed sentence may be targeted as an information source for generating the subtopic label. Thereby, the topic label suitable for the search purpose of the searcher can be presented. That is, in the topic label generation step (identification information adding step), when the search condition is a seed sentence, a fitness calculation program using at least one of the seed sentence and the relevant document (relevant text) associated with the related word list 161, element-by-element (block-by-block) fitness calculation and classification / identification element determination based on a label element extraction program.

以下、サブトピックラベル生成制御部１０２−１５０における、第１の実施の形態とは異なるサブトピックラベル生成制御プログラム１５０ａの具体的な処理の流れについて図１４を用いて説明する。 Hereinafter, a specific processing flow of the subtopic label generation control program 150a different from that of the first embodiment in the subtopic label generation control unit 102-150 will be described with reference to FIG.

まず、サブトピック１に関するサブトピックラベルを生成するために、本図ではサブトピックプロファイル５０２からサブトピック１のサブトピックプロファイル１４１３と、分類判定結果５０６からサブトピック１の分類に属する文書２の適合テキスト１４１４が選択されている。 First, in order to generate a subtopic label related to subtopic 1, in this figure, the subtopic profile 1413 of subtopic 1 from subtopic profile 502 and the matching text of document 2 belonging to the classification of subtopic 1 from classification determination result 506 are shown. 1414 is selected.

次に、テキストブロック分割プログラム１２０１が起動され、適合テキストをブロックに分割する。本図の実施例では、文書２に対してピリオドをブロックの境界文字列としてブロックに分割している。この結果、ブロック１〜ブロック４の４つのブロックに分割され、ブロック分割結果１４１０が得られている。 Next, the text block division program 1201 is activated to divide the matching text into blocks. In the embodiment shown in the figure, a period is divided into blocks as a boundary character string of the block for the document 2. As a result, the block is divided into four blocks, block 1 to block 4, and a block division result 1410 is obtained.

次に、適合度算出プログラム１６１が起動され、適合テキストの各ブロックに対してサブトピックプロファイルに対するブロック別適合度を上記（６）式を用いて算出する。本図の実施例では、サブトピック１に対する文書２のブロック１〜ブロック４のブロック別適合度として、“0.6”、“0.3”、“0.0”、“0.1”が算出されている。 Next, the fitness level calculation program 161 is started, and the fitness level for each block with respect to the subtopic profile is calculated for each block of the fitness text using the above equation (6). In the example of this figure, “0.6”, “0.3”, “0.0”, and “0.1” are calculated as the block-by-block suitability of block 1 to block 4 of document 2 for subtopic 1.

上記のブロックの分割処理およびブロック別適合度の算出処理を、該サブトピックの分類に属するすべての適合テキストに対して行なう。この結果、本図ではサブトピック１の分類に属するすべての適合テキストに関するブロック別適合度結果１４１１が得られている。なお、本図のブロック別適合度結果１４１１の“Ｄ”は文書番号、“Ｂ”はブロック番号を示している。 The block dividing process and the block-specific fitness calculation process are performed on all the matching texts belonging to the subtopic classification. As a result, in this drawing, the block-by-block matching result 1411 regarding all the matching texts belonging to the subtopic 1 classification is obtained. Note that “D” in the block matching result 1411 in this drawing indicates a document number, and “B” indicates a block number.

次に、ラベル用ブロック抽出プログラム１２０２が起動され、ブロック別適合度結果１４１１からブロック別適合度が最も高いブロックを抽出し、抽出されたブロックをサブトピックラベルとする。本図の実施例では、文書２のブロック４のブロック別適合度が最も高いため、サブトピック１のサブトピックラベル１４１２を「H-company BBB's Series with technology is truly the ultimate in plasma television.」としている。 Next, the block extraction program 1202 for labels is started, the block with the highest matching degree for each block is extracted from the matching degree result 1411 for each block, and the extracted block is set as a subtopic label. In the example of this figure, the block 4 fitness of the document 2 is the highest, so the subtopic label 1412 of the subtopic 1 is “H-company BBB's Series with technology is truly the ultimate in plasma television.” .

以上が、サブトピックラベル生成制御プログラム１５０ａの具体的な処理の流れである。これら一連の処理を、すべてのサブトピックについて行なう。 The above is the specific processing flow of the subtopic label generation control program 150a. A series of these processes is performed for all subtopics.

以上説明したように、本発明に係る第３の実施の形態によれば、検索者は各サブトピックがそれぞれどんな内容であるかを容易に理解することができるため、目的文書を効率よくかつ適切に探し出すことができる。 As described above, according to the third embodiment of the present invention, the searcher can easily understand what each subtopic is, so that the target document can be efficiently and appropriately stored. To find out.

以上説明したように、本発明の実施の形態によれば、検索結果集合文書を分類して表示する際に、各分類に関する識別情報（サブトピックプロファイル５０２、サブトラピックラベル５０３、検索条件適合度５０１、サブトピック別適合度５０４、適合性判定閾値５０５など）を付与する識別情報付与ステップを有することを特徴とする。 As described above, according to the embodiment of the present invention, when a search result set document is classified and displayed, identification information (subtopic profile 502, subtropic label 503, search condition conformance 501) regarding each classification is displayed. And a sub-topic matching level 504, a suitability determination threshold value 505, etc.).

また、上記識別情報付与ステップにおいて、関連単語リスト生成ステップで生成された各関連単語リストに含まれる特徴語を、上記各分類に対する識別情報とすることを特徴とする。 In the identification information providing step, the characteristic word included in each related word list generated in the related word list generating step is used as identification information for each classification.

また、上記識別情報付与ステップにおいて、関連単語リスト生成ステップで生成された各関連単語リストに含まれる特徴語を用いて、文、段落、節および章のうち少なくとも１つ以上を、上記各分類に対する識別情報とすることを特徴とする。 Further, in the identification information adding step, at least one or more of sentences, paragraphs, sections and chapters are assigned to the respective classifications using the feature words included in each related word list generated in the related word list generating step. The identification information is used.

また、上記識別情報付与ステップにおいて、関連単語リスト生成ステップで生成された関連単語リストに関連付けられ、分類判定ステップでの分類判定結果としての適合文書に含まれる文、段落、節および章の要素に対して、適合度算出プログラム１６１に基づく前記関連単語リストに対する要素別適合度１４１１を算出する要素別適合度算出ステップと、該要素別適合度算出ステップにより算出された関連単語リストに対する要素別適合度１４１１から、例えばラベル用ブロック抽出プログラム１２０２により、関連単語リストに関する分類の識別情報として用いる要素１４１２を判定する分類識別用要素判定ステップとを含むことを特徴とする。 Further, in the identification information adding step, it is associated with the related word list generated in the related word list generating step and included in the sentence, paragraph, section and chapter elements included in the conforming document as the classification determination result in the classification determination step. On the other hand, an element-by-element suitability calculation step for calculating the element-by-element suitability 1411 for the related word list based on the fitness calculation program 161, and an element-by-element suitability for the related word list calculated by the element-by-element suitability calculation step. 1411 includes, for example, a classification identification element determination step of determining an element 1412 used as classification identification information related to the related word list by the label block extraction program 1202.

また、上記識別情報付与ステップ（ラベル生成ステップ）において、検索条件が種文章の場合は、種文章と関連単語リストに関連付けられた適合文書と少なくとも一方を用いて前記要素別適合度算出ステップおよび前記分類識別用要素判定ステップを行なうことを特徴とする。 In the identification information providing step (label generation step), when the search condition is a seed sentence, the element-specific suitability calculation step using the seed sentence and the relevant document associated with the related word list and at least one of A classification identifying element determination step is performed.

本発明に係る第１の実施の形態における文書検索システムの全体構成をプログラムを主体に示す図である。It is a figure which shows the whole structure of the document search system in 1st Embodiment based on this invention mainly on a program. 本発明に係る第１の実施の形態における文書検索システムの全体構成を機能的に示す図である。It is a figure which shows functionally the whole structure of the document search system in 1st Embodiment based on this invention. 本発明に係る第１の実施の形態における検索制御部で実行される検索制御プログラム１３０を説明するＰＡＤ図である。It is a PAD figure explaining the search control program 130 performed by the search control part in the 1st Embodiment concerning this invention. 本発明に係る第１の実施の形態における検索結果分類制御部で実行される検索結果分類制御プログラム１４０を説明するＰＡＤ図である。It is a PAD figure explaining the search result classification control program 140 performed by the search result classification control part in the 1st Embodiment concerning this invention. 本発明に係る第１の実施の形態における検索結果分類制御部で実行するサブトピック抽出プログラム１４１を説明するＰＡＤ図である。It is a PAD figure explaining the subtopic extraction program 141 performed with the search result classification | category control part in 1st Embodiment based on this invention. 本発明に係る第１の実施の形態における検索制御部等での検索制御プログラム１３０の具体的な処理の流れを説明するための図である。It is a figure for demonstrating the flow of a specific process of the search control program 130 in the search control part etc. in 1st Embodiment based on this invention. 本発明に係る第１の実施の形態における検索結果分類制御部でのサブトピック抽出プログラム１４１の具体的な処理の流れを説明するための図である。It is a figure for demonstrating the flow of a specific process of the subtopic extraction program 141 in the search result classification | category control part in 1st Embodiment based on this invention. 本発明に係る第１の実施の形態における検索結果出力プログラム１３４の出力例として検索結果出力画面を示す図である。It is a figure which shows a search result output screen as an example of an output of the search result output program 134 in 1st Embodiment based on this invention. 本発明に係る第１の実施の形態における検索結果出力プログラム１３４の出力例として、種文章に対する適合度順かサブトピック別適合度順かを選択するインターフェースを備えた検索結果出力画面を示す図である。As an output example of the search result output program 134 in the first embodiment according to the present invention, a diagram showing a search result output screen provided with an interface for selecting the suitability order for seed text or the suitability order by subtopic. is there. 本発明に係る第１の実施の形態における検索結果出力プログラム１３４の出力例として、各サブトピックについてそれぞれ何件の適合文書が存在するかを示す検索結果出力画面と、検索者によって選択されたサブトピックに関する検索結果出力画面を示す図である。As an output example of the search result output program 134 according to the first embodiment of the present invention, a search result output screen showing how many relevant documents exist for each subtopic, and a sub selected by the searcher It is a figure which shows the search result output screen regarding a topic. 本発明に係る第２の実施の形態における検索結果分類制御部で実行するサブトピック抽出プログラム１４１ａを説明するＰＡＤ図である。It is a PAD figure explaining the subtopic extraction program 141a performed with the search result classification | category control part in 2nd Embodiment based on this invention. 本発明に係る第２の実施の形態における検索結果分類制御部でのサブトピック抽出プログラム１４１ａの具体的な処理の流れを説明するための図である。It is a figure for demonstrating the flow of a specific process of the subtopic extraction program 141a in the search result classification | category control part in the 2nd Embodiment which concerns on this invention. 本発明に係る第３の実施の形態におけるサブトピックラベル生成制御部でのサブトピックラベル生成制御プログラム１５０ａの構成を示す図である。It is a figure which shows the structure of the subtopic label production | generation control program 150a in the subtopic label production | generation control part in the 3rd Embodiment which concerns on this invention. 本発明に係る第３の実施の形態におけるサブトピックラベル生成制御部で実行するサブトピックラベル生成制御プログラム１５０ａを説明するＰＡＤ図である。It is a PAD figure explaining the subtopic label production | generation control program 150a performed by the subtopic label production | generation control part in the 3rd Embodiment which concerns on this invention. 本発明に係る第３の実施の形態におけるサブトピックラベル生成制御部でのサブトピックラベル生成制御プログラム１５０ａの具体的な処理の流れを説明する図である。It is a figure explaining the flow of a specific process of the subtopic label production | generation control program 150a in the subtopic label production | generation control part in the 3rd Embodiment which concerns on this invention.

符号の説明Explanation of symbols

１００…ディスプレイ、１０１…キーボード、１０２…中央演算処理装置（ＣＰＵ）、１０２−１１０…システム制御部、１０２−１２０…登録制御部、１０２−１３０…検索制御部、１０２−１４０…検索結果分類制御部、１０２−１５０…サブトピックラベル生成制御部、１０２−１６１…適合度算出部、１０３…磁気ディスク装置、１０４…フレキシブルディスクドライブ（ＦＤＤ）、１０５…主メモリ、１０６…バス、１０７…ネットワーク、１０８…フレキシブルディスク、
１１０…システム制御プログラム、１２０…登録制御プログラム、１３０…検索制御プログラム、１２１…文書ファイル取得ファイル、１２２…テキスト登録プログラム、１３１…検索条件取得プログラム、１３２…特徴語抽出プログラム、１３３…テキスト読込プログラム、１３４…検索結果出力プログラム、１４０…検索結果分類制御プログラム、１４１…サブトピック抽出プログラム、１４２…分類判定プログラム、１５０…サブトピックラベル生成制御プログラム、１５１…ラベル用特徴語抽出プログラム、１６０…共有ライブラリ、１６１…適合度算出プログラム、１７０…ワークエリア、１８０…テキスト、１５０ａ…サブトピックラベル生成制御プログラム、５００…特徴語、５０１…適合テキスト、５０２…サブトピックプロファイル、５０３…サブトピックラベル、５０４…サブトピック別適合度、５０５…サブトピック適合性判定閾値、５０６…分類判定結果、５１０…種文章、６００…出現パターン生成処理、６０１…単語間関連度算出処理、６０２…グルーピング処理、６０３…重複排除処理、６１０…出現パターン、６１１…単語間関連度、６１２…関連単語リスト、６１３…サブトピックプロファイル、７００、８００…検索結果一覧表示、９００…検索結果、９０１…サブトピック１の検索結果一覧表示、１１０１…単語間関連度取得、１１１１…関連語辞書、１２０１…ブロック分割プログラム、１２０２…ラベル用ブロック抽出プログラム、１４１０…ブロック分割結果、１４１１…ブロック別適合度結果、１４１２…サブトピック１のサブトピックラベル、１４１３…選択されたサブトピックプロファイル、１４１４…選択された文書２の適合テキスト。
DESCRIPTION OF SYMBOLS 100 ... Display, 101 ... Keyboard, 102 ... Central processing unit (CPU), 102-110 ... System control part, 102-120 ... Registration control part, 102-130 ... Search control part, 102-140 ... Search result classification control , 102-150 ... subtopic label generation control unit, 102-161 ... fitness calculation unit, 103 ... magnetic disk device, 104 ... flexible disk drive (FDD), 105 ... main memory, 106 ... bus, 107 ... network, 108: Flexible disk,
DESCRIPTION OF SYMBOLS 110 ... System control program, 120 ... Registration control program, 130 ... Search control program, 121 ... Document file acquisition file, 122 ... Text registration program, 131 ... Search condition acquisition program, 132 ... Feature word extraction program, 133 ... Text reading program 134 ... Search result output program, 140 ... Search result classification control program, 141 ... Subtopic extraction program, 142 ... Classification determination program, 150 ... Subtopic label generation control program, 151 ... Label feature word extraction program, 160 ... Share Library, 161 ... Conformity calculation program, 170 ... Work area, 180 ... Text, 150a ... Subtopic label generation control program, 500 ... Feature word, 501 ... Conformance text, 502 ... Subtopic File, 503 ... subtopic label, 504 ... subtopic matching level, 505 ... subtopic compatibility determination threshold, 506 ... classification determination result, 510 ... seed sentence, 600 ... appearance pattern generation process, 601 ... inter-word relevance calculation Process 602 ... Grouping process 603 ... Deduplication process 610 ... Appearance pattern 611 ... Inter-word relevance 612 ... Related word list 613 ... Subtopic profile 700, 800 ... Search result list display 900 ... Search result 901 ... Subtopic 1 search result list display 1101 ... Inter-word relevance acquisition 1111 ... Related word dictionary 1201 ... Block division program 1202 ... Label block extraction program 1410 ... Block division result 1411 ... By block Goodness of fit result, 1412 ... Subtopic 1 Kkuraberu, 1413 ... sub-topics profile selected, 1414 ... selected conforming text of the document 2.

Claims

検索条件に対する検索結果を分類して表示する検索結果提示装置において、
前記検索条件として入力された文書から複数の特徴語を抽出する特徴語抽出手段と、
前記特徴語抽出手段から抽出された前記特徴語を用いて検索対象文書を検索し、その結果、前記特徴語が含まれた複数の適合文書を取得し、取得した前記複数の適合文書のそれぞれに対して前記特徴語抽出手段で抽出された複数の前記特徴語からなる組み合わせの出現有無を判定し、該出現有無の判定結果に基づいて前記特徴語間の関連性を判定し、関連する特徴語間をまとめた関連単語リストを生成する関連単語リスト生成手段と、
前記関連単語リスト生成手段で生成された各関連単語リストに対する適合文書の関連単語リスト適合度を算出する関連単語リスト適合度算出手段と、
前記関連単語リスト適合度算出手段で算出された関連単語リスト適合度から、該適合文書の該関連単語リストに対する適合度を判定し、該適合度が高いと判定された場合には該適合文書を該関連単語リストに関連付けて保持する分類判定手段と、
前記分類判定手段で各関連単語リストに関連付けられた適合文書集合に対して、分類の識別情報を付与する分類識別情報付与手段と、
前記分類識別情報付与手段で生成された識別情報を各分類に付与して、前記文書集合を表示する検索結果表示手段と
を有することを特徴とする検索結果提示装置。 In a search result presentation device that classifies and displays search results for search conditions ,
Feature word extraction means for extracting a plurality of feature words from the document input as the search condition ;
Searching the target document using the feature words extracted from the feature word extraction unit, as a result, to obtain a plurality of relevant documents to the feature words is included, to each of the plurality of relevant documents retrieved feature words the determined appearance whether the combination comprising a plurality of said characteristic word extracted by the feature word extraction means, to determine the relationship between the characteristic word, based on the output current existence determination results related for A related word list generating means for generating a related word list that summarizes the interval;
Related word list relevance calculating means for calculating the related word list relevance of the corresponding document for each related word list generated by the related word list generating means;
From the related word list fitness calculating means associated word list fitness calculated by, determining the degree of conformity the related word list of the relevant documents, the relevant documents in the case where it is determined that the matching degree is higher Classification determination means for holding in association with the related word list;
Classification identification information giving means for assigning classification identification information to the matching document set associated with each related word list by the classification determination means;
A search result display device comprising: search result display means for displaying the document set by adding the identification information generated by the classification identification information adding means to each classification.

請求項１記載の検索結果提示装置において、
さらに、前記指定された検索条件に対する適合文書の検索条件適合度を算出する検索条件適合度算出手段と、
検索結果の表示時に、前記分類判定手段で判定された結果に基づいて各関連単語リスト別に適合文書を、前記検索条件適合度算出手段で算出された検索条件適合度あるいは前記関連単語リスト適合度算出手段で算出された関連単語リスト適合度のいずれかの降順で表示する手段
を有することを特徴とする検索結果提示装置。 The search result presentation device according to claim 1,
Further, a search condition conformity calculation means for calculating a search condition conformance of a conforming document with respect to the specified search condition,
At the time of display of the search result, based on the result determined by the classification determination means, the relevant document is classified for each related word list, the search condition fitness calculated by the search condition fitness calculation means or the related word list fitness calculation A search result presentation device comprising means for displaying in descending order of the degree of matching of the related word list calculated by the means.

請求項１記載の検索結果提示装置において、
さらに、前記分類判定手段で判定された結果に基づいて、各関連単語リスト別にそれぞれ関連付けられた適合文書の件数を表示する関連単語リスト別文書件数表示手段を有することを特徴とする検索結果提示装置。 The search result presentation device according to claim 1,
The retrieval result presentation device further comprises a related word list document number display means for displaying the number of relevant documents associated with each related word list based on the result determined by the classification determination means. .

請求項１記載の検索結果提示装置において、
前記分類識別情報付与手段は、前記関連単語リスト生成手段で生成された各関連単語リストに含まれる特徴語を各分類の識別情報とする手段を有することを特徴とする検索結果提示装置。 The search result presentation device according to claim 1,
The said classification identification information provision means has a means to use the characteristic word contained in each related word list produced | generated by the said related word list production | generation means as identification information of each classification | category.

請求項１記載の検索結果提示装置において、
前記分類識別情報付与手段は、
前記分類判定手段により前記関連単語リストに関連付けられた適合文書に含まれる文、段落、節および章の要素に対して、該関連単語リストに対する要素別適合度を算出する要素別適合度算出手段と、
前記要素別適合度算出手段により算出された該関連単語リストに対する要素別適合度から、各分類の識別情報として用いる要素を判定する分類識別情報要素判定手段
を有することを特徴とする検索結果提示装置。 The search result presentation device according to claim 1,
The classification identification information giving means is
Element-by-element fitness calculation means for calculating element-by-element compatibility for the related word list with respect to elements of sentences, paragraphs, sections, and chapters included in the matching document associated with the related word list by the classification determination means; ,
A search result presentation device comprising classification identification information element determination means for determining an element to be used as identification information for each classification from the element-specific fitness for the related word list calculated by the element-specific fitness calculation means .

請求項１記載の検索結果提示装置において、
前記分類識別情報付与手段は、
検索条件が種文章の場合は、種文章に含まれる文、段落、節および章の要素に対して、該関連単語リストに対する要素別適合度を算出する要素別適合度算出手段と、
前記要素別適合度算出手段により算出された該関連単語リストに対する要素別適合度から、各分類の識別情報として用いる要素を判定する分類識別情報要素判定手段
を有することを特徴とする検索結果提示装置。 The search result presentation device according to claim 1,
The classification identification information giving means is
When the search condition is a seed sentence, element-by-element suitability calculation means for calculating the element-by-element suitability for the related word list for the sentence, paragraph, section, and chapter elements included in the seed sentence;
A search result presentation device comprising classification identification information element determination means for determining an element to be used as identification information for each classification from the element-specific fitness for the related word list calculated by the element-specific fitness calculation means .

請求項１記載の検索結果提示装置において、
前記関連単語リスト生成手段は、前記検索条件が特徴語及び前記特徴語の論理関係を含む論理演算式の場合に、前記論理演算式を積和標準形に変換し、変換された前記積和標準形の積で関連付けられたキーワード集合をまとめた関連単語リストを生成することを特徴とする検索結果提示装置。 The search result presentation device according to claim 1,
The related word list generation means converts the logical operation expression into a product-sum standard form when the search condition is a logical operation expression including a feature word and a logical relationship between the feature words, and the converted product-sum standard A search result presentation device, characterized by generating a related word list in which a set of keywords associated with a product of shapes is collected.