JP3848014B2

JP3848014B2 - Document search method and document search apparatus

Info

Publication number: JP3848014B2
Application number: JP15253999A
Authority: JP
Inventors: 達也出羽
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1999-05-31
Filing date: 1999-05-31
Publication date: 2006-11-22
Anticipated expiration: 2019-05-31
Also published as: JP2000339342A

Description

【０００１】
【発明の属する技術分野】
本発明は、複数の文書の中から検索要求に応じた文書を検索（キーワード検索、類似文書検索）するための文書検索方法およびそれを用いた文書検索装置に関する。
【０００２】
【従来の技術】
近年のパーソナルコンピューターの普及に伴って大量の電子化文書が作成されるようになり、更にコンピューターネットワークの普及に伴ってそれらの大量の電子化文書へのアクセスが容易になってきた。しかし、アクセス可能な文書が増えれば増えるほど、その中からユーザが必要とする文書を探し出すのが困難になり、折角の情報が活用されないという事態になりかねない。そこで、大量の文書の中からユーザが必要としているものを選び出す文書検索装置、とりわけ、タイトルや作成者といった書誌情報だけでなく、文書の内容を利用した全文検索技術を用いた文書検索装置への需要が高まりつつある。
【０００３】
【発明が解決しようとする課題】
従来の文書検索装置においては、検索対象となる文書に対し、形態素解析処理を施す等して語句を抽出し、抽出した語句を文書内出現頻度や出現文書数で重み付けすることにより索引を作成することが一般的に行われている。このような文書全体から語句を抽出し索引を作成する方法は、特許明細書や学術論文といった長い文書を対象とした場合、重要でない（当該文書の内容的な特徴を表した箇所でない）箇所に出現する語句を抽出してしまうという問題がある。
【０００４】
このような問題を回避するため、特に、構成が定型化された（構造化された）文書（構造化された文書の例として、特許明細書や学術論文等があり、特許明細書の場合、「特許請求の範囲」「発明の詳細な説明」「発明の実施の形態」等の項目毎の構成要素があり、学術論文の場合、「アブストラクト」「本文」等の構成要素がある）では、その文書の構成要素のうち、特許明細書であれば請求項、学術論文等であればアブストラクト等、その文書の要旨を簡潔に表現した主構成要素だけから語句を抽出して索引を作成するという方法がとられることもある。しかし、このような部分はより抽象度の高い語で記述されることが多いため、ユーザの検索要求がより具体的な語句で記述された場合には、検索結果から洩れてしまう危険が大きい。
【０００５】
一方、ユーザがそのような危険を考慮して抽象的な言葉で検索要求を記述した場合には、不要な文書が多数マッチしてしまうという問題がある。
【０００６】
本発明はこのような実情に鑑みてなされたものであり、文書の内容を的確に表した語句を抽出して当該文書の検索のために用いる索引を作成することにより、文書の内容に即した精度の高い文書の検索を可能にする文書検索方法およびそれを用いた文書検索装置を提供することを目的とする。
【０００７】
【課題を解決するための手段】
（１）本発明の文書検索方法は、複数の文書の中から入力された検索要求に応じた文書を検索する文書検索方法において、前記文書は複数の構成要素で構造化された文書であって、前記文書の予め定めらた主たる構成要素の中から第１の語句を抽出し、さらに、前記文書の前記主たる構成要素以外の構成要素の中から前記第１の語句との間で所定の条件を満たす第２の語句を抽出し、前記複数の文書のそれぞれから抽出された前記第１および第２の語句と前記検索要求とに基づき文書を検索することを特徴とする。
【０００８】
本発明の文書検索方法は、複数の文書の中から入力された文書に類似する文書を検索するための文書検索方法において、前記文書は複数の構成要素で構造化された文書であって、前記入力された文書と検索対象の前記複数の文書のそれぞれから、該文書の予め定められた主たる構成要素の中から第１の語句を抽出し、さらに、該文書の前記主たる構成要素以外の構成要素の中から前記第１の語句との間で所定の条件を満たす第２の語句を抽出し、前記入力された文書と前記検索対象の複数の文書との間で、そのそれぞれから抽出された前記第１および第２の語句の類似度を求めて、前記入力された文書に類似する文書を前記検索対象の複数の文書の中から検索することを特徴とする。
【０００９】
本発明によれば、文書の内容を的確に表した第１の語句（基本語）と第２の語句（拡張語）を抽出して当該文書を検索するために用いる索引を作成することにより、文書の内容に即した精度の高い文書の検索を可能にする。
【００１０】
好ましくは、予め定められた言語表現にて前記第１の語句に関連付けられた語句を第２の語句として抽出する。
【００１１】
また、好ましくは、前記第１の語句を項とする述語と同じ述語の項になっている語句を第２の語句として抽出する。
【００１２】
（２）本発明の文書検索装置は、複数の文書の中から入力された検索要求に応じた文書を検索する文書検索装置において、前記文書は複数の構成要素で構造化された文書であって、
前記文書の予められた主たる構成要素の中から第１の語句を抽出する第１の抽出手段と、
前記文書の前記主たる構成要素以外の構成要素の中から前記第１の語句との間で所定の条件を満たす第２の語句を抽出する第２の抽出手段と、
前記複数の文書のそれぞれから抽出された前記第１および第２の語句と前記検索要求とに基づき文書を検索する検索手段と、
を具備したことを特徴とする。
【００１３】
本発明の文書検索装置は、複数の文書の中から入力された文書に類似する文書を検索するための文書検索装置において、前記文書は複数の構成要素で構造化された文書であって、
前記入力された文書と検索対象の前記複数の文書のそれぞれから、該文書の予め定められた主たる構成要素の中から第１の語句を抽出する第１の抽出手段と、前記入力された文書と前記検索対象の複数の文書のそれぞれから、前記主たる構成要素以外の構成要素の中から前記第１の語句との間で所定の条件を満たす第２の語句を抽出する第２の抽出手段と、
前記入力された文書と前記検索対象の複数の文書との間で、そのそれぞれから抽出された前記第１および第２の語句の類似度を求めて、前記入力された文書に類似する文書を前記検索対象の複数の文書の中から検索する検索手段と、
を具備したことを特徴とする。
【００１４】
本発明によれば、文書の内容を的確に表した第１の語句（基本語）と第２の語句（拡張語）を抽出して当該文書を検索するために用いる索引を作成することにより、文書の内容に即した精度の高い類似文書の検索を可能にする。
【００１５】
好ましくは、予め定められた言語表現にて前記第１の語句に関連付けられた語句を第２の語句として抽出する。
【００１６】
また、好ましくは、前記第１の語句を項とする述語と同じ述語の項になっている語句を第２の語句として抽出する。
【００１７】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態について説明する。
【００１８】
図１に、本実施形態にかかる文書検索装置の機器構成例を示したものである。図１に示すように、この実施形態の文書検索装置は、本発明の文書検索処理を実行するためのプログラムや各種データを記憶する外部記憶装置１０２、外部記憶装置１０２に記憶されたプログラムを実行するＣＰＵ１０１、他のコンピュータから公衆網、専用線等の通信ネットワークを介して所定のデータを読み込む通信装置１０３、検索要求等ユーザからの指示を入力するためのキーボード１０４、マウス１０５、検索結果等を表示する表示装置１０６をバスを介して互いに接続してなる。
【００１９】
図２は、本実施形態にかかる文書検索装置の機能ブロック図である。図２に示すように、この実施形態の文書検索装置は、検索要求等ユーザからの指示を入力する入力部２０１、検索結果を表示する出力部２０３、検索対象となる文書群を格納する文書格納部２１１、文書群から語句を抽出して索引を作成する索引作成部２０４、文書群を検索するための索引を記憶する索引格納部２０９、索引格納部２０９に格納された索引を参照してユーザからの検索要求に適合した文書を選択する文書検索部２０８、索引を作成したり、ユーザの検索要求から語句を抽出するための語句抽出部２０５、検索対象となる構造化された文書の構成要素を認識する文書構造認識部２１０、ユーザからの指示により索引作成部２０４や文書検索部２０８を起動する制御部２０２からなる。
【００２０】
語句抽出部２０５は、検索対象の文書の主たる構成要素から基本語を抽出する基本語抽出部２０６と、基本語に追加するための語句を主たる構成要素以外の要素から抽出する拡張語抽出部２０７からなる。
【００２１】
図２の各構成部（出力部２０１、制御部２０２、入力部２０３、索引作成部２０４、語句抽出部２０５、文書検索部２０８、および文書構造認識部２１０）は、図１の外部記憶装置１０２に記録されてＣＰＵ１０１によって実行制御されるプログラムとして構成され、また、索引格納部２０９および文書格納部２１１は、外部記憶装置１０２または、通信装置１０３を介してつながっている他のコンピューターの外部記憶装置上に構築されていてもよい。この場合、入力部２０３は図１のキーボード１０４およびマウス１０５を介して入力された検索要求等のユーザからの指示を受け取り、また、出力部２０１は検索結果を図１の表示装置１０６に表示するためのものである。
【００２２】
以上のような構成により、図２に示す文書検索装置は、入力した文書の内容に類似する文書を検索する（類似文書検索）。
【００２３】
なお、ここでは、検索対象の文書として、例えば、図３（ａ）に示すような特許明細書を入力部２０３から入力し、類似特許検索を行う場合の各部の動作について説明する。
【００２４】
検索に先立って、文書格納部２１１に既に格納されている複数の特許明細書のそれぞれから索引を作成しておく。索引の作成は、制御部２０２が索引作成部２０４を呼び出すことにより行われる。
【００２５】
索引作成部２０４は、語句抽出部２０５を呼び出して、文書格納部２１１に格納されている特許明細書から語句を抽出し、抽出した語句から索引を作成する。語句抽出部２０５は、基本語抽出部２０６と拡張語抽出部２０７とからなる。基本語抽出部２０６は、文書構造認識部２１０を呼び出して、特許明細書の構成要素のうち、「特許請求の範囲」という項目の構成要素の文章のみを取り出し、取り出された文章全体から索引語句を抽出する。拡張語抽出部２０７は、文書構造認識部２１０を呼び出して、特許明細書の構成要素のうち、「発明の実施の形態」という項目の構成要素の文章のみを取り出し、基本語抽出部２０６により抽出された索引語句を拡張するための語句を抽出する。
【００２６】
制御部２０２は、入力部２０３から特許明細書が入力されると、文書検索部２０８を呼び出す。文書検索部２０８は、語句抽出部２０５を呼び出すことにより、入力された特許明細書から語句を抽出する。さらに、文書検索部２０８は、抽出された語句と、索引格納部２０９に格納された索引を参照することにより、入力された特許明細書と文書格納部２０９に格納された各文書との間の類似度を計算する。制御部２０２は、類似度の高い特許明細書のリストを出力部２０１よりユーザに呈示する。
【００２７】
次に、索引作成部２０４の処理について、図３（ｂ）に示す特許明細書の索引を作成する場合を例にとり詳述する。
【００２８】
図４に索引作成部２０４の処理の流れを示す。索引作成部２０４は、文書格納部２１１から特許明細書を１つずつ取り出して、索引を作成する。文書格納部２１１では、１つの特許明細書が１つのファイルとして格納されており、各特許明細書には固有のファイル名が付けられている。例えば、図３（ｂ）に示す特許明細書には「特開平０１−９９９９９９．ｔｘｔ」というファイル名が付けられている。
【００２９】
索引格納部２０９では、各ファイルは番号で管理されているため、各ファイルに番号を付けて登録する（ステップＳ２）。次に、基本語の抽出を行う（ステップＳ３）。
【００３０】
図５に図４のステップ３の処理の流れを示す。
【００３１】
図５において、まず、文書構造認識部２１０が呼び出されて、特許明細書の構成要素のうち、「特許請求の範囲」という項目の構成要素の文章のみが取り出される（ステップＳ１１）。図３（ｂ）の特許明細書から取り出された「特許請求の範囲」に書かれた文章の例を図７に示す。この「特許請求の範囲」に書かれた文章に対し、形態素解析を施す（ステップＳ１２）。形態素解析の方法については広く公知であるのでここでは詳述しない。
【００３２】
図７の文章に対して形態素解析を施した結果の一部を図８に示す。図８では、１行に１形態素の情報が出力されており、行頭からスペースで区切られて、形態素表記、読み、基本形表記、品詞、品詞番号、細品詞、細品詞番号、活用型、活用型番号、活用形、活用形番号が並んでいる。情報がない場合は、「＊」が記されている。
【００３３】
ステップＳ１２の形態素解析の結果から、名詞、動詞、形容詞、記号、未知語等の品詞を持った語を索引語として抽出する（ステップＳ１３）。図９は、図８の形態素解析の結果から抽出された索引語のリストの一部を示したものである。
【００３４】
一方、ステップＳ１２の形態素解析の結果からは名詞句リストも抽出される（ステップＳ１４）。名詞句とは、ここでは、名詞、記号、未知語、形容詞語幹、形容詞連体形の連接、あるいは助詞「の」を介した連続を指す。図１０は、図８の形態素解析の結果から抽出された名詞句のリストの一部を示したものである。図１０において、左側が名詞句の表記で、名詞句が複数の形態素から構成される場合は、形態素間の境界を「／」で示している。右側は名詞句を構成する形態素の品詞を記している。表記の場合と同様、形態素間の境界は「／」で示している。
【００３５】
さらに、ステップＳ１２の形態素解析の結果から、対応関係にある述語と項のリスト（述語−項リスト）を抽出する（ステップＳ１５）。なお、ここでは、述語は動詞に限定しており、項は名詞句の形で抽出する。図１１は、図８の形態素解析の結果から抽出された述語−項リストの一部を示したものである。図１１において、抽出された各述語に対し、その各述語が取る項を右側に記している。一つの述語が複数の項を取る場合は、項と項の間を「；」で区切っている。各項は、「表記（品詞）」の形で記されており、１つの項が複数の形態素で構成される場合は、形態素間の境界は「／」で示している。
【００３６】
図４の説明に戻る。ステップＳ３で基本語の抽出が終了したら、次に、図５のステップＳ１４、ステップＳ１５で抽出したリストを基に、ステップＳ１３で抽出された索引語リストの拡張を行う（ステップＳ４）。
【００３７】
図６は、図４のステップＳ４の処理の流れを示したフローチャートである。
【００３８】
まず、文書構造認識部２１０が呼び出されて、特許明細書の構成要素のうち、「発明の実施の形態」に書かれた文章のみが取り出される（ステップＳ２１）。図３（ｂ）の特許明細書から取り出された「発明の実施の形態」に書かれた文章の一部を図１２に示す。この「発明の実施の形態」に書かれた文章に対し、形態素解析を施す（ステップＳ２２）。形態素解析処理については、図５のステップＳ１２と同様でよい。
【００３９】
形態素解析を施した結果から図５のステップＳ１５と同様にして、述語−項リストを抽出する（ステップＳ２３）。図１２の文章に対して形態素解析を施した結果から抽出した述語−項リストを図１３に示す。形式は図１１と同じである。
【００４０】
続いて、名詞句の言い換えを抽出する（ステップＳ２４）。名詞句の言い換えは、図１４に示すような文字列のパターンを予め用意しておき、これと文章との照合を行うことにより実現する。例えば、パターン番号「１」の文字列のパターンは、「としての」という文字列の前後に名詞句が出現したとき、前に出現した名詞句を拡張元名詞句として、後に出現した名詞句を拡張名詞句として抽出する。このパターンを用いて、例えば、図１２の文章中にある「広域の地名としての都道府県名」という部分から、拡張元名詞句として「広域の地名」が、拡張名詞句として「都道府県名」が抽出される。図１２の文章から抽出された拡張元名詞句と拡張名詞句との対応を示したリスト、すなわち、名詞句の言い換えリストを図１５に示す。
【００４１】
次に、ステップＳ２３で抽出された述語−項リストと、ステップＳ２４で抽出された名詞句言い換えリストを用いて、拡張語の選択を行う（ステップＳ２５）。
【００４２】
図１９は、図６のステップＳ２５のに拡張語選択処理の流れを示したフローチャートである。拡張語選択処理としては、述語−項リストを用いる場合と、名詞句の言い換えリストを用いる場合とがあるが、ここでは、図１９では、述語−項リストを用いる場合を例にとり説明する。
【００４３】
「発明の実施の形態」の文章から抽出された述語−項リスト中の述語と項との組を１組ずつ取り出し、全ての組について、図５のステップＳ１５で「特許請求の範囲」の文章から抽出された述語−項リスト中の述語と照合する（ステップＳ３１）。すなわち、「発明の実施の形態」の文章から抽出された述語−項リストから述語と項との組を１組取り出す（ステップＳ３２）。そして、この述語と同じ述語が「特許請求の範囲」の文章から抽出された述語−項リスト中の述語にあるかどうか調べる。同じものがない場合は、ステップＳ３１に戻り、「発明の実施の形態」の文章から抽出された述語−項リストから次の述語と項との組を取り出す。
【００４４】
同じものがあった場合は（ステップＳ３３）、「発明の実施の形態」の文章から抽出された述語−項リスト中の当該述語の項を１つずつ調べる（ステップＳ３４、ステップＳ３５）。すなわち、図５のステップＳ１４で「特許請求の範囲」の文章から抽出した名詞句リストに「発明の実施の形態」の文章から抽出された述語−項リスト中の当該述語の項と同じものがないかどうか調べる（ステップＳ３６）。同じものがあった場合は、ステップＳ３４に戻って次の項をチェックする。同じものがなかった場合は、ステップＳ３７に進み、当該項を拡張語として登録し、再びステップＳ３４に戻って次の項をチェックする。ステップＳ３５〜ステップＳ３７の処理を「発明の実施の形態」の文章から抽出された述語−項リスト中の当該述語の項がなくなるまで繰り返す。
【００４５】
以上のようにして、図１３に示した「発明の実施の形態」の文章から抽出された述語−項リストのみを用いて拡張語を選択してもよいし、これに換えて、図１５の名詞句の言い換えリストを用いて拡張語を選択してもよい。さらに、述語−項リストと名詞句の言い換えリストを両方用いて拡張語を選択してもよい。
【００４６】
すなわち、図９に示すような手順にて拡張語を選択した後、次に、図６のステップＳ２４にて「発明の実施の形態」の文章から抽出された図１５に示したような名詞句の言い換えリストにある拡張元名詞句と同じ名詞句が図５のステップＳ１４で「特許請求の範囲」の文章から抽出した名詞句リストに存在するか否か調べる。同じものがあった場合は、その名詞句の言い換えリストの拡張元名詞句に対応する拡張名詞句を拡張語とする。このとき、すでに拡張語として選択済みの拡張名詞句は無視する。
【００４７】
なお、図１５に示すような名詞句の言い換えリストを用いて拡張語を選択した後に、図１３に示すような述語−項リストを用いて拡張語を選択してもよい。
【００４８】
図１３の述語−項リストと図１５名詞句の言い換えリストとを両方用いて得られた拡張語を図１６に示す。
【００４９】
図６の説明に戻り、ステップＳ２６では、ステップＳ２５で選択された、名詞句の形の拡張語を単語に展開する。例えば、図１６の拡張語を単語に展開したものを図１７に示す。ここでは、拡張語の出現頻度を一律「１」であるとして、展開された単語の頻度を計算している。
【００５０】
図５のステップＳ１３で抽出された図９に示したような索引語リストに図６のステップＳ２５〜ステップＳ２６で抽出されて、単語に展開された拡張語（図１７参照）を追加したものを図１８に示す。図１８において、索引語番号「１８」〜「２０」が新たに追加された語、すなわち、拡張語である。
【００５１】
ここで再び図４の説明に戻る。次の処理はステップＳ５である。ステップＳ５では、ステップＳ４までに得られた索引語とその出現頻度の情報を頻度表に書き出す。ここで作成される頻度表が索引格納部２０９に格納される索引に相当する。
【００５２】
頻度表の例を図２０に示す。縦軸に各文書の格納されているファイルを識別するためのファイル番号、横軸に基本語および拡張語として抽出された単語のそれぞれを識別するための単語番号が取られ、どのファイルに、どの単語が何回出現したかが記されている。
【００５３】
以上のような処理（図４参照）を、文書格納部２１１中の全ての特許明細書のファイルに対して実行する。文書格納部２１１中の全ての特許明細書のファイルに対して処理を終えると、ステップＳ６で、各索引語の文書頻度を数える。各索引語の文書頻度は、頻度表を縦に読んで、出現頻度が「１」以上のファイルの数を数えることにより得られる。各索引語の文書頻度を算出した例を図２１に示す。
【００５４】
次に、文書検索部２０８の処理について、図３（ａ）に示す特許明細書に類似する特許明細書を検索する場合を例にとり詳述する。図２２に文書検索部２０８の処理の流れを示す。
【００５５】
入力部２０３より入力された図３（ａ）に示したような特許明細書に対し、基本語が抽出される（ステップＳ４１）。ステップＳ４１での処理は図４のステップＳ３の処理と同様である。図３（ａ）に示す特許明細書を入力としたとき、基本語抽出処理の過程で得られる情報を図２３〜２６に示す。図２３は、図３（ａ）に示す特許明細書から抽出された「特許請求の範囲」の文章である。図２４は、図２３の文章に形態素解析を施した結果から抽出した索引語リストである。図２５は、図２３の文章に形態素解析を施した結果から抽出した名詞句リストである。図２６は、図２３の文章に形態素解析を施した結果から抽出した述語−項リストである。同じく入力部２０３より入力された特許明細書に対して、ステップＳ４２で拡張語が抽出される。ステップＳ４２での処理は図４のステップＳ４と同様である。
【００５６】
図３（ａ）に示す特許明細書を入力としたとき、拡張語抽出処理の過程で得られるデータを図２７〜３１に示す。図２７は、図３（ａ）に示す特許明細書から抽出された「発明の実施の形態」の文章である。図２８は、図２７の文章から抽出した述語−項のリストである。図２７の文章からは、名詞句の言い換えは１つも抽出されなかった。図２９は、図２８の述語−項リストを用いて選択された拡張語である。図３０は、図２９の拡張語を単語に展開したものである。図３１は、図２４の索引語リストに、図３０の拡張語を加えたものである。
【００５７】
ステップＳ４３では、ステップＳ４１、Ｓ４２で抽出された語句を用い、索引格納部２０９に格納されている索引語リストを参照して、文書格納部２１１に格納された各特許明細書との類似度を計算する。例えば、図３（ａ）の特許明細書と図３（ｂ）の特許明細書との類似度の計算は、図３１の索引語と図１８の索引語とを比較することにより行われる。文書中に出現する単語の頻度情報を用いた文書間類似度の計算方法には様々な方法が知られているが、ここではどのようなものを用いてもよい。
【００５８】
例えば、各文書毎にその各索引語のｔｆ・ｉｄｆ値を次式から求める。
【００５９】
ｔｆ・ｌｏｇ（Ｎ／ｄｆ）
ｔｆ：当該索引語の当該文書中における出現頻度
Ｎ：総文書数
ｄｆ：当該索引語の文書頻度
そして、索引語番号を次元にとり索引語番号に対応する索引語のｔｆ・ｉｄｆ値を各次元の要素とする特徴ベクトルを求める。入力された文書（すなわち、ここでは、図３（ａ）の特許明細書）と検索対象の文書（例えば、図３（ｂ）の特許明細書）のそれぞれについて、特徴ベクトルを求める。あるいは、ｔｆ・ｉｄｆ値の替わりに、各文書毎に索引語番号に対するその文書内での当該索引語の出現頻度を特徴ベクトルの要素としてもよい。
【００６０】
図２４に示した索引語リストを基に索引語の出現頻度を用いて作成された図３（ｂ）に示した特許明細書の特徴ベクトルの一部を次式に示す。
【００６１】
【数１】

【００６２】
そして、入力された文書（すなわち、ここでは、図３（ａ）の特許明細書）と検索対象の文書（例えば、図３（ｂ）の特許明細書）のそれぞれについて、特徴ベクトルを求めて、これらの間で内積を算出して、それを入力された文書と文書格納部２１１に格納されている各文書との間の類似度としてもよい。なお、内積の代わりにコサイン距離を求めてもよい。この場合、類似度の値が大きいほど類似度が高くなる。
【００６３】
ステップＳ４３で計算された入力された文書と文書格納部２１１に格納されている各文書との間の類似度は、その値が大きい順にソートされ（ステップＳ４４）、上位ｎ位（ｎは正の整数）の特許明細書のファイル名が出力部２０１に出力される（ステップＳ４５）。
【００６４】
上述したような処理によれば、文書検索の精度を向上させることができる。例えば、図３（ａ）の特許明細書と、図３（ｂ）の特許明細書とでは、どちらも都道府県名と市区郡名とを対応付けて辞書に格納しているにもかかわらず、どちらの特許請求の範囲にも「都道府県名」「市区郡名」という語は出現しない。そのため、拡張語抽出部２０７を持たない従来の文書検索装置では、特許請求の範囲の文章だけから索引語を抽出した場合、図３（ａ）の特許明細書と図３（ｂ）の特許明細書との間の高い類似度は得られない。これに対して本発明の文書検索装置では、図３（ａ）の特許明細書においても、図３（ｂ）の特許明細書においても、特許請求の範囲に出現しない「都道府県名」と「市区郡名」という語が拡張語として索引語に追加されるため両者の間で高い類似度が得られる。
【００６５】
なお、上記実施形態では、検索要求として入力した文書に類似する文書の検索要求の場合を例にとり説明したが、この場合に限らず、種々変形して応用可能である。例えば、文書検索部２０８での検索処理は、入力されたキーワードに合致する文書の検索要求の場合であっても、上記同様にして（すなわち、キーワードと索引語との類似度を求める）文書を検索することが可能である。
【００６６】
また、上記実施形態では、索引語として、主に名詞、動詞のみを抽出しているが、この場合に限るものではなく、種々変形して応用可能である。例えば、これらに加えて例えば形容詞、副詞等を選択してもよいし、動詞を選択しなくてもよい。
【００６７】
また、ここでは、検索対象の文書が特許明細書である場合を例にとり説明しているため、その内容の特徴を最も適切に記述している「特許請求の範囲」という項目の文章から基本語を抽出し、基本語に関連する拡張語（例えば、基本語をより具体化して表現している拡張語）を「発明の実施の形態」という項目の文章から抽出しているが、この場合に限るものではない。また、検索対象の文書が学術論文であれば、基本語を「アブストラクト」から抽出し、その基本語に関連する拡張語を本文から抽出するようにしてもよい。このように、検索対象の文書がどのような文書であるにしろ、基本語は、その文書の内容の特徴を最も適切に記述している構成要素から抽出し、拡張語は、それより詳細な記述がなされている構成要素から抽出することが望ましい。
【００６８】
図３２は、上記実施形態で説明した文書検索装置を適用した類似文書検索を行う他の文書検索装置の構成を概略的に示したものである。図３２に示した類似文書検索装置では、まず、入力した文書の類似文書を検索するに先だって、当該文書の大まかな分類を行う。例えば、文書の内容に応じて複数のクラス（例えば、電気、機械、化学等）が用意されているとする。各クラスは、例えば、そのクラスに属する文書にてよく使われる単語を羅列した辞書を有し、この辞書の単語と入力された文書内の単語とを照合して（類似度を算出して）、最も類似するクラスを特定する。このとき求めることができる当該文書中に出現する単語と、その出現頻度は、先に説明した図９に示したような索引語リストの作成の際に用いてもよい。
【００６９】
この大分け分類処理部においては、クラスの特定された文書は、例えば、その文書中の単語と出現頻度とに基づき、より詳細なサブクラスに分類され、さらに、サブクラスの特定された文書はより詳細なグループに分類され、さらに、グループの特定された文書はより詳細なサブグループに分類されてもよい。
【００７０】
次に、類似文書検索処理部において、上記実施形態にて説明した類似文書の検索を行い、検索された類似文書のリストを出力する。
【００７１】
なお、ここでの検索結果を大分け分類処理および類似文書検索処理にフィードバックすることにより、より精度の高い（ヒット率の高い）類似文書の検索が可能になる。すなわち、例えば、大分け分類処理において得られた入力された文書から抽出された単語を当該文書の属するサブグループ、グループ、サブクラス、クラスの辞書に追加する。また、検索された類似文書の索引語リストに当該入力文書にはあって類似文書にはない単語を追加する。
【００７２】
図３２に示した文書検索装置もコンピュータに実行させることのできるプログラムとして、磁気ディスク（フロッピーディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤなど）、半導体メモリなどの記録媒体に格納して頒布することもできる。
【００７３】
以上説明したように上記実施形態によれば、文書中の予め定められた構成要素の文章から索引語を抽出し、他の構成要素の文章中から適切な語を拡張語として抽出して索引語に追加することにより、大量の文書の中からユーザの検索要求に合致する文書を高精度に選択できるようになる。
【００７４】
【発明の効果】
以上説明したように、本発明によれば、文書の内容に即した精度の高い文書の検索を可能にする。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る文書検索装置の機器構成例を示した図。
【図２】本発明の一実施形態に係る文書検索装置の機能ブロック図。
【図３】入力部より入力される特許明細書の例と、文書格納部に格納される特許明細書の具体例を示した図。
【図４】索引作成部における索引作成処理動作を説明するためのフローチャート。
【図５】基本語抽出処理動作を説明するためのフローチャート。
【図６】拡張語抽出処理動作を説明するためのフローチャート。
【図７】図３（ｂ）の特許明細書中の「特許請求の範囲」という項目に書かれた文章を示した図。
【図８】図７の文章に対し形態素解析を行った結果を示した図。
【図９】図８の形態素解析の結果から抽出した索引語リストの一例を示した図。
【図１０】図８の形態素解析の結果から抽出した名詞句リストの一例を示した図。
【図１１】図８の形態素解析の結果から抽出した述語−項リストの一例を示した図。
【図１２】図３（ｂ）の特許明細書中の「発明の実施の形態」という項目に書かれた文章を示した図。
【図１３】図１２の文章に対し形態素解析を施した結果から抽出した述語−項リストの一例を示した図。
【図１４】名詞句の言い換えを抽出するための文字列パターンの一例を示した図。
【図１５】図１２の文章に形態素解析を施した結果から抽出された名詞句の言い換えリストの一例を示した図。
【図１６】図１３の述語−項リストと図１５の名詞句の言い換えリストとを用いて選択した拡張語の一例を示した図。
【図１７】図１６の拡張語を単語に展開した場合を示した図。
【図１８】図９の索引語リストに拡張語を追加して得られた索引語リストの一例を示した図。
【図１９】拡張語選択処理動作を説明するためのフローチャート。
【図２０】頻度表の一例を示した図。
【図２１】索引語の文書頻度の算出結果を示した図。
【図２２】文書検索部における文書検索処理動作を説明するためのフローチャート。
【図２３】図３（ａ）に示す特許明細書中の「特許請求の範囲」という項目に書かれた文章を示した図。
【図２４】図２３の文章に対し形態素解析を施した結果から抽出した索引語リストの一例を示した図。
【図２５】図２３の文章に形態素解析を施した結果から抽出した名詞句リストの一例を示した図。
【図２６】図２３の文章に形態素解析を施した結果から抽出した述語−項リストの一例を示した図。
【図２７】図３（ａ）に示した特許明細書中の「発明の実施の形態」という項目の文章を示した図。
【図２８】図２７の文章に形態素解析を施した結果から抽出した述語−項リストの一例を示した図。
【図２９】図２８の述語−項リストを用いて選択された拡張語の一例を示した図。
【図３０】図２９の拡張語を単語に展開した場合を示した図。
【図３１】図２４の索引語リストに、図３０の拡張語を追加して得られた索引語リストの一例を示した図。
【図３２】他の文書検索装置の構成例を示した図。
【符号の説明】
２０１…出力部
２０２…制御部
２０３…入力部
２０４…索引作成部
２０５…語句抽出部
２０６…基本語抽出部
２０７…拡張語抽出部
２０８…文書検索部
２０９…索引格納部
２１０…文書構造認識部
２１１…文書格納部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document search method for searching for a document according to a search request from a plurality of documents (keyword search, similar document search) and a document search apparatus using the same.
[0002]
[Prior art]
With the spread of personal computers in recent years, a large amount of digitized documents have been created, and with the spread of computer networks, it has become easier to access these large amounts of digitized documents. However, the more documents that can be accessed, the more difficult it becomes to find a document that the user needs, and there is a possibility that corner information will not be utilized. Therefore, a document search apparatus that selects a user's need from a large number of documents, in particular, a document search apparatus that uses not only bibliographic information such as titles and creators but also full-text search technology using the contents of the document. Demand is increasing.
[0003]
[Problems to be solved by the invention]
In a conventional document search apparatus, a word is extracted by performing a morphological analysis process on a document to be searched, and an index is created by weighting the extracted word / phrase by the appearance frequency or the number of appearance documents in the document. It is generally done. This method of extracting words and indexes from the whole document and creating an index is not important when it is a long document such as a patent specification or academic paper. There is a problem that words that appear are extracted.
[0004]
In order to avoid such a problem, in particular, there is a patent specification or an academic paper as an example of a structured document (structured) document (structured document). There are components for each item such as “claims”, “detailed description of the invention”, “embodiment of the invention”, and in the case of academic papers, there are components such as “abstract” and “text”. Among the constituent elements of the document, claims are created for patent specifications, abstracts are made for academic papers, etc., and words are extracted from only the main constituent elements that briefly express the gist of the document. Sometimes a method is taken. However, since such a part is often described with a word having a higher abstraction level, when the user's search request is described with a more specific word / phrase, there is a high risk of being leaked from the search result.
[0005]
On the other hand, when a user describes a search request with abstract words in consideration of such danger, there is a problem that many unnecessary documents are matched.
[0006]
The present invention has been made in view of such circumstances, and by extracting words that accurately represent the contents of a document and creating an index used for searching the document, the present invention is adapted to the contents of the document. It is an object of the present invention to provide a document search method and a document search apparatus using the document search method that can search documents with high accuracy.
[0007]
[Means for Solving the Problems]
(1) The document search method of the present invention is a document search method for searching for a document according to a search request input from a plurality of documents, wherein the document is a document structured by a plurality of components. , Extracting a first word / phrase from predetermined main components of the document, and a predetermined condition between the first word / phrase from components other than the main component of the document A second phrase that satisfies the above condition is extracted, and the document is searched based on the first and second phrases extracted from each of the plurality of documents and the search request.
[0008]
The document search method of the present invention is a document search method for searching a document similar to a document input from a plurality of documents, wherein the document is a document structured by a plurality of components, From each of the input document and the plurality of documents to be searched, a first word / phrase is extracted from predetermined main constituent elements of the document, and further, constituent elements other than the main constituent elements of the document A second word / phrase that satisfies a predetermined condition between the first word / phrase and the first word / phrase is extracted from each of the input document and the plurality of documents to be searched. The similarity between the first and second words is obtained, and a document similar to the input document is searched from the plurality of documents to be searched.
[0009]
According to the present invention, by extracting the first word (basic word) and the second word (extended word) that accurately represent the contents of the document and creating an index used for searching the document, Enables retrieval of documents with high accuracy in accordance with the contents of the documents.
[0010]
Preferably, a phrase associated with the first phrase is extracted as a second phrase with a predetermined language expression.
[0011]
Preferably, a phrase having the same predicate term as the predicate having the first word as a term is extracted as a second word.
[0012]
(2) The document search apparatus of the present invention is a document search apparatus for searching for a document according to a search request input from a plurality of documents, wherein the document is a document structured by a plurality of components. ,
First extraction means for extracting a first word from the main main constituent elements of the document;
Second extraction means for extracting a second phrase satisfying a predetermined condition with the first phrase from components other than the main component of the document;
Search means for searching for a document based on the first and second phrases extracted from each of the plurality of documents and the search request;
It is characterized by comprising.
[0013]
The document retrieval apparatus of the present invention is a document retrieval apparatus for retrieving a document similar to a document input from a plurality of documents, wherein the document is a document structured by a plurality of components,
A first extraction means for extracting a first word from a predetermined main component of the document from each of the input document and the plurality of documents to be searched; and the input document; Second extraction means for extracting, from each of the plurality of documents to be searched, a second word that satisfies a predetermined condition with the first word from among components other than the main component;
The similarity between the input document and the plurality of documents to be searched for the similarity between the first and second words extracted from each of the documents is obtained, and a document similar to the input document is obtained. A search means for searching among a plurality of documents to be searched;
It is characterized by comprising.
[0014]
According to the present invention, by extracting the first word (basic word) and the second word (extended word) that accurately represent the contents of the document and creating an index used for searching the document, This makes it possible to search for similar documents with high accuracy in accordance with the contents of the documents.
[0015]
Preferably, a phrase associated with the first phrase is extracted as a second phrase with a predetermined language expression.
[0016]
Preferably, a phrase having the same predicate term as the predicate having the first word as a term is extracted as a second word.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0018]
FIG. 1 shows an example of the device configuration of a document search apparatus according to this embodiment. As shown in FIG. 1, the document search apparatus of this embodiment executes a program for executing the document search process of the present invention, an external storage device 102 for storing various data, and a program stored in the external storage device 102. CPU 101, communication device 103 that reads predetermined data from another computer via a communication network such as a public network, a dedicated line, a keyboard 104 for inputting an instruction from a user such as a search request, a mouse 105, a search result, etc. Display devices 106 to be displayed are connected to each other via a bus.
[0019]
FIG. 2 is a functional block diagram of the document search apparatus according to the present embodiment. As shown in FIG. 2, the document search apparatus of this embodiment includes an input unit 201 for inputting an instruction from a user such as a search request, an output unit 203 for displaying a search result, and a document storage for storing a document group to be searched. Section 211, index creation section 204 that creates an index by extracting words from the document group, index storage section 209 that stores an index for searching the document group, and an index stored in the index storage section 209 with reference to the user Search unit 208 for selecting documents that match a search request from the user, a phrase extraction unit 205 for creating an index or extracting a phrase from a user search request, and a component of a structured document to be searched A document structure recognizing unit 210 for recognizing the document, and a control unit 202 for activating the index creating unit 204 and the document searching unit 208 according to an instruction from the user.
[0020]
The phrase extraction unit 205 includes a basic word extraction unit 206 that extracts a basic word from main constituent elements of a document to be searched, and an extended word extraction unit 207 that extracts a phrase to be added to the basic word from elements other than the main constituent element. Consists of.
[0021]
Each component (output unit 201, control unit 202, input unit 203, index creation unit 204, phrase extraction unit 205, document search unit 208, and document structure recognition unit 210) in FIG. 2 is the same as the external storage device 102 in FIG. The index storage unit 209 and the document storage unit 211 are recorded on the computer and are executed and controlled by the CPU 101. The index storage unit 209 and the document storage unit 211 are connected to the external storage device 102 or the external storage device of another computer connected via the communication device 103. It may be built on top. In this case, the input unit 203 receives an instruction from the user such as a search request input via the keyboard 104 and the mouse 105 in FIG. 1, and the output unit 201 displays the search result on the display device 106 in FIG. Is for.
[0022]
With the above configuration, the document search apparatus shown in FIG. 2 searches for a document similar to the content of the input document (similar document search).
[0023]
Here, as a search target document, for example, a description will be given of the operation of each unit when a patent specification as shown in FIG. 3A is input from the input unit 203 and a similar patent search is performed.
[0024]
Prior to the search, an index is created from each of a plurality of patent specifications already stored in the document storage unit 211. The index is created when the control unit 202 calls the index creation unit 204.
[0025]
The index creating unit 204 calls the word / phrase extracting unit 205 to extract words / phrases from the patent specification stored in the document storage unit 211 and creates an index from the extracted words / phrases. The phrase extraction unit 205 includes a basic word extraction unit 206 and an extended word extraction unit 207. The basic word extraction unit 206 calls the document structure recognition unit 210, extracts only the text of the component of the item “Claims” from the components of the patent specification, and searches for the index phrase from the entire extracted text To extract. The extended word extracting unit 207 calls the document structure recognizing unit 210, extracts only the text of the component of the item “invention embodiment” from the components of the patent specification, and extracts it by the basic word extracting unit 206 Extract phrases to expand the indexed phrases.
[0026]
When a patent specification is input from the input unit 203, the control unit 202 calls the document search unit 208. The document search unit 208 calls the phrase extraction unit 205 to extract a phrase from the inputted patent specification. Further, the document search unit 208 refers to the extracted word / phrase and the index stored in the index storage unit 209, so that the input between the patent specification and each document stored in the document storage unit 209 is performed. Calculate similarity. The control unit 202 presents a list of patent specifications having a high degree of similarity to the user from the output unit 201.
[0027]
Next, the processing of the index creation unit 204 will be described in detail by taking as an example the case of creating an index of the patent specification shown in FIG.
[0028]
FIG. 4 shows a processing flow of the index creation unit 204. The index creating unit 204 takes out patent specifications one by one from the document storage unit 211 and creates an index. In the document storage unit 211, one patent specification is stored as one file, and each patent specification is given a unique file name. For example, the patent specification shown in FIG. 3B is given a file name “JP-A-01-99999999.txt”.
[0029]
In the index storage unit 209, since each file is managed by a number, each file is registered with a number (step S2). Next, basic words are extracted (step S3).
[0030]
FIG. 5 shows the flow of processing in step 3 of FIG.
[0031]
In FIG. 5, first, the document structure recognition unit 210 is called to extract only the text of the component of the item “Claims” from the components of the patent specification (step S11). FIG. 7 shows an example of a sentence written in “Claims” taken from the patent specification of FIG. Morphological analysis is performed on the text written in the “claims” (step S12). The method of morphological analysis is widely known and will not be described in detail here.
[0032]
FIG. 8 shows a part of the result of performing the morphological analysis on the text of FIG. In FIG. 8, one morpheme information is output in one line, and is separated by a space from the beginning of the line, morpheme notation, reading, basic form notation, part of speech, part of speech number, fine part of speech, fine part of speech number, inflection type, inflection type Numbers, usage types, and usage type numbers are lined up. When there is no information, “*” is marked.
[0033]
From the result of morphological analysis in step S12, words having parts of speech such as nouns, verbs, adjectives, symbols, and unknown words are extracted as index words (step S13). FIG. 9 shows a part of the list of index words extracted from the result of the morphological analysis of FIG.
[0034]
On the other hand, a noun phrase list is also extracted from the result of the morphological analysis in step S12 (step S14). A noun phrase here refers to a noun, a symbol, an unknown word, an adjective stem, a conjunctive adjunct concatenation, or a continuation through the particle "no". FIG. 10 shows a part of the list of noun phrases extracted from the result of the morphological analysis of FIG. In FIG. 10, when the left side is a noun phrase notation and the noun phrase is composed of a plurality of morphemes, the boundary between the morphemes is indicated by “/”. The right side shows the part of speech of the morphemes that make up the noun phrase. As in the case of the notation, the boundary between morphemes is indicated by “/”.
[0035]
Further, a list of predicates and terms (predicate-term list) in a correspondence relationship is extracted from the result of the morphological analysis in step S12 (step S15). Here, predicates are limited to verbs, and terms are extracted in the form of noun phrases. FIG. 11 shows a part of the predicate-term list extracted from the result of the morphological analysis of FIG. In FIG. 11, for each extracted predicate, the term taken by each predicate is shown on the right side. When one predicate takes multiple terms, the terms are separated by “;”. Each term is described in the form of “notation (part of speech)”, and when one term is composed of a plurality of morphemes, the boundary between the morphemes is indicated by “/”.
[0036]
Returning to the description of FIG. When the basic word extraction is completed in step S3, the index word list extracted in step S13 is expanded based on the list extracted in steps S14 and S15 of FIG. 5 (step S4).
[0037]
FIG. 6 is a flowchart showing the flow of the process in step S4 of FIG.
[0038]
First, the document structure recognizing unit 210 is called to extract only the text written in the “Embodiment of the Invention” from the constituent elements of the patent specification (step S21). FIG. 12 shows a part of a sentence written in “Embodiment of the Invention” extracted from the patent specification of FIG. Morphological analysis is performed on the text written in the “embodiment of the invention” (step S22). The morpheme analysis process may be the same as step S12 in FIG.
[0039]
A predicate-term list is extracted from the result of the morphological analysis in the same manner as in step S15 in FIG. 5 (step S23). FIG. 13 shows a predicate-term list extracted from the result of performing the morphological analysis on the sentence of FIG. The format is the same as in FIG.
[0040]
Subsequently, the paraphrase of the noun phrase is extracted (step S24). The paraphrasing of the noun phrase is realized by preparing a character string pattern as shown in FIG. 14 and collating it with a sentence. For example, in the pattern of the character string of pattern number “1”, when a noun phrase appears before and after the character string “as”, the noun phrase that appears before is used as the extended noun phrase, Extract as an extended noun phrase. By using this pattern, for example, from the part of “prefecture name as a wide-area place name” in the sentence of FIG. 12, “wide-area place name” as an extended noun phrase and “prefecture name” as an extended noun phrase Is extracted. FIG. 15 shows a list showing the correspondence between the extended noun phrase extracted from the sentence of FIG. 12 and the extended noun phrase, that is, a paraphrase list of noun phrases.
[0041]
Next, an extended word is selected using the predicate-term list extracted in step S23 and the noun phrase paraphrase list extracted in step S24 (step S25).
[0042]
FIG. 19 is a flowchart showing the extended word selection process in step S25 of FIG. As an extended word selection process, there are a case where a predicate-term list is used and a case where a paraphrase list of noun phrases is used. Here, in FIG. 19, a case where a predicate-term list is used will be described as an example.
[0043]
One set of predicate and term in the predicate-term list extracted from the text of the “Embodiment of the Invention” is extracted one by one, and the text of “Claims” in step S15 of FIG. Are collated with the predicates in the predicate-term list extracted from (step S31). That is, one set of a predicate and a term is extracted from the predicate-term list extracted from the sentence “Embodiment of the Invention” (step S32). Then, it is checked whether or not the same predicate as the predicate is in the predicate in the predicate-term list extracted from the sentence “Claims”. If there is not the same, the process returns to step S31, and the next set of predicate and term is extracted from the predicate-term list extracted from the sentence of the “Embodiment of the Invention”.
[0044]
When there is the same thing (step S33), the term of the said predicate in the predicate-term list extracted from the sentence of "embodiment of invention" is investigated one by one (step S34, step S35). That is, the same noun phrase list extracted from the sentence “claims” in step S14 of FIG. 5 is the same as the term of the predicate in the predicate-term list extracted from the sentence “embodiment of the invention”. It is checked whether there is any (step S36). If there is the same item, the process returns to step S34 to check the next term. If there is no same item, the process proceeds to step S37, where the term is registered as an extended word, and the process returns to step S34 again to check the next term. The processes in steps S35 to S37 are repeated until there are no more predicate terms in the predicate-term list extracted from the sentence “Embodiment of the Invention”.
[0045]
As described above, an extended word may be selected using only the predicate-term list extracted from the sentence of the “Embodiment of the Invention” shown in FIG. An extended word may be selected using a paraphrase list of noun phrases. Furthermore, an extended word may be selected using both a predicate-term list and a paraphrase list of noun phrases.
[0046]
That is, after an extended word is selected by the procedure as shown in FIG. 9, next, the noun phrase as shown in FIG. 15 extracted from the sentence “Embodiment of the Invention” in step S24 of FIG. It is checked whether or not the same noun phrase as the extended original noun phrase in the paraphrase list exists in the noun phrase list extracted from the sentence “Claims” in step S14 of FIG. If there is the same thing, the extended noun phrase corresponding to the extended source noun phrase in the paraphrase list of the noun phrase is set as the extended word. At this time, the extended noun phrase already selected as the extended word is ignored.
[0047]
In addition, after an extended word is selected using a paraphrase list of noun phrases as shown in FIG. 15, an extended word may be selected using a predicate-term list as shown in FIG.
[0048]
FIG. 16 shows an extended word obtained by using both the predicate-term list of FIG. 13 and the paraphrase list of the noun phrase of FIG.
[0049]
Returning to the description of FIG. 6, in step S26, the expanded word in the form of a noun phrase selected in step S25 is expanded into a word. For example, FIG. 17 shows an expanded word of FIG. 16 expanded into words. Here, the expanded word frequency is calculated assuming that the expanded word appearance frequency is uniformly “1”.
[0050]
An index word list as shown in FIG. 9 extracted in step S13 in FIG. 5 and an expanded word (see FIG. 17) extracted in steps S25 to S26 in FIG. As shown in FIG. In FIG. 18, index word numbers “18” to “20” are newly added words, that is, expanded words.
[0051]
Here, it returns to description of FIG. 4 again. The next process is step S5. In step S5, the index word obtained up to step S4 and the information of its appearance frequency are written in the frequency table. The frequency table created here corresponds to an index stored in the index storage unit 209.
[0052]
An example of the frequency table is shown in FIG. The vertical axis takes the file number to identify the file in which each document is stored, and the horizontal axis takes the word number to identify each word extracted as a basic word and extended word. It shows how many times the word appears.
[0053]
The above processing (see FIG. 4) is executed for all patent specification files in the document storage unit 211. When the processing is completed for all the patent specification files in the document storage unit 211, the document frequency of each index word is counted in step S6. The document frequency of each index word is obtained by reading the frequency table vertically and counting the number of files having an appearance frequency of “1” or more. An example of calculating the document frequency of each index word is shown in FIG.
[0054]
Next, the processing of the document search unit 208 will be described in detail by taking as an example a case where a patent specification similar to the patent specification shown in FIG. FIG. 22 shows a processing flow of the document search unit 208.
[0055]
A basic word is extracted from the patent specification as shown in FIG. 3A input from the input unit 203 (step S41). The process in step S41 is the same as the process in step S3 of FIG. When the patent specification shown in FIG. 3A is used as input, information obtained in the basic word extraction process is shown in FIGS. FIG. 23 is a sentence “claims” extracted from the patent specification shown in FIG. FIG. 24 is an index word list extracted from the result of performing morphological analysis on the sentence of FIG. FIG. 25 is a noun phrase list extracted from the result of performing morphological analysis on the sentence of FIG. FIG. 26 is a predicate-term list extracted from the result of performing the morphological analysis on the sentence of FIG. Similarly, an extended word is extracted from the patent specification input from the input unit 203 in step S42. The process in step S42 is the same as step S4 in FIG.
[0056]
When the patent specification shown in FIG. 3A is used as input, data obtained in the process of the extended word extraction process are shown in FIGS. FIG. 27 is a text of an “Embodiment of the Invention” extracted from the patent specification shown in FIG. FIG. 28 is a list of predicates-terms extracted from the text of FIG. No paraphrase of noun phrase was extracted from the text of FIG. FIG. 29 is an extension word selected using the predicate-term list in FIG. FIG. 30 is an expanded word of FIG. 29 expanded into words. FIG. 31 is obtained by adding the extended word of FIG. 30 to the index word list of FIG.
[0057]
In step S43, using the phrases extracted in steps S41 and S42, referring to the index word list stored in the index storage unit 209, the similarity with each patent specification stored in the document storage unit 211 is determined. calculate. For example, the similarity between the patent specification of FIG. 3A and the patent specification of FIG. 3B is calculated by comparing the index word of FIG. 31 with the index word of FIG. Various methods for calculating the similarity between documents using frequency information of words appearing in the document are known, but any method may be used here.
[0058]
For example, for each document, the tf · idf value of each index word is obtained from the following equation.
[0059]
tf · log (N / df)
tf: appearance frequency of the index word in the document
N: Total number of documents
df: document frequency of the index word
Then, a feature vector having the index word number as a dimension and the tf · idf value of the index word corresponding to the index word number as an element of each dimension is obtained. A feature vector is obtained for each of the input document (that is, here, the patent specification in FIG. 3A) and the document to be searched (for example, the patent specification in FIG. 3B). Alternatively, instead of the tf · idf value, the appearance frequency of the index word in the document with respect to the index word number for each document may be used as an element of the feature vector.
[0060]
A part of the feature vector of the patent specification shown in FIG. 3B created using the appearance frequency of the index word based on the index word list shown in FIG.
[0061]
[Expression 1]

[0062]
Then, for each of the input document (that is, here, the patent specification of FIG. 3A) and the document to be searched (for example, the patent specification of FIG. 3B), a feature vector is obtained, An inner product may be calculated between them, and the similarity may be calculated between the input document and each document stored in the document storage unit 211. Note that the cosine distance may be obtained instead of the inner product. In this case, the greater the similarity value, the higher the similarity.
[0063]
The similarity between the input document calculated in step S43 and each document stored in the document storage unit 211 is sorted in descending order of the value (step S44). The file name of the (integer) patent specification is output to the output unit 201 (step S45).
[0064]
According to the processing described above, the accuracy of document search can be improved. For example, in both the patent specification of FIG. 3A and the patent specification of FIG. 3B, the prefecture name and the city name are associated with each other and stored in the dictionary. In both claims, the words “prefecture name” and “city name” do not appear. Therefore, in the conventional document search apparatus that does not have the extended word extraction unit 207, when the index word is extracted only from the sentences in the claims, the patent specification of FIG. 3 (a) and the patent specification of FIG. 3 (b) A high degree of similarity with a book cannot be obtained. On the other hand, in the document retrieval apparatus of the present invention, “prefecture name” and “name” that do not appear in the scope of claims in both the patent specification of FIG. 3A and the patent specification of FIG. Since the word “city name” is added to the index word as an extension word, a high degree of similarity can be obtained between the two.
[0065]
In the above embodiment, the case of a search request for a document similar to a document input as a search request has been described as an example. However, the present invention is not limited to this case, and various modifications can be applied. For example, the search processing in the document search unit 208 is performed in the same manner as described above (that is, the degree of similarity between a keyword and an index word) is obtained even in the case of a search request for a document that matches an input keyword. It is possible to search.
[0066]
In the above embodiment, only nouns and verbs are extracted as index words. However, the present invention is not limited to this case, and various modifications can be applied. For example, in addition to these, for example, adjectives, adverbs, etc. may be selected, or verbs may not be selected.
[0067]
In addition, here, since the case where the document to be searched is a patent specification is described as an example, the basic word is derived from the sentence of the item “Claims” that most appropriately describes the characteristics of the content. In this case, an extended word related to the basic word (for example, an extended word expressing the basic word more specifically) is extracted from the sentence of the item “Embodiment of the Invention”. It is not limited. If the document to be searched is an academic paper, the basic word may be extracted from the “abstract”, and the extended word related to the basic word may be extracted from the text. Thus, no matter what the document to be searched is, the basic word is extracted from the component that best describes the characteristics of the content of the document, and the extended word is more detailed than that. It is desirable to extract from the components that are described.
[0068]
FIG. 32 schematically shows the configuration of another document search apparatus that performs similar document search to which the document search apparatus described in the above embodiment is applied. In the similar document search apparatus shown in FIG. 32, first, prior to searching for similar documents of the input document, the documents are roughly classified. For example, it is assumed that a plurality of classes (for example, electricity, machine, chemistry, etc.) are prepared according to the contents of a document. Each class has, for example, a dictionary that lists frequently used words in documents belonging to the class, and collates the words in this dictionary with the words in the input document (calculates the similarity). Identify the most similar class. The words appearing in the document that can be obtained at this time and their appearance frequencies may be used when creating the index word list as shown in FIG. 9 described above.
[0069]
In this broad classification processing unit, a document whose class is specified is classified into a more detailed subclass based on, for example, words in the document and appearance frequency, and further, a document whose subclass is specified is more detailed. In addition, the identified documents of the group may be classified into more detailed subgroups.
[0070]
Next, the similar document search processing unit searches for similar documents described in the above embodiment, and outputs a list of searched similar documents.
[0071]
By feeding back the search results here to the broad classification processing and similar document search processing, it is possible to search for similar documents with higher accuracy (high hit rate). That is, for example, a word extracted from an input document obtained in the broad classification process is added to a dictionary of subgroups, groups, subclasses, and classes to which the document belongs. In addition, a word that is present in the input document but not in the similar document is added to the index word list of the retrieved similar document.
[0072]
The document retrieval apparatus shown in FIG. 32 is also stored as a program that can be executed by a computer in a recording medium such as a magnetic disk (floppy disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), semiconductor memory, etc. You can also
[0073]
As described above, according to the above-described embodiment, an index word is extracted from a sentence of a predetermined component in a document, and an appropriate word is extracted as an extension word from the sentence of another component element. As a result, it is possible to select a document that matches the user's search request from a large number of documents with high accuracy.
[0074]
【The invention's effect】
As described above, according to the present invention, it is possible to search for a document with high accuracy in accordance with the content of the document.
[Brief description of the drawings]
FIG. 1 is a diagram showing an example of the device configuration of a document search apparatus according to an embodiment of the present invention.
FIG. 2 is a functional block diagram of a document search apparatus according to an embodiment of the present invention.
FIG. 3 is a diagram showing an example of a patent specification input from an input unit and a specific example of a patent specification stored in a document storage unit.
FIG. 4 is a flowchart for explaining an index creation processing operation in an index creation unit.
FIG. 5 is a flowchart for explaining a basic word extraction processing operation;
FIG. 6 is a flowchart for explaining an extended word extraction processing operation;
7 is a diagram showing a sentence written in the item “Claims” in the patent specification of FIG. 3B;
FIG. 8 is a diagram showing a result of performing morphological analysis on the sentence of FIG. 7;
9 is a diagram showing an example of an index word list extracted from the result of the morphological analysis of FIG.
10 is a diagram showing an example of a noun phrase list extracted from the result of the morphological analysis in FIG. 8. FIG.
11 is a diagram showing an example of a predicate-term list extracted from the result of morphological analysis in FIG. 8. FIG.
FIG. 12 is a diagram showing a sentence written in the item “Embodiment of the Invention” in the patent specification of FIG. 3B;
13 is a diagram showing an example of a predicate-term list extracted from the result of performing morphological analysis on the sentence of FIG.
FIG. 14 is a diagram showing an example of a character string pattern for extracting paraphrases of noun phrases.
15 is a diagram showing an example of a paraphrase list of noun phrases extracted from the result of performing morphological analysis on the sentence of FIG.
16 is a diagram showing an example of an extended word selected using the predicate-term list of FIG. 13 and the noun phrase paraphrase list of FIG. 15;
FIG. 17 is a diagram showing a case where the extended word in FIG. 16 is expanded into words.
18 is a diagram showing an example of an index word list obtained by adding an extended word to the index word list of FIG.
FIG. 19 is a flowchart for explaining an extended word selection processing operation;
FIG. 20 is a diagram showing an example of a frequency table.
FIG. 21 is a diagram showing a calculation result of index word document frequency.
FIG. 22 is a flowchart for explaining a document search processing operation in a document search unit.
FIG. 23 is a diagram showing a sentence written in the item “Claims” in the patent specification shown in FIG.
24 is a diagram showing an example of an index word list extracted from a result of performing morphological analysis on the sentence of FIG. 23. FIG.
FIG. 25 is a diagram showing an example of a noun phrase list extracted from the result of performing morphological analysis on the sentence of FIG. 23;
26 is a diagram showing an example of a predicate-term list extracted from the result of performing morphological analysis on the sentence of FIG.
27 is a diagram showing a sentence of an item “Embodiment of the Invention” in the patent specification shown in FIG.
28 is a diagram showing an example of a predicate-term list extracted from the result of performing morphological analysis on the sentence in FIG. 27;
FIG. 29 is a diagram showing an example of an extension word selected using the predicate-term list in FIG. 28;
30 is a diagram showing a case where the extended word in FIG. 29 is expanded into words.
31 is a diagram showing an example of an index word list obtained by adding the extended word of FIG. 30 to the index word list of FIG.
FIG. 32 is a diagram showing a configuration example of another document search apparatus.
[Explanation of symbols]
201: Output unit
202 ... Control unit
203 ... Input unit
204 ... Index creation section
205 ... Phrase extraction unit
206: Basic word extraction unit
207 ... Extended word extraction unit
208: Document search section
209 ... Index storage unit
210: Document structure recognition unit
211 ... Document storage unit

Claims

複数の構成要素で構造化された複数の文書を記憶する文書記憶手段と、
前記文書記憶手段に記憶された各文書から索引語リストを作成する索引作成手段と、
前記索引作成手段で作成された各文書の索引語リストを記憶する索引記憶手段と、
前記記憶手段に記憶された文書のなかから、入力された検索要求に応じた文書を検索する検索手段と、
を備えた文書検索装置における文書検索方法であって、
前記索引作成手段が、前記文書の予め定められた主たる構成要素の中から第１の語句を索引語として抽出する第１の抽出ステップと、
前記索引作成手段が、前記文書の前記主たる構成要素以外の構成要素の中から、予め定められた言語表現にて前記第１の語句に関連付けられた第２の語句を索引語として抽出する第２の抽出ステップと、
前記索引作成手段が、前記第１の語句及び前記第２の語句と、前記第１の語句及び前記第２の語句の前記文書中での出現頻度を含む索引語リストを作成し、前記索引記憶手段に記憶するステップと、
前記検索手段が、前記索引語記憶手段に記憶された各文書の前記索引語リストを用いて、前記検索要求として入力されたキーワードと、各索引語リスト中の索引語との間の類似度を算出することにより、前記入力されたキーワードに合致する文書を前記文書記憶手段に記憶された複数の文書のなかから検索する検索ステップと、
を含む文書検索方法。 Document storage means for storing a plurality of documents structured by a plurality of components;
Index creation means for creating an index word list from each document stored in the document storage means;
Index storage means for storing an index word list of each document created by the index creation means;
Search means for searching for a document according to the input search request from among the documents stored in the storage means;
A document search method in a document search apparatus comprising:
A first extraction step in which the index creating means extracts a first word / phrase as an index word from predetermined main components of the document ;
The index creation means extracts a second word / phrase associated with the first word / phrase as an index word in a predetermined linguistic expression from constituent elements other than the main constituent element of the document . Extraction steps of
The index creating means creates an index word list including the first word and the second word and the appearance frequency of the first word and the second word in the document, and stores the index memory. Storing in the means;
The search means uses the index word list of each document stored in the index word storage means to determine the similarity between the keyword input as the search request and the index word in each index word list. A search step for searching a document that matches the input keyword by searching among a plurality of documents stored in the document storage means;
Search method including documents.

複数の構成要素で構造化された複数の文書を記憶する文書記憶手段と、
前記文書記憶手段に記憶された各文書から索引語リストを作成する索引作成手段と、
前記索引作成手段で作成された各文書の索引語リストを記憶する索引記憶手段と、
前記記憶手段に記憶された文書のなかから、入力された検索要求に応じた文書を検索する検索手段と、
を備えた文書検索装置における文書検索方法であって、
前記索引作成手段が、前記文書の予め定められた主たる構成要素の中から第１の語句を索引語として抽出する第１の抽出ステップと、
前記索引作成手段が、前記文書の前記主たる構成要素以外の構成要素の中から、予め定められた言語表現にて前記第１の語句に関連付けられた第２の語句を索引語として抽出する第２の抽出ステップと、
前記索引作成手段が、前記第１の語句及び前記第２の語句と、前記第１の語句及び前記第２の語句の前記文書中での出現頻度を含む索引語リストを作成し、前記索引記憶手段に記憶するステップと、
前記検索手段が、前記検索要求として入力された、複数の構成要素で構造化された文書の前記主たる構成要素の中から索引語として第３の語句を抽出する第３の抽出ステップと、
前記検索手段が、前記入力された文書の前記主たる構成要素以外の構成要素の中から、予め定められた言語表現にて前記第３の語句に関連付けられた第４の語句を索引語として抽出する第４の抽出ステップと、
前記検索手段が、前記第３の語句及び前記第４の語句と、前記第３の語句及び前記第４の語句の前記入力された文書中での出現頻度を含む索引語リストを作成するステップと、
前記検索手段が、前記入力された文書から作成された索引リスト中の索引語及びその出現頻度と、前記索引語記憶手段に記憶された各文書の前記索引語リスト中の索引語及びその出現頻度との間で類似度を算出することにより、前記入力された文書に類似する文書を前記文書記憶手段に記憶された複数の文書のなかから検索する検索ステップと、
を含む文書検索方法。 Document storage means for storing a plurality of documents structured by a plurality of components;
Index creation means for creating an index word list from each document stored in the document storage means;
Index storage means for storing an index word list of each document created by the index creation means;
Search means for searching for a document according to the input search request from among the documents stored in the storage means;
A document search method in a document search apparatus comprising:
A first extraction step in which the index creating means extracts a first word / phrase as an index word from predetermined main components of the document ;
The index creation means extracts a second word / phrase associated with the first word / phrase as an index word in a predetermined linguistic expression from constituent elements other than the main constituent element of the document . Extraction steps of
The index creating means creates an index word list including the first word and the second word and the appearance frequency of the first word and the second word in the document, and stores the index memory. Storing in the means;
A third extraction step in which the search means extracts a third word as an index word from the main components of the document structured by a plurality of components input as the search request;
The search means extracts, as an index word, a fourth word / phrase associated with the third word / phrase in a predetermined language expression from components other than the main component of the input document. A fourth extraction step;
The search means creating an index word list including the third word and the fourth word and the appearance frequency of the third word and the fourth word in the input document; ,
The search means includes an index word in the index list created from the input document and its appearance frequency, and an index word in the index word list of each document stored in the index word storage means and its appearance frequency. A search step for searching a document similar to the input document from a plurality of documents stored in the document storage means by calculating a similarity between
Search method including documents.

前記第２の抽出ステップは、前記第１の語句を項とする述語と同じ述語の項になっている第２の語句を抽出することを特徴とする請求項１または２記載の文書検索方法。 The second extraction step, according to claim 1 or 2 document search method wherein extracting the second word that is a section of the same predicate predicates to claim the first word.

前記第４の抽出ステップは、前記第３の語句を項とする述語と同じ述語の項になっている第４の語句を抽出することを特徴とする請求項２記載の文書検索方法。3. The document search method according to claim 2, wherein the fourth extraction step extracts a fourth word / phrase having a predicate term that is the same as a predicate having the third word / phrase as a term.

複数の構成要素で構造化された複数の文書を記憶する文書記憶手段と、  Document storage means for storing a plurality of documents structured by a plurality of components;
前記文書記憶手段に記憶された各文書の予め定められた主たる構成要素の中から第１の語句を索引語として抽出するとともに、当該文書の前記主たる構成要素以外の構成要素の中から、予め定められた言語表現にて前記第１の語句に関連付けられた第２の語句を索引語として抽出する抽出手段と、  The first word / phrase is extracted as an index word from predetermined main constituent elements of each document stored in the document storage means, and predetermined from constituent elements other than the main constituent elements of the document. Extracting means for extracting, as an index word, a second phrase associated with the first phrase in a given language expression;
前記文書記憶手段に記憶された各文書について、当該文書から抽出された前記第１の語句及び前記第２の語句と、前記第１の語句及び前記第２の語句の当該文書中での出現頻度を含む索引語リストを作成する作成手段と、  For each document stored in the document storage means, the first word and the second word extracted from the document, and the appearance frequency of the first word and the second word in the document Creating means for creating an index word list including
前記文書記憶手段に記憶された各文書について、前記作成手段で作成された前記索引語リストを記憶する索引語記憶手段と、  Index word storage means for storing the index word list created by the creation means for each document stored in the document storage means;
前記索引語記憶手段に記憶された各文書の前記索引語リストを用いて、検索要求として入力されたキーワードと、各索引語リスト中の索引語との間の類似度を算出することにより、前記入力されたキーワードに合致する文書を前記文書記憶手段に記憶された複数の文書のなかから検索する検索手段と、  By using the index word list of each document stored in the index word storage means to calculate the similarity between the keyword input as a search request and the index word in each index word list, Search means for searching a document matching the input keyword from a plurality of documents stored in the document storage means;
を具備したことを特徴とする文書検索装置。  A document retrieval apparatus comprising:

複数の構成要素で構造化された複数の文書を記憶する文書記憶手段と、  Document storage means for storing a plurality of documents structured by a plurality of components;
前記文書記憶手段に記憶された各文書の予め定められた主たる構成要素の中から第１の語句を索引語として抽出するとともに、当該文書の前記主たる構成要素以外の構成要素の中から、予め定められた言語表現にて前記第１の語句に関連付けられた第２の語句を索引語として抽出する第１の抽出手段と、  The first word / phrase is extracted as an index word from predetermined main constituent elements of each document stored in the document storage means, and predetermined from constituent elements other than the main constituent elements of the document. First extraction means for extracting, as an index word, a second word / phrase associated with the first word / phrase in a given language expression;
前記文書記憶手段に記憶された各文書について、当該文書から抽出された前記第１の語句及び前記第２の語句と、前記第１の語句及び前記第２の語句の当該文書中での出現頻度を含む索引語リストを作成する第１の作成手段と、  For each document stored in the document storage means, the first word and the second word extracted from the document, and the appearance frequency of the first word and the second word in the document First creation means for creating an index word list including:
前記文書記憶手段に記憶された各文書について、前記第１の作成手段で作成された前記索引語リストを記憶する索引語記憶手段と、  Index word storage means for storing the index word list created by the first creation means for each document stored in the document storage means;
検索要求として入力された、複数の構成要素で構造化された文書の前記主たる構成要素の中から索引語として第３の語句を抽出するとともに、当該文書の前記主たる構成要素以外の構成要素の中から、予め定められた言語表現にて前記第３の語句に関連付けられた第４の語句を索引語として抽出する第２の抽出手段と、A third word / phrase is extracted as an index word from the main constituent elements of the document structured by a plurality of constituent elements input as a search request, and the constituent elements other than the main constituent elements of the document are extracted. Second extraction means for extracting, as an index word, a fourth phrase associated with the third phrase in a predetermined language expression;
前記第３の語句及び前記第４の語句と、前記第３の語句及び前記第４の語句の前記入力された文書中での出現頻度を含む索引語リストを作成する第２の作成手段と、  Second creation means for creating an index word list including the third word and the fourth word, and the frequency of appearance of the third word and the fourth word in the input document;
前記入力された文書から作成された索引リスト中の索引語及びその出現頻度と、前記索引語記憶手段に記憶された各文書の前記索引語リスト中の索引語及びその出現頻度との間で類似度を算出することにより、前記入力された文書に類似する文書を前記文書記憶手段に記憶された複数の文書のなかから検索する検索手段と、Similarity between the index word in the index list created from the input document and its appearance frequency and the index word in the index word list of each document stored in the index word storage means and its appearance frequency Search means for searching a document similar to the input document from a plurality of documents stored in the document storage means by calculating a degree;
を具備したことを特徴とする文書検索装置。  A document retrieval apparatus comprising:

前記抽出手段は、前記第１の語句を項とする述語と同じ述語の項になっている第２の語句を抽出することを特徴とする請求項５記載の文書検索装置。The extraction means, the document search apparatus according to claim 5, wherein the extracting the second word that is a section of the same predicate predicates to claim the first word.

前記第１の抽出手段は、前記第１の語句を項とする述語と同じ述語の項になっている第２の語句を抽出し、 The first extraction means extracts a second phrase that is in the same predicate term as a predicate whose term is the first phrase;
前記第２の抽出手段は、前記第３の語句を項とする述語と同じ述語の項になっている第 The second extraction means has a predicate term that is the same as the predicate having the third word as a term. ４の語句を抽出することを特徴とする請求項６記載の文書検索装置。7. The document retrieval apparatus according to claim 6, wherein four words / phrases are extracted.

コンピュータを、Computer
複数の構成要素で構造化された複数の文書を記憶する文書記憶手段、  Document storage means for storing a plurality of documents structured by a plurality of components;
前記文書記憶手段に記憶された各文書の予め定められた主たる構成要素の中から第１の語句を索引語として抽出するとともに、当該文書の前記主たる構成要素以外の構成要素の中から、予め定められた言語表現にて前記第１の語句に関連付けられた第２の語句を索引語として抽出する抽出手段、  The first word / phrase is extracted as an index word from predetermined main constituent elements of each document stored in the document storage means, and predetermined from constituent elements other than the main constituent elements of the document. Extracting means for extracting, as an index word, a second word / phrase associated with the first word / phrase in a given language expression;
前記文書記憶手段に記憶された各文書について、当該文書から抽出された前記第１の語句及び前記第２の語句と、前記第１の語句及び前記第２の語句の当該文書中での出現頻度を含む索引語リストを作成する作成手段、  For each document stored in the document storage means, the first word and the second word extracted from the document, and the appearance frequency of the first word and the second word in the document Creating means for creating an index word list including
前記文書記憶手段に記憶された各文書について、前記作成手段で作成された前記索引語リストを記憶する索引語記憶手段、  Index word storage means for storing the index word list created by the creation means for each document stored in the document storage means,
前記索引語記憶手段に記憶された各文書の前記索引語リストを用いて、検索要求として入力されたキーワードと、各索引語リスト中の索引語との間の類似度を算出することにより、前記入力されたキーワードに合致する文書を前記文書記憶手段に記憶された複数の文書のなかから検索する検索手段、  By using the index word list of each document stored in the index word storage means to calculate the similarity between the keyword input as a search request and the index word in each index word list, Retrieval means for retrieving a document that matches the input keyword from a plurality of documents stored in the document storage means;
として機能させるためのプログラムを記憶したコンピュータ読み取り可能な記憶媒体。  A computer-readable storage medium storing a program for functioning as a computer.

コンピュータを、  Computer
複数の構成要素で構造化された複数の文書を記憶する文書記憶手段、  Document storage means for storing a plurality of documents structured by a plurality of components;
前記文書記憶手段に記憶された各文書の予め定められた主たる構成要素の中から第１の語句を索引語として抽出するとともに、当該文書の前記主たる構成要素以外の構成要素の中から、予め定められた言語表現にて前記第１の語句に関連付けられた第２の語句を索引語として抽出する第１の抽出手段、  The first word / phrase is extracted as an index word from predetermined main constituent elements of each document stored in the document storage means, and predetermined from constituent elements other than the main constituent elements of the document. First extraction means for extracting, as an index word, a second word / phrase associated with the first word / phrase in a given language expression;
前記文書記憶手段に記憶された各文書について、当該文書から抽出された前記第１の語句及び前記第２の語句と、前記第１の語句及び前記第２の語句の当該文書中での出現頻度を含む索引語リストを作成する第１の作成手段、  For each document stored in the document storage means, the first word and the second word extracted from the document, and the appearance frequency of the first word and the second word in the document A first creation means for creating an index word list including
前記文書記憶手段に記憶された各文書について、前記第１の作成手段で作成された前記索引語リストを記憶する索引語記憶手段、  Index word storage means for storing the index word list created by the first creation means for each document stored in the document storage means,
検索要求として入力された、複数の構成要素で構造化された文書の前記主たる構成要素の中から索引語として第３の語句を抽出するとともに、当該文書の前記主たる構成要素以外の構成要素の中から、予め定められた言語表現にて前記第３の語句に関連付けられた第４の語句を索引語として抽出する第２の抽出手段と、A third word is extracted as an index word from the main constituent elements of the document structured by a plurality of constituent elements inputted as a search request, and the constituent elements other than the main constituent elements of the document are extracted. Second extraction means for extracting, as an index word, a fourth phrase associated with the third phrase in a predetermined language expression;
前記第３の語句及び前記第４の語句と、前記第３の語句及び前記第４の語句の前記入力された文書中での出現頻度を含む索引語リストを作成する第２の作成手段、  Second creation means for creating an index word list including the third word and the fourth word and the appearance frequency of the third word and the fourth word in the input document;
前記入力された文書から作成された索引リスト中の索引語及びその出現頻度と、前記索引語記憶手段に記憶された各文書の前記索引語リスト中の索引語及びその出現頻度との間で類似度を算出することにより、前記入力された文書に類似する文書を前記文書記憶手段に記憶された複数の文書のなかから検索する検索手段、Similarity between the index word in the index list created from the input document and its appearance frequency and the index word in the index word list of each document stored in the index word storage means and its appearance frequency Search means for searching a document similar to the input document from a plurality of documents stored in the document storage means by calculating a degree;
として機能させるためのプログラムを記憶したコンピュータ読み取り可能な記憶媒体。  A computer-readable storage medium storing a program for functioning as a computer.

前記抽出手段は、前記第１の語句を項とする述語と同じ述語の項になっている第２の語句を抽出することを特徴とする請求項９記載の記憶媒体。Said extraction means, storage medium of claim 9, wherein the extracting the second word that is a section of the same predicate predicates to claim the first word.

前記第１の抽出手段は、前記第１の語句を項とする述語と同じ述語の項になっている第２の語句を抽出し、 The first extraction means extracts a second phrase that is in the same predicate term as a predicate whose term is the first phrase;
前記第２の抽出手段は、前記第３の語句を項とする述語と同じ述語の項になっている第４の語句を抽出することを特徴とする請求項１０記載の記憶媒体。 11. The storage medium according to claim 10, wherein the second extraction unit extracts a fourth word / phrase having a predicate term that is the same as a predicate having the third word / phrase as a term.