JP3859044B2

JP3859044B2 - Index creation method and search method

Info

Publication number: JP3859044B2
Application number: JP27655398A
Authority: JP
Inventors: 一樹安松; 章文関島
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1998-09-11
Filing date: 1998-09-11
Publication date: 2006-12-20
Anticipated expiration: 2018-09-11
Also published as: JP2000090115A

Description

【０００１】
【発明の属する技術分野】
本発明は、例えば文書に対する全文検索のためのインデクスを高速に生成し、また、当該インデクスを用いて高速な検索を実現する方法に関し、特に、検索等に用いるキーの構成に関する。
【０００２】
【従来の技術】
大量の文書に対する全文検索の方法として、シグネチャ・ファイルと呼ばれるデータ構造を用いる方法がある。特開平７-２４４６７１号公報に示されている方法では、文書における文字の出現をビットで表すインデクスを構成している。この方法では、格納されている文書数に影響されずに、比較的高速な検索が可能である。
しかしながら、いくつかの異なる語に対して１つのビットを割り当てているため、指定した以外の語が含まれている文書が検索される可能性があり、正確な検索が行えないという問題があった。また、生成や検索のアルゴリズムが複雑であり、既存のデータベース管理システムの上で実現することが困難であった。
【０００３】
このような問題に対して、文献「Compression and Fast Indexing for Multi-Gigabyte Text Databases」には、一般的なデータベース管理システムが提供しているハッシュ表やＢ＋木などのインデクス手法を用いて、高速な全文検索の機能を実現する方法が提案されている。この方法では、インデクスのキーとなる語と値となる文書に識別番号を割り付け、それらを圧縮して格納している。
これにより、検索に必要となるディスクの読み出しページ数を減らし、高速に検索が可能となる。また、異なる語に異なる識別番号を割り付けられるため、正確な検索が可能となる。なお、この文献は、この方法を用いて、約７０万件の文書に対する検索が高速に行えることを示している。
【０００４】
しかしながら、上記の文献で述べられている方法では、インデクスの新規作成や更新処理の性能が考慮されていないため、高い性能が得られないという問題があった。特に、更新時に或る語に対して同じ文書を重複して登録しないようにするための確認の処理は、文書識別番号の集まりに対する繰り返し処理により実現しなければならないため、効率よく実現することができない。
【０００５】
このような問題に対して、本出願人は特願平１０−２６６９１号に処理の効率化を図ることができるインデクス作成方法および検索方法を提案している。この方法では、Ｂ＋木のキーとして語の識別番号の後ろに文書の識別番号をつなげて配置したものを用いて、Ｂ＋木を、語の識別番号に或るハッシュ関数を適用して得られるハッシュ値と文書識別番号に別のハッシュ関数を適用して得られるハッシュ値などによって分割して二次元の配列に配置することにより、新規生成時や更新時には、文書識別番号のハッシュ値などが同じになるものをまとめて登録することで、書き込みページ数を少なくし、処理効率を高めている。
【０００６】
また、上記の文献で述べられている方法では、語の識別番号を必要とする。異なる語に異なる識別番号を与えるには、何らかの方法で語を管理する必要がある。１つの語の管理（追加、検索）に要する時間をＴｗとし、１個の文書の中に１００個の異なる語が含まれているとすると、例えば語の管理では、検索時には１Ｔｗ、登録時には１００Ｔｗの時間を要する。語の数が増加するに伴い、Ｔｗは増加し、その結果、インデクスの新規作成や更新処理の性能は大きく低下することとなる。
【０００７】
具体例として、インターネットのＷＷＷページに対する全文検索を行う場合などは、対象となる文書の数が数百万件となり、その文書中に出現する異なる語の数は数千万件となる。インターネットの場合には、固有名詞の豊富さや、スペル・ミスや、多言語での記述などの特性により、異なる語の数は文書数に比例して増加する傾向がある。数千万件の異なる語をＢ＋木で管理する場合、１つの語の追加や検索に要する時間Ｔｗとしては、０．１秒から１秒の時間が必要となる。これは、Ｂ＋木に対する１個の語の追加や検索処理が木の高さ＋１だけの回数のハードディスクに対するアクセスを必要とすることや、Ｂ＋木が巨大になるとディスクのメモリ中へのキャッシュの効果が得られず、ハードディスクに対するほとんどすべてのアクセスが実際にハードディスクを読み込む処理を必要とすることに起因する。
【０００８】
例えば、１千万件の異なる語をＢ＋木で管理する場合に、各ノードの分岐の数が５００であるとするとＢ＋木の高さはｌｏｇ500（１０，０００，０００）−１＝１．５９となり、１個の語と文書の組を追加、検索するために平均で約２．６回のディスク・アクセスが必要となる。ハードディスクを１回アクセスするために数十ミリ秒から数百ミリ秒かかるため、１個の語と文書の組の追加や検索には０．１秒から１秒の時間が必要となる。よって、１個の文書の中に１００個の異なる語が含まれているとすると、１個の文書を登録するために１００Ｔｗ、すなわち、１０秒から１００秒の時間を必要とすることになる。
【０００９】
【発明が解決しようとする課題】
ところで、上記従来例では語の識別番号の後ろに文書の識別番号をつなげて配置したものをＢ＋木のキーとして用いて検索等を行う構成を示したが、このように語に識別番号を与える代わりに、例えば語の文字列を用いてキーを作成することも考えられる。このように語の文字列を用いてキーを作成すれば、語と識別番号との対応関係を管理等する必要がなくなるため、例えば異なる語の数が多くなっても、語の管理に要する負担を少なくすることができると考えられる。
【００１０】
しかしながら、Ｂ＋木のキーの長さは一般に固定長であるため、例えば文書中で用いられる種々な長さの語のすべてに対応するためには、これらすべての語を含めることが可能なようにキーの長さを十分に長く取る必要がある。ところが、キーの長さが大きくなるとデータベースのサイズが大きくなってしまうため、ハードディスクを読み込む回数が多くなると同時にメモリ中のキャッシュのヒット率が低下し、全体的な性能が低下するといった不具合が生じてしまう。
また、例えばキーの長さを可変長にしたとしても、この場合には、処理が複雑になると同時にディスクのフラグメントが生じてしまうため、アクセス効率等が悪くなって、全体的な性能が低下するといった不具合が生じてしまう。
【００１１】
本発明は、このような従来の事情に鑑みなされたもので、語の文字列からキーを決定する仕方を工夫して、例えば大量の異なる語を含む大量の文書に対する全文検索のためのインデクスを高速に作成する方法を提供することを目的とする。また、本発明は、このように作成されたインデクスを用いて、高速な検索を実現する方法を提供することを目的とする。
また、本発明は、上記のような方法を実行するための装置や、また、プログラムを記憶した記憶媒体を提供することを目的とする。
【００１２】
【課題を解決するための手段】
上記目的を達成するため、本発明に係るインデクス作成方法では、指定された語から決定されるキーを用いて検索対象値を検索するために、キーと検索対象値とを対応させたインデクスをインデクス作成装置が作成するに際して、前記インデクス作成装置の登録手段が、語の長さが設定された閾値以下の場合には当該語の文字列を含むキー文字列と検索対象値との組からなるキーを記憶手段に登録する一方、語の長さが前記閾値を越える場合には当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を当該文字列に代えて含ませたキー文字列と検索対象値との組からなるキーを記憶手段に登録する。
このように、本発明では、前記閾値より長い語については当該文字列のハッシュ値を用いてキーを構成するため、例えばキーの長さを短い固定長に制限することができ、これにより、データベースサイズの増加を防ぎ、処理効率を高めることができる。
【００１３】
また、本発明では、前記インデクス作成装置の登録手段は、前記閾値を越える長さの語と当該語を一意に特定するための登録番号とを対応させて登録テーブルに登録し、当該語から決定されるキー文字列に前記登録番号を付加することで、前記閾値を越える長さの語の間でハッシュ値が重なってしまった場合でも、各語を区別できる構成とした。
また、本発明では、前記インデクス作成装置の登録手段は、前記閾値以下の長さの語から決定されるキー文字列と前記閾値を越える長さの語から決定されるキー文字列に各々を区別するフラグを付加することで、これらのキー文字列を区別できる構成とした。
【００１４】
また、本発明に係るインデクス作成方法では、上記のように語の長さに閾値を設定してインデクスを作成するに際して、キー文字列と検索対象値との組からなるキーを登録するインデクスを複数のサブインデクスにより構成し、登録するキーに対応する検索対象値に所定の関数を適用して決まる関数値と登録するキーに対応する語に所定の関数を適用して決まる関数値によって参照される二次元配列位置にサブインデクスを格納する。
ここで、本発明では、上記したキーと対応させる検索対象値として、例えば当該キーを決定する語を含んでいる１つの文書の文書識別情報を用いる。
【００１５】
上記したサブインデクスを構成する一例として、本発明では、文書に一意に識別する文書識別番号を与え、当該文書識別番号を文書識別情報として用いて、文書識別情報に適用する関数として文書識別番号を二次元配列の一の方向の位置を示すハッシュ値にマップするハッシュ関数と、語に適用する関数として語を二次元配列の他の方向の位置を示すハッシュ値にマップするハッシュ関数とを用意し、前記インデクス作成装置の登録手段は、文書における語の出現をその文書識別番号およびその語の各々にハッシュ関数を適用して得られたハッシュ値を用いて対応するサブインデクスに登録する。
【００１６】
また、本発明では、このようなサブインデクスの構成を実施するに際して、前記インデクス作成装置の登録手段は、複数の文書における語の出現を一括して登録する場合に、それらの文書の文書識別番号にハッシュ関数を適用して決まるハッシュ値が同じになるものを１つのグループにまとめて、グループごとに語の出現を登録する。
このように、本発明では、例えばインデクスの新規生成時や更新時に、文書識別番号のハッシュ値が同じになるものをまとめて登録することができるため、書き込みページ数を少なくし、処理効率を高めることができる。
【００１７】
また、本発明では、前記インデクス作成装置の登録手段は、上記のようにして１つのグループにまとめられた文書におけるすべての語の出現を登録する場合に、各語の出現を語にハッシュ関数を適用して決まるハッシュ値が同じになるものを一つのグループにまとめて、グループごとに語の出現を登録することで、処理効率を更に高めることもできる。
【００１８】
また、上記したサブインデクスを構成する他の例として、本発明では、語の文書における出現回数を文書識別情報として用いて、文書識別情報に適用する関数として語の文書における出現回数を二次元配列の一の方向の位置を示すハッシュ値にマップするハッシュ関数と、語に適用する関数として語を二次元配列の他の方向の位置を示すハッシュ値にマップするハッシュ関数を用意し、前記インデクス作成装置の登録手段は、或る文書における或る語の出現をその語の出現回数およびその語の各々にハッシュ関数を適用して得られたハッシュ値を用いて対応するサブインデクスに登録する。
【００１９】
この構成においても、上記と同様に、前記インデクス作成装置の登録手段は、複数の文書における語の出現を一括して登録する場合に、各語の出現回数にハッシュ関数を適用して決まるハッシュ値が同じになるものを１つのグループにまとめて、グループごとに語の出現を登録することで、処理効率を高めることができる。また、上記と同様に、前記インデクス作成装置の登録手段は、１つのグループにまとめられた文書におけるすべての語の出現を登録する場合に、各語の出現を語にハッシュ関数を適用して決まるハッシュ値が同じになるものを一つのグループにまとめて、グループごとに語の出現を登録することで、処理効率を更に高めることができる。
【００２０】
また、上記したサブインデクスを構成する他の例として、本発明では、語の文書における出現頻度を文書識別情報として用いて、文書識別情報に適用する関数として語の文書における出現頻度を二次元配列の一の方向の位置を示すハッシュ値にマップするハッシュ関数と、語に適用する関数として語を二次元配列の他の方向の位置を示すハッシュ値にマップするハッシュ関数を用意し、前記インデクス作成装置の登録手段は、或る文書における或る語の出現をその語の出現頻度およびその語の各々にハッシュ関数を適用して得られたハッシュ値を用いて対応するサブインデクスに登録する。
【００２１】
この構成においても、上記と同様に、前記インデクス作成装置の登録手段は、複数の文書における語の出現を一括して登録する場合に、各語の出現頻度にハッシュ関数を適用して決まるハッシュ値が同じになるものを１つのグループにまとめて、グループごとに語の出現を登録することで、処理効率を高めることができる。また、上記と同様に、前記インデクス作成装置の登録手段は、１つのグループにまとめられた文書におけるすべての語の出現を登録する場合に、各語の出現を語にハッシュ関数を適用して決まるハッシュ値が同じになるものを一つのグループにまとめて、グループごとに語の出現を登録することで、処理効率を更に高めることができる。
また、本発明では、以上の登録に際して、例えば主記憶装置に用意した少なくとも１つのサブインデクスが格納できるページキャッシュを用いることで、処理の高速化を図った。
【００２２】
また、本発明では、以上に示したサブインデクスとしてＢ＋木構造を用いた。本発明では、文書識別情報をキーに付加することで、例えば異なる文書中の同じ語から決定されるキーを文書毎に区別することを可能にし、これにより、例えば異なる文書中の同じ語から決定されるキーが重なってしまってＢ＋木構造中で衝突してしまう（例えば両者の区別ができずに一方が他方に上書きされてしまう）ことを防ぐことができる。
なお、文書の一意な文書識別情報としては、例えば上記した文書識別番号や語の出現回数や語の出現頻度といった情報を用いることができる。
【００２３】
また、本発明に係るインデクス検索方法では、文書名と当該文書に含まれる語から決定されるキーとを対応させたインデクスをＢ＋木構造により構成して、指定された語から決定されるキーを用いて対応する文書名をインデクス検索装置が得る検索を行うに際して、文書名に一意に識別する文書識別情報を与え、前記インデクス検索装置の記憶手段には、語の長さが設定された閾値以下の場合には当該語の文字列を含むキー文字列と文書識別情報との組からなるキーが登録されている一方、語の長さが前記閾値を越える場合には当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を含むキー文字列と文書識別情報との組からなるキーが登録されており、前記インデクス検索装置の検索手段が、指定された語の長さが設定された閾値以下の場合には当該語の文字列を含むキー文字列に検索範囲を指定する文書識別情報を結合したものをキーとして用いて記憶手段から当該キーと一致するキーに対応する文書識別情報を検索する一方、語の長さが前記閾値を越える場合には当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を含むキー文字列に前記文書識別情報を結合したものをキーとして用いて記憶手段から当該キーと一致するキーに対応する文書識別情報を検索する。
【００２４】
このように、本発明では、例えば語の文字列或いは文字列のハッシュ値を含むキー文字列の後ろに検索範囲を指定する文書識別情報を結合したものをキーとして用いるため、或る文書における或る語の出現をＢ＋木インデクスに対する１回の検索で見つけることを実現することができる。
なお、文書識別情報としては、上記と同様に、例えば文書識別番号や語の出現回数や語の出現頻度といった情報を用いることができる。
【００２５】
また、本発明では、上記のインデクス検索方法において、前記閾値を越える長さの語から決定されるキー文字列に当該語を一意に特定するための登録番号を付加するとともに、当該登録番号と当該語とを対応させて前記インデクス検索装置の登録テーブルに登録しておき、上記検索に際して、前記インデクス検索装置の検索手段は、前記閾値を越える長さの語から決定されるキー文字列に検索範囲を指定する登録番号を付加したキーを用いて検索を行った後に、更に当該キーのキー文字列に付加された登録番号と対応して登録されている語を登録テーブルに基づいて特定し、特定した語と検索対象の語との対応に基づいて、検索された文書名集合から該当する文書名を特定する。
【００２６】
このように、本発明では、前記閾値を越える長さの語から決定されるキーを用いて検索を行うに際して、例えば同一の文書中の異なる語の間でキーを構成するハッシュ値が重なってしまった場合であっても、まず、登録番号以外のハッシュ値や文書識別情報から文書名集合を検索した後に、更に、検索した文書名集合から登録番号を用いて該当する文書名を特定するようにしたため、検索対象と一致する語を一意に特定することができる。
【００２７】
また、本発明は、以上に示した方法を実行する装置や、以上に示した方法を実行するためのプログラムを記憶した記憶媒体として構成することもできる。
例えば、本発明に係るインデクス作成装置では、指定された語から決定されるキーを用いて検索対象値を検索するために、キーと検索対象値とを対応させたインデクスを作成するに際して、記憶手段がインデクスを記憶し、登録手段が語の長さが設定された閾値以下であることに応じて当該語の文字列を含むキー文字列と検索対象値との組からなるキーを記憶手段に登録する一方、語の長さが前記閾値を越えることに応じて当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を含むキー文字列と検索対象値との組からなるキーを記憶手段に登録する。
【００２８】
また、本発明に係る記憶媒体では、指定された語から決定されるキーを用いて検索対象値を検索するために、キーと検索対象値とを対応させたインデクスの作成処理を、コンピュータに実行させるプログラムを当該コンピュータに読み取り可能に記憶した構成において、前記プログラムは、語の長さが設定された閾値以下であることに応じて当該語の文字列を含むキー文字列と検索対象値との組からなるキーをインデクスメモリに登録する一方、語の長さが前記閾値を越えることに応じて当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を含むキー文字列と検索対象値との組からなるキーをインデクスメモリに登録する処理を、前記コンピュータに実行させる。
【００２９】
【発明の実施の形態】
本発明に係る第１実施例を図面を参照して説明する。
図１には、本発明に係る方法を実行する装置の構成例を示してある。なお、この装置はコンピュータハードウエア資源を用いて、本発明を実施するためのプログラムを実行することにより構成されている。
【００３０】
文書蓄積部１はハードディスク装置等の外部メモリにより構成されており、この文書蓄積部１には、登録や検索の対象となる文書がその文書名や文書識別番号と対応付けられて格納されて管理される。なお、文書識別番号は文書を一意に識別する情報であり、例えば各文書毎に異なる番号が与えられている。
文書ソート部２は、インデクスの登録の対象となる文書の文書名を、あらかじめ定義されたハッシュ関数を文書識別番号に適用して得られるハッシュ値が同じになるものがまとまるようにソートする。
【００３１】
形態素解析部３は、指定された文書の全文を解析し、語の切り出しを行う。
キー文字列作成部４は、与えられた語に基づいてキー文字列を作成する。
ロングワード管理部５は、あらかじめ定められた閾値よりも長い語の一覧を保持するロングワードテーブルを管理する。本例では、ロングワードテーブルは、文書毎に用意される。また、文書毎のロングワードテーブルが保持する語には、一意な登録番号が付与される。
なお、上記したキー文字列や登録番号の具体例については後述する。
【００３２】
インデクス登録部６は、与えられた文書名の文書識別番号および語を得て、後述するインデクス選択部８の機能により選択されたＢ＋木構造に、キー文字列と文書識別番号とを結合した値をキーとして、文書内の語の出現を登録する。
インデクス蓄積部７はハードディスク装置等の外部メモリにより構成されており、このインデクス蓄積部７は、あらかじめ定められた大きさの二次元の配列（本例ではＤ×Ｗ、但し、Ｄ、Ｗは１以上の整数）上にＢ＋木を記憶する。また、インデクス蓄積部７は文書名と文書識別番号との対応関係も記憶している。
【００３３】
インデクス選択部８は、与えられた文書識別番号と語の文字列（例えば文字コード）に、それぞれあらかじめ定められたハッシュ関数を適用し、その結果得られたハッシュ値を用いてインデクス蓄積部７に格納されているインデクス表から語の出現を登録するＢ＋木の識別番号を選択する。
ここで、上記した文書ソート部２およびインデクス選択部８で用いられる文書識別番号に適用されるハッシュ関数Ｈｄや、語の文字列に適用されるハッシュ関数Ｈｗは、文書識別番号をｉd、語の文字列をｓとしたとき、それぞれ、０≦Ｈd（ｉd）＜Ｄ、０≦Ｈw（ｓ）＜Ｗ、となる整数をハッシュ値とするように定義される。
【００３４】
問い合わせ入力部９は、利用者からの検索要求を受け付け、例えば語をＡＮＤまたはＯＲで結合した検索式を生成する。
検索実行部１０は、与えられた検索式に含まれている語の文字列から、インデクス選択部８の機能により検索の対象となるＢ＋木を得て検索処理を行う。
結果出力部１１は、検索実行部１０により得られた検索結果をディスプレイ表示等して利用者に提示する。
【００３５】
図２には、インデクス蓄積部７に格納されているＢ＋木のキーの構成例を示してある。
同図に示されるように、このＢ＋木のキーは、キー文字列の後ろに文書識別番号を結合した構造となっており、本例では、キー文字列として９バイト、文書識別番号として４バイトの領域を割り当てている。
【００３６】
図３には、キー文字列の構成例を示してある。
同図に示されるように、キー文字列は、対象とする語の長さがあらかじめ定められた閾値より短いか或いは長いかによって、２通りの構造を成す。本例では、語の長さの閾値を８バイトに設定してあり、対象とする語の長さが８バイト以下（すなわち、閾値以下）の場合には、語の文字列をそのままキー文字列に含めて用いる一方、対象とする語の長さが８バイトを越える（すなわち、閾値を越える）場合には、語の文字列ｓにハッシュ関数Ｈｌを適用して得られるハッシュ値Ｈｌ（ｓ）をキー文字列に含めて用いる。
【００３７】
ここで、上記したハッシュ関数Ｈｌは、語の文字列をｓ、閾値をｎとしたとき、０≦Ｈｌ（ｓ）＜２^8(n-1)、となる整数をハッシュ値とするように定義される。すなわち、ハッシュ関数Ｈｌの返す値（すなわち、ハッシュ値）のデータサイズは、閾値より１バイト少ないハッシュ値（本例の場合は７バイト）となる。
【００３８】
また、例えば語の長さが８バイトを越える場合に、異なる語の間でハッシュ値が重複する可能性があるので、本例では、ハッシュ値とロングワードテーブルにおけるその語の登録番号（本例では、後述するように１５ビットの登録番号を含んだ２バイトから成る登録番号部として図３に示してある）を結合したものをキー文字列とする。これにより、同じ文書中に現れる長い語の間でハッシュ値が重複しても、キー文字列が重複することが無くなる。また、実際のＢ＋木中では、上記図２に示したようにキー文字列に文書識別番号を結合したものをキーとするので、異なる文書に含まれる語の間でキー文字列が重複しても、Ｂ＋木のキーが重複することは無い。
【００３９】
また、長さが８バイト以下の語の文字列と長さが８バイトを越える語のハッシュ値との間で値が重なる可能性がある。そこで、本例では、語の長さが８バイト以下の場合には、語の文字列の８バイト目の値をキー文字列中の９バイト目にずらし、キー文字列中の８バイト目の値を０にする。例えば、対象とする語が’ｉｎｔｅｒｎｅｔ’である場合、キー文字列は、６９６Ｅ７４６５７２６Ｅ６５００７４（１６進）となる。
【００４０】
一方、長さが８バイトを越える語については、登録番号の最大値を３２７６７（すなわち、２¹⁵−１）とし、登録番号を格納する領域（登録番号部）の最上位ビットを常に１にする。これにより、語の長さが８バイト以下の場合にはキー文字列の先頭から８バイト目のデータの最上位ビットが必ず０になる一方、語の長さが８バイトを越える場合にはキー文字列の先頭から８バイト目のデータの最上位ビットが必ず１になるので、長さが閾値以下の語の文字列と長さが閾値を越える語のハッシュ値との間で値が重なっても、キー文字列が重複することが無くなる。
【００４１】
図４には、キー文字列を作成する処理の手順の一例を示してある。
すなわち、例えば与えられた語の長さが８バイト以下の場合には（ステップＳ１）、まず、キー文字列ｉｗに語の文字列ｓの先頭から７バイトの値をコピーし（ステップＳ２）、次に、キー文字列ｉｗの８バイト目に０をセットし（ステップＳ３）、次いで、キー文字列ｉｗの９バイト目に前記文字列ｓの８バイト目の値をコピーすることにより（ステップＳ４）、キー文字列ｉｗを作成する。
【００４２】
一方、与えられた語の長さが８バイトを越える場合には（ステップＳ１）、まず、キー文字列ｉｗに語の文字列ｓのハッシュ値Ｈｌ（ｓ）をコピーし（ステップＳ５）、次に、与えられた登録番号と１０００００００００００００００（２進）との論理和を取った値をキー文字列ｉｗの８バイト目以降にコピーすることにより（ステップＳ６）、キー文字列ｉｗを作成する。
なお、図４において、＜＜はビットを左にシフトする演算を示している。
【００４３】
このように、本例のインデクス作成方法では、語の長さが設定された閾値以下の場合には当該語の文字列を含むキー文字列と値（本例では、当該キー文字列を決定する語を含んでいる１つの文書の文書識別情報）との組を登録することとする一方、語の長さが前記閾値を越える場合には当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を当該文字列に代えて含ませたキー文字列と値（本例では、当該キー文字列を決定する語を含んでいる１つの文書の文書識別情報）との組を登録するようにしたため、例えばキーの長さを短い固定長に制限することができ、これにより、データベースサイズの増加を防ぎ、処理効率を高めることができる。
【００４４】
なお、語の長さの閾値としては、どのような値が設定されてもよいが、例えば長さが閾値以下の語の出現率や長さが閾値を越える語の出現率を考慮して、インデクスのサイズを小さくすることや、データの登録処理や検索処理にかかる時間を短くすることができるような値に設定されるのが好ましい。
【００４５】
また、上記のように本例では、設定された閾値を越える長さの語と当該語を一意に特定するための登録番号とを対応させて登録テーブル（本例では、ロングワードテーブル）に登録し、登録した登録番号を当該語から決定されるキー文字列に付加することにより、前記閾値を越える長さの語の間でハッシュ値が重なってしまった場合でも、各語を区別できる構成としている。
また、本例では、設定された閾値以下の長さの語から決定されるキー文字列と当該閾値を越える長さの語から決定されるキー文字列に各々を区別するフラグを付加することで、これらのキー文字列を区別できる構成としてある。なお、本例では、好ましい態様として、キー文字列中のハッシュ値と登録番号との境目となる先頭から８バイト目のデータの最上位ビットをフラグとして用いたが、フラグの構成としては特に限定はない。
【００４６】
図５には、Ｂ＋木から語の出現を検索する処理の手順の一例を示してある。
すなわち、或る語を含む文書を得る検索においては、まず、当該語からキー文字列ｉｗを作成し（ステップＳ１１）、その語の出現を含む全てのＢ＋木について、当該キー文字列ｉｗに検索範囲を指定する文書識別番号の最小値（ここでは０（３２ビット））を結合した値をｓｔａｒｔ点として設定する一方（ステップＳ１２）、検索範囲を指定する文書識別番号の最大値（ここではＦＦＦＦＦＦＦＦ（１６進））を結合した値をｅｎｄ点として設定する（ステップＳ１４）。
【００４７】
ここで、与えられた語の長さが閾値を越える場合には、語のハッシュ値に登録番号を付加した値がキー文字列ｉｗとして用いられるため、この登録番号についても、上記した文書識別番号の場合と同様に検索範囲を設定する。すなわち、本例では、検索範囲の最小値に対応するキー文字列ｉｗを作成する際に、語の検索範囲を指定する登録番号の最小値（ここでは０）を与える一方（ステップＳ１１）、検索範囲の最大値に対応するキー文字列ｉｗを作成する際に、語の検索範囲を指定する登録番号の最大値（ここでは７ＦＦＦ（１６進））を与え（ステップＳ１３）、これにより、キー文字列ｉｗに与えた登録番号の最小値から最大値までの範囲で検索が行われるようにしている。
【００４８】
そして、上記したｓｔａｒｔ点からｅｎｄ点までの間で検索を行うことにより（ステップＳ１５）、例えば与えられた語の長さが閾値（本例では８バイト）以下である場合には（ステップＳ１６）、その語に対するすべての出現を、文書識別番号の昇順に得ることができる（ステップＳ１７）。
一方、与えられた語の長さが閾値を越えている場合には（ステップＳ１６）、キー文字列ｉｗに与えた登録番号の検索範囲において該当する語の出現がすべて検索されるため、検索された語の出現の中から本当に検索対象に該当するものを特定することを行う。
【００４９】
具体的には、本例では、得られた検索結果から語の登録番号と文書識別番号を取り出して、当該文書識別番号と対応するロングワードテーブルを参照することで（ステップＳ１８）、当該登録番号と対応した語を特定し、例えば特定した語と検索対象の語とを比較することにより、検索対象の語が検索された文書に本当に含まれているかどうかを検証する（ステップＳ１９）。この結果、検索対象の語が文書に含まれていればその語の出現を返し（ステップＳ１７）、含まれていなければ例えばＮＵＬＬを返す（ステップＳ２０）。
なお、上記と同様に、図５において、＜＜はビットを左にシフトする演算を示している。
【００５０】
一例として、いくつかの文書に関する語の出現を登録した時点で、Ｂ＋木の一部の状態が図６に示されているようになっていたとする。この状態において、キー文字列が７４６５７３７４００００００００００（１６進）であるような語を含む文書を検索する場合には、キーの値が７４６５７３７４００００００００００００００００００（１６進）と７４６５７３７４００００００００００ＦＦＦＦＦＦＦＦ（１６進）の範囲にあるものを検索することにより、目的とする語の出現（Ｏ４、Ｏ５）を得られる。
【００５１】
また、キー文字列が７４６５７３７４００００００００００（１６進）である語が文書識別番号が７である文書に含まれているか否かを確認する場合には、７４６５７３７４０００００００００００００００００７（１６進）をキーとして、キーの値が一致するものを検索することで、語の出現（Ｏ５）を得ることができる。なお、本例では、語の出現には、例えば当該語がどの文書中のどのページやどの行にあるかといった情報や、当該語の出現回数や出現頻度等といった情報が含まれている。
【００５２】
このように、本例のインデクス検索方法では、例えば文書名と当該文書に含まれる語から決定されるキーとを対応させたインデクスをＢ＋木構造により構成して、語から決定されるキーを用いて対応する文書名を得る検索を行うに際して、文書名に一意に識別する文書識別情報（本例では文書識別番号）を与え、語の長さが設定された閾値以下の場合には当該語の文字列を含むキー文字列に検索範囲を指定する文書識別情報を結合した値をキーとして用いる一方、語の長さが前記閾値を越える場合には当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を含むキー文字列に前記文書識別情報を結合した値をキーとして用いることにより、或る文書における或る語の出現をＢ＋木インデクスに対する１回の検索で見つけることを実現している。
【００５３】
また、本例では、前記閾値を越える長さの語から決定されるキー文字列には当該語を一意に特定するための登録番号が付加されているとともに、当該登録番号と当該語とが対応してロングワードテーブルに登録されており、上記検索では、前記閾値を越える長さの語から決定されるキー文字列に検索範囲を指定する登録番号を付加した値を用いて検索を行った後に、更に当該キー文字列に付加された登録番号と対応して登録されている語を特定し、特定した語と検索対象の語との対応に基づいて、検索された文書名集合から該当する文書名を特定しており、これにより、上記したように検索対象と一致する語を一意に特定することができる。
【００５４】
図７には、インデクス蓄積部７におけるＢ＋木の格納構造を示してある。
同図に示されるように、本例では、Ｄ×Ｗの二次元配列によりＤ×Ｗ個のサブインデクスが設けられており、各サブインデクスとしてＢ＋木構造が用いられている。例えば、文書識別番号がｉｄで且つ語の文字列がｓである或る語の出現は、Ｂ＋木（Ｈｗ（ｓ），Ｈｄ（ｉｄ））のサブインデクスに対応したＢ＋木に登録されている。
よって、文字列がｓである語の出現を検索する場合には、後述する図８に示されている手順で選択されたＢ＋木について上記図５に示した処理を実行する。
【００５５】
ここで、図８には、指定された或る一つの語が出現する文書を検索する処理の手順の一例を示してある。
すなわち、この処理では、まず、与えられた語の文字列ｓにハッシュ関数Ｈｗを適用して得られるハッシュ値Ｈｗ（ｓ）をｗに代入する（ステップＳ３１）。そして、変数ｉおよび変数ｒを０に初期化し（ステップＳ３２）、ｉを１つずつ増加させながら（ステップＳ３５）、ｉがＤとなるまで（ステップＳ３６）、Ｂ＋木（ｗ，ｉ）に対して語の検索を繰り返し行い（ステップＳ３３）、その結果を配列Ｒ［ｒ，ｒ＋ｒ’］に追加している（ステップＳ３４）。なお、ｒ’には検索結果数が代入され、ｉを１つ増加させる度毎にｒがｒ＋ｒ’に置き換えられる。
【００５６】
これにより、上記図７に示した二次元配列の或る一つの行のサブインデクスに記憶されているＢ＋木群に対する検索を行うことができる。このように検索を行う範囲を一つの行に限っても、目的とする語の出現はそれ以外のＢ＋木には含まれていないので、これにより見つかった文書のみに目的とする語が含まれていることになる。このように、本例の検索処理では、検索の対象となるＢ＋木が限定され、且つ各Ｂ＋木を順序良く利用するため、検索対象のサブインデクスを保持するキャッシュのヒット率を高めることができ、効率よく検索が実行できる。なお、本例の検索では、検索実行部１０が使用する主記憶装置のキャッシュに、少なくとも１つのＢ＋木サブインデクスが保持されるようになっている。
【００５７】
図９には、複数の文書の語の出現を一括して登録する処理の手順の一例を示してある。
すなわち、この処理では、まず、各文書を文書識別番号ｉｄにハッシュ関数Ｈｄを適用して得られるハッシュ値Ｈｄ（ｉｄ）によりＤ個のグループに分け、グループ分けされた文書をグループごとに各配列Ｇ（０）〜Ｇ（Ｄ−１）に格納する（ステップＳ４１）。続いて、変数ｄを０に初期化し（ステップＳ４２）、変数ｄを１つずつ増加させながら（ステップＳ４９）、変数ｄがＤとなるまで（ステップＳ５０）、各配列Ｇ（ｄ）について以下の登録処理を行う。
【００５８】
すなわち、この登録処理では、上記した各配列Ｇ（ｄ）について、その配列Ｇ（ｄ）に格納されている全ての文書から語の出現（文書と語の組）を取り出して（ステップＳ４３）、取り出した語の出現を各語の文字列ｓにハッシュ関数Ｈｗを適用して得られるハッシュ値Ｈｗ（ｓ）によりＷ個のグループに分け、グループ分けされた語の出現をグループごとに各配列Ｏ（０）〜Ｏ（Ｗ−１）に格納する（ステップＳ４４）。そして、変数ｗを０に初期化し（ステップＳ４５）、変数ｗを１つずつ増加させながら（ステップＳ４７）、変数ｗがＷとなるまで（ステップＳ４８）、配列Ｇ（ｄ）に格納されている各グループＯ（ｗ）について、そのグループＯ（ｗ）に属している語の出現を登録する処理を実施する（ステップＳ４６）。なお、語の出現を登録する処理の手順については、後述する図１０に示す。
【００５９】
上記の処理により、語の出現は上記図７に示された配列の左上から下方向に並んだＢ＋木サブインデクスに順に格納され、一番下のＢ＋木サブインデクスまで格納が終わると、一つ右の列について上から下方向に並んだＢ＋木サブインデクスに順に格納されるため、複数のＢ＋木サブインデクスを交互に参照することがなくなり、ページ・キャッシュのヒット率を高めることができる。
さらに、主記憶上に一つのＢ＋木サブインデクスの内容を保持できるだけの領域があれば、一つのＢ＋木サブインデクスに対する格納処理をすべて主記憶中で実行できるため、極めて高速に格納処理を実行できる。
【００６０】
図１０には、或る文書における或る語の出現を登録する処理の手順の一例を示してある。
すなわち、この処理では、例えば文書中の語の文字列ｓおよび当該文書の文書識別番号ｉｄを得て、それぞれにハッシュ関数Ｈｗ、Ｈｄを適用して得られるハッシュ値Ｈｗ（ｓ）、Ｈｄ（ｉｄ）をそれぞれの変数ｗ、ｄに保持する（ステップＳ６１、Ｓ６２）。
【００６１】
続いて、対象とする語に基づいてキー文字列ｉｗを作成する。ここで、語の長さが設定された閾値（本例では８バイト）以下の場合には（ステップＳ６３）、当該語の文字列を含んだキー文字列ｉｗを作成する一方（ステップＳ６４）、語の長さが前記閾値を越える場合には（ステップＳ６３）、当該語を文書に対応するロングワードテーブルに登録して登録番号を取得し（ステップＳ６７）、取得した登録番号と語に基づいてキー文字列ｉｗを作成する（ステップＳ６８）。
【００６２】
次いで、作成したキー文字列ｉｗの値を左に３２ビットシフトした値に、文書識別番号ｉｄの値を足したものを変数ｋに代入する（ステップＳ６５）。そして、上記図７に示された配列のサブインデクスＢ＋木（ｗ，ｄ）に、前記ｋをキーとして語の出現を登録する（ステップＳ６６）。
なお、上記と同様に、図１０において、＜＜はビットを左にシフトする演算を示している。
【００６３】
このように、本例では、指定された語から決定されるキー文字列を含むキーを用いて値（本例では、当該キー文字列を決定する語を含んでいる１つの文書の文書識別情報）を検索するためにキーと値（本例では、キー文字列を決定する語を含んでいる１つの文書の文書識別情報）とを対応させたインデクスを作成するに際して、キー文字列と値（本例では、当該キー文字列を決定する語を含んでいる１つの文書の文書識別情報）との組を登録するインデクスを複数のサブインデクスにより構成し、登録する値（本例では、キー文字列を決定する語を含んでいる１つの文書の文書識別情報）に所定の関数を適用して決まる関数値と語（本例では語の文字列）に所定の関数を適用して決まる関数値によって参照される二次元配列位置にサブインデクスを格納する構成を用い、また、サブインデクスとしてＢ＋木構造を用いることで、処理の効率化を図っている。
【００６４】
なお、具体的には、本例では上記したように、文書に適用する関数として文書識別番号を二次元配列の一の方向の位置を示すハッシュ値にマップするハッシュ関数と、語に適用する関数として語を二次元配列の他の方向の位置を示すハッシュ値にマップするハッシュ関数とを用意し、文書における語の出現をその文書識別番号およびその語の各々にハッシュ関数を適用して得られたハッシュ値を用いて対応するサブインデクスに登録するようにした。
【００６５】
そして、本例では、このようなサブインデクスの構成を用いて、複数の文書における語の出現を一括して登録する場合に、それらの文書の文書識別番号にハッシュ関数を適用して決まるハッシュ値が同じになるものを１つのグループにまとめて、グループごとに語の出現を登録することにより、インデクスの新規生成時や更新時における書き込みページ数を少なくし、処理効率を高めた。
更に、本例では、上記のようにして１つのグループにまとめられた文書におけるすべての語の出現を登録する場合に、各語の出現を語にハッシュ関数を適用して決まるハッシュ値が同じになるものを一つのグループにまとめて、グループごとに語の出現を登録することにより、処理効率を更に高めた。
【００６６】
また、本例では、サブインデクスとしてＢ＋木構造を用いるに際して、キーに当該キーと対応する文書の一意な識別情報（本例では文書識別情報）を付加することにより、例えば異なる文書中の同じ語から決定されるキー文字列が重なってしまった場合であっても、これらのキー文字列を文書識別情報により区別可能な構成とすることで、両者がＢ＋木構造中で衝突してしまうことを防いだ。
また、本例では、上記したように、主記憶装置に用意した少なくとも１つのサブインデクスが格納できるページキャッシュを用いているため、処理の高速化を図ることができる。
【００６７】
以上のように、本例のインデクス作成方法やインデクス検索方法では、例えば大量の異なる語を含む大量の文書に対する全文検索のためのインデクスを作成する処理や、作成したインデクスを用いて検索を行う処理を実行するに際して、語の文字列或いは当該文字列のハッシュ値を用いてキーを作成するようにしたため、異なる語の数によって受ける影響を小さくすることができ、例えば数百万件の文書に対する全文検索のためのインデクスを高速に生成、更新、検索することができる。
【００６８】
また、本例の方法では、例えば各語と識別番号との対応関係を管理するといったことを行わずとも、設定された閾値を越える長さの語に適用するハッシュ関数や登録番号を記憶等しておけば、語とキーとの対応が付けられるため、例えば異なる語の数が多くなっても語の管理に要する負担を少なくすることができ、これにより、１つの語の管理（追加、検索）に要する時間Ｔｗをゼロに近づけることができる。
【００６９】
また、本例の方法では、上記したサブインデクスの構成やＢ＋木の構造を採用しているため、文書の格納処理や検索処理に必要となる更新ページ数や読み出しページ数を削減することができ、登録処理や検索処理を高速に実行することができる。なお、本例では、サブインデクスとしてＢ＋木を用いることで、例えば木のルートからの検索を短いパスで実現することや、新たな語の出現を容易に追加等することができる。
【００７０】
図１１には、本発明の第２実施例として、Ｂ＋木インデクスの縦方向の分割に、語の出現回数に或る関数を適用した関数値を用いる場合のキーの構成例を示してある。同図に示されるように、本例のキーは、キー文字列の後ろに語の出現回数を整数であらわした値を結合した構造であり、キー文字列として９バイト、出現回数として４バイトの領域を割り当てている。
【００７１】
図１２には、上記図１１に示したキーの構成を用いて、語の出現をＢ＋木に登録した状態の一例を示してある。
上記図１２に示されているように、同じ語に対する複数の異なる語の出現が、例えば語の出現回数の多い順にならべられる。これにより、検索処理において、検索の結果を語の出現回数の多い順に取り出すことが容易となる。
【００７２】
なお、本第２実施例における語の出現を検索する処理手順は、上記図８に示した第１実施例における語の出現を検索する処理と同じである。また、語の出現を登録する処理手順は、上記図９に示した処理手順において文書識別番号を用いて文書をグループ分けしている処理（ステップＳ４１）を語の出現回数を用いて語の出現をグループ分けする処理に置き換え、また、上記図１０に示した処理手順において文書識別番号の値を用いてキーとなる値を生成している処理（ステップＳ６２、Ｓ６５）を語の出現回数の値を用いてキーとなる値を生成する処理に置き換えること等で実現できる。
【００７３】
このように、本発明では、文書中の語の出現に適用する関数としてその語の文書における出現回数を二次元配列の一の方向の位置を示すハッシュ値にマップするハッシュ関数と、語に適用する関数として語を二次元配列の他の方向の位置を示すハッシュ値にマップするハッシュ関数を用意し、或る文書における或る語の出現をその語の出現回数およびその語の各々にハッシュ関数を適用して得られたハッシュ値を用いて対応するサブインデクスに登録するといった構成を用いることもできる。
【００７４】
この構成においても、上記第１実施例の場合と同様に、複数の文書における語の出現を一括して登録する場合に、各語の出現回数にハッシュ関数を適用して決まるハッシュ値が同じになるものを１つのグループにまとめてグループごとに語の出現を登録することや、更に、各語の出現を語にハッシュ関数を適用して決まるハッシュ値が同じになるものを一つのグループにまとめてグループごとに語の出現を登録することにより、処理効率を高めることができる。
【００７５】
図１３には、本発明の第３実施例として、Ｂ＋木の縦方向の分割に語の出現頻度に或る関数を適用した関数値を用いる場合のキーの構成例を示してある。
同図に示されるように、本例のキーは、キー文字列の後ろに語の出現頻度を整数であらわした値を結合した構造であり、キー文字列として９バイト、出現頻度として１バイトの領域を割り当てている。
なお、本例では、或る語の出現の出現頻度は、その語がその文書に現れた回数をその文書の総語数で割って１００を掛けた値であらわす。
【００７６】
図１４には、上記図１３に示したキーの構成を用いて、語の出現をＢ＋木に登録した状態の一例を示してある。
上記図１４に示されているように、同じ語に対する複数の異なる語の出現が、例えば語の出現頻度の高い順にならべられる。これにより、検索処理において、検索の結果を語の出現頻度の高い順に取り出すことが容易となる。
【００７７】
なお、本第３実施例における語の出現を検索する処理手順は、上記図８に示した第１実施例における語の出現を検索する処理と同じである。また、語の出現を登録する処理手順は、上記図９に示した処理手順において文書識別番号を用いて文書をグループ分けしている処理（ステップＳ４１）を語の出現頻度を用いて語の出現をグループ分けする処理に置き換え、また、上記図１０に示した処理手順において文書識別番号の値を用いてキーとなる値を生成している処理（ステップＳ６２、Ｓ６５）を語の出現頻度の値を用いてキーとなる値を生成する処理に置き換えること等で実現できる。
【００７８】
このように、本発明では、文書中の語の出現に適用する関数としてその語の文書における出現頻度を二次元配列の一の方向の位置を示すハッシュ値にマップするハッシュ関数と、語に適用する関数として語を二次元配列の他の方向の位置を示すハッシュ値にマップするハッシュ関数を用意し、或る文書における或る語の出現をその語の出現頻度およびその語の各々にハッシュ関数を適用して得られたハッシュ値を用いて対応するサブインデクスに登録するといった構成を用いることもできる。
【００７９】
この構成においても、上記第１実施例や第２実施例の場合と同様に、複数の文書における語の出現を一括して登録する場合に、各語の出現頻度にハッシュ関数を適用して決まるハッシュ値が同じになるものを１つのグループにまとめてグループごとに語の出現を登録することや、更に、各語の出現を語にハッシュ関数を適用して決まるハッシュ値が同じになるものを一つのグループにまとめてグループごとに語の出現を登録することにより、処理効率を高めることができる。
【００８０】
ここで、本発明は、以上に示した方法を実行する装置として把握することもできる。
一例として、本発明に係るインデクス作成装置では、指定された語から決定されるキーを用いて検索対象値を検索するためにキーと検索対象値とを対応させたインデクスを作成するに際して、ハードディスク等から成る記憶手段がインデクスを記憶し、登録手段が語の長さが設定された閾値以下であることに応じて当該語の文字列を含むキー文字列と検索対象値との組からなるキーを記憶手段に登録する一方、語の長さが前記閾値を越えることに応じて当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を含むキー文字列と検索対象値との組からなるキーを記憶手段に登録することにより、処理の効率化を図ることができる。
【００８１】
また、上記したように以上では、上記図１に示した装置に備えられた各機能手段により行われる処理は、例えばプロセッサやメモリ等を備えたハードウエア資源においてプロセッサが制御プログラムを実行することにより構成されるが、本発明では、これらの各機能手段を独立したハードウエア回路として構成してもよい。
また、本発明は上記の制御プログラムを格納したフロッピーディスクやＣＤ−ＲＯＭ等のコンピュータにより読み取り可能な記憶媒体として把握することもでき、当該制御プログラムを記憶媒体からコンピュータに入力してプロセッサに実行させることにより、本発明に係る処理を遂行させることができる。
【００８２】
一例として、本発明に係る記憶媒体では、指定された語から決定されるキーを用いて検索対象値を検索するためにキーと検索対象値とを対応させたインデクスの作成処理を、コンピュータに実行させるプログラムを当該コンピュータに読み取り可能に記憶した構成において、前記プログラムが、語の長さが設定された閾値以下であることに応じて当該語の文字列を含むキー文字列と検索対象値との組からなるキーをインデクスメモリに登録する一方、語の長さが前記閾値を越えることに応じて当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を含むキー文字列と検索対象値との組からなるキーをインデクスメモリに登録する処理を前記コンピュータに実行させることにより、処理の効率化を図ることができる。
【００８３】
なお、上記したインデクスメモリとしては、例えば上記プログラムを格納した記憶媒体とは別個なハードディスク装置等として設けることができるばかりでなく、例えば当該記憶媒体の中に設けられてもよい。また、インデクスメモリを記憶媒体中に設ける場合には、例えばキーと検索対象値とを対応させたインデクスを上記プログラム中に記憶してもよく、また、このようなインデクスを上記プログラムとは別に記憶していてもよい。
【００８４】
【発明の効果】
以上説明したように、本発明によると、指定された語から決定されるキーを用いて検索対象値を検索するためにキーと検索対象値とを対応させたインデクスを作成するに際して、語の長さが設定された閾値以下の場合には当該語の文字列を含むキー文字列と検索対象値との組からなるキーを登録する一方、語の長さが前記閾値を越える場合には当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を当該文字列に代えて含ませたキー文字列と検索対象値との組からなるキーを登録するようにしたため、例えば異なる語の数が多い場合でも、キーの長さを短い固定長に制限することができ、これにより、データベースサイズの増加を防ぎ、処理効率を高めることができる。
【００８５】
また、本発明では、前記閾値を越える長さの語を一意に特定するための登録番号を当該語から決定されるキー文字列に付加することや、前記閾値以下の長さの語から決定されるキー文字列と前記閾値を越える長さの語から決定されるキー文字列に各々を区別するフラグを付加することを行うようにしたため、登録時や検索時において異なる語から決定されるキーを確実に区別することができる。
【００８６】
また、本発明では、キーと検索対象値との組を登録するインデクスを複数のサブインデクスにより構成し、例えば文書を特定する文書識別番号や語の出現回数や語の出現頻度を二次元配列の一の方向の位置を示すハッシュ値にマップするハッシュ関数と、語を二次元配列の他の方向の位置を示すハッシュ値にマップするハッシュ関数とを用意して、文書における語の出現をその文書識別番号等およびその語の各々にハッシュ関数を適用して得られたハッシュ値を用いて対応するサブインデクスに登録するようにしたため、例えば文書識別番号等や語にハッシュ関数を適用して決まるハッシュ値が同じになるものを１つのグループにまとめて処理することで、複数の文書における語の出現を一括して登録する処理等の効率を高めることができる。
【００８７】
また、本発明では、例えば主記憶装置に用意した少なくとも１つのサブインデクスが格納できるページキャッシュを用いることで、処理の高速化を実現した。
また、本発明では、サブインデクスとしてＢ＋木構造を用いることで、インデクス更新処理や検索処理の高速化を図るとともに、キーに当該キーと対応する文書の一意な文書識別情報を付加することで、或る文書における或る語の出現をＢ＋木インデクスに対する１回の検索で見つけること等を実現した。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る装置の構成例を示す図である。
【図２】本発明の第１実施例に係るキーの構成例を示す図である。
【図３】キー文字列の構成例を示す図である。
【図４】キー文字列を作成する処理の手順の一例を示すフローチャートである。
【図５】語の出現を検索する処理の手順の一例を示すフローチャートである。
【図６】Ｂ＋木の内容の一部を例示する図である。
【図７】Ｂ＋木のインデクス配列の構成例を示す図である。
【図８】語の出現を検索する処理の手順の一例を示すフローチャートである。
【図９】複数の文書の語の出現を一括して登録する処理の手順の一例を示すフローチャートである。
【図１０】或る一つの語の出現を登録する処理の手順の一例を示す図である。
【図１１】本発明の第２実施例に係るキーの構成例を示す図である。
【図１２】本発明の第２実施例におけるＢ＋木の内容の一部を例示する図である。
【図１３】本発明の第３実施例に係るキーの構成例を示す図である。
【図１４】本発明の第３実施例におけるＢ＋木の内容の一部を例示する図である。
【符号の説明】
１・・文書蓄積部、２・・文書ソート部、３・・形態素解析部、
４・・キー文字列作成部、５・・ロングワード管理部、
６・・インデクス登録部、７・・インデクス蓄積部、
８・・インデクス選択部、９・・問い合わせ入力部、
１０・・検索実行部、１１・・結果出力部、[0001]
BACKGROUND OF THE INVENTION
The present invention relates to, for example, a method for generating an index for a full-text search for a document at high speed, and realizing a high-speed search using the index, and more particularly, to a configuration of a key used for search or the like.
[0002]
[Prior art]
As a full-text search method for a large amount of documents, there is a method using a data structure called a signature file. In the method disclosed in Japanese Patent Laid-Open No. 7-244671, an index is formed that represents the appearance of characters in a document by bits. This method enables a relatively high-speed search without being affected by the number of stored documents.
However, since one bit is assigned to several different words, there is a possibility that a document containing a word other than the specified one may be searched, and an accurate search cannot be performed. . In addition, the algorithm for generation and search is complicated, and it is difficult to realize it on an existing database management system.
[0003]
In order to deal with such problems, the document “Compression and Fast Indexing for Multi-Gigabyte Text Databases” uses indexing techniques such as hash tables and B + trees provided by general database management systems to achieve high speed. A method for realizing a full-text search function has been proposed. In this method, an identification number is assigned to a word and a value key as index keys, and these are compressed and stored.
As a result, the number of read pages of the disk required for the search is reduced, and the search can be performed at high speed. In addition, since different identification numbers can be assigned to different words, an accurate search is possible. This document shows that this method can be used to search about 700,000 documents at high speed.
[0004]
However, the method described in the above document has a problem that high performance cannot be obtained because the performance of new index creation and update processing is not considered. In particular, the confirmation process for preventing the same document from being registered twice for a certain word at the time of updating must be realized by iterative processing for a collection of document identification numbers, and thus can be efficiently realized. Can not.
[0005]
In response to such a problem, the present applicant has proposed an index creation method and a search method capable of improving the efficiency of processing in Japanese Patent Application No. 10-26691. In this method, a hash obtained by applying a hash function to a B + tree and a word identification number using a B + tree key arranged after a word identification number is connected to the document identification number. The hash value of the document identification number is the same at the time of new generation or update by dividing the value and the hash value obtained by applying another hash function to the document identification number and placing it in a two-dimensional array. By registering all of them together, the number of pages to be written is reduced and the processing efficiency is improved.
[0006]
Also, the methods described in the above documents require word identification numbers. In order to give different words different identification numbers, it is necessary to manage the words in some way. Assuming that the time required for management (addition, search) of one word is Tw, and 100 different words are included in one document, for example, in word management, 1 Tw at the time of search and 100 Tw at the time of registration Takes time. As the number of words increases, Tw increases, and as a result, the performance of new index creation and update processing is greatly reduced.
[0007]
As a specific example, when a full-text search is performed on an Internet WWW page, the number of target documents is several million, and the number of different words appearing in the document is tens of millions. In the case of the Internet, the number of different words tends to increase in proportion to the number of documents due to the richness of proper nouns, spelling mistakes, and descriptions in multiple languages. When tens of millions of different words are managed using a B + tree, the time Tw required for adding or searching for one word requires 0.1 to 1 second. This is because the addition of one word to the B + tree and the search process require access to the hard disk as many times as the height of the tree + 1, and the effect of the cache in the memory of the disk when the B + tree becomes huge. This is because almost all accesses to the hard disk actually require processing to read the hard disk.
[0008]
For example, when 10 million different words are managed by a B + tree, if the number of branches at each node is 500, the height of the B + tree is log 500 (10,000,000) -1 = 1.59. Thus, on average, about 2.6 disk accesses are required to add and search for a single word / document pair. Since it takes several tens of milliseconds to several hundred milliseconds to access the hard disk once, it takes 0.1 second to 1 second to add or search a set of one word and a document. Therefore, if 100 different words are included in one document, 100 Tw, that is, 10 to 100 seconds is required to register one document.
[0009]
[Problems to be solved by the invention]
By the way, in the above-described conventional example, a configuration is shown in which a search is performed using a B + tree key in which a document identification number is connected after a word identification number. In this way, an identification number is given to a word. Alternatively, for example, a key may be created using a character string of words. If the key is created using the character string in this way, it is not necessary to manage the correspondence between the word and the identification number. For example, even if the number of different words increases, the burden of managing the word Can be reduced.
[0010]
However, since the length of the B + tree key is generally a fixed length, for example to accommodate all of the various length words used in a document, it is possible to include all these words. The key must be long enough. However, as the key length increases, the size of the database increases, which increases the number of times the hard disk is read and at the same time the cache hit rate in the memory decreases, leading to a decrease in overall performance. End up.
Also, for example, even if the key length is variable, in this case, the processing becomes complicated and at the same time, disk fragmentation occurs, resulting in poor access efficiency and overall performance. Such a problem will occur.
[0011]
The present invention has been made in view of such conventional circumstances, and devised a method of determining a key from a character string of words, for example, an index for full-text search for a large number of documents including a large number of different words. The object is to provide a method of creating at high speed. Another object of the present invention is to provide a method for realizing a high-speed search using an index created in this way.
Another object of the present invention is to provide an apparatus for executing the above method and a storage medium storing a program.
[0012]
[Means for Solving the Problems]
To achieve the above object, the index creation method according to the present invention uses a key determined from a specified word. Search target To find the value , Key and Search target The index corresponding to the value The index creation device When creating, The registration means of the index creation device is If the word length is less than or equal to the set threshold, the key containing the character string of the word String When Search target Pair with value Key consisting of The In memory On the other hand, if the length of a word exceeds the threshold, a key that includes a hash value determined by applying a predetermined hash function to the character string of the word instead of the character string. String When Search target Pair with value Key consisting of The In memory sign up.
In this way, in the present invention, for a word longer than the threshold value, the key is configured using the hash value of the character string, and therefore, the key length can be limited to a short fixed length, for example. The increase in size can be prevented and the processing efficiency can be increased.
[0013]
In the present invention, The registration means of the index creation device is: To uniquely identify a word with a length exceeding the threshold and the word Registration A key that is registered in the registration table in association with the number and is determined from the word String By adding the registration number to the above, each word can be distinguished even if the hash values overlap between words having a length exceeding the threshold.
In the present invention, The registration means of the index creation device is: Key determined from words with a length less than or equal to the threshold String And a key determined from a word whose length exceeds the threshold String By adding a flag to distinguish each of these keys String It was set as the structure which can distinguish.
[0014]
In the index creation method according to the present invention, when creating an index by setting a threshold value for the word length as described above, String When Search target Pair with value Key consisting of An index for registering a group consists of multiple sub-indexes and is registered Search target corresponding to the key Determined by applying a predetermined function to the value function Value and Corresponds to the key to be registered Determined by applying a given function to a word function Store the sub-index at the two-dimensional array position referenced by the value.
Here, in the present invention, the above keys are associated with each other. Search target One document containing, for example, a word that determines the key Document identification information Is used.
[0015]
As an example of configuring the above-described sub-index, in the present invention, a document identification number that uniquely identifies a document is given, Using the document identification number as document identification information, documents Identification information Indicates the position in one direction of a two-dimensional array of document identification numbers as a function applied to hash A hash function that maps to a value, and a function that applies to the word to indicate the position in the other direction of the two-dimensional array of words hash A hash function that maps to a value, The registration means of the index creation device is: The occurrence of a word in a document was obtained by applying a hash function to the document identification number and each of the words hash Use the value to register in the corresponding subindex.
[0016]
In the present invention, when implementing such a sub-index configuration, The registration means of the index creation device is: Determined by applying a hash function to the document identification numbers of multiple documents when registering word occurrences in multiple documents hash Items having the same value are grouped into one group, and word occurrences are registered for each group.
As described above, according to the present invention, for example, when a new index is created or updated, documents having the same hash value of the document identification number can be registered together, thereby reducing the number of pages to be written and increasing the processing efficiency. be able to.
[0017]
In the present invention, The registration means of the index creation device is: When registering the occurrence of all words in a document grouped in one group as described above, the occurrence of each word is determined by applying a hash function to the word. hash Processing efficiency can be further improved by grouping the same values into one group and registering word occurrences for each group.
[0018]
In addition, as another example of the above-described sub-index, in the present invention, Using the number of occurrences of the word in the document as document identification information, documents Identification information As a function that applies to word The number of occurrences in a document in one direction to indicate the position in one direction hash A hash function that maps to a value, and a function that applies to the word to indicate the position in the other direction of the two-dimensional array of words hash Prepare a hash function that maps to a value, The registration means of the index creation device is: The occurrence of a word in a document was obtained by applying a hash function to the number of occurrences of the word and each of the words hash Use the value to register in the corresponding subindex.
[0019]
In this configuration as well, The registration means of the index creation device is: When registering the occurrence of words in multiple documents at once, the hash function is applied to the number of occurrences of each word. hash Processing efficiency can be improved by grouping the same values into one group and registering the occurrence of a word for each group. Also, as above, The registration means of the index creation device is: When registering the occurrences of all words in a document grouped together, the occurrence of each word is determined by applying a hash function to the words. hash Processing efficiency can be further improved by grouping items having the same value into one group and registering word occurrences for each group.
[0020]
In addition, as another example of the above-described sub-index, in the present invention, Using the appearance frequency of the word document as document identification information, documents Identification information As a function that applies to word The frequency of appearance in a document in a two-dimensional array indicating the position in one direction hash A hash function that maps to a value, and a function that applies to the word to indicate the position in the other direction of the two-dimensional array of words hash Prepare a hash function that maps to a value, The registration means of the index creation device is: The occurrence of a word in a document was obtained by applying a hash function to the frequency of the word and each of the words hash Use the value to register in the corresponding subindex.
[0021]
In this configuration as well, The registration means of the index creation device is: When registering the occurrence of words in multiple documents at once, it is determined by applying a hash function to the appearance frequency of each word hash Processing efficiency can be improved by grouping the same values into one group and registering the occurrence of a word for each group. Also, as above, The registration means of the index creation device is: When registering the occurrences of all words in a document grouped together, the occurrence of each word is determined by applying a hash function to the words. hash Processing efficiency can be further improved by grouping items having the same value into one group and registering word occurrences for each group.
In the present invention, at the time of the above registration, for example, a page cache capable of storing at least one sub-index prepared in the main storage device is used to speed up the processing.
[0022]
In the present invention, a B + tree structure is used as the sub-index described above. The . In the present invention , By adding document identification information to a key, it is possible to distinguish, for example, keys determined from the same word in different documents for each document, so that, for example, a key determined from the same word in different documents can be determined. It is possible to prevent overlapping and colliding in the B + tree structure (for example, one cannot be distinguished from the other and overwritten on the other).
Note that the document unique documents As the identification information, for example, information such as the document identification number, the number of appearances of words, and the appearance frequency of words can be used.
[0023]
In the index search method according to the present invention, an index that associates a document name with a key determined from a word included in the document is configured by a B + tree structure, Designated The corresponding document name using the key determined from the word Index search device When performing a search to obtain the document identification information uniquely identifying the document name, In the storage unit of the index search device, when a word length is equal to or less than a set threshold, a key consisting of a set of a key character string including a character string of the word and document identification information is registered. If a word length exceeds the threshold, a key consisting of a set of a key character string including a hash value determined by applying a predetermined hash function to the character string of the word and document identification information is registered. , The search means of the index search device is designated If the word length is less than the set threshold, the character string of the word Key string containing Combined with document identification information that specifies the search range thing As a key To retrieve the document identification information corresponding to the key matching the key from the storage means. On the other hand, if the word length exceeds the threshold, a hash value determined by applying a predetermined hash function to the character string of the word Key string containing Combined with the document identification information thing As a key To retrieve the document identification information corresponding to the key matching the key from the storage means. .
[0024]
Thus, in the present invention, for example, a character string of words Or The hash value of the string Key string containing Combined with document identification information specifying the search range after thing Is used as a key, so that the occurrence of a certain word in a certain document can be found by a single search on the B + tree index.
As the document identification information, for example, information such as a document identification number, the number of appearances of words, and the appearance frequency of words can be used as described above.
[0025]
In the present invention, the above-mentioned Index In a search method, a key determined from a word having a length exceeding the threshold value String To uniquely identify the word Registration Add a number and Registration Match the number with the word In the registration table of the index search device Register and when you search above The search means of the index search device comprises: Key determined from a word whose length exceeds the threshold String Added a registration number that specifies the search range to Key After performing a search using Key string The registered word corresponding to the registration number added to Based on registration table Based on the correspondence between the identified word and the search target word, the corresponding document name is identified from the retrieved document name set.
[0026]
As described above, in the present invention, when a search is performed using a key determined from a word whose length exceeds the threshold, for example, hash values constituting the key overlap between different words in the same document. Even in such a case, after searching a document name set from hash values and document identification information other than the registration number, the corresponding document name is specified using the registration number from the searched document name set. Therefore, it is possible to uniquely specify a word that matches the search target.
[0027]
The present invention can also be configured as a device that executes the method described above, or a storage medium that stores a program for executing the method described above.
For example, in the index creation device according to the present invention, a key determined from a specified word is used. Search target To find the value , Key and Search target When creating an index corresponding to a value, the storage means Index And a key that contains the character string of the word in response to the fact that the registration means is less than or equal to the set threshold String When Search target Pair with value Key consisting of A key including a hash value determined by applying a predetermined hash function to a character string of the word in response to the word length exceeding the threshold String When Search target Pair with value Key consisting of Is registered in the storage means.
[0028]
In the storage medium according to the present invention, a key determined from a specified word is used. Search target To find the value , Key and Search target In a configuration in which a program that causes a computer to execute index creation processing that corresponds to a value is stored in the computer in a readable manner, the program responds when the word length is equal to or less than a set threshold value. Key containing a string of words String When Search target Pair with value Key consisting of Is registered in the index memory, and a key including a hash value determined by applying a predetermined hash function to the character string of the word when the word length exceeds the threshold value String When Search target Pair with value Key consisting of Is caused to be executed by the computer.
[0029]
DETAILED DESCRIPTION OF THE INVENTION
A first embodiment according to the present invention will be described with reference to the drawings.
FIG. 1 shows an example of the configuration of an apparatus for executing the method according to the present invention. This apparatus is configured by using a computer hardware resource and executing a program for carrying out the present invention.
[0030]
The document storage unit 1 is configured by an external memory such as a hard disk device. The document storage unit 1 stores and manages a document to be registered or searched in association with its document name or document identification number. Is done. The document identification number is information for uniquely identifying a document. For example, a different number is assigned to each document.
The document sorting unit 2 is obtained by applying a document name of a document to be registered for an index to a document identification number using a predefined hash function. hash Sort so that the same values are collected.
[0031]
The morpheme analysis unit 3 analyzes the whole sentence of the designated document and cuts out words.
The key character string creation unit 4 creates a key character string based on a given word.
The long word management unit 5 manages a long word table that holds a list of words longer than a predetermined threshold. In this example, a long word table is prepared for each document. A unique registration number is assigned to each word held in the long word table for each document.
Specific examples of the key character string and the registration number will be described later.
[0032]
The index registration unit 6 obtains the document identification number and word of the given document name, and combines the B + tree structure selected by the function of the index selection unit 8 described later with the key character string and the document identification number The occurrence of a word in a document is registered using as a key.
The index storage unit 7 is composed of an external memory such as a hard disk device. The index storage unit 7 is a two-dimensional array having a predetermined size (D × W in this example, where D and W are 1). The B + tree is stored on the above integer). The index storage unit 7 also stores the correspondence between document names and document identification numbers.
[0033]
The index selection unit 8 applies a predetermined hash function to the given document identification number and word character string (for example, character code), and the obtained result is obtained. hash Using the value, the B + tree identification number for registering the appearance of a word is selected from the index table stored in the index storage unit 7.
Here, the hash function Hd applied to the document identification number used in the document sort unit 2 and the index selection unit 8 and the hash function Hw applied to the character string of the word have the document identification number id and the word When the character string is s, integers satisfying 0 ≦ Hd (id) <D and 0 ≦ Hw (s) <W, respectively. hash Defined as a value.
[0034]
The inquiry input unit 9 receives a search request from a user, and generates a search expression in which words are combined with AND or OR, for example.
The search execution unit 10 obtains a B + tree to be searched from the character string of the word included in the given search expression by the function of the index selection unit 8 and performs a search process.
The result output unit 11 presents the search result obtained by the search execution unit 10 to the user through a display display or the like.
[0035]
FIG. 2 shows a configuration example of the B + tree key stored in the index storage unit 7.
As shown in the figure, this B + tree key has a structure in which a document identification number is combined after a key character string. In this example, the key character string is 9 bytes and the document identification number is 4 bytes. Allocate space.
[0036]
FIG. 3 shows a configuration example of the key character string.
As shown in the figure, the key character string has two structures depending on whether the length of the target word is shorter or longer than a predetermined threshold. In this example, the word length threshold is set to 8 bytes, and if the target word length is 8 bytes or less (ie, the threshold or less), the word character string is directly used as the key character string. On the other hand, if the length of the target word exceeds 8 bytes (that is, exceeds the threshold), the hash value Hl (s) obtained by applying the hash function Hl to the word string s Is included in the key string.
[0037]
Here, the hash function Hl described above is 0 ≦ Hl (s) <2 where s is a word character string and n is a threshold value. ^{8 (n-1)} An integer that becomes hash Defined as a value. That is, the data size of the value returned by the hash function Hl (ie, the hash value) is 1 byte less than the threshold value. hash Value (7 bytes in this example).
[0038]
In addition, for example, when the length of a word exceeds 8 bytes, hash values may be duplicated between different words. In this example, the hash value and the registration number of the word in the longword table (this example Then, as will be described later, a key character string is a combination of two-byte registration number portions including a 15-bit registration number (shown in FIG. 3). As a result, even if hash values are duplicated between long words appearing in the same document, key character strings are not duplicated. Further, in the actual B + tree, as shown in FIG. 2 above, the key character string combined with the document identification number is used as a key, so that the key character string is duplicated between words included in different documents. However, there is no overlap between the B + tree keys.
[0039]
Further, there is a possibility that a value overlaps between a character string of a word having a length of 8 bytes or less and a hash value of a word having a length of more than 8 bytes. Therefore, in this example, when the word length is 8 bytes or less, the value of the eighth byte of the word character string is shifted to the ninth byte in the key character string, and the eighth byte in the key character string is shifted. Set the value to 0. For example, when the target word is “internet”, the key character string is 696E7465726E650074 (hexadecimal).
[0040]
On the other hand, for words longer than 8 bytes, the maximum registration number is 32767 (ie 2 ¹⁵ -1), and the most significant bit of the area (registration number part) for storing the registration number is always set to 1. Thus, when the word length is 8 bytes or less, the most significant bit of the 8th byte data from the beginning of the key character string is always 0, while when the word length exceeds 8 bytes, the key Since the most significant bit of the 8th byte data from the beginning of the character string is always 1, the value overlaps between the character string of the word whose length is less than the threshold and the hash value of the word whose length exceeds the threshold However, there will be no duplicate key strings.
[0041]
FIG. 4 shows an example of a procedure for creating a key character string.
That is, for example, when the length of a given word is 8 bytes or less (step S1), first, the value of 7 bytes from the beginning of the word character string s is copied to the key character string iw (step S2). Next, 0 is set to the eighth byte of the key character string iw (step S3), and then the value of the eighth byte of the character string s is copied to the ninth byte of the key character string iw (step S4). ), A key character string iw is created.
[0042]
On the other hand, if the length of a given word exceeds 8 bytes (step S1), first, the hash value Hl (s) of the word character string s is copied to the key character string iw (step S5), and the next Then, the key character string iw is created by copying the logical sum of the given registration number and 1,000,000,000,000,000 (binary) after the eighth byte of the key character string iw (step S6).
In FIG. 4, << indicates an operation for shifting a bit to the left.
[0043]
Thus, in the index creation method of this example, when the word length is equal to or less than the set threshold, the key including the character string of the word is used. String And value (in this example, the key String If the word length exceeds the threshold, a predetermined hash function is added to the character string of the word. Key that contains the hash value determined by application instead of the string String And value (In this example, document identification information of one document including a word that determines the key character string) For example, the key length can be limited to a short fixed length, thereby preventing an increase in the database size and increasing the processing efficiency.
[0044]
Note that any value may be set as the word length threshold. For example, considering the appearance rate of words whose length is equal to or less than the threshold or the appearance rate of words whose length exceeds the threshold, It is preferable to set the index size to a value that can reduce the time required for data registration processing and search processing.
[0045]
In addition, as described above, in this example, the word having a length exceeding the set threshold and the word for uniquely identifying the word are specified. Registration Registered in the registration table (in this example, the longword table) in association with the number Registration Key whose number is determined from the word String Thus, even if hash values overlap between words having a length exceeding the threshold, each word can be distinguished.
In this example, the key is determined from a word having a length less than or equal to the set threshold. String And a key determined from a word whose length exceeds the threshold String By adding a flag to distinguish each of these keys String It can be distinguished. In the present example, as a preferred mode, the most significant bit of the 8th byte data from the beginning, which is the boundary between the hash value in the key character string and the registration number, is used as a flag, but the configuration of the flag is particularly limited. There is no.
[0046]
FIG. 5 shows an example of a processing procedure for searching for an occurrence of a word from the B + tree.
That is, in a search for obtaining a document including a certain word, first, a key character string iw is created from the word (step S11), and all B + trees including the appearance of the word are searched in the key character string iw. A value obtained by combining the minimum values (here, 0 (32 bits)) of the document identification numbers specifying the range is set as the start point (step S12), while the maximum value (here, FFFFFFFF) of the document identification numbers specifying the search range is set. A value obtained by combining (hexadecimal)) is set as an end point (step S14).
[0047]
Here, when the length of the given word exceeds the threshold value, a value obtained by adding the registration number to the hash value of the word is used as the key character string iw. Therefore, the above document identification number is also used for this registration number. Set the search range as in. That is, in this example, when the key character string iw corresponding to the minimum value of the search range is created, the minimum value (here, 0) of the registration number that specifies the search range of the word is given (step S11). When creating the key character string iw corresponding to the maximum value of the range, the maximum value of the registration number (here 7FFF (hexadecimal)) for designating the word search range is given (step S13). The search is performed in the range from the minimum value to the maximum value of the registration numbers given to the column iw.
[0048]
Then, by performing a search from the start point to the end point (step S15), for example, a given word Length of Is less than or equal to the threshold (8 bytes in this example) (step S16), all occurrences for that word can be obtained in ascending order of document identification numbers (step S17).
On the other hand, if the length of the given word exceeds the threshold value (step S16), all occurrences of the corresponding word are searched in the search range of the registration number given to the key character string iw. The words that really correspond to the search target are identified from the appearances of the words.
[0049]
Specifically, in this example, the registration number of the word and the document identification number are extracted from the obtained search result, and the registration number is referred to by referring to the long word table corresponding to the document identification number (step S18). Is identified, and for example, by comparing the identified word with the word to be searched, it is verified whether the word to be searched is really included in the searched document (step S19). As a result, if the word to be searched is included in the document, the appearance of the word is returned (step S17), and if not included, for example, NULL is returned (step S20).
Similarly to the above, in FIG. 5, << indicates an operation for shifting a bit to the left.
[0050]
As an example, it is assumed that the state of a part of the B + tree is as shown in FIG. 6 when the occurrence of words related to some documents is registered. In this state, when searching for a document including a word whose key character string is 7465737440000000 (hexadecimal), search for a document whose key value is in the range of 7465374374000000000000 (hexadecimal) and 7465737440000000FFFFFFFF (hexadecimal). By doing so, the appearance (O4, O5) of the target word can be obtained.
[0051]
When it is confirmed whether or not a word having a key character string of 7465737440000000 (hexadecimal) is included in a document having a document identification number of 7, the key value is 74657374000000000000000007 (hexadecimal). By searching for a match, the word appearance (O5) can be obtained. In this example, the appearance of a word includes, for example, information such as which page and which line in which document the word is present, and information such as the number of appearances and the appearance frequency of the word.
[0052]
Thus, in the index retrieval method of this example, for example, an index that associates a document name with a key determined from a word included in the document is configured by a B + tree structure, and a key determined from the word is used. When a search for obtaining a corresponding document name is performed, document identification information (in this example, a document identification number) that uniquely identifies the document name is given, and if the word length is equal to or less than the set threshold value, String Key string containing Hash value determined by applying a predetermined hash function to the character string of the word when the word length exceeds the threshold while using the value obtained by combining the document identification information designating the search range as a key Key string containing By using the value obtained by combining the document identification information as a key, the occurrence of a certain word in a certain document can be found by a single search for the B + tree index.
[0053]
In this example, the key is determined from a word whose length exceeds the threshold. String Is used to uniquely identify the word Registration A number is added and the Registration The key and the word are registered in the long word table in correspondence with each other, and the key determined from the word having a length exceeding the threshold is used in the above search. String After performing a search using a value with the registration number that specifies the search range added to String The registered word is identified in correspondence with the registration number added to, and the corresponding document name is identified from the retrieved document name set based on the correspondence between the identified word and the word to be searched. This makes it possible to uniquely identify a word that matches the search target as described above.
[0054]
FIG. 7 shows the storage structure of the B + tree in the index storage unit 7.
As shown in the figure, in this example, D × W sub-indexes are provided by a D × W two-dimensional array, and a B + tree structure is used as each sub-index. For example, the appearance of a word whose document identification number is id and the character string of the word is s is registered in the B + tree corresponding to the sub-index of the B + tree (Hw (s), Hd (id)). .
Therefore, when searching for the appearance of a word whose character string is s, the process shown in FIG. 5 is executed for the B + tree selected in the procedure shown in FIG.
[0055]
Here, FIG. 8 shows an example of a processing procedure for searching for a document in which one specified word appears.
That is, in this process, first, the hash function Hw is applied to the character string s of a given word. hash The value Hw (s) is substituted for w (step S31). Then, the variable i and the variable r are initialized to 0 (step S32), and i is incremented by 1 (step S35), until i becomes D (step S36), with respect to the B + tree (w, i) The word search is repeated (step S33), and the result is added to the array R [r, r + r ′] (step S34). Note that the number of search results is substituted for r ′, and every time i is increased by 1, r is replaced with r + r ′.
[0056]
Thereby, it is possible to perform a search for the B + tree group stored in the sub-index of a certain row of the two-dimensional array shown in FIG. Thus, even if the search range is limited to one line, the appearance of the target word is not included in the other B + trees, so the target word is included only in the document found by this. Will be. In this way, in the search processing of this example, the B + trees to be searched are limited, and since each B + tree is used in order, the hit rate of the cache holding the search target sub-index can be increased. , You can search efficiently. In the search of this example, at least one B + tree sub-index is held in the cache of the main storage device used by the search execution unit 10.
[0057]
FIG. 9 shows an example of a processing procedure for collectively registering appearances of words in a plurality of documents.
That is, in this process, first, each document is obtained by applying the hash function Hd to the document identification number id. hash The document is divided into D groups according to the value Hd (id), and the grouped documents are stored in the arrays G (0) to G (D-1) for each group (step S41). Subsequently, the variable d is initialized to 0 (step S42), and the variable d is incremented by 1 (step S49), and the variable d becomes D (step S50). Perform the registration process.
[0058]
That is, in this registration process, for each array G (d) described above, word occurrences (a set of documents and words) are extracted from all documents stored in the array G (d) (step S43), Appearance of the extracted word is obtained by applying the hash function Hw to the character string s of each word. hash The groups are divided into W groups based on the value Hw (s), and the appearances of the grouped words are stored in the respective arrays O (0) to O (W-1) for each group (step S44). Then, the variable w is initialized to 0 (step S45), and the variable w is incremented by 1 (step S47), and stored in the array G (d) until the variable w becomes W (step S48). For each group O (w), a process of registering the appearance of a word belonging to the group O (w) is performed (step S46). The procedure for registering the appearance of a word is shown in FIG.
[0059]
With the above processing, the appearance of words is stored in order in the B + tree sub-index arranged in the downward direction from the upper left of the array shown in FIG. 7, and when the storage up to the bottom B + tree sub-index is completed, Since the right column is stored in order in the B + tree sub-index arranged from top to bottom, a plurality of B + tree sub-indexes are not referred to alternately, and the page cache hit rate can be increased.
Furthermore, if there is an area in the main memory that can hold the contents of one B + tree sub-index, all the storage processes for one B + tree sub-index can be executed in the main memory, so that the storage process can be executed very quickly. .
[0060]
FIG. 10 shows an example of a processing procedure for registering the appearance of a certain word in a certain document.
That is, in this process, for example, the character string s of the word in the document and the document identification number id of the document are obtained, and obtained by applying the hash functions Hw and Hd to each. hash The values Hw (s) and Hd (id) are held in the variables w and d (steps S61 and S62).
[0061]
Subsequently, a key character string iw is created based on the target word. Here, when the word length is equal to or less than the set threshold value (8 bytes in this example) (step S63), the key character string iw including the character string of the word is created (step S64), If the word length exceeds the threshold (step S63), the word is registered in the long word table corresponding to the document to obtain a registration number (step S67), and based on the obtained registration number and word. A key character string iw is created (step S68).
[0062]
Next, a value obtained by adding the value of the document identification number id to the value obtained by shifting the value of the created key character string iw by 32 bits to the left is substituted into a variable k (step S65). Then, the occurrence of a word is registered in the sub-index B + tree (w, d) of the array shown in FIG. 7 using k as a key (step S66).
Similar to the above, in FIG. 10, << indicates an operation for shifting a bit to the left.
[0063]
Thus, in this example, it is determined from the specified word. Contains key string Value using key (In this example, document identification information of one document including a word that determines the key character string) Key and value to find (In this example, document identification information of one document including a word for determining a key character string) When creating an index that corresponds to String And value (In this example, document identification information of one document including a word that determines the key character string) The value to be registered by configuring an index for registering a pair with multiple sub-indexes (In this example, document identification information of one document including a word for determining a key character string) Determined by applying a predetermined function to function Determined by applying a predetermined function to the value and word (in this example, the word string) function By using a configuration in which a sub-index is stored at a two-dimensional array position referenced by a value, and using a B + tree structure as the sub-index, processing efficiency is improved.
[0064]
Specifically, in this example, as described above, the document identification number is indicated as a function to be applied to the document and indicates the position in one direction of the two-dimensional array. hash A hash function that maps to a value, and a function that applies to the word to indicate the position in the other direction of the two-dimensional array of words hash A hash function that maps to a value, and the occurrence of a word in a document is obtained by applying the hash function to that document identification number and each of that word hash Registered in the corresponding sub-index using the value.
[0065]
In this example, when the occurrence of words in a plurality of documents is collectively registered using such a sub-index structure, a hash function is applied to the document identification numbers of these documents. hash By grouping items with the same value into one group and registering the occurrence of a word for each group, the number of pages to be written when an index is newly created or updated is reduced, and the processing efficiency is improved.
Furthermore, in this example, when registering the occurrences of all words in a document grouped in one group as described above, the occurrence of each word is determined by applying a hash function to the word. hash Processing efficiency was further improved by grouping items with the same value into one group and registering word occurrences for each group.
[0066]
Also, in this example, when using a B + tree structure as a sub-index, by adding unique identification information (document identification information in this example) of a document corresponding to the key to the key, for example, the same word in different documents Even if the key character strings determined from the above are overlapped, by making these key character strings distinguishable according to the document identification information, both will collide in the B + tree structure. It was prevented.
In this example, as described above, since the page cache that can store at least one sub-index prepared in the main storage device is used, the processing speed can be increased.
[0067]
As described above, in the index creation method and index search method of this example, for example, processing for creating an index for full-text search for a large amount of documents including a large number of different words, and processing for performing retrieval using the created index Since the key is created using the character string of the word or the hash value of the character string, the influence affected by the number of different words can be reduced, for example, the whole sentence for millions of documents Indexes for search can be generated, updated, and searched at high speed.
[0068]
Further, in the method of this example, for example, a hash function or a registration number to be applied to a word having a length exceeding a set threshold value is stored without managing the correspondence between each word and an identification number. In this case, since the correspondence between the word and the key can be attached, for example, even if the number of different words increases, the burden of managing the word can be reduced, thereby managing one word (addition, search) The time Tw required for) can be brought close to zero.
[0069]
In addition, since the method of this example employs the above-described sub-index configuration and B + tree structure, the number of update pages and read pages required for document storage processing and retrieval processing can be reduced. Registration processing and search processing can be executed at high speed. In this example, by using a B + tree as a sub-index, for example, a search from the root of the tree can be realized with a short path, and the appearance of a new word can be easily added.
[0070]
In FIG. 11, as a second embodiment of the present invention, a certain function is applied to the number of appearances of words in the vertical division of the B + tree index. function An example of the key configuration when using a value is shown. As shown in the figure, the key in this example has a structure in which a key character string is followed by a value that represents the number of occurrences of an word as an integer. The key character string is 9 bytes and the number of occurrences is 4 bytes. Allocate space.
[0071]
FIG. 12 shows an example of a state in which the appearance of a word is registered in the B + tree using the key configuration shown in FIG.
As shown in FIG. 12, the appearances of a plurality of different words for the same word are arranged in the order of, for example, the number of appearances of the word. Thereby, in the search process, it becomes easy to extract the search results in descending order of word appearances.
[0072]
The processing procedure for searching for the appearance of a word in the second embodiment is the same as the processing for searching for the appearance of a word in the first embodiment shown in FIG. Further, the processing procedure for registering the appearance of a word is the same as the processing procedure shown in FIG. 9 in which the document is grouped using the document identification number (step S41). Is replaced with the process of grouping, and the process of generating a key value using the value of the document identification number (steps S62 and S65) in the process procedure shown in FIG. It can be realized by replacing with a process of generating a key value using.
[0073]
Thus, in the present invention, the number of appearances of the word in the document is shown as a function applied to the appearance of the word in the document, indicating the position in one direction of the two-dimensional array hash A hash function that maps to a value, and a function that applies to the word to indicate the position in the other direction of the two-dimensional array of words hash A hash function that maps to a value was prepared, and the occurrence of a word in a document was obtained by applying the hash function to the number of occurrences of the word and to each of the words hash It is also possible to use a configuration in which a value is used for registration in the corresponding subindex.
[0074]
Also in this configuration, as in the case of the first embodiment, when registering the appearance of words in a plurality of documents at once, the hash function is applied to the number of appearances of each word. hash It is determined by grouping items with the same value into one group and registering the occurrence of a word for each group, and applying a hash function to the occurrence of each word. hash Processing efficiency can be improved by grouping items having the same value into one group and registering the occurrence of a word for each group.
[0075]
In FIG. 13, as a third embodiment of the present invention, a certain function is applied to the appearance frequency of words in the vertical division of the B + tree. function An example of the key configuration when using a value is shown.
As shown in the figure, the key of this example has a structure in which a key character string is combined with a value that represents the appearance frequency of a word as an integer. The key character string has 9 bytes and the appearance frequency has 1 byte. Allocate space.
In this example, the appearance frequency of a certain word is represented by a value obtained by dividing the number of times that word appears in the document by the total number of words in the document and multiplying by 100.
[0076]
FIG. 14 shows an example of a state in which word appearances are registered in the B + tree using the key configuration shown in FIG.
As shown in FIG. 14, the appearance of a plurality of different words for the same word is arranged in the order of, for example, the appearance frequency of the words. Thereby, in the search process, it becomes easy to extract the search results in descending order of word appearance frequency.
[0077]
The processing procedure for searching for the occurrence of a word in the third embodiment is the same as the processing for searching for the appearance of a word in the first embodiment shown in FIG. Also, the processing procedure for registering the appearance of words is the same as the processing procedure shown in FIG. 9 in which the documents are grouped using the document identification numbers (step S41). Is replaced with the process of grouping, and the process of generating a key value using the value of the document identification number in the process procedure shown in FIG. 10 (steps S62 and S65) is the value of the word appearance frequency. It can be realized by replacing with a process of generating a key value using.
[0078]
Thus, in the present invention, the appearance frequency of the word in the document is shown as a function applied to the appearance of the word in the document, indicating the position in one direction of the two-dimensional array. hash A hash function that maps to a value, and a function that applies to the word to indicate the position in the other direction of the two-dimensional array of words hash A hash function that maps to a value was prepared, and the occurrence of a word in a document was obtained by applying the hash function to the frequency of the word and each of the words hash It is also possible to use a configuration in which a value is used for registration in the corresponding subindex.
[0079]
Also in this configuration, as in the case of the first embodiment and the second embodiment, when the appearance of words in a plurality of documents is registered in a batch, the hash function is applied to the appearance frequency of each word. hash It is determined by grouping items with the same value into one group and registering the occurrence of a word for each group, and applying a hash function to the occurrence of each word. hash Processing efficiency can be improved by grouping items having the same value into one group and registering the occurrence of a word for each group.
[0080]
Here, the present invention can also be grasped as an apparatus that executes the method described above.
As an example, the index creation device according to the present invention uses a key determined from a specified word. Search target Key to find the value and Search target When creating an index corresponding to a value, storage means such as a hard disk Index And a key that contains the character string of the word in response to the fact that the registration means is less than or equal to the set threshold String When Search target Pair with value Key consisting of A key including a hash value determined by applying a predetermined hash function to a character string of the word in response to the word length exceeding the threshold String When Search target Pair with value Key consisting of Can be processed efficiently.
[0081]
Further, as described above, the processing performed by each functional unit provided in the apparatus shown in FIG. 1 is performed by the processor executing the control program in the hardware resource including the processor, the memory, and the like. In the present invention, these functional units may be configured as independent hardware circuits.
The present invention can also be grasped as a computer-readable storage medium such as a floppy disk or a CD-ROM storing the control program, and the processor inputs the control program from the storage medium to the processor. Thus, the processing according to the present invention can be performed.
[0082]
As an example, the storage medium according to the present invention uses a key determined from a specified word. Search target Key to find the value and Search target In a configuration in which a program that causes a computer to execute an index creation process that corresponds to a value is stored in a readable manner in the computer, the program is in response to the word length being less than or equal to a set threshold value. Key containing a string of words String When Search target Pair with value Key consisting of Is registered in the index memory, and a key including a hash value determined by applying a predetermined hash function to the character string of the word when the word length exceeds the threshold value String When Search target Pair with value Key consisting of By causing the computer to execute the process of registering the data in the index memory, the process can be made more efficient.
[0083]
Note that the above-described index memory can be provided not only as a hard disk device or the like that is separate from the storage medium storing the program, for example, but may be provided within the storage medium, for example. When the index memory is provided in the storage medium, for example, a key and Search target An index corresponding to a value may be stored in the program, or such an index may be stored separately from the program.
[0084]
【The invention's effect】
As explained above, according to the present invention, a key determined from a specified word is used. Search target Key to find the value and Search target When creating an index corresponding to a value, if the word length is less than the set threshold, the key containing the character string of the word String When Search target Pair with value Key consisting of If the word length exceeds the threshold value, a hash value determined by applying a predetermined hash function to the character string of the word is included instead of the character string. String When Search target Pair with value Key consisting of For example, even when there are a large number of different words, the key length can be limited to a short fixed length, thereby preventing an increase in the database size and improving the processing efficiency.
[0085]
In the present invention, a registration number for uniquely identifying a word having a length exceeding the threshold is a key determined from the word. String Or keys determined from words with a length less than or equal to the threshold String And a key determined from a word whose length exceeds the threshold String Since a flag for distinguishing each is added to the key, it is possible to reliably distinguish keys determined from different words at the time of registration or search.
[0086]
In the present invention, the key and Search target An index for registering a pair with a value is composed of a plurality of sub-indexes. For example, a document identification number for specifying a document, the number of occurrences of a word, and the appearance frequency of a word indicate the position in one direction of a two-dimensional array. hash A hash function that maps to a value, and the position of the word in the other direction hash A hash function that maps to a value, and the occurrence of a word in a document was obtained by applying the hash function to each of the word with its document identification number etc. hash Since it is registered in the corresponding sub-index using a value, it is determined by applying a hash function to a document identification number or a word, for example hash By processing items having the same value together in one group, it is possible to improve the efficiency of processing for registering the appearance of words in a plurality of documents at once.
[0087]
Further, in the present invention, for example, a page cache that can store at least one sub-index prepared in the main storage device is used, thereby realizing high-speed processing.
In the present invention, the B + tree structure is used as a sub-index to speed up the index update process and the search process, and a unique key for the document corresponding to the key is used as a key. documents By adding identification information, it was possible to find the appearance of a certain word in a certain document by a single search on the B + tree index.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a configuration example of an apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram showing a configuration example of a key according to the first embodiment of the present invention.
FIG. 3 is a diagram illustrating a configuration example of a key character string.
FIG. 4 is a flowchart illustrating an example of a processing procedure for creating a key character string.
FIG. 5 is a flowchart illustrating an example of a processing procedure for searching for an occurrence of a word.
FIG. 6 is a diagram illustrating a part of the contents of a B + tree.
FIG. 7 is a diagram illustrating a configuration example of an index array of a B + tree.
FIG. 8 is a flowchart illustrating an example of a processing procedure for searching for an occurrence of a word.
FIG. 9 is a flowchart illustrating an example of a processing procedure for collectively registering appearances of words in a plurality of documents.
FIG. 10 is a diagram illustrating an example of a processing procedure for registering the appearance of a certain word.
FIG. 11 is a diagram illustrating a configuration example of a key according to the second embodiment of the present invention.
FIG. 12 is a diagram illustrating a part of the contents of a B + tree in the second embodiment of the present invention.
FIG. 13 is a diagram illustrating a configuration example of a key according to a third embodiment of the present invention.
FIG. 14 is a diagram illustrating a part of the contents of a B + tree in the third embodiment of the present invention.
[Explanation of symbols]
1. Document storage unit 2. Document sorting unit 3. Morphological analysis unit
4. Key character string creation part, 5. Long word management part,
6 ・・ Index registration part, 7 ・・ Index storage part,
8 .... Index selection part, 9 .... Inquiry input part,
10 .... Search execution part, 11 .... Result output part,

Claims

指定された語から決定されるキーを用いて検索対象値を検索するために、キーと検索対象値とを対応させたインデクスをインデクス作成装置が作成するインデクス作成方法において、
前記インデクス作成装置の登録手段が、語の長さが設定された閾値以下の場合には当該語の文字列を含むキー文字列と検索対象値との組からなるキーを記憶手段に登録する一方、語の長さが前記閾値を越える場合には当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を当該文字列に代えて含ませたキー文字列と検索対象値との組からなるキーを記憶手段に登録することを特徴とするインデクス作成方法。In an index creation method in which an index creation device creates an index in which a key and a search target value are associated with each other in order to search for a search target value using a key determined from a specified word.
When the registration unit of the index creation device registers a key consisting of a key character string including a character string of the word and a search target value in the storage unit when the word length is equal to or less than a set threshold value, When the word length exceeds the threshold, a set of a key character string and a search target value including a hash value determined by applying a predetermined hash function to the character string of the word instead of the character string An index creation method characterized by registering a key comprising:

請求項１に記載のインデクス作成方法において、
前記インデクス作成装置の登録手段は、前記閾値を越える長さの語と当該語を一意に特定するための登録番号とを対応させて登録テーブルに登録し、当該語から決定されるキー文字列に前記登録番号を付加することを特徴とするインデクス作成方法。The index creation method according to claim 1,
The registration unit of the index creation device registers a word having a length exceeding the threshold and a registration number for uniquely identifying the word in the registration table, and creates a key character string determined from the word. An index creation method, wherein the registration number is added.

請求項１又は請求項２に記載のインデクス作成方法において、
前記インデクス作成装置の登録手段は、前記閾値以下の長さの語から決定されるキー文字列と前記閾値を越える長さの語から決定されるキー文字列に各々を区別するフラグを付加することを特徴とするインデクス作成方法。In the index creation method according to claim 1 or 2,
The registration means of the index creation unit, adding a flag to distinguish each key string is determined from the length of words exceeds the threshold value as the key string that is determined from the word length of less than or equal to the threshold value An index creation method characterized by

請求項１乃至請求項３のいずれか１項に記載のインデクス作成方法において、
キー文字列と検索対象値との組からなるキーを登録するインデクスを複数のサブインデクスにより構成し、
登録するキーに対応する検索対象値に所定の関数を適用して決まる関数値と登録するキーに対応する語に所定の関数を適用して決まる関数値によって参照される二次元配列位置にサブインデクスを格納することを特徴とするインデクス作成方法。The index creation method according to any one of claims 1 to 3,
An index for registering a key consisting of a set of key character string and search target value is composed of multiple sub-indexes,
A sub-index at a two-dimensional array position referenced by a function value determined by applying a predetermined function to the search target value corresponding to the key to be registered and a function value determined by applying the predetermined function to the word corresponding to the key to be registered An index creation method characterized by storing an index.

請求項４に記載のインデクス作成方法において、
キーを決定する語を含んでいる１つの文書の文書識別情報を検索対象値として用いることを特徴とするインデクス作成方法。The index creation method according to claim 4,
An index creation method, wherein document identification information of one document including a word for determining a key is used as a search target value.

請求項５に記載のインデクス作成方法において、
文書に一意に識別する文書識別番号を与え、
当該文書識別番号を文書識別情報として用いて、
文書識別情報に適用する関数として文書識別番号を二次元配列の一の方向の位置を示すハッシュ値にマップするハッシュ関数と、語に適用する関数として語を二次元配列の他の方向の位置を示すハッシュ値にマップするハッシュ関数とを用意し、
前記インデクス作成装置の登録手段は、文書における語の出現をその文書識別番号およびその語の各々にハッシュ関数を適用して得られたハッシュ値を用いて対応するサブインデクスに登録することを特徴とするインデクス作成方法。The index creation method according to claim 5,
Give the document a document identification number that uniquely identifies it,
Using the document identification number as document identification information,
A hash function that maps the document identification number to a hash value indicating the position in one direction of the two-dimensional array as a function applied to the document identification information, and a position of the word in the other direction as a function applied to the word. A hash function that maps to the hash value shown,
The registration means of the index creation device registers the occurrence of a word in a document in a corresponding sub-index using a document identification number and a hash value obtained by applying a hash function to each of the words. Index creation method to be performed.

請求項６に記載のインデクス作成方法において、
前記インデクス作成装置の登録手段は、複数の文書における語の出現を一括して登録する場合に、それらの文書の文書識別番号にハッシュ関数を適用して決まるハッシュ値が同じになるものを１つのグループにまとめて、グループごとに語の出現を登録することを特徴とするインデクス作成方法。In the index creation method according to claim 6,
The registration means of the index creation device, when registering the appearance of words in a plurality of documents at once, uses one hash value determined by applying a hash function to the document identification numbers of those documents. An index creation method characterized by registering word occurrences for each group in a group.

請求項５に記載のインデクス作成方法において、
語の文書における出現回数を文書識別情報として用いて、
文書識別情報に適用する関数として語の文書における出現回数を二次元配列の一の方向の位置を示すハッシュ値にマップするハッシュ関数と、
語に適用する関数として語を二次元配列の他の方向の位置を示すハッシュ値にマップするハッシュ関数を用意し、
前記インデクス作成装置の登録手段は、或る文書における或る語の出現をその語の出現回数およびその語の各々にハッシュ関数を適用して得られたハッシュ値を用いて対応するサブインデクスに登録することを特徴とするインデクス作成方法。The index creation method according to claim 5,
Using the number of occurrences of the word in the document as document identification information,
A hash function that maps the number of occurrences of the word in the document to a hash value indicating the position in one direction of the two-dimensional array as a function applied to the document identification information ;
Prepare a hash function that maps a word to a hash value indicating the position in the other direction of the two-dimensional array as a function applied to the word,
The registration means of the index creation device registers the appearance of a word in a document in the corresponding sub-index using the number of appearances of the word and a hash value obtained by applying a hash function to each of the words. An index creation method characterized by:

請求項８に記載のインデクス作成方法において、
前記インデクス作成装置の登録手段は、複数の文書における語の出現を一括して登録する場合に、各語の出現回数にハッシュ関数を適用して決まるハッシュ値が同じになるものを１つのグループにまとめて、グループごとに語の出現を登録することを特徴とするインデクス作成方法。The index creation method according to claim 8,
The registration unit of the index creation device, when registering the appearance of words in a plurality of documents at once, puts the same hash value determined by applying a hash function to the number of appearances of each word into one group Collectively, an index creation method characterized by registering the occurrence of a word for each group.

請求項５に記載のインデクス作成方法において、
語の文書における出現頻度を文書識別情報として用いて、
文書識別情報に適用する関数として語の文書における出現頻度を二次元配列の一の方向の位置を示すハッシュ値にマップするハッシュ関数と、
語に適用する関数として語を二次元配列の他の方向の位置を示すハッシュ値にマップするハッシュ関数を用意し、
前記インデクス作成装置の登録手段は、或る文書における或る語の出現をその語の出現頻度およびその語の各々にハッシュ関数を適用して得られたハッシュ値を用いて対応するサブインデクスに登録することを特徴とするインデクス作成方法。The index creation method according to claim 5,
Using the appearance frequency of the word document as document identification information,
A hash function that maps the frequency of occurrence of the word in the document as a function applied to the document identification information to a hash value indicating the position in one direction of the two-dimensional array;
Prepare a hash function that maps a word to a hash value indicating the position in the other direction of the two-dimensional array as a function applied to the word,
The registration means of the index creation device registers the appearance of a word in a document in the corresponding sub-index using the appearance frequency of the word and a hash value obtained by applying a hash function to each of the words. An index creation method characterized by:

請求項１０に記載のインデクス作成方法において、
前記インデクス作成装置の登録手段は、複数の文書における語の出現を一括して登録する場合に、各語の出現頻度にハッシュ関数を適用して決まるハッシュ値が同じになるものを１つのグループにまとめて、グループごとに語の出現を登録することを特徴とするインデクス作成方法。The index creation method according to claim 10,
The registration unit of the index creation device, when registering the appearance of words in a plurality of documents at once, puts the same hash value determined by applying a hash function to the appearance frequency of each word into one group Collectively, an index creation method characterized by registering the occurrence of a word for each group.

請求項７又は請求項９又は請求項１１に記載のインデクス作成方法において、
前記インデクス作成装置の登録手段は、１つのグループにまとめられた文書におけるすべての語の出現を登録する場合に、各語の出現を語にハッシュ関数を適用して決まるハッシュ値が同じになるものを一つのグループにまとめて、グループごとに語の出現を登録することを特徴とするインデクス作成方法。In the index creation method according to claim 7 or claim 9 or claim 11,
When the registration means of the index creation device registers the occurrences of all words in a document grouped into one group, the hash values determined by applying a hash function to the occurrences of each word are the same. An index creation method characterized by registering word occurrences for each group.

請求項１２に記載のインデクス作成方法において、
主記憶装置に用意した少なくとも１つのサブインデクスが格納できるページキャッシュを用いることを特徴とするインデクス作成方法。The index creation method according to claim 12,
An index creation method using a page cache capable of storing at least one sub-index prepared in a main storage device.

請求項５乃至請求項１３のいずれか１項に記載のインデクス作成方法において、
サブインデクスとしてＢ＋木構造を用いたことを特徴とするインデクス作成方法。The index creation method according to any one of claims 5 to 13,
An index creation method characterized by using a B + tree structure as a sub-index.

文書名と当該文書に含まれる語から決定されるキーとを対応させたインデクスをＢ＋木構造により構成して、指定された語から決定されるキーを用いて対応する文書名をインデクス検索装置が得るインデクス検索方法において、
文書名に一意に識別する文書識別情報を与え、
前記インデクス検索装置の記憶手段には、語の長さが設定された閾値以下の場合には当該語の文字列を含むキー文字列と文書識別情報との組からなるキーが登録されている一方、語の長さが前記閾値を越える場合には当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を含むキー文字列と文書識別情報との組からなるキーが登録されており、
前記インデクス検索装置の検索手段が、指定された語の長さが設定された閾値以下の場合には当該語の文字列を含むキー文字列に検索範囲を指定する文書識別情報を結合したものをキーとして用いて記憶手段から当該キーと一致するキーに対応する文書識別情報を検索する一方、語の長さが前記閾値を越える場合には当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を含むキー文字列に前記文書識別情報を結合したものをキーとして用いて記憶手段から当該キーと一致するキーに対応する文書識別情報を検索することを特徴としたインデクス検索方法。The index retrieval apparatus uses a B + tree structure to make an index that associates a document name with a key determined from a word included in the document, and uses the key determined from the designated word to search for the corresponding document name. In the index search method to obtain,
Give document identification information that uniquely identifies the document name,
In the storage unit of the index search device, when a word length is equal to or less than a set threshold, a key consisting of a set of a key character string including a character string of the word and document identification information is registered. If a word length exceeds the threshold, a key consisting of a set of a key character string including a hash value determined by applying a predetermined hash function to the character string of the word and document identification information is registered. ,
Those search means of the index retrieval device, in the case of less than or equal to the length of the specified word is set thresholds that combines the document identification information for specifying the search to a key character string containing the character string of the words While searching the document identification information corresponding to the key that matches the key from the storage means using as a key , if the word length exceeds the threshold, apply a predetermined hash function to the character string of the word index retrieval method characterized by retrieving the document identification information corresponding to the key that matches with the key from the storage means using the union of the document identification information as a key string containing the hash value determined as a key.

請求項１５に記載のインデクス検索方法において、
前記閾値を越える長さの語から決定されるキー文字列には当該語を一意に特定するための登録番号が付加されているとともに、当該登録番号が当該語と対応して前記インデクス検索装置の登録テーブルに登録されており、
前記インデクス検索装置の検索手段は、前記閾値を越える長さの語から決定されるキー文字列に検索範囲を指定する登録番号を付加したキーを用いて検索を行った後に、更に当該キーのキー文字列に付加された登録番号と対応して登録されている語を登録テーブルに基づいて特定し、特定した語と検索対象の語との対応に基づいて、検索された文書名集合から該当する文書名を特定することを特徴とするインデクス検索方法。The index search method according to claim 15,
With the key string being appended registration number for uniquely identifying the word is determined from the word length exceeds the threshold, the index retrieval device the registration numbers correspond with the word Registered in the registration table ,
The search means of the index search device performs a search using a key to which a registration number designating a search range is added to a key character string determined from a word having a length exceeding the threshold, and further performs key search for the key. A word registered corresponding to the registration number added to the character string is specified based on the registration table, and based on the correspondence between the specified word and the word to be searched, it corresponds from the searched document name set. An index search method characterized by specifying a document name.

指定された語から決定されるキーを用いて検索対象値を検索するために、キーと検索対象値とを対応させたインデクスを作成するインデクス作成装置において、
インデクスを記憶する記憶手段と、
語の長さが設定された閾値以下であることに応じて当該語の文字列を含むキー文字列と検索対象値との組からなるキーを記憶手段に登録する一方、語の長さが前記閾値を越えることに応じて当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を含むキー文字列と検索対象値との組からなるキーを記憶手段に登録する登録手段と、
を備えたことを特徴とするインデクス作成装置。In an index creation device for creating an index in which a key and a search target value are associated with each other in order to search for a search target value using a key determined from a specified word,
Storage means for storing the index ;
In response to the word length being equal to or less than the set threshold value, a key consisting of a set of a key character string including the character string of the word and a search target value is registered in the storage unit, while the word length is A registration means for registering a key comprising a set of a key character string including a hash value determined by applying a predetermined hash function to a character string of the word in response to exceeding the threshold and a search target value in the storage means;
An index creation device characterized by comprising:

指定された語から決定されるキーを用いて検索対象値を検索するために、キーと検索対象値とを対応させたインデクスの作成処理を、コンピュータに実行させるプログラムを当該コンピュータに読み取り可能に記憶した記憶媒体において、
前記プログラムは、語の長さが設定された閾値以下であることに応じて当該語の文字列を含むキー文字列と検索対象値との組からなるキーをインデクスメモリに登録する一方、語の長さが前記閾値を越えることに応じて当該語の文字列に所定のハッシュ関数を適用して決まるハッシュ値を含むキー文字列と検索対象値との組からなるキーをインデクスメモリに登録する処理を、前記コンピュータに実行させることを特徴とする記憶媒体。To find the search target value by using a key that is determined from the specified word, the creation processing of the index that associates the key searched value, as readable program for causing a computer to execute on the computer storage In the storage medium
The program registers, in the index memory, a key composed of a key character string including a character string of the word and a search target value in response to the word length being equal to or less than a set threshold value, while Processing for registering a key consisting of a set of a key character string including a hash value determined by applying a predetermined hash function to a character string of the word and a search target value in the index memory when the length exceeds the threshold For the computer to execute.