JP2007133682A

JP2007133682A - Full text retrieval system and full text retrieval method therefor

Info

Publication number: JP2007133682A
Application number: JP2005326404A
Authority: JP
Inventors: Nobuyuki Hoshiba; 信之干場
Original assignee: BUSINESS SEARCH TECHNOLOGIES C; Business Search Technologies Corp
Current assignee: BUSINESS SEARCH TECHNOLOGIES C; Business Search Technologies Corp
Priority date: 2005-11-10
Filing date: 2005-11-10
Publication date: 2007-05-31

Abstract

<P>PROBLEM TO BE SOLVED: To provide a full text retrieval system by an N-gram, wherein restriction of the number of retrieval characters, delay of processing, and enlargement of an index database are improved. <P>SOLUTION: This full text retrieval system has: an indexer part 2 generating a hash table having a hash key generated from a split character string of one character or above of text information and a hash value showing an address of data having the division character string at the head inside the index database, and the index database including an additional character string added with one or more succeeding characters in the text information to the division character string and an appearance position of the additional character string; a character string retrieval part retrieving the has value corresponding to the hash key, and retrieving the additional character string present in address data of the index database shown by the hash value; and a character string comparison calculation part 3 comparing identity of a retrieval target character string and the additional character string, and performing appearance position matching calculation of the additional character string. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、全文検索システム、及び、その全文検索方法に関し、特に、電子化された文章情報からインデックス情報を作成し、そのインデックス情報を用いて検索対象文字列の検索を高速に行う全文検索システム、及び、その全文検索方法に関する。 The present invention relates to a full-text search system and a full-text search method thereof, and more particularly to a full-text search system that creates index information from digitized text information and searches for a search target character string at high speed using the index information. And a full-text search method thereof.

インターネットの普及により、インターネットに存在する情報量も急速に増大している。このような莫大な情報の中から、いち早く目的とする情報を探し出すための検索システムが必要不可欠となっている。また、イントラネットやローカルエリアネットワーク（ＬＡＮ）等のネットワークにおいても、ネットワークに繋がれたコンピュータに保存される情報量が増大しているため、それらの情報を、早く正確に検索するためにも検索システムが必要である。 With the spread of the Internet, the amount of information existing on the Internet is rapidly increasing. A search system for quickly finding out target information from such a vast amount of information is indispensable. In addition, in a network such as an intranet or a local area network (LAN), the amount of information stored in a computer connected to the network is increasing. Therefore, a search system can be used to quickly and accurately search for such information. is required.

検索システムは、分野を限定せず大量の情報を収集し、収集した情報をデータベース化し、それらの情報を文字列等で検索できるものである。そのような検索システムは、大きく分けてディレクトリ型とロボット型の２種類がある。ディレクトリ型検索システムは、収集情報をカテゴリー別に分類して検索する検索システムである。収集情報を階層構造で分類しており、分類階層をユーザがたどって絞り込むことで検索を行うものである。しかし、収録データの選定および階層構造の分類作業等を人間が行っているため、検索対象の情報量において、後述するロボット型検索システムに劣っている。そのため、ディレクトリ型検索システムは、インターネットポータルサイトに一部使用されるのみで、現在、検索システムの殆どは、ロボット型である。 The search system is capable of collecting a large amount of information without limiting the field, making the collected information into a database, and searching for the information with a character string or the like. Such search systems are roughly classified into two types: a directory type and a robot type. A directory-type search system is a search system that searches collected information by category. The collected information is classified in a hierarchical structure, and a search is performed by the user following the classification hierarchy and narrowing down. However, since the selection of recorded data and the classification work of the hierarchical structure are performed by humans, the amount of information to be searched is inferior to the robot type search system described later. Therefore, the directory type search system is only partially used for the Internet portal site, and most of the search systems are currently robot type.

ロボット型検索システムは、インターネットのＷｅｂページ全体、または、ネットワークに接続されたコンピュータに保存される電子ファイル（ハイパーテキストやソフトウェアで読取可能なテキスト情報）全体を検索対象とすることから、全文検索システムとも呼ばれる。 The robot-type search system searches the entire Web page on the Internet or the entire electronic file (text information readable by hypertext or software) stored in a computer connected to the network. Also called.

ここでいう、全文検索システムには、全文検索を行うためのコンピュータ・ハードウェアおよびコンピュータ上で起動するソフトウェア全てが含まれる。そして、クライアント・コンピュータ上に検索画面を表示させ、検索を受け付けおよび検索結果を出力する検索サービスのためのソフトウェア、全文検索を行うために予め作成するインデックスデータベース、そのインデックスデータベース作成のためのソフトウェア（インデクサ）なども含む。 The full-text search system here includes computer hardware for performing full-text search and all software that is activated on the computer. Software for a search service that displays a search screen on a client computer, accepts a search, and outputs a search result, an index database created in advance for performing a full-text search, software for creating the index database ( Indexers).

全文検索システムには、大きく分けて２つの機能から構成される。１つは検索サービスを始める前に実行する機能で、インデクサによるインデックスデータベース生成（インデクシング）である。もう１つは、インデックスデータベースで作成されたデータを利用し、検索対象の入力を受け付け、検索結果を出力する検索サービスである。 The full-text search system is roughly composed of two functions. One is a function executed before starting a search service, which is index database generation (indexing) by an indexer. The other is a search service that uses data created in an index database, accepts an input of a search target, and outputs a search result.

さらに、インターネットやコンピュータ上に保存された情報は、英語のみならず、日本語、中国語等の言語で書かれており、我が国においては、日本語を対照に検索できる検索システムは重要である。しかし、日本語文章の全文検索においては、インデックス生成において特有の問題がある。この問題は、日本語文章は、英語のように分かち書き（単語毎に区切って記述すること）文法で文章が記述されていないため、日本語文章のインデックス生成においては、検索対象の文章に対して単語の分割処理を行う必要があることである。 Furthermore, information stored on the Internet and computers is written not only in English, but also in languages such as Japanese and Chinese. In Japan, a search system that can search in contrast to Japanese is important. However, full-text search of Japanese sentences has a unique problem in index generation. This problem is because Japanese sentences are not written in English (separated for each word) grammar as in English, so in Japanese sentence index generation, It is necessary to perform word division processing.

日本語のような、分かち書きされていない文章の単語分け方法の１つに、形態素解析がある。形態素解析とは、文法を意識して文字列を単語に分割し、分割した文字列に対して、名詞、動詞、助詞といった文法要素の並びを理解し、文法要素毎に文字列を分けていく方法である。この方法は、日本語本来の単語に分解するため、うまく分解できれば、単語レベルでの検索には最も適している。形態素解析の例として、例えば、「台風２０号が北上中です」という文章に対して形態素解析を行うと、「台風」、「２０」、「号」、「が」、「北上中」、「です」というように、文章が文法要素に従った単語毎に区切られる。 One of the word division methods for sentences that are not separated, such as Japanese, is morphological analysis. Morphological analysis divides a character string into words in consideration of grammar, understands the arrangement of grammar elements such as nouns, verbs, and particles for the divided character strings, and divides the character strings into grammatical elements. Is the method. This method breaks down into native Japanese words, so if it can be broken down well, it is most suitable for word level searches. As an example of morphological analysis, for example, when morphological analysis is performed on a sentence “Typhoon No. 20 is in Kitakami”, “Typhoon”, “20”, “No.”, “Ga”, “Kitakami Chu”, “ The sentence is divided into words according to grammatical elements.

しかし、コンピュータで日本語を入力する際に、仮名漢字変換を使っていると誤った単語の区切り方をすることがよくあるのと同じように、文書を記述した人が望んだとおりに文章が分割できないことがある。文法要素毎に文字列を正しく分割できなかった場合は、正しい単語での登録が行われなくなるので、検索を行ったときに検索対象が見つからないという検索精度の低下という問題が発生する。 However, just as you often use the Kana-Kanji conversion when entering Japanese on a computer, you often break the word incorrectly, as if the person who wrote the document wanted it. There are times when it cannot be divided. If the character string cannot be correctly divided for each grammatical element, registration with the correct word is not performed, which causes a problem that the search accuracy is lowered because the search target is not found when the search is performed.

このような問題を生じない別の単語分け方法に、Ｎグラムがある。Ｎグラムとは、テキスト情報（文章、特定の画像に記述したり、関連づける語句、またはそれらの組み合わせ）から、文字の並びを例えば、１文字（１グラム）、２文字（２グラム）、３文字（３グラム）というように取り出して、その取り出した文字の出現位置を計算して、その言語の特質を記述するものである。Ｎグラムは、日本語の文法を特に意識せず、ある特定の文字数で日本語を分解し、それをとりあえず単語としてインデックスに登録してしまう方法である。ＮグラムのＮは分割する文字数を示す。Ｎを１としたものを１グラムといい、これは、１文字ごとに分割することを示す。１文字毎に分解すると、文字列中の各１文字が単語として認識されるため、たった１文字でも検索できるようになる。１グラムの単語分けの例として、例えば、「台風２０号が北上中です」という文章に対して１グラム単語分けを行うと、「台」、「風」、「２」、「０」、「号」、「が」、「北」、「上」、「中」、「で」、「す」というように、文章が１文字毎に区切られる。 Another grammar division method that does not cause such problems is N-gram. N-gram is a character sequence (for example, 1 character (1 gram), 2 characters (2 gram), 3 characters) from text information (sentences, words or phrases associated with a specific image, or a combination thereof). (3 grams) is extracted, the appearance position of the extracted character is calculated, and the characteristics of the language are described. The N-gram is a method in which Japanese grammar is not particularly conscious and Japanese is decomposed with a certain number of characters and is registered in the index as a word for the time being. N in the N-gram indicates the number of characters to be divided. One in which N is 1 is referred to as 1 gram, which indicates that each character is divided. If each character is decomposed, each character in the character string is recognized as a word, so that even a single character can be searched. As an example of 1-gram word division, for example, when 1-gram word division is performed on a sentence “Typhoon No. 20 is in the north,” “Tai”, “Wind”, “2”, “0”, “0”, “ Sentences are separated character by character, such as “No.”, “Ga”, “North”, “Up”, “Medium”, “De”, “Su”.

１グラムにおいて、「２０号」という文字を検索する場合、「２」、「０」、「号」という文字を検出し、それらが互いに隣りあっていれば、つまり、区切られた各文字の出現位置が検索文字の出現位置と同じであれば、「２０号」という文字を含む言葉が正確に検出されたことになる。このように、Ｎグラムによる検索は、区切られた文字データと共に文字の連続性情報をデータベースで管理することで、どのような文字列でも検索を可能にする検索方法である。 When searching for the character “20” in one gram, if the characters “2”, “0”, “No.” are detected and they are next to each other, that is, the appearance of each delimited character If the position is the same as the appearance position of the search character, the word including the character “No. 20” is accurately detected. As described above, the N-gram search is a search method that enables searching for any character string by managing character continuity information together with the divided character data in the database.

また、２グラムの場合では、テキスト情報を２文字ずつに区切って登録する。例えば「台風２０号が北上中です」という文章に対して２グラム解析を行うと、「台風」、「風２」、「２０」、「０号」、「号が」、「が北」、「北上」、「上中」「中で」、「です」というふうに文章が２文字毎に区切られる。この２グラムデータに対して、「２０号」という文字を検索すると、「２０」と「０号」という２グラムデータが検索され、互いの出現位置が正しい場合、つまり、上記の例では、「２０」と「０号」のテキスト情報内での出現位置が１つだけ異なるときは、「２０号」が検出されたと判定される。 In the case of 2 grams, text information is divided into two characters and registered. For example, if a 2-gram analysis is performed on the sentence “Typhoon No. 20 is in the north,” “Typhoon”, “Wind 2”, “20”, “0”, “No.”, “No. Sentences are divided every two characters, such as “Kitakami”, “Upper Middle”, “Inside”, and “I”. When the character “20” is searched for the 2 gram data, the 2 gram data “20” and “0” are searched, and when the mutual appearance positions are correct, that is, in the above example, “ When the appearance position in the text information of “20” and “No. 0” differs by one, it is determined that “No. 20” has been detected.

Ｎグラムでは、Ｎより小さい文字列の検索は行えず、また、Ｎより大きい文字列の場合は、文字列の長さをＬとすると、[（Ｌ＋Ｎ−１）／Ｎ］の式で表される回数の検索を行い、かつ、出現位置が検索文字列と一致しているかを判断することで、検出を行っている。なお、ここで［ｘ］は、ｘを超えない最大の整数を意味している。従って、[（Ｌ＋Ｎ−１）／Ｎ］式で表されるように、Ｎの数が大きくなればなる程、検索回数は減少する。 N-grams cannot search for a character string smaller than N, and in the case of a character string larger than N, if the length of the character string is L, it is represented by the formula [(L + N-1) / N]. The number of searches is performed, and detection is performed by determining whether the appearance position matches the search character string. Here, [x] means the maximum integer not exceeding x. Therefore, as represented by the formula [(L + N-1) / N], the larger the number of N, the smaller the number of searches.

また、上述した全文検索システムは、よく知られている（例えば、非特許文献１を参照。）。例えば、全文検索システムの機能構成、ならびに、日本語処理に必要な単語分けの詳細が開示されている。 The above-described full-text search system is well known (for example, see Non-Patent Document 1). For example, the functional configuration of the full-text search system and details of word division necessary for Japanese language processing are disclosed.

リチャード・セルツァー等著、干場信之等訳「ＡＬＴＡＶＩＳＴＡ完全活用ガイド」Ｐ２６０〜Ｐ２９６、翔泳社、１９９７年１１月２５日発行。Published by Richard Seltzer et al., Translated by Nobuyuki Hoshiba, etc. “ALTA VISTA Complete Usage Guide” P260-P296, Shosuisha, published on November 25, 1997.

上述のように、従来の全文検索システムは、検索精度を上げるためにＮグラムによって、全文検索を行っている。しかし、ＮグラムのＮの数によって、検索文字数の制限、処理の遅延化、インデックスデータベースの肥大化という問題がある。 As described above, the conventional full-text search system performs full-text search using N-grams in order to increase the search accuracy. However, depending on the number of N in the N-gram, there are problems such as limitation of the number of search characters, delay of processing, and enlargement of the index database.

Ｎグラムの検索の回数を減らすためには、Ｎを大きくすれば良いが、Ｎを大きくすると、Ｎより小さい文字列の検索ができなり、検索文字数の制限が生じる。そのため、検索可能な文字列の最小文字数とＮが等しくなるように、Ｎグラムデータベースを持つ必要がある。 In order to reduce the number of searches for N-grams, N may be increased. However, if N is increased, a character string smaller than N can be searched, and the number of search characters is limited. For this reason, it is necessary to have an N-gram database so that N is equal to the minimum number of searchable character strings.

また、Ｎを小さくし過ぎると、検索処理が遅延化する。上述の例を用いて説明すると、３文字の検索を行う際、２グラムデータベースでは、３文字中の前２文字と後２文字（例えば、前２文字は「２０」、後２文字は「０号」）の２回のインデックスデータベース検索を行い、検索した２文字の出現位置情報の整合処理を行う必要がある。一方、３グラムデータベースでは、検索文字数と、インデックスデータベースに登録されている単語の文字数が同一であるため（検索文字列は「２０号」、インデックスデータベースに登録されている文字列も「２０号」）１回のインデックスデータベース検索で検索が終了し、出現位置情報整合処理は発生しない。 If N is made too small, the search process is delayed. To explain using the above example, when searching for three characters, in the 2-gram database, the first two characters and the last two characters in the three characters (for example, the first two characters are “20” and the second two characters are “0”). No. ") twice in the index database search, and it is necessary to perform matching processing of the searched appearance information of the two characters. On the other hand, in the 3-gram database, the number of search characters and the number of characters of words registered in the index database are the same (the search character string is “20”, and the character string registered in the index database is also “20”). ) The search is completed in one index database search, and the appearance position information matching process does not occur.

そのため、複数のＮグラムデータベースを持つことによって、出現位置の整合処理を減らす試みが、行われている。例えば、２グラムと、３グラムのインデックスデータベースを切り分けて使用するものである。検索文字数が２文字の検索は、２グラムデータベース、３文字以上の検索は、３グラムデータベースで検索を行うことで、検索システム全体としての検索時間を減らすことが可能である。しかし、この方法では、検索システムに、異なるＮの数だけインデックスデータベースを備える必要があり、そのデータベース容量をＮの異なる数だけ用意する必要がある。 Therefore, an attempt has been made to reduce the matching process of appearance positions by having a plurality of N-gram databases. For example, a 2-gram index database and a 3-gram index database are used separately. Searches with two search characters can be reduced in the search system as a whole by performing a search with a 2 gram database and a search with 3 or more characters with a 3 gram database. However, in this method, it is necessary to provide N different index databases in the search system, and it is necessary to prepare the database capacity corresponding to N different numbers.

以上の課題を解決するために、本発明による全文検索システムでは、電子ファイルから取り出されたテキスト情報をＮグラム分割し（Ｎ：Ｎ＞１の整数）、ＮグラムとＮグラムの出現位置を含むＮグラムデータをデータベースのデータテーブルに登録し、Ｎグラムデータテーブルを用いて、検索対象の文字列によるテキスト情報の全文検索を行う全文検索システムにおいて、Ｎグラムの（Ｎ−ｎ）文字の各々から生成したハッシュキー（ｎ：Ｎ＞ｎの整数）をハッシュテーブルに登録するハッシュキー登録手段、（Ｎ−ｎ）文字を含むＮグラムのＮグラムデータをＮグラムデータテーブルに登録するＮグラムデータ登録手段、Ｎグラムデータを登録したＮグラムデータテーブルのアドレスを、ハッシュキーと関連付けてハッシュテーブルに登録するアドレス登録手段、検索対象の文字列をＮグラム分割し、検索Ｎグラムに含まれる（Ｎ−ｎ）グラムからハッシュキーを生成し、ハッシュキーに関連するアドレスを、ハッシュテーブルから取得し、アドレスに基づいてＮグラムデータテーブルからＮグラムを検索するハッシュキー関連Ｎグラム検索手段、および、検索Ｎグラムとハッシュキー関連Ｎグラムとの同一性の比較を行い、かつ、ハッシュキー関連Ｎグラムの出現位置整合演算を行う文字列比較演算手段、を備えた。 In order to solve the above problems, in the full-text search system according to the present invention, text information extracted from an electronic file is divided into N-grams (N: integer of N> 1), and includes the appearance positions of N-grams and N-grams. In a full-text search system that registers N-gram data in a data table of a database and uses the N-gram data table to perform full-text search of text information by a character string to be searched, from each of (N−n) characters of N-gram Hash key registration means for registering the generated hash key (n: integer of N> n) in the hash table, N-gram data registration for registering N-gram N-gram data including (N−n) characters in the N-gram data table Means, the address of the N-gram data table in which the N-gram data is registered is associated with the hash key and registered in the hash table. The address registration means for dividing the character string to be searched into N-grams, generating a hash key from (N−n) grams included in the search N-gram, obtaining an address related to the hash key from the hash table, and H-key related N-gram search means for searching N-grams from the N-gram data table based on the above, and the comparison between the search N-gram and the hash key-related N-gram, and the appearance of the hash key-related N-gram Character string comparison calculation means for performing position alignment calculation is provided.

上述のハッシュキー登録手段と、Ｎグラムデータ登録手段と、アドレス登録手段は、インデクサによって実行される。 The above hash key registration means, N-gram data registration means, and address registration means are executed by an indexer.

また、電子ファイルから取り出されたテキスト情報をＮグラム分割し（Ｎ：Ｎ＞１の整数）、ＮグラムとＮグラムの出現位置を含むＮグラムデータをデータテーブルに登録し、Ｎグラムデータテーブルを用いて、検索対象の文字列によるテキスト情報の全文検索を行う全文検索方法において、Ｎグラムの（Ｎ−ｎ）文字の各々から生成したハッシュキー（ｎ：Ｎ＞ｎの整数）をハッシュテーブルに登録するハッシュキー登録ステップ、（Ｎ−ｎ）文字を含むＮグラムのＮグラムデータをＮグラムデータテーブルに登録するＮグラムデータ登録ステップ、Ｎグラムデータを登録したＮグラムデータテーブルのアドレスを、ハッシュキーと関連付けてハッシュテーブルに登録するアドレス登録ステップ、検索対象の文字列をＮグラム分割し、検索Ｎグラムに含まれる（Ｎ−ｎ）グラムからハッシュキーを生成し、ハッシュキーに関連するアドレスを、ハッシュテーブルから取得し、アドレスに基づいてＮグラムデータテーブルからＮグラムを検索するハッシュキー関連Ｎグラム検索ステップ、および、検索Ｎグラムとハッシュキー関連Ｎグラムとの同一性の比較を行い、かつ、ハッシュキー関連Ｎグラムの出現位置整合演算を行う文字列比較演算ステップ、を有する全文検索方法とした。 In addition, the text information extracted from the electronic file is divided into N-grams (N: integer of N> 1), N-gram data including the N-gram and the appearance position of the N-gram is registered in the data table, and the N-gram data table is stored. In a full-text search method for performing full-text search of text information using a character string to be searched, hash keys (n: integer of N> n) generated from each of (N−n) characters of N-gram are stored in a hash table. Hash key registration step for registering, N-gram data registration step for registering N-gram N-gram data including (N-n) characters in the N-gram data table, and hashing the address of the N-gram data table for registering N-gram data Address registration step to register in the hash table in association with the key, search character string divided into N-grams A hash key-related N-gram that generates a hash key from (N-n) grams included in a gram, obtains an address associated with the hash key from the hash table, and retrieves an N-gram from the N-gram data table based on the address A full-text search method having a search step, and a character string comparison operation step for comparing the identity of the search N-gram and the hash key-related N-gram and performing the appearance position matching operation of the hash key-related N-gram .

本発明は、検索文字列から生成されるハッシュキー、その検索文字列を含む文字列データを有するインデックスデータベース内のアドレスデータからなるハッシュテーブル、および、検索文字列を先頭に含む文字列データと、その文字列データの出現位置を有するインデックスデータベースにより、全文検索処理の高速化を可能とするものである。 The present invention includes a hash key generated from a search character string, a hash table composed of address data in an index database having character string data including the search character string, and character string data including the search character string at the head, The index database having the appearance position of the character string data can speed up the full-text search process.

本発明による全文検索システムに係る実施形態について、添付図面を参照して詳述する。以下では、その実施形態を、全文検索システムの実施例として説明する。 An embodiment according to a full-text search system according to the present invention will be described in detail with reference to the accompanying drawings. Hereinafter, the embodiment will be described as an example of a full-text search system.

図１は、本発明に係る全文検索システムにおける機能の概略構成例を示す。この全文検索システムは、情報収集部１、インデクサ２、文字列比較演算部３、検索結果表示部４、ハッシュテーブルＨＴ、ＮグラムデータテーブルＤＴ、ハッシュテーブルＨＴとＮグラムデータテーブルＤＴを管理するインデックスデータベースＤＢを有し、さらに、検索文字列設定部ａ、を備えている。なお、ハッシュテーブルＤＴは、インデックスデータベースＤＢ外部に配置しても良く、例えば、外部のアプリケーション内で起動され、インデックスデータベースＤＢと相互通信しても良い。 FIG. 1 shows a schematic configuration example of functions in a full-text search system according to the present invention. This full-text search system includes an information collection unit 1, an indexer 2, a character string comparison operation unit 3, a search result display unit 4, a hash table HT, an N-gram data table DT, and an index for managing the hash table HT and the N-gram data table DT. It has a database DB and further includes a search character string setting unit a. The hash table DT may be arranged outside the index database DB. For example, the hash table DT may be activated in an external application and may communicate with the index database DB.

情報収集部１は、無線、有線（電線、光ケーブル）等の公知のネットワークに繋がったコンピュータ上の電子ファイルから電子化されたテキスト情報を収集する。 The information collecting unit 1 collects digitized text information from an electronic file on a computer connected to a known network such as wireless or wired (electric wire, optical cable).

インデクサ部２は、情報収集部１が収集したテキスト情報をＮグラムに分割し、ハッシュテーブルＨＴに、Ｎグラムデータの先頭（Ｎ−ｎ）文字からハッシュキー（ハッシュ法によりデータベースレコードの検索を高速に行なうためのキー）を生成し、登録する。次に、インデクサ部２は、ＮグラムデータをＮグラムデータテーブルＤＴに登録し、登録したＮグラムデータテーブルＤＴのアドレスデータを、ハッシュテーブルＨＴのハッシュ値として登録する。このようにして、ハッシュキーとハッシュ値はペアで登録され、ハッシュキーを参照することで、ハッシュ値で示されるアドレスのレコードに直接アクセスすることができる。従って、ハッシュテーブルＨＴによって、ハッシュキーに相当する（Ｎ−ｎ）文字は、ＮグラムデータテーブルＤＴのＮグラムデータと関係付けられる。 The indexer unit 2 divides the text information collected by the information collecting unit 1 into N-grams, and searches the hash table HT from the first (N−n) characters of the N-gram data with a hash key (hash method for high speed database record search). Key) to generate and register. Next, the indexer unit 2 registers N-gram data in the N-gram data table DT, and registers the address data of the registered N-gram data table DT as a hash value of the hash table HT. In this way, the hash key and the hash value are registered as a pair, and the record at the address indicated by the hash value can be directly accessed by referring to the hash key. Therefore, (N−n) characters corresponding to the hash key are related to the N-gram data of the N-gram data table DT by the hash table HT.

さらに、インデクサ部２は、ＮグラムデータテーブルＤＴに、Ｎグラムデータの出現位置情報を登録する。 Further, the indexer unit 2 registers appearance position information of N-gram data in the N-gram data table DT.

文字列比較演算部３は、検索対象となった文字列をＮグラムデータに分割し、その分割Ｎグラムデータの先頭（Ｎ−ｎ）文字を用いてハッシュキーを計算し、そのハッシュキーによりハッシュテーブルＨＴからハッシュ値であるＮグラムデータのＮグラムデータテーブルＤＴ内アドレスを取得する。文字列比較演算部３は、取得したアドレスを用いて、Ｎグラムデータと出現位置を取得する。さらに、検索対象文字列を分割したＮグラムデータと、ＮグラムデータテーブルＤＴから検索したＮグラムデータの同一性を比較し、さらに、Ｎグラムデータの出現位置情報からそれらの整合性を判断し、適切な検索対象文字列の出現位置（ファイルパスもしくはＵＲＬ等）を取得する。 The character string comparison calculation unit 3 divides the character string to be searched into N-gram data, calculates a hash key using the first (N−n) characters of the divided N-gram data, and uses the hash key to hash The address in the N-gram data table DT of N-gram data that is a hash value is acquired from the table HT. The character string comparison calculation unit 3 acquires the N-gram data and the appearance position using the acquired address. Furthermore, the N-gram data obtained by dividing the search target character string is compared with the N-gram data searched from the N-gram data table DT, and further, the consistency is determined from the appearance position information of the N-gram data. Appearance position (file path or URL, etc.) of an appropriate search target character string is acquired.

検索結果表示部４は、文字列比較演算部３によって検索された文字列を表示部（示されない）と出現位置を表示する。なおこの場合、検索文字列の出現位置に加えて、「単語の重要度」によって、表示順序が決められて表示される。表示方法は、公知の全文検索システムと同じである。 The search result display unit 4 displays the character string searched by the character string comparison calculation unit 3 and the appearance position (not shown) and the appearance position. In this case, in addition to the appearance position of the search character string, the display order is determined according to the “word importance”. The display method is the same as a known full-text search system.

図２は、本発明の一実施例としてインデクサ部２によるインデックスデータベースＤＢの２グラムのハッシュテーブルＨＴおよび３グラムのＮグラムデータテーブルＤＴ生成処理の概要を示す。図２に示すテキスト情報（ａ）は、電子ファイルのテキスト情報の例であり、所定の出現位置を有している。２グラムデータ（ｂ）は、テキスト情報（ｂ）を２グラムに分割したデータであり、ハッシュキーに対応する２グラムデータである。ハッシュテーブルＨＴは、ハッシュキーとハッシュ値から構成されるテーブルである。ＮグラムデータテーブルＤＴ（ｃ）は、インデックスデータベースＤＢ内において、３グラムと出現位置の関係を定義したテーブルである。 FIG. 2 shows an outline of the 2-gram hash table HT and 3-gram N-gram data table DT generation processing of the index database DB by the indexer unit 2 as an embodiment of the present invention. The text information (a) shown in FIG. 2 is an example of text information of an electronic file and has a predetermined appearance position. The 2-gram data (b) is data obtained by dividing the text information (b) into 2 grams, and is 2-gram data corresponding to the hash key. The hash table HT is a table composed of a hash key and a hash value. The N-gram data table DT (c) is a table that defines the relationship between 3 grams and the appearance position in the index database DB.

図２を参照して、インデクサ部２によるハッシュテーブルＨＴへのハッシュキー生成処理について、さらに詳細に説明する。インデクサ部２は、情報収集部１が収集した電子化テキスト情報を、１文字ずつずらしながら３グラムに分割し、さらに、その３グラム文字の先頭２文字（２文字のユニコードからなる４バイトデータ）を使って、２バイトのハッシュキーを生成し、予めエントリが用意されているハッシュテーブルＨＴに登録する。このハッシュキーは、２グラムデータに対して良い分散が得られるように生成されることが望ましく、公知の生成方法が適用可能である。本実施例においては、２グラムデータのユニコード２文字それぞれの下位８ビットを使って２バイトのハッシュキーを生成している。 With reference to FIG. 2, the hash key generation processing for the hash table HT by the indexer unit 2 will be described in more detail. The indexer unit 2 divides the digitized text information collected by the information collecting unit 1 into 3 grams while shifting character by character, and further, the first 2 characters of the 3 gram characters (4-byte data consisting of 2 characters of Unicode) Is used to generate a 2-byte hash key and register it in a hash table HT in which entries are prepared in advance. This hash key is preferably generated so that good distribution can be obtained for the 2-gram data, and a known generation method can be applied. In this embodiment, a 2-byte hash key is generated using the lower 8 bits of each of the two Unicode characters of the 2-gram data.

図２のテキスト情報（ａ）として、“／Ａｆｉｌｅ／Ａ．ｔｘｔ”というファイルに保存されている（出現位置を有する）「私は会社へ」と、“／Ｂｆｉｌｅ／Ｂ．ｔｘｔ”というファイルに保存されている「私の会社で」と、“／Ｃｆｉｌｅ／Ｃ．ｔｘｔ”というファイルに保存されている「私の会議」がある。インデクサ部２は、テキスト情報（ａ）を、それぞれ１文字ずつずらしながら３グラムを生成し、それら３グラムの先頭２文字を使って、「私は」、「は会」、「会社」、「社へ」という２グラムデータ（ｂ）と、「私の」、「の会」、「会社」、「社で」という２グラムデータ（ｂ）と、「私の」、「の会」、「会議」という２グラムデータ（ｂ）に分割する。 As text information (a) in FIG. 2, “I am a company” stored in a file “/Afile/A.txt” and “/Bfile/B.txt” are stored in a file “/Afile/A.txt”. There is “My meeting” saved and “My meeting” saved in a file “/Cfile/C.txt”. The indexer unit 2 generates 3 grams while shifting the text information (a) one character at a time, and uses the first two characters of these 3 grams to create “I am”, “Hakai”, “Company”, “ 2 gram data (b) “to the company”, 2 gram data (b) “my”, “no society”, “company”, “in company”, “my”, “no association”, “ It is divided into 2-gram data (b) called “conference”.

インデクサ部２は、２グラムデータ（ｂ）から、２バイトデータのハッシュキーを生成する。例えば、「私は」という２グラムデータは、「1000」というハッシュキーを生成し、生成したハッシュキーがハッシュテーブルＨＴに存在しない場合は、新規にハッシュキーをハッシュテーブルＨＴに登録する。同じテキスト情報（ａ）に重複して出現した「会社」という２グラムデータ（ｂ）は、複数あるため、１つのハッシュキー「9881」しか生成されない。 The indexer unit 2 generates a 2-byte data hash key from the 2-gram data (b). For example, the 2-gram data “I am” generates a hash key “1000”, and if the generated hash key does not exist in the hash table HT, a new hash key is registered in the hash table HT. Since there are a plurality of two-gram data (b) “company” appearing in duplicate in the same text information (a), only one hash key “9881” is generated.

なお、本発明の実施例においては、ハッシュテーブルＨＴは、予め用意されており、６５，５３６（２の１６乗）のデータエントリ数を有する。日本語の場合、一般に使用する文字は５０００文字程度である。２文字の場合は、１０万種類程度になる。それを２バイトデータ（２の１６乗＝６５，５３６）で表示する。ハッシュテーブルのエントリ数は、複数の検索文字列が１つのハッシュキーに集中しないように考慮された数値であり、そのため、エントリ数は、対象となる言語、発生する文字種類数、検索文字列の文字列数から適宜変更する必要がある。 In the embodiment of the present invention, the hash table HT is prepared in advance and has the number of data entries of 65,536 (2 to the 16th power). In the case of Japanese, generally 5000 characters are used. In the case of two characters, there are about 100,000 types. It is displayed as 2-byte data (2 to the 16th power = 65,536). The number of entries in the hash table is a numerical value so that a plurality of search character strings are not concentrated on one hash key. Therefore, the number of entries includes the target language, the number of character types to be generated, and the number of search character strings. It is necessary to change appropriately from the number of character strings.

インデクサ部２は、ハッシュキー生成後、ハッシュテーブルＨＴにおいて、そのハッシュキーが新規な場合は、新３グラムとして上述の分割した３グラムおよびその関連情報をＮグラムデータテーブルＤＴに登録する。登録したデータベースのアドレスを、ハッシュテーブルＨＴのハッシュ値としてハッシュテーブルに登録する。このようにして、ハッシュキーとハッシュ値がハッシュテーブルＨＴ内でペアとなるため、後述する全文検索処理において、ハッシュキーをハッシュテーブルＨＴで参照することで、そのハッシュキーの元となった２文字を先頭に有する３グラムのデータを、ＮグラムデータテーブルＤＴに直接的にアクセス可能となる。 After the hash key is generated, if the hash key is new in the hash table HT, the indexer unit 2 registers the above-divided 3 gram and the related information in the N-gram data table DT as a new 3 gram. The registered database address is registered in the hash table as a hash value of the hash table HT. In this way, since the hash key and the hash value are paired in the hash table HT, by referring to the hash key in the hash table HT in the full-text search process described later, the two characters that are the source of the hash key It becomes possible to directly access the 3 gram data having “.” In the N-gram data table DT.

インデクサ部２は、ハッシュキー生成後、ハッシュテーブルＨＴにおいて、そのハッシュキーが新規で無い場合は、既に存在するハッシュキーとペアを組むハッシュ値（アドレス）を参照して、ＮグラムデータテーブルＤＴを検索し、ハッシュキー作成の元となった３グラムが、ＮグラムデータテーブルＤＴ内に既に存在するか否かの判断を行う。３グラムが未登録の場合は、参照したアドレスが示すデータブロックに、新規３グラムとしてレコードを書き込む。３グラムが既登録の場合は、その３グラムの出現位置のみを登録する。 When the hash key is not new in the hash table HT after generating the hash key, the indexer unit 2 refers to the hash value (address) that forms a pair with the already existing hash key, and determines the N-gram data table DT. It is searched and it is determined whether or not the 3 gram from which the hash key is created already exists in the N-gram data table DT. If 3 grams are not registered, a record is written as a new 3 grams in the data block indicated by the referenced address. When 3 grams are already registered, only the appearance position of the 3 grams is registered.

例えば、「私は会社へ」とういテキスト情報（ａ）から、分割された「私は」という２グラムデータは、「1000」というハッシュキーを生成し、「1000」がハッシュテーブルＨＴに既に存在する場合は、そのハッシュキー「1000」とペアを組むハッシュ値「0x188」が示すインデックスデータベースＤＢ内のＮグラムデータテーブルＤＴのアドレスに、「私は会」という３グラムが存在するか否かを判断し、未登録の場合は、「私は会」という新３グラム及びその関連情報を登録する。既登録の場合は、出現位置のみが登録される。 For example, from the text information (a) that says “I am a company”, the divided 2-gram data “I am” generates a hash key “1000”, and “1000” already exists in the hash table HT. If it is, whether or not 3 grams “I am a group” exists in the address of the N-gram data table DT in the index database DB indicated by the hash value “0x188” paired with the hash key “1000”. If it is determined and not registered, a new 3 gram “I am a party” and related information are registered. If already registered, only the appearance position is registered.

例えば、図２のテキスト情報（ａ）の「私の会社で」から分割される「私の会」と、テキスト情報（ａ）の「私の会議」から分割される「私の会」は同じ３グラムであるが、出現位置が異なる。従って、ＮグラムデータテーブルＤＴ（ｃ）では、同じ１つの３グラム「私の会」に複数の出現位置「／Ｂｆｉｌｅ／Ｂ．ｔｘｔ．１」と、「／Ｃｆｉｌｅ／Ｃ．ｔｘ．１」を有することになる。 For example, “My Association” divided from “My Company” in the text information (a) in FIG. 2 is the same as “My Association” divided from “My Meeting” in the text information (a). Although it is 3 grams, the appearance position is different. Therefore, in the N-gram data table DT (c), a plurality of appearance positions “/Bfile/B.txt.1” and “/Cfile/C.txt.1” are assigned to the same three-gram “My Association”. Will have.

図２において、２グラム「会社」は、ハッシュキー生成時は重複したため、１つのハッシュキーしか生成されなかったが、３グラムとしては、「会社へ」と「会社で」とは異なるため、ＮグラムデータテーブルＤＴに独自レコードとして保存される。 In FIG. 2, 2 gram “company” was duplicated at the time of hash key generation, so only one hash key was generated. However, 3 gram is different from “to company” and “in company”, so N It is stored as a unique record in the gram data table DT.

同一のハッシュキーを有するデータは、まず、同一のブロックに書き込まれる。もしブロックが同一ハッシュキーを有する他の３グラムで一杯の場合は、異なるブロックに書き込む。追加データが異なるブロックに書き込まれた場合は、その追加データの検索が可能なように、異なるブロックに存在し、かつ、同一ハッシュキーを有するデータのチェーン列に、追加データが保存されたブロックのアドレスデータを書き込む。 Data having the same hash key is first written in the same block. If the block is full of other 3 grams with the same hash key, write to a different block. If the additional data is written in a different block, the additional data stored in the chain row of data that exists in the different block and has the same hash key so that the additional data can be searched. Write address data.

このようにして、同一のハッシュキーを有する３グラムのデータがインデックスデータベース内にブロック化して登録されることとなる。これは、後の全文検索時に、同一のハッシュキーを有するデータをブロック単位で一括読出す場合のハードディスクやバッファメモリに対するＩ/Ｏ数を減らす。これにより、ハッシュ値で示されたＮグラムデータテーブルＤＴ内のアドレスを検索することで、ブロック化／バッファリング化された同一のハッシュキーを有する３グラムのデータを一括で取得することが可能となる。また、ハードディスクのシークタイムを減らすために、新規ブロックは、同一ハッシュキーを有するブロックに隣接するように設けても良い。 In this way, 3 gram data having the same hash key is registered as a block in the index database. This reduces the number of I / Os to the hard disk and the buffer memory when data having the same hash key is collectively read in block units during subsequent full-text search. As a result, by searching for an address in the N-gram data table DT indicated by the hash value, it is possible to collectively obtain data of 3 grams having the same hash key that is blocked / buffered. Become. In order to reduce the seek time of the hard disk, the new block may be provided adjacent to a block having the same hash key.

図２のＮグラムデータテーブルＤＴ（ｃ）の３グラム「会社へ」と「会社で」は、同じハッシュキーを有するが、同一のブロックに書き込めなかった場合を示している。レコード「会社へ」のチェーン列に、レコード「会社で」のアドレスが記載されているため、チェーン列を追うことで、同一ハッシュ値を有する３グラム検索を可能にしている。 The 3 grams “to company” and “in company” in the N-gram data table DT (c) in FIG. 2 have the same hash key but cannot be written to the same block. Since the address of the record “in the company” is described in the chain column of the record “to company”, the three-gram search having the same hash value can be performed by following the chain column.

３グラムは、一般に、複数の電子ファイルに存在するため、１つの３グラムは、複数の出現位置を有する。例えば、３文字データの出現位置（上記例では「私の会」に相当する）を、ＮグラムデータテーブルＤＴの情報として格納する。ＮグラムデータテーブルＤＴ（ｃ）の出現位置列により、「私の会」という３グラムが、「／Ｂｆｉｌｅ／Ｂ．ｔｘｔ．１」と、「／Ｃｆｉｌｅ／Ｃ．ｔｘ．１」というファイルにあることがわかる。なお、ここで、示す“１”は、「私の会」の先頭の文字がテキスト情報（ａ）の１番目に位置することを示す。 Since 3 grams generally exist in a plurality of electronic files, one 3 gram has a plurality of appearance positions. For example, the appearance position of three-character data (corresponding to “My Association” in the above example) is stored as information in the N-gram data table DT. According to the appearance position column of the N-gram data table DT (c), 3 grams of “My Association” are in the files “/Bfile/B.txt.1” and “/Cfile/C.txt.1”. I understand that. Here, “1” shown indicates that the first character of “My Association” is positioned first in the text information (a).

また、３グラムの関連情報は、「出現位置」に加えて、「単語に関する情報」、「単語の重要度」、「単語そのもののイメージ」、「ファイル（ＵＲＬ）に関する情報」、「ファイル名（ＵＲＬ）」、「ファイル作成日時（最終更新日時）」、「ファイルの所有者」、「ファイルのサイズ」、「収集日時」、「タイトル」、「概要」等の情報があり、ＮグラムデータテーブルＤＴに格納され得る。 In addition to the “appearance position”, the 3 gram related information includes “information about the word”, “importance of the word”, “image of the word itself”, “information about the file (URL)”, “file name ( URL) ”,“ file creation date / time (last update date / time) ”,“ file owner ”,“ file size ”,“ collection date / time ”,“ title ”,“ summary ”, etc., and N-gram data table Can be stored in DT.

上記のＮグラム以外の情報は、Ｎグラムと比較して膨大な数となる。そのため、出現位置等の位置情報は、Ｎグラムデータテーブルと異なる別テーブルで管理し、Ｎグラムデータテーブルには、そのＮグラムデータ毎の位置情報リスト（別テーブルで管理される位置情報が保存されるアドレス）だけを持たせても良い。 The information other than the N-gram is enormous as compared with the N-gram. Therefore, position information such as the appearance position is managed in a separate table different from the N-gram data table, and the N-gram data table stores a position information list for each N-gram data (position information managed in the separate table is stored). Only address).

文字列設定部ａから入力された文字列は、文字列比較演算部３によって、まず、３グラムに分割される。文字列比較演算部３は、ハッシュテーブルＨＴから、分割した３グラムの先頭２文字からハッシュキーを計算し、その２文字を先頭に有する３グラムのアドレスを示すハッシュ値を取得する。 The character string input from the character string setting unit a is first divided into 3 grams by the character string comparison operation unit 3. The character string comparison calculation unit 3 calculates a hash key from the first two characters of the divided three grams from the hash table HT, and acquires a hash value indicating the address of the three grams having the two characters at the head.

文字列比較演算部３は、ハッシュ値で示されたアドレスから、検索対象文字列の３グラムの先頭２文字から生成するハッシュキーを有する３グラムの全関連情報（出現位置等）を取得する。例えば、「私は会社へ」を検索対象文字列とする場合、「会社へ」の先頭２文字「会社」からハッシュキーを計算し、そのハッシュキーに対応するハッシュ値を取得し、「会社」から生成した同一のハッシュキーを有する全３グラム「会社へ」および「会社で」が検索される。 The character string comparison calculation unit 3 acquires all the related information (appearance position, etc.) of 3 grams having a hash key generated from the first two characters of 3 grams of the search target character string from the address indicated by the hash value. For example, if “I am a company” is a search target character string, a hash key is calculated from the first two characters “company” of “to company”, a hash value corresponding to the hash key is obtained, and “company” is obtained. All 3 grams having the same hash key generated from “to company” and “in company” are searched.

次に、文字列比較演算部３は、検索対象文字列から分割した３グラムと、検索された３グラムとの同一性を検査する。例えば、検索対象文字列の３グラムの一部「会社へ」と、同一のハッシュキーを有する３グラム「会社へ」および「会社で」の同一性を判断する。この場合、検索された３グラム「会社で」は、検索文字列の一部「会社へ」と同一ではないため、３グラム「会社で」のデータは、破棄される。 Next, the character string comparison operation unit 3 checks the identity of the 3 grams divided from the search target character string and the retrieved 3 grams. For example, a part of 3 grams of the search target character string “to company” is identified as the same as 3 grams “to company” and “in company” having the same hash key. In this case, since the searched 3 gram “company” is not the same as part of the search character string “to company”, the data of 3 gram “company” is discarded.

本実施例においては、ハッシュキーは、ユニコード下位８文字を利用して生成するため、同一のハッシュキーを有する３グラムは、常にその先頭２文字が共通するとは限らない。従って、検索文字列とＤＢから取得した３グラムの比較は、３文字全てに行う必要がある。 In this embodiment, since the hash key is generated using the lower 8 characters of Unicode, the 3 grams having the same hash key do not always have the same first two characters. Therefore, the comparison between the search character string and 3 grams acquired from the DB needs to be performed for all three characters.

さらに、文字列比較演算部３は、検索した文字列の出現位置の整合性を計算する。例えば、「私は会」と「会社へ」が同じファイルもしくはＵＲＬに存在し、その文字の出現位置が２だけ異なるか否かについて計算し、正しければ、それらの文字列から得られる検索情報を、正しい検索情報としてメモリ等に保存する。 Furthermore, the character string comparison operation unit 3 calculates the consistency of the appearance position of the searched character string. For example, if “I am a member” and “To company” exist in the same file or URL and the appearance position of the character differs by 2 or not, search information obtained from those character strings is calculated if it is correct. Then, it is stored in a memory or the like as correct search information.

このように、本実施例においては、「私は会社へ」というデータを２グラムデータベースでは、「私は」、「会社」、「社へ」の３回の検索が必要なところを、３グラムデータベースのように、「私は会」、「会社へ」という２組の３文字列データとしてＮグラム検索が行えることがわかる。 As described above, in this embodiment, the data “I am going to the company” is stored in the 2 gram database, and “I am”, “company”, and “to the company” need to be searched three times. It can be seen that N-gram search can be performed as two sets of 3 character string data such as “I am a party” and “To company” like a database.

また、本実施例において、２文字検索を行う場合は、その２文字からハッシュキーを生成し、ハッシュ値で示される３グラムデータのチェーンをたどりながら、３グラムの先頭２文字を探す。これは、通常の２グラムエンジンでは、１つの２グラムデータだけ見つかれば終わる処理であるが、本実施例では、該当するデータレコードは、複数存在する可能性がある。そのため、３グラムデータのチェーンはコード順にソートされており、連続して見付かるようになっている。従って、通常の２グラムエンジンの処理と殆ど変わらない処理速度で、２文字の検索が可能である。 In this embodiment, when performing a two-character search, a hash key is generated from the two characters, and the first two characters of the three grams are searched while following the chain of three-gram data indicated by the hash value. This is a process that is completed when only one 2-gram data is found in a normal 2-gram engine, but in this embodiment, there may be a plurality of corresponding data records. Therefore, the chain of 3 gram data is sorted in code order so that it can be found continuously. Therefore, it is possible to search for two characters at a processing speed that is almost the same as that of a normal 2-gram engine.

以上に説明された本発明に係る全文検索システムに関するインデックス生成処理の一例を、図３のフローチャートにより説明する。 An example of the index generation processing related to the full-text search system according to the present invention described above will be described with reference to the flowchart of FIG.

図３に示されるように、フローチャートでは、まず、ステップＳ１０１において、情報収集部１は、インターネットまたはＬＡＮ（ローカルエリアネットワーク）上のコンピュータに保存された電子化ファイルからテキスト情報を取得する。次いで、ステップＳ１０２に進む。 As shown in FIG. 3, in the flowchart, first, in step S101, the information collection unit 1 acquires text information from an electronic file stored in a computer on the Internet or a LAN (local area network). Next, the process proceeds to step S102.

ステップＳ１０２では、情報収集部１が取得した情報を元に、インデクサ２は、テキスト情報を１文字ずらしながら３グラム及びその関連情報を生成し、その３グラムの先頭２文字を使って、ハッシュキーを生成する。次に、ステップＳ１０３に進む。 In step S102, based on the information acquired by the information collection unit 1, the indexer 2 generates 3 grams and related information while shifting the text information by one character, and uses the first two characters of the 3 grams to generate a hash key. Is generated. Next, the process proceeds to step S103.

ステップＳ１０３では、生成したハッシュキーが既にハッシュテーブルＨＴに存在するか否かの判断処理を行う。そして、ハッシュキーが存在する場合は、ステップＳ１０５に進み、ハッシュキーが存在しない場合は、ステップＳ１０４に進む。 In step S103, it is determined whether or not the generated hash key already exists in the hash table HT. If the hash key exists, the process proceeds to step S105. If the hash key does not exist, the process proceeds to step S104.

ステップＳ１０４では、インデクサ２は、ハッシュテーブルＨＴに新たなハッシュキーを加える。次に、ステップＳ１０５に進む。 In step S104, the indexer 2 adds a new hash key to the hash table HT. Next, the process proceeds to step S105.

ステップＳ１０５では、インデクサ２は、３グラムが既にＮグラムデータテーブルＤＴに存在するか否かの判断処理を行う。新ハッシュキーの場合は、ＮグラムデータテーブルＤＴに同一ハッシュキーを有する３グラムデータは存在しないことを意味するため、新３グラムデータの登録のため、ステップＳ１０６に進む。新ハッシュキーで無い場合は、そのハッシュキーとペアを組むハッシュ値を使って、ＮグラムデータテーブルＤＴに登録されている３グラムデータを検索する。インデクサ２は、このアドレスで示された３グラムをまず検索し、次に、当該３グラムがアドレスに存在しない場合は、そのアドレスに存在するデータのチェーン列で示された同一ハッシュキーを有する他の３グラムを検索する。検索した結果、同一のハッシュキーを有する３グラムがＮグラムデータテーブルＤＴ内に存在しない場合は、ステップＳ１０６に進む。同一のハッシュキーを有する他の３グラムが存在する場合は、ステップＳ１０７に進む。 In step S105, the indexer 2 determines whether or not 3 grams already exist in the N-gram data table DT. In the case of the new hash key, this means that there is no 3 gram data having the same hash key in the N-gram data table DT, and therefore the process proceeds to step S106 for registration of the new 3 gram data. If it is not a new hash key, the 3 gram data registered in the N-gram data table DT is searched using the hash value paired with the hash key. The indexer 2 first searches for the 3 gram indicated by this address, and then if the 3 gram does not exist at the address, the indexer 2 has the same hash key indicated by the chain string of data present at that address. Search for 3 grams of. As a result of the search, if 3 grams having the same hash key do not exist in the N-gram data table DT, the process proceeds to step S106. If there is another 3 gram having the same hash key, the process proceeds to step S107.

ステップＳ１０６では、インデクサ２は、新３グラムデータを３グラムデータテーブルＤＴに登録する。新３グラムデータが、新ハッシュキーに基づくものである場合は、登録したＮグラムデータテーブルＤＴ内のアドレスを、対応する新ハッシュキーのハッシュ値としてハッシュテーブルＨＴに登録する。次に、ステップＳ１０７に進む。 In step S106, the indexer 2 registers the new 3-gram data in the 3-gram data table DT. When the new 3-gram data is based on the new hash key, the address in the registered N-gram data table DT is registered in the hash table HT as the hash value of the corresponding new hash key. Next, the process proceeds to step S107.

ステップＳ１０７では、インデクサ２は、登録する３グラムに関連する出現位置等の位置情報をＮグラムデータテーブルＤＴに登録する。 In step S107, the indexer 2 registers position information such as the appearance position related to the 3 gram to be registered in the N-gram data table DT.

ステップＳ１０８では、登録した３グラムデータが、ステップＳ１０１で取得したテキスト情報の最終文字列か否かの判断処理を行う。最終文字列の場合は、そのテキスト情報に関するインデックス生成処理は終了し、最終文字列ではない場合は、ステップＳ１０２に戻り、再度インデックス生成処理を繰り返す。 In step S108, it is determined whether or not the registered 3-gram data is the last character string of the text information acquired in step S101. In the case of the final character string, the index generation process regarding the text information is completed. When it is not the final character string, the process returns to step S102 and the index generation process is repeated again.

図４に、本発明に係る全文検索システムに関する全文検索処理の一例を、フローチャートにより説明する。 FIG. 4 is a flowchart illustrating an example of a full text search process related to the full text search system according to the present invention.

図４に示されるように、フローチャートでは、まず、ステップＳ２０１において、全文検索システムのユーザから、検索文字列設定部ａでの入力データにより、検索対象文字列が設定される。次に、ステップＳ２０２に進む。 As shown in FIG. 4, in the flowchart, first, in step S201, a search target character string is set by a user of the full-text search system based on input data in the search character string setting unit a. Next, the process proceeds to step S202.

Ｓ２０２では、文字列比較演算部３が、入力された検索文字列を３グラムに分割し、その分割データの先頭２文字からハッシュキーを計算し、ハッシュテーブルＨＴから、そのハッシュキーに対応するハッシュ値であるアドレスを取得する。次に、ステップＳ２０３に進む。 In S202, the character string comparison operation unit 3 divides the input search character string into 3 grams, calculates a hash key from the first two characters of the divided data, and hashes corresponding to the hash keys from the hash table HT. Get the address that is the value. Next, the process proceeds to step S203.

Ｓ２０３では、文字列比較演算部３は、ＮグラムデータテーブルＤＴにおいて、ハッシュ値で示されるアドレスを検索する。そのアドレス、または、検索データのチェーン列で示すアドレスから、該当する複数の３グラムデータの情報（３グラム、出現位置等）を一括取得する。次に、ステップＳ２０４に進む。 In S203, the character string comparison calculation unit 3 searches the address indicated by the hash value in the N-gram data table DT. From the address or the address indicated by the search data chain, information (3 grams, appearance position, etc.) of the corresponding plurality of 3 gram data is collectively acquired. Next, the process proceeds to step S204.

ステップＳ２０４では、文字列比較演算部３は、ＮグラムデータテーブルＤＴから一括取得した複数の３グラムが、検索対象文字列から分割した３グラムと一致するか否かを判断する。一致しない場合、Ｓ２０７に進み、一致する場合、ステップＳ２０５に進む。なお、位置情報をＮグラムデータテーブルＤＴの別テーブルで定義し、Ｎグラムとその位置情報に関するリストだけをＮグラムデータテーブルとして場合は、ステップＳ２０４では、グラム（文字列）情報だけの整合を行う。 In step S204, the character string comparison calculation unit 3 determines whether or not the plurality of 3 grams collectively acquired from the N-gram data table DT matches the 3 grams divided from the search target character string. If they do not match, the process proceeds to S207, and if they match, the process proceeds to step S205. If the position information is defined in a separate table of the N-gram data table DT and only the N-gram and the list related to the position information are used as the N-gram data table, only the gram (character string) information is matched in step S204. .

ステップＳ２０５では、検索された複数の３グラムの出現位置に基づいて、各３グラムの出現位置が連続性を持つか否かが判断される。出現位置の連続性が無い３グラムは、ステップＳ２０７に進み、出現位置の連続性が有る３グラムは、ステップＳ２０６に進む。 In step S205, it is determined whether the appearance positions of each 3 gram have continuity based on the searched appearance positions of the plurality of 3 grams. For 3 grams without appearance position continuity, the process proceeds to step S207, and for 3 grams with appearance position continuity, the process proceeds to step S206.

ステップＳ２０６では、検索文字列の分割文字列が、最終分割文字列か否かを判断する。最終分割文字列の場合、ステップＳ２０８に進み、最終分割文字列ではない場合、ステップＳ２０２に戻り、ステップＳ２０６において、分割文字列が最終分割文字列と判断されるまで、上述の処理が繰り返される。 In step S206, it is determined whether or not the divided character string of the search character string is the final divided character string. If it is the final divided character string, the process proceeds to step S208. If it is not the final divided character string, the process returns to step S202, and the above processing is repeated until it is determined in step S206 that the divided character string is the final divided character string.

ステップＳ２０７では、ＤＢから取得した３グラムデータ、および／または、出現位置の連続性の無い３グラムデータが破棄される。 In step S207, the 3 gram data acquired from the DB and / or the 3 gram data with no continuity of the appearance position are discarded.

ステップＳ２０８では、検索対象文字列が、出現位置連続性の有る３グラムデータから得られる出現位置情報等と共に検索結果として画面表示され、全文検索処理が終了する。 In step S208, the search target character string is displayed on the screen as a search result together with appearance position information obtained from 3-gram data having appearance position continuity, and the full-text search process is completed.

図６は、本発明に係る全文検索システムのハードウェア構成例を示す図である。図１の全文検索システムは、例えば、図６に示すようなコンピュータ１０により実行される。コンピュータ１０は、必要な処理を実行するＣＰＵ１１、処理された結果を格納するメモリ１２（例えば、ＲＡＭ（Random Access Memory））、ディスプレイ１３、例えば、キーボードやマウスのような入力装置１４、ハードディスク１５等を備える。 FIG. 6 is a diagram illustrating a hardware configuration example of the full-text search system according to the present invention. The full text search system in FIG. 1 is executed by a computer 10 as shown in FIG. 6, for example. The computer 10 includes a CPU 11 that performs necessary processing, a memory 12 that stores the processed results (for example, RAM (Random Access Memory)), a display 13, for example, an input device 14 such as a keyboard and a mouse, a hard disk 15, and the like. Is provided.

全文検索処理を実行するプログラム（データ）は、記録媒体に保存され、ＣＤ／ＤＶＤドライブ１７等からローディングされ、もしくは、他のコンピュータ１９からネットワーク１６を介してダウンロードされ、ＣＰＵ１１の制御によって、コンピュータ１０のハードディスク１５に保存される。次に、プログラムを実行するために、ＣＰＵ１１の制御によって、メモリ１２に格納され、更に、ＣＰＵ１１の制御によってＣＰＵ１１に送られ、そこで、プログラムは、ＣＰＵ１１で実行および処理される。また、図１に示される、情報収集部１、インデクサ２、文字列比較演算部３、検索結果表示部４の各処理は、ＣＰＵ１１で実行される。 A program (data) for executing the full-text search process is stored in a recording medium, loaded from the CD / DVD drive 17 or the like, or downloaded from another computer 19 via the network 16 and controlled by the CPU 11 under the control of the computer 10. Are stored in the hard disk 15. Next, in order to execute the program, it is stored in the memory 12 under the control of the CPU 11 and further sent to the CPU 11 under the control of the CPU 11, where the program is executed and processed by the CPU 11. Further, each process of the information collection unit 1, the indexer 2, the character string comparison calculation unit 3, and the search result display unit 4 shown in FIG.

検索文字列設定部ａの設定画面、および、検索結果表示部４による検索結果表示画面は、ディスプレイ１３に画面表示され、検索文字列設定部ａの設定情報は入力装置１４によって入力される。 A setting screen of the search character string setting unit a and a search result display screen by the search result display unit 4 are displayed on the display 13, and setting information of the search character string setting unit a is input by the input device 14.

インデックス生成処理においては、ＣＰＵ１１上で実行される情報収集部１は、インターネットやＬＡＮ上に接続されたコンピュータ上の電子ファイルからテキスト情報をメモリ１２に格納する。 In the index generation process, the information collection unit 1 executed on the CPU 11 stores text information in the memory 12 from an electronic file on a computer connected to the Internet or a LAN.

次に、ＣＰＵ１１上で実行されるインデクサ２は、メモリ１２に格納されたテキスト情報から、メモリ上に用意されたハッシュテーブルＨＴに、ハッシュキーを生成する。さらに、ＣＰＵ１１は、ハードディスク上のインデックスデータベースＤＢ内のハッシュ値で示されたＮグラムデータテーブルＤＴのアドレスデータに３グラムやその出現位置情報等の情報を生成する。 Next, the indexer 2 executed on the CPU 11 generates a hash key from the text information stored in the memory 12 in the hash table HT prepared on the memory. Further, the CPU 11 generates information such as 3 gram and its appearance position information in the address data of the N-gram data table DT indicated by the hash value in the index database DB on the hard disk.

全文検索処理においては、ＣＰＵ１１は、入力装置１４を介して入力された検索文字列は、ＣＰＵ１１によって、メモリに格納される。メモリ格納された検索文字列は、文字列比較演算部３によって、３グラムの文字列に分割され、さらに３グラムの先頭２グラムを用いて、メモリに格納されているハッシュテーブルＨＴにあるハッシュ値を用いて、ハードディスク内のＮグラムデータテーブルＤＴの該当アドレスにアクセスする。なお、ハードディスクに保存されたＮグラムデータテーブルＤＴに問い合わせが発行されると、ハードディスク上のＮグラムデータテーブルＤＴから取得されたデータブロックはメモリ（バッファ・キャッシュ）に保持される。次回に同じ問い合わせが発生すると、ＣＰＵ１１は、まずバッファ・キャッシュを先に探し、存在しない場合にハードディスク上のＮグラムデータテーブルＤＴにアクセスする。一般に、ＮグラムデータテーブルＤＴのバッファ・キャッシュ上の存在確率は、ＮグラムデータテーブルＤＴに対して相対的にメモリ容量が大きければ大きいほど高い。従って、メモリがＮグラムデータテーブルＤＴに対して十分に大きい場合は、ＮグラムデータテーブルＤＴをメモリ上に全て格納しても良い。さらに、ＮグラムデータテーブルＤＴ内では、位置情報がその多くを占めるため、検索速度を向上するために、位置情報を別テーブルで定義し、Ｎグラムとその位置情報に関するリストだけをＮグラムデータテーブルとしても良い。 In the full text search process, the CPU 11 stores the search character string input via the input device 14 in the memory by the CPU 11. The search character string stored in the memory is divided into 3 gram character strings by the character string comparison operation unit 3, and the hash values in the hash table HT stored in the memory using the first 2 grams of the 3 grams. Is used to access the corresponding address of the N-gram data table DT in the hard disk. When an inquiry is issued to the N-gram data table DT stored in the hard disk, the data block acquired from the N-gram data table DT on the hard disk is held in the memory (buffer cache). When the same inquiry occurs next time, the CPU 11 first searches the buffer cache first, and if it does not exist, accesses the N-gram data table DT on the hard disk. In general, the existence probability of the N-gram data table DT on the buffer cache is higher as the memory capacity is larger than the N-gram data table DT. Therefore, if the memory is sufficiently large relative to the N-gram data table DT, the entire N-gram data table DT may be stored on the memory. Further, since the position information occupies most of the N-gram data table DT, in order to improve the search speed, the position information is defined in a separate table, and only the list related to the N-gram and the position information is stored in the N-gram data table. It is also good.

本発明によれば、同一のハッシュキーを有する３グラムデータは、同一のブロック、または、同一のバッファにキャッシュされている。また、同一ハッシュキーを有する３グラムデータが非常に多い場合は、複数のブロックに跨る場合がある。その場合も、ＮグラムデータテーブルＤＴ内のチェーン列によって、同一ハッシュキーを有する３グラムデータのアドレスは定義され、線形検索とは異なる直接的な検索が可能である。そのため、高速問い合わせ処理が可能である。 According to the present invention, 3-gram data having the same hash key is cached in the same block or the same buffer. In addition, when there are a lot of 3-gram data having the same hash key, the data may straddle a plurality of blocks. Even in this case, the address of the 3 gram data having the same hash key is defined by the chain column in the N-gram data table DT, and a direct search different from the linear search is possible. Therefore, high-speed inquiry processing is possible.

検索された３グラムは、文字列比較演算部３によって、メモリ１１に保存された検索対象文字列の分割文字列との同一性の判断がされる。同一性が無い場合は、メモリ１１から破棄される。ＣＰＵ１１上で実行する文字列比較演算部３は、同一性があった検索された３グラムデータに対して、出現位置整合処理を行う。出現位置整合の取れないデータは、メモリ１１から破棄され、出現位置整合の取れたデータは、ディスプレイ１３を介してその文字列の出現位置と共に出力される。 The searched 3 gram is judged by the character string comparison calculation unit 3 to be identical to the divided character string of the search target character string stored in the memory 11. If there is no identity, it is discarded from the memory 11. The character string comparison calculation unit 3 executed on the CPU 11 performs appearance position matching processing on the retrieved 3 gram data having the same identity. Data whose appearance position is not matched is discarded from the memory 11, and data whose appearance position is matched is output along with the appearance position of the character string via the display 13.

また、本発明に係る全文検索システムの起動や、アクセスは、入力装置１４からの起動の他に、ＣＰＵ１１で実行中の他プログラムから起動し、回線１６に接続された他コンピュータ１９から起動し、或いは、他コンピュータ１９内の実行中のプログラムから起動されるようにしても良い。さらに、ハッシュテーブルＤＴは、インデックスデータベースＤＢの管理外として、同じコンピュータ１０内、または、他のコンピュータ１９に存在する外部のアプリケーションで起動されていても良い。 The full-text search system according to the present invention is started up and accessed from the input device 14 as well as from other programs being executed by the CPU 11 and from the other computer 19 connected to the line 16. Or you may make it start from the program in execution in the other computer 19. FIG. Further, the hash table DT may be activated by an external application existing in the same computer 10 or in another computer 19 outside the management of the index database DB.

以上説明したように、本発明においては、全文検索システムにより、最小検索文字列が２グラムでありながら、３グラム専用の全文検索システムと同様の検索処理速度の提供を可能にしている。また、これは、ハッシュテーブルＨＴで、２グラムのインデックス情報を有し、ＮグラムデータテーブルＤＴで、３グラムの出現位置等を有することにより実現したものである。 As described above, in the present invention, the full text search system can provide the same search processing speed as the full text search system dedicated to 3 grams, while the minimum search character string is 2 grams. This is realized by having 2 grams of index information in the hash table HT and having an appearance position of 3 grams in the N-gram data table DT.

しかし、これは、３グラムのハッシュテーブル内のインデックス情報と、４グラムのＮグラムデータテーブルＤＴの出現位置においても、各々で管理する文字列の数が異なるだけでその原理は同じである。その場合、種類の異なる３文字の組み合わせに対応するハッシュテーブルが必要となるため、２文字のハッシュテーブルよりもハッシュテーブルのエントリ数が増加する点が相違する。この場合、２グラムハッシュテーブルと２グラムインデックスデータベース、および、３グラムハッシュテーブルと４グラムデータベースの組合せで、２、３、４グラムデータベースを有する場合と同じ処理速度でありながら、２、４グラムデータベースの容量で実現が可能である。 However, this also applies to the index information in the 3 gram hash table and the appearance position of the 4 gram N-gram data table DT, but the principle is the same except that the number of character strings managed is different. In that case, since a hash table corresponding to a combination of three characters of different types is required, the number of entries in the hash table is increased as compared to a two-character hash table. In this case, a combination of a 2 gram hash table and a 2 gram index database, and a combination of a 3 gram hash table and a 4 gram database, and a 2, 4 gram database with the same processing speed as the case of having a 2, 3, 4 gram database It can be realized with a capacity of.

さらに、２グラムハッシュテーブルと３グラムインデックスデータベース、および、１グラムハッシュテーブルと１グラムデータベースの組合せで、１、２、３グラムデータベースを有する場合と同じ処理速度でありながら、１、３グラムデータベースの容量で実現が可能である。 In addition, the combination of a 2 gram hash table and a 3 gram index database, and a combination of a 1 gram hash table and a 1 gram database, with the same processing speed as having 1, 2, and 3 gram databases, It can be realized with capacity.

なお、同一のＮグラムで構成されるハッシュテーブルとインデックスデータベースにおいても上述された本発明の実施例が適用される。例えば、１文字のユニコードの下位８ビットで生成されるハッシュキーでハッシュテーブルが構成される。全文検索処理においては、ハッシュテーブルＨＴを介して、ＮグラムデータテーブルＤＴ内の１グラムを検索するため、検索処理が高速化される。また、文字列比較演算処理においては、検索対象文字列の文字数と、インデックスデータベースから検索された文字の文字数は同一であるが、両文字列はハッシュキーで関係付けられているため、上述の実施例と同様に同一性の検証が必要である。 Note that the above-described embodiment of the present invention is applied to a hash table and an index database composed of the same N-gram. For example, a hash table is composed of a hash key generated with the lower 8 bits of a Unicode character. In the full-text search process, one gram in the N-gram data table DT is searched through the hash table HT, so the search process is speeded up. In the character string comparison calculation processing, the number of characters in the search target character string is the same as the number of characters retrieved from the index database, but both character strings are related by a hash key. Similarity to the example requires verification of identity.

本発明に係る全文検索システムにおける機能の概略構成例を示す図である。It is a figure which shows the schematic structural example of the function in the full text search system which concerns on this invention. インデクサ部２によるハッシュテーブルおよびＮグラムデータテーブルＤＴのＮグラムデータテーブル生成処理の概要を示す図である。It is a figure which shows the outline | summary of the N-gram data table production | generation process of the hash table by the indexer part 2, and the N-gram data table DT. 発明に係る全文検索システムにおけるインデックス生成処理の例を示す図である。It is a figure which shows the example of the index production | generation process in the full text search system which concerns on invention. 本発明に係る全文検索システムに関する全文検索処理の一例を示す図である。It is a figure which shows an example of the full text search process regarding the full text search system which concerns on this invention. 本発明に係る全文検索システムのハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of the full text search system which concerns on this invention.

符号の説明Explanation of symbols

１情報収集部
２インデクサ部
３文字列比較演算部
４検索結果表示部
１０コンピュータ
１１ＣＰＵ
１２メモリ
１３ディスプレイ
１４入力装置
１５ハードディスク
１６回線
１７ＣＤ／ＤＶＤドライブ
１９他のコンピュータ DESCRIPTION OF SYMBOLS 1 Information collection part 2 Indexer part 3 Character string comparison calculating part 4 Search result display part 10 Computer 11 CPU
12 Memory 13 Display 14 Input Device 15 Hard Disk 16 Line 17 CD / DVD Drive 19 Other Computer

Claims

電子ファイルから取り出されたテキスト情報をＮグラム分割し（Ｎ：Ｎ＞１の整数）、該Ｎグラムと該Ｎグラムの出現位置を含むＮグラムデータをデータベースのデータテーブルに登録し、該Ｎグラムデータテーブルを用いて、検索対象の文字列による該テキスト情報の全文検索を行う全文検索システムにおいて、
前記Ｎグラムの（Ｎ−ｎ）文字の各々から生成したハッシュキー（ｎ：Ｎ＞ｎの整数）をハッシュテーブルに登録するハッシュキー登録手段、
前記（Ｎ−ｎ）文字を含むＮグラムのＮグラムデータを前記Ｎグラムデータテーブルに登録するＮグラムデータ登録手段、
前記Ｎグラムデータを登録した前記Ｎグラムデータテーブルのアドレスを、前記ハッシュキーと関連付けてハッシュテーブルに登録するアドレス登録手段、
前記検索対象の文字列をＮグラム分割し、該検索Ｎグラムに含まれる（Ｎ−ｎ）グラムからハッシュキーを生成し、該ハッシュキーに関連する前記アドレスを、前記ハッシュテーブルから取得し、該アドレスに基づいて前記ＮグラムデータテーブルからＮグラムを検索するハッシュキー関連Ｎグラム検索手段、および、
前記検索Ｎグラムと前記ハッシュキー関連Ｎグラムとの同一性の比較を行い、かつ、該ハッシュキー関連Ｎグラムの出現位置整合演算を行う文字列比較演算手段、を備えることを特徴とする全文検索システム。 The text information extracted from the electronic file is divided into N grams (N: integer of N> 1), and N-gram data including the N-gram and the appearance position of the N-gram is registered in the data table of the database, and the N-gram In a full-text search system that performs a full-text search of the text information by a character string to be searched using a data table,
A hash key registration means for registering a hash key (n: an integer of N> n) generated from each of (N−n) characters of the N-gram in a hash table;
N-gram data registration means for registering N-gram N-gram data including the (N−n) characters in the N-gram data table;
Address registration means for registering the address of the N-gram data table in which the N-gram data is registered in the hash table in association with the hash key;
The search target character string is divided into N-grams, a hash key is generated from (N−n) grams included in the search N-gram, the address related to the hash key is acquired from the hash table, Hash key related N-gram search means for searching N-gram from the N-gram data table based on an address; and
A full-text search comprising: character string comparison operation means for comparing the sameness between the search N-gram and the hash key-related N-gram and performing an appearance position matching operation of the hash key-related N-gram system.

前記Ｎグラムデータテーブルにおいて、同一のハッシュキーを有する前記Ｎグラムデータは、同一のブロック、または、同一のバッファに保存する請求項１に記載の全文検索システム。 The full-text search system according to claim 1, wherein the N-gram data having the same hash key in the N-gram data table is stored in the same block or the same buffer.

前記ハッシュテーブル内に、前記生成したハッシュキーが既に登録されている場合、該既登録ハッシュキーと関連付けられるアドレスに基づいて、前記Ｎグラムデータテーブルに前記Ｎグラムデータを登録すること、を備える請求項１および２に記載の全文検索システム。 Registering the N-gram data in the N-gram data table based on an address associated with the already-registered hash key when the generated hash key is already registered in the hash table. Item 3. The full-text search system according to item 1 or 2.

前記ＮグラムデータのＮグラムが、既に前記Ｎグラムデータテーブル内に登録されている場合、該既登録Ｎグラムデータに前記Ｎグラムデータの出現位置を書き込み、または、更新する請求項１〜３に記載の全文検索システム。 The N-gram of the N-gram data, when the N-gram of the N-gram data is already registered in the N-gram data table, writes or updates the appearance position of the N-gram data in the registered N-gram data. Full text search system described.

電子ファイルから取り出されたテキスト情報をＮグラム分割し（Ｎ：Ｎ＞１の整数）、該（Ｎ−１）グラムと該（Ｎ−１）グラムの出現位置を含む（Ｎ−１）グラムデータをデータベースのデータテーブルに登録し、該（Ｎ−１）グラムデータテーブルを用いて、検索対象の文字列による該テキスト情報の全文検索を行う全文検索システムにおいて、
前記（Ｎ−１）グラムの各々から生成したハッシュキーをハッシュテーブルに登録するハッシュキー登録手段、
前記（Ｎ−１）グラムデータを前記（Ｎ−１）グラムデータテーブルに登録する（Ｎ−１）グラムデータ登録手段、
前記（Ｎ−１）グラムデータを登録した前記（Ｎ−１）グラムデータテーブルのアドレスを、前記ハッシュキーと関連付けてハッシュテーブルに登録するアドレス登録手段、
前記検索対象の文字列を（Ｎ−１）グラム分割し、該検索（Ｎ−１）グラムからハッシュキーを生成し、該ハッシュキーに関連する前記アドレスを、前記ハッシュテーブルから取得し、該アドレスに基づいて前記（Ｎ−１）グラムデータテーブルから（Ｎ−１）グラムを検索するハッシュキー関連（Ｎ−１）グラム検索手段、および、
前記検索（Ｎ−１）グラムと前記ハッシュキー関連（Ｎ−１）グラムとの同一性の比較を行い、かつ、該ハッシュキー関連（Ｎ−１）グラムの出現位置整合演算を行う文字列比較演算手段、を備えることを特徴とする全文検索システムを備える請求項１〜４に記載の全文検索システム。 The text information extracted from the electronic file is divided into N grams (N: integer of N> 1), and (N-1) gram data including the (N-1) gram and the appearance position of the (N-1) gram. In a full-text search system that performs a full-text search of the text information using a character string to be searched using the (N-1) gram data table,
Hash key registration means for registering a hash key generated from each of the (N-1) grams in a hash table;
(N-1) Gram data registration means for registering the (N-1) gram data in the (N-1) gram data table;
Address registration means for registering the address of the (N-1) gram data table in which the (N-1) gram data is registered in the hash table in association with the hash key;
The search target character string is divided into (N-1) grams, a hash key is generated from the search (N-1) gram, the address associated with the hash key is obtained from the hash table, and the address Hash key related (N-1) gram retrieval means for retrieving (N-1) grams from the (N-1) gram data table based on
Character string comparison for comparing the identity of the search (N-1) gram and the hash key related (N-1) gram, and performing the appearance position matching operation of the hash key related (N-1) gram The full-text search system according to claim 1, further comprising a calculation means.

電子ファイルから取り出されたテキスト情報をＮグラム分割し（Ｎ：Ｎ＞１の整数）、該Ｎグラムと該Ｎグラムの出現位置を含むＮグラムデータをデータベースのデータテーブルに登録し、該Ｎグラムデータテーブルを用いて、検索対象の文字列による該テキスト情報の全文検索を行う全文検索方法において、
前記Ｎグラムの（Ｎ−ｎ）文字の各々から生成したハッシュキー（ｎ：Ｎ＞ｎの整数）をハッシュテーブルに登録するハッシュキー登録ステップ、
前記（Ｎ−ｎ）文字を含むＮグラムのＮグラムデータを前記Ｎグラムデータテーブルに登録するＮグラムデータ登録ステップ、
前記Ｎグラムデータを登録した前記Ｎグラムデータテーブルのアドレスを、前記ハッシュキーと関連付けてハッシュテーブルに登録するアドレス登録ステップ、
前記検索対象の文字列をＮグラム分割し、該検索Ｎグラムに含まれる（Ｎ−ｎ）グラムからハッシュキーを生成し、該ハッシュキーに関連する前記アドレスを、前記ハッシュテーブルから取得し、該アドレスに基づいて前記ＮグラムデータテーブルからＮグラムを検索するハッシュキー関連Ｎグラム検索ステップ、および、
前記検索Ｎグラムと前記ハッシュキー関連Ｎグラムとの同一性の比較を行い、かつ、該ハッシュキー関連Ｎグラムの出現位置整合演算を行う文字列比較演算ステップ、を有することを特徴とする全文検索方法。 The text information extracted from the electronic file is divided into N grams (N: integer of N> 1), and N-gram data including the N-gram and the appearance position of the N-gram is registered in the data table of the database, and the N-gram In a full-text search method for performing a full-text search of the text information using a character string to be searched using a data table,
A hash key registration step of registering a hash key (n: an integer of N> n) generated from each of the (N−n) characters of the N-gram in a hash table;
N-gram data registration step for registering N-gram N-gram data including the (N−n) characters in the N-gram data table;
An address registration step of registering an address of the N-gram data table in which the N-gram data is registered in the hash table in association with the hash key;
The search target character string is divided into N-grams, a hash key is generated from (N−n) grams included in the search N-gram, the address related to the hash key is acquired from the hash table, A hash key-related N-gram search step of searching for an N-gram from the N-gram data table based on an address; and
A full-text search comprising: a character string comparison operation step for comparing the sameness between the search N-gram and the hash key-related N-gram and performing an appearance position matching operation of the hash key-related N-gram Method.

電子ファイルから取り出されたテキスト情報をＮグラム分割し（Ｎ：Ｎ＞１の整数）、該Ｎグラムと該Ｎグラムの出現位置を含むＮグラムデータをデータベースのデータテーブルに登録し、該Ｎグラムデータテーブルを用いて、検索対象の文字列による該テキスト情報の全文検索するため、コンピュータに、
前記Ｎグラムの（Ｎ−ｎ）文字の各々から生成したハッシュキー（ｎ：Ｎ＞ｎの整数）をハッシュテーブルに登録するハッシュキー登録手順、
前記（Ｎ−ｎ）文字を含むＮグラムのＮグラムデータを前記Ｎグラムデータテーブルに登録するＮグラムデータ登録手順、
前記Ｎグラムデータを登録した前記Ｎグラムデータテーブルのアドレスを、前記ハッシュキーと関連付けてハッシュテーブルに登録するアドレス登録手順、
前記検索対象の文字列をＮグラム分割し、該検索Ｎグラムに含まれる（Ｎ−ｎ）グラムからハッシュキーを生成し、該ハッシュキーに関連する前記アドレスを、前記ハッシュテーブルから取得し、該アドレスに基づいて前記ＮグラムデータテーブルからＮグラムを検索するハッシュキー関連Ｎグラム検索手順、および、
前記検索Ｎグラムと前記ハッシュキー関連Ｎグラムとの同一性の比較を行い、かつ、該ハッシュキー関連Ｎグラムの出現位置整合演算を行う文字列比較演算手順、を実行させるためのプログラム。 The text information extracted from the electronic file is divided into N grams (N: integer of N> 1), and N-gram data including the N-gram and the appearance position of the N-gram is registered in the data table of the database, and the N-gram In order to perform a full-text search of the text information by a character string to be searched using a data table,
A hash key registration procedure for registering a hash key (n: integer of N> n) generated from each of the (N−n) characters of the N-gram in a hash table;
N-gram data registration procedure for registering N-gram N-gram data including the (N−n) characters in the N-gram data table;
An address registration procedure for registering the address of the N-gram data table in which the N-gram data is registered in the hash table in association with the hash key;
The search target character string is divided into N-grams, a hash key is generated from (N−n) grams included in the search N-gram, the address related to the hash key is acquired from the hash table, A hash key related N-gram search procedure for searching N-grams from the N-gram data table based on an address; and
A program for executing a character string comparison calculation procedure for comparing the sameness between the search N-gram and the hash key-related N-gram and performing an appearance position matching calculation of the hash key-related N-gram.

前記Ｎグラムデータテーブルにおいて、同一のハッシュキーを有する前記Ｎグラムデータは、同一のブロック、または、同一のバッファに保存する手順を実行させる請求項７に記載のプログラム。 The program according to claim 7, wherein the N-gram data having the same hash key in the N-gram data table executes a procedure for storing the same in the same block or the same buffer.

前記ハッシュテーブル内に、前記生成したハッシュキーが既に登録されている場合、該既登録ハッシュキーと関連付けられるアドレスに基づいて、前記Ｎグラムデータテーブルに前記Ｎグラムデータを登録する手順を実行させる請求項７および８に記載のプログラム。 When the generated hash key is already registered in the hash table, a procedure for registering the N-gram data in the N-gram data table based on an address associated with the already-registered hash key is executed. Item 9. The program according to items 7 and 8.

前記ＮグラムデータのＮグラムが、既に前記Ｎグラムデータテーブル内に登録されている場合、該既登録Ｎグラムデータに前記Ｎグラムデータの出現位置を書き込み、または、更新する手順を実行させる請求項７〜９に記載のプログラム。 The N-gram of the N-gram data, when the N-gram of the N-gram data is already registered in the N-gram data table, the procedure for writing or updating the appearance position of the N-gram data in the registered N-gram data is executed. The program according to 7-9.

電子ファイルから取り出されたテキスト情報をＮグラム分割し（Ｎ：Ｎ＞１の整数）、該（Ｎ−１）グラムと該（Ｎ−１）グラムの出現位置を含む（Ｎ−１）グラムデータをデータテーブルに登録し、該（Ｎ−１）グラムデータテーブルを用いて、検索対象の文字列による該テキスト情報の全文検索を行うために、コンピュータに、
前記（Ｎ−１）グラムの各々から生成したハッシュキーをハッシュテーブルに登録するハッシュキー登録手順、
前記（Ｎ−１）グラムデータを前記（Ｎ−１）グラムデータテーブルに登録する（Ｎ−１）グラムデータ登録手順、
前記（Ｎ−１）グラムデータを登録した前記（Ｎ−１）グラムデータテーブルのアドレスを、前記ハッシュキーと関連付けてハッシュテーブルに登録するアドレス登録手順、
前記検索対象の文字列を（Ｎ−１）グラム分割し、該検索（Ｎ−１）グラムからハッシュキーを生成し、該生成ハッシュキーに関連する前記アドレスを、前記ハッシュテーブルの中から検索し、該アドレスに基づいて前記（Ｎ−１）グラムデータテーブルから（Ｎ−１）グラムを検索するハッシュキー関連（Ｎ−１）グラム検索手順、および、
前記検索（Ｎ−１）グラムと前記ハッシュキー関連（Ｎ−１）グラムとの同一性の比較を行い、かつ、該ハッシュキー関連（Ｎ−１）グラムの出現位置整合演算を行う文字列比較演算手順、を実行させる請求項７〜１０に記載のプログラム。 The text information extracted from the electronic file is divided into N grams (N: integer of N> 1), and (N-1) gram data including the (N-1) gram and the appearance position of the (N-1) gram. In the data table, and using the (N-1) gram data table to perform a full text search of the text information by the character string to be searched,
A hash key registration procedure for registering a hash key generated from each of the (N-1) grams in a hash table;
(N-1) Gram data registration procedure for registering the (N-1) gram data in the (N-1) gram data table;
An address registration procedure for registering the address of the (N-1) gram data table in which the (N-1) gram data is registered in the hash table in association with the hash key;
The search target character string is divided into (N-1) grams, a hash key is generated from the search (N-1) gram, and the address related to the generated hash key is searched from the hash table. Hash key related (N-1) gram retrieval procedure for retrieving (N-1) gram from the (N-1) gram data table based on the address, and
Character string comparison for comparing the identity of the search (N-1) gram and the hash key related (N-1) gram, and performing the appearance position matching operation of the hash key related (N-1) gram The program according to claim 7, wherein the calculation procedure is executed.