JP2006106896A

JP2006106896A - Database registration system, database retrieval system, vocabulary index registration method and different notation identification retrieval method

Info

Publication number: JP2006106896A
Application number: JP2004289280A
Authority: JP
Inventors: Kanji Nakamura; 寛爾中村
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2004-09-30
Filing date: 2004-09-30
Publication date: 2006-04-20

Abstract

<P>PROBLEM TO BE SOLVED: To effectively realize any retrieval at the time of identifying different notations and in the not case. <P>SOLUTION: When text data are registered in a database 10, an N-gram dividing part 121 divides a character string included in the data into N-grams. A hush value converting part 122 converts each of the divided grams into a hash value which is turned to be the same value in any case of a plurality of notations under such conditions that a plurality of notations which may be the object of different notation identification exist in the notations of the character string configuring the grams. A vocabulary index registering part 123 registers the vocabulary index of the character string configuring each of the divided grams in the entry of a hash table 103 to be specified by a hash value converted by the hash value converting part 122 or a collision chain 104 linked to the entry. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、テキストデータをデータベースに登録するデータベース登録システム及びデータベースに格納されたテキストデータ中の文字列を検索するデータベース検索システムに係り、特に異表記同一視検索に適した語彙索引の登録と、登録された語彙索引を利用した異表記同一視検索に好適な、データベース登録システム、データベース検索システム、語彙索引登録方法及び異表記同一視検索方法に関する。 The present invention relates to a database registration system for registering text data in a database and a database search system for searching for a character string in text data stored in the database. The present invention relates to a database registration system, a database search system, a vocabulary index registration method, and a different notation equating search method, which are suitable for different identities searching using a registered vocabulary index.

複数の文書（文書のテキストデータ）が格納されたデータベースから、ユーザの指定する検索条件に合致した文書等を検索するデータベース検索システムが種々開発されている。このデータベースに文書を格納する場合、検索の高速化のために、当該文書中のテキストデータ（文字列）に索引付けがなされるのが一般的である。例えば、テキストデータを細かい語彙に分割し、それらを語彙索引として利用する索引付けの手法は、従来からよく知られた技術である。今、データベース検索システム内に、語彙索引"発明"と、語彙索引"特許"と、語彙索引"提案者"を含む語彙索引集合が存在するものとする。このデータベース検索システムにおいて、「発明」という文字列を検索する際は、まず語彙索引集合の中から語彙索引"発明"を検索する。この語彙索引"発明"には、対応する語彙索引情報がデータベース内のどこに格納されているかを示すポインタ情報が付加されている。語彙索引"発明"のポインタ情報で指定される語彙索引情報は、"発明"という語彙がデータベース内のどこに格納されているかを示す格納位置情報を含む。そこで、この語彙索引情報を参照することで検索処理を実行することができる。また、「特許提案者」のような長い文字列を検索する際は、「特許」と「提案者」のように短い文字列に分割して、それぞれの語彙索引情報を参照する。そのため、「特許提案者」を含むテキストデータをデータベースに格納する際に、「特許提案者」を「特許」と「提案者」とに分割して、それぞれの語彙索引を生成しておくのが一般的である。ここでは、語彙索引集合の中から"特許"と"提案者"とを検索して、それぞれの語彙索引情報を取得し、それらをマージすることにより、文字列"特許提案者"の検索が実行できる。 Various database search systems have been developed that search for documents that meet a search condition specified by a user from a database storing a plurality of documents (text data of documents). When documents are stored in this database, text data (character strings) in the documents are generally indexed in order to speed up the search. For example, an indexing technique that divides text data into fine vocabularies and uses them as a vocabulary index is a well-known technique. Assume that a vocabulary index set including a vocabulary index “invention”, a vocabulary index “patent”, and a vocabulary index “proposer” exists in the database search system. In this database search system, when searching for the character string “invention”, the lexical index “invention” is first searched from the vocabulary index set. This lexical index “invention” is added with pointer information indicating where the corresponding lexical index information is stored in the database. The lexical index information specified by the pointer information of the vocabulary index “invention” includes storage position information indicating where the vocabulary “invention” is stored in the database. Thus, the search process can be executed by referring to this vocabulary index information. When searching for a long character string such as “Patent Proposer”, it is divided into short character strings such as “Patent” and “Proposer”, and the respective lexical index information is referred to. Therefore, when storing text data including “patent proposer” in the database, it is necessary to divide “patent proposer” into “patent” and “proposer” and generate respective lexical indexes. It is common. Here, "patent" and "proposer" are searched from the vocabulary index set, the respective lexical index information is acquired, and they are merged to search for the string "patent proposer". it can.

長い文字列を分割する技術として、主に、予め用意された単語辞書と照らし合わせながら細かい文字列(単語)に分割していく形態素解析技術と、一定の文字数で分割していくＮグラム（N-gram）技術との２通りがよく知られている。 As a technique for dividing a long character string, mainly a morphological analysis technique that divides into a fine character string (word) against a word dictionary prepared in advance, and an N-gram (N -gram) technology is well-known.

以下、単語辞書を必要とせずに、また新しい単語が出現しても分割に失敗することのないＮグラム技術を適用した、語彙索引の作成手法について、"patent"という文字列をデータベースに格納する場合の索引付けを例に述べる。まず初めに、この文字列をＮグラムに分割する。Ｎ＝３の場合、グラム分割により"pat","ate","ten","ent","nt","t"の各グラムが得られる。これらの各グラムについて、それぞれハッシュ値を計算し、対応する語彙索引を、そのハッシュ値で特定されるハッシュテーブルのエントリに登録する。ここで語彙索引には、対応する語彙がデータベース内のどこに格納されたかを示す位置情報（データベース内の格納位置の情報）を含む語彙索引情報へのポインタの情報が付加されている。計算されたハッシュ値で特定されるハッシュテーブルのエントリに、当該ハッシュ値の計算の対象となったグラム（文字列）とは異なるグラム（文字列）の語彙索引が既に登録されている場合、即ちハッシュ値が衝突した場合は、重複する語彙索引をリストでつないで管理する。このリストを、コリジョンチェーンと呼ぶ。以上の処理を全てのグラムに対して行い、語彙索引をハッシュテーブルまたはコリジョンチェーンに登録する。上述の"patent"をＮグラム分割した場合、"ten"と"ent"の両グラムのハッシュ値が衝突したとする。 In the following, a character string “patent” is stored in the database for a lexical index creation method that uses an N-gram technique that does not require a word dictionary and does not fail to divide even if a new word appears. The case indexing is described as an example. First, this character string is divided into N-grams. When N = 3, each gram of “pat”, “ate”, “ten”, “ent”, “nt”, and “t” is obtained by the gram division. For each of these grams, a hash value is calculated, and the corresponding lexical index is registered in the entry of the hash table specified by the hash value. Here, the vocabulary index is added with pointer information to vocabulary index information including position information (information on the storage position in the database) indicating where the corresponding vocabulary is stored in the database. When the lexical index of a gram (character string) different from the gram (character string) for which the hash value is calculated is already registered in the hash table entry specified by the calculated hash value, that is, When hash values collide, duplicate lexical indexes are connected by a list and managed. This list is called a collision chain. The above processing is performed for all the grams, and the vocabulary index is registered in the hash table or the collision chain. When the above-mentioned “patent” is divided into N grams, it is assumed that hash values of both “ten” and “ent” have collided.

次に、語彙索引の集合（ハッシュテーブル及びコリジョンチェーン）を利用した検索処理について述べる。文字列"patent"を検索する場合は、まず当該文字列を分割し、それぞれのハッシュ値を計算する。ここで、登録時は"patent"をＮグラム（Ｎ＝３）分割して、"pat","ate","ten","ent","nt","t"の６つのグラムに区分したが、検索時は"pat"と"ent"とに２分割する。 Next, search processing using a set of vocabulary indexes (hash table and collision chain) will be described. When searching for the character string “patent”, first, the character string is divided and each hash value is calculated. Here, at the time of registration, “patent” is divided into N grams (N = 3) and divided into 6 grams of “pat”, “ate”, “ten”, “ent”, “nt”, “t”. However, when searching, it is divided into “pat” and “ent”.

次に、得られたハッシュ値からハッシュテーブルを参照する。ここでは、"pat"の語彙索引はハッシュテーブルに登録され、"ten"との間でハッシュ値が衝突した"ent"の語彙索引は、コリジョンチェーンにつながれているものとする。このような場合、"pat"の語彙索引はハッシュテーブルから簡単に取得できる。これに対し、"ent"の語彙索引を取得するにはコリジョンチェーンを走査しなければならない。"pat"と"ent"の両グラムについて語彙索引を取得できたなら、対応する語彙索引情報をマージすることにより、文字列"patent"のデータベース内の格納位置を知ることができる。 Next, the hash table is referred to from the obtained hash value. Here, it is assumed that the vocabulary index of “pat” is registered in the hash table, and the lexical index of “ent” whose hash value collides with “ten” is connected to the collision chain. In such a case, the vocabulary index of “pat” can be easily obtained from the hash table. On the other hand, the collision chain must be scanned to obtain the “ent” lexical index. If the lexical index can be acquired for both the “pat” and “ent” grams, the storage position of the character string “patent” in the database can be known by merging the corresponding lexical index information.

このように、上記した先行技術（以下、第１の先行技術と称する）において、ハッシュ値が衝突したグラム（語彙）の語彙索引を検索する場合、コリジョンチェーンを走査する必要がある。このため、第１の先行技術において異表記の語彙を同一視して検索する際には、以下に述べるように、ハッシュテーブル参照とコリジョンチェーン走査という処理を多数実施しなければならないという問題がある。ここでは大文字／小文字を同一視して検索する場合を想定するが、上記の問題は、大文字／小文字以外にも全角／半角、ひらがな／カタカナ等、異表記された語彙を同一視しようとする際に生じる。 Thus, in the above-described prior art (hereinafter referred to as the first prior art), when searching for a lexical index of a gram (vocabulary) whose hash values collide, it is necessary to scan the collision chain. For this reason, in the first prior art, there is a problem in that a number of processes such as hash table reference and collision chain scanning must be performed as described below when searching for lexical terms with different notations. . Here, it is assumed that the search is performed with the same case of uppercase / lowercase characters. However, the above problem is not only uppercase / lowercase characters but also double-byte / half-width characters, hiragana / katakana, etc. To occur.

そこで、第１の先行技術において、文字列"patent"の大文字／小文字（異表記）同一視検索を行う場合について、図７を参照して説明する。まず、図７において矢印Ａ１で示すように、"patent"を"pat"と"ent"の２つのグラムに分割する。次に、図７において矢印Ａ２で示すように、"pat"のハッシュ値を計算する。次に、得られたハッシュ値からハッシュテーブル７１及びコリジョンチェーン７２を参照し、語彙索引情報を取得する。ここまでの処理は、前述した場合と同様である。ところが、大文字／小文字同一視検索を実施する場合、例えば"pat"に関しては、"paT"，"PAt"，"PAT"等についても、図７において矢印Ａ３で示すように、"pat"と同様にハッシュ値を計算して、ハッシュテーブル７１及びコリジョンチェーン７２の参照を行わなければならない。 Therefore, in the first prior art, a case where the uppercase / lowercase (different notation) identification search of the character string “patent” is performed will be described with reference to FIG. First, as shown by an arrow A1 in FIG. 7, “patent” is divided into two grams of “pat” and “ent”. Next, as indicated by an arrow A2 in FIG. 7, a hash value of “pat” is calculated. Next, referring to the hash table 71 and the collision chain 72 from the obtained hash value, lexical index information is acquired. The processing so far is the same as that described above. However, when performing an uppercase / lowercase equality search, for example, for "pat", "paT", "PAt", "PAT", etc. are the same as "pat" as shown by arrow A3 in FIG. The hash value must be calculated and the hash table 71 and the collision chain 72 must be referred to.

アルファベット３文字の文字列で大文字／小文字のみが異なる文字列の組み合わせは全部で８通りある。このため、大文字／小文字を同一視する場合には、ハッシュ値計算からハッシュテーブル参照、コリジョンチェーン走査という処理を８回実施しなければならないことになる。また、ここでは大文字／小文字の同一視についてのみを想定しているが、全角／半角の同一視等とも組み合わせて考えると、前述の処理を繰り返す数が更に飛躍的に増加する。 There are a total of eight combinations of three alphabetic character strings that differ only in uppercase / lowercase characters. For this reason, when capital letters / lowercase letters are identified, it is necessary to perform the processes of hash value calculation, hash table reference, and collision chain scanning eight times. Here, only uppercase / lowercase identification is assumed. However, when combined with full-width / half-width identification, etc., the number of repetitions of the above-described processing further increases dramatically.

一方、特許文献１には、大文字／小文字、全角／半角、ひらがな／カタカナ等、異表記された語彙を同一視するのに適した語彙索引（インデックス）の作成と、検索のための技術（以下、第２の先行技術と称する）が記載されている。この第２の先行技術では、異表記同一視の対象となり得る複数の表記に共通のインデックス（語彙索引）が作成され、そのインデックスにリンクしたリーフ（語彙索引情報）の集合が作成される。各リーフは、データベース中の対応する文字列（語彙）の格納位置を示す場所情報と、当該文字列（語彙）の表記（大文字／小文字、全角／半角など）の違いを識別するための特別の構造の文字情報とから構成される。 On the other hand, Patent Document 1 discloses a technique for creating and searching a vocabulary index (index) suitable for identifying differently expressed vocabulary such as uppercase / lowercase, full-width / half-width, hiragana / katakana, etc. , Referred to as the second prior art). In the second prior art, a common index (vocabulary index) is created for a plurality of notations that can be identified with different notations, and a set of leaves (vocabulary index information) linked to the index is created. Each leaf has special information for identifying the difference between the location information indicating the storage position of the corresponding character string (vocabulary) in the database and the notation of the character string (vocabulary) (uppercase / lowercase, full-width / half-width, etc.) It consists of character information of structure.

第２の先行技術においては、異表記を同一視する検索の場合、検索対象文字列に対応するインデックスにリンクした全てのリーフの情報を取得すれば良い。これにより、大文字／小文字同一視検索を効率的に実行できる。一方、異表記を同一視しない検索の場合には、検索対象文字列に対応するインデックスにリンクした全てのリーフの情報、つまり検索対象文字列と同一視可能なデータベース中の全文字列のリーフの情報（語彙索引情報）を参照して、そのリーフ中の文字情報と検索対象文字列とを比較し、検索対象文字列と表記が一致する文字情報を含むリーフの情報だけを取得すれば良い。しかし、検索対象文字列と同一視可能なデータベース中の全文字列のリーフの情報を参照して上記比較を行うことは、極めて効率が悪い。また、データベース中の文字列毎に、対応する文字情報と格納位置の情報とを含むリーフの情報を必要とするため、情報量が膨大となる。
特開平８−７７１８８号公報（段落０００８乃至００１０） In the second prior art, in the case of a search that identifies different notations, information on all the leaves linked to the index corresponding to the search target character string may be acquired. This makes it possible to efficiently execute uppercase / lowercase equality search. On the other hand, in a search that does not identify different notations, information on all the leaves linked to the index corresponding to the search target character string, that is, all the character string leaves in the database that can be identified with the search target character string. By referring to the information (vocabulary index information), the character information in the leaf is compared with the search target character string, and only the leaf information including the character information whose notation matches the search target character string is acquired. However, it is extremely inefficient to perform the comparison by referring to leaf information of all character strings in the database that can be identified with the search target character string. Further, since each piece of character string in the database requires leaf information including corresponding character information and storage position information, the amount of information becomes enormous.
JP-A-8-77188 (paragraphs 0008 to 0010)

上記した第１の先行技術においては、異表記同一視検索を行う場合に、同一視の対象となる異なる表記毎に、ハッシュ値計算、ハッシュテーブル参照及びコリジョンチェーン走査を含む一連の処理を行わなければならないため、効率が悪いという問題がある。 In the first prior art described above, when a different notation identification search is performed, a series of processes including hash value calculation, hash table reference, and collision chain scanning must be performed for each different notation to be identified. Therefore, there is a problem that the efficiency is low.

一方、第２の先行技術においては、異表記同一視検索の効率向上を図ることはできるものの、データベース中の文字列毎に、対応する文字情報と格納位置の情報とを含むリーフの情報を必要とするため、情報量が膨大となるという問題がある。また、異表記を同一視しない検索では、検索対象文字列と同一視可能なデータベース中の全文字列のリーフの情報を参照して、そのリーフ中の文字情報と検索対象文字列とを比較しなければならないため、検索効率が著しく低下するという問題もある。 On the other hand, in the second prior art, although it is possible to improve the efficiency of an allotment identification search, leaf information including corresponding character information and storage location information is required for each character string in the database. Therefore, there is a problem that the amount of information becomes enormous. Also, in a search that does not identify different notations, refer to the leaf information of all character strings in the database that can be identified with the search target character string, and compare the character information in that leaf with the search target character string. Therefore, there is a problem that the search efficiency is remarkably lowered.

本発明は上記事情を考慮してなされたものでその目的は、異表記を同一視する場合としない場合の、いずれの検索も効果的に実現することを可能とする、データベース登録システム、データベース検索システム、語彙索引登録方法及び異表記同一視検索方法を提供することにある。 The present invention has been made in consideration of the above circumstances, and the object thereof is a database registration system and a database search that can effectively realize any search in the case where different notations are identified with each other. It is to provide a system, a vocabulary index registration method, and a different notation identification method.

本発明の第１の観点によれば、テキストデータをデータベースに登録するデータベース登録システムが提供される。このデータベース登録システムは、上記データベースにテキストデータが登録される際に、当該テキストデータに含まれている文字列を登録対象文字列としてＮグラムに分割する分割手段と、この分割手段によって分割された各グラムについて、そのグラムを構成する文字列の表記に異表記同一視の対象となり得る複数の表記が存在するものとして、当該複数の表記のいずれの場合にも同一の値となるハッシュ値に変換するハッシュ値変換手段と、上記分割手段によって分割された各グラムについて、そのグラムを構成する文字列の語彙索引を、上記ハッシュ値変換手段によって変換されたハッシュ値で特定されるハッシュテーブルのエントリまたは当該エントリにリンクしたリストに登録する語彙索引登録手段とを備える。 According to a first aspect of the present invention, a database registration system for registering text data in a database is provided. The database registration system includes a dividing unit that divides a character string included in the text data into N-grams as registration target character strings when the text data is registered in the database, and the dividing unit. For each gram, it is assumed that there are multiple notations that can be subject to different notation in the notation of the character string that constitutes the gram, and converted to a hash value that is the same value in any of the multiple notations For each gram divided by the dividing means, a lexical index of a character string that constitutes the gram, or a hash table entry identified by the hash value converted by the hash value converting means or Vocabulary index registration means for registering in a list linked to the entry.

このような構成においては、語彙索引の登録対象となる語彙（文字列）の表記が、異表記同一視の対象となり得る複数の表記のいずれの場合でも、同一の値となるハッシュ値にハッシュ値変換手段によって変換される。このハッシュ値変換手段によるハッシュ値変換によって、異表記同一視の対象となり得る、表記の異なる全ての語彙のハッシュ値が揃えられる。これにより、異表記同一視の対象となり得る語彙の語彙索引は全て、上記揃えられたハッシュ値で特定されるハッシュテーブルのエントリまたは当該エントリにリンクしたリストに登録される。この結果、異表記同一視検索時には、上記統一されたハッシュ値で特定されるハッシュテーブルのエントリにリンクしたリスト（コリジョンチェーン）を走査するだけで良く、ハッシュテーブルの複数のエントリにそれぞれリンクした同数のリストを個々に走査する先行技術に比べて、走査すべきリストを減らすことができ、検索性能を向上できる。特に、上記リストがデータベース内に格納される構成を適用する場合、異表記同一視の対象となり得る語彙の語彙索引が、当該データベース内で近接した局所領域に集中して配置されることから、先行技術に比べて一層走査効率が高くなり、より検索性能を向上できる。また、異表記を同一視しない場合には、上記統一されたハッシュ値で特定されるハッシュテーブルのエントリまたは当該エントリにリンクしたリストの中から、目的の表記の語彙索引だけを検索するだけで良く、異表記を同一視しない場合の検索も簡単に実行できる。 In such a configuration, the hash value becomes the hash value that is the same value in any case where the notation of the vocabulary (character string) to be registered in the vocabulary index is a plurality of notations that can be subject to different notation identification. It is converted by the conversion means. By hash value conversion by the hash value conversion means, hash values of all vocabulary with different notation that can be identified with different notations are aligned. Thus, all vocabulary indexes of vocabularies that can be identified with different notations are registered in the hash table entry specified by the aligned hash value or a list linked to the entry. As a result, at the time of different notation identification search, it is only necessary to scan a list (collision chain) linked to the hash table entry specified by the unified hash value, and the same number linked to a plurality of entries in the hash table. Compared to the prior art that individually scans the list, the list to be scanned can be reduced, and the search performance can be improved. In particular, when applying a configuration in which the list is stored in a database, vocabulary indexes of vocabularies that can be identified with different notations are concentrated in a local region close to the database. Compared with the technology, the scanning efficiency becomes higher and the search performance can be further improved. When notating different notations, it is only necessary to search only the lexical index of the target notation from the hash table entry specified by the unified hash value or the list linked to the entry. The search can be easily executed when the different notations are not identified.

ここで、上記ハッシュ値変換手段によるハッシュ値変換を、上記分割手段によって分割された各グラムについて、そのグラムを構成する文字列の表記に異表記同一視の対象となり得る複数の表記が存在するものとして、当該グラムを構成する文字列の表記を当該複数の表記のうちの予め定められた表記に統一するための表記変換を行う表記統一変換手段と、この表記統一変換手段によって表記変換された文字列のハッシュ値を計算するハッシュ値計算手段とにより実現すると良い。 Here, the hash value conversion by the hash value conversion means, for each gram divided by the dividing means, there are a plurality of notations that can be the target of different notation in the notation of the character string constituting the gram A notation conversion means for performing notation conversion for unifying the notation of the character string constituting the gram into a predetermined notation among the plurality of notations, and a character notation converted by the notation unified conversion means It may be realized by a hash value calculation means for calculating a hash value of a column.

本発明の第２の観点によれば、上記構成のデータベース登録システムによって登録された語彙索引を用いて、上記データベースに格納されたテキストデータ中の文字列を検索するデータベース検索システムが提供される。このデータベース検索システムは、検索対象文字列をＮグラムに分割する分割手段と、この分割手段によって分割された各グラムについて、そのグラムを構成する文字列の表記に異表記同一視の対象となり得る複数の表記が存在するものとして、当該複数の表記のいずれの場合にも同一の値となるハッシュ値に変換するハッシュ値変換手段と、上記分割手段によって分割された各グラムについて、上記ハッシュ値変換手段によって変換されたハッシュ値で特定される上記ハッシュテーブルのエントリまたは当該エントリにリンクしたリストを走査することによって、当該グラムを構成する文字列の語彙索引を検索する語彙索引検索手段と、この語彙索引検索手段の語彙索引検索結果に基づいて、上記検索対象文字列と完全に一致する文字列のみ、または上記検索対象文字列と同一視可能な全ての文字列を取得する検索結果処理手段とを備える。 According to a second aspect of the present invention, there is provided a database search system for searching a character string in text data stored in the database using a lexical index registered by the database registration system having the above configuration. The database search system includes a dividing unit that divides a character string to be searched into N-grams, and a plurality of grammars divided by the dividing unit that can be identified with different notations in the notation of the character strings that form the gram. Hash value conversion means for converting into a hash value that has the same value in any of the plurality of notations, and the hash value conversion means for each gram divided by the division means The lexical index search means for searching the lexical index of the character string constituting the gram by scanning the hash table entry specified by the hash value converted by the above or a list linked to the entry, and the vocabulary index Based on the lexical index search result of the search means, the character string that exactly matches the search target character string Or and a retrieval result processing means for obtaining the search-target character string all character strings that can be identified with.

このような構成においては、異表記同一視検索時には、ハッシュ値変換手段によって揃えられたハッシュ値で特定されるハッシュテーブルのエントリ及びリストを走査して語彙索引が調べられるため、異表記を同一視する場合、しない場合のどちらの検索も容易に実行できる。 In such a configuration, the lexical index is examined by scanning the hash table entry and list specified by the hash value aligned by the hash value conversion means at the time of the typographical search. You can easily perform either search or not.

本発明によれば、語彙索引の登録時に、異表記同一視の対象となり得る語彙に対応するハッシュ値を揃えることにより、異表記同一視の対象となり得る語彙の語彙索引を全て、当該揃えられたハッシュ値で特定されるハッシュテーブルのエントリまたは当該エントリにリンクしたリストに登録することができるため、異表記同一視検索時に走査すべきリストの数を減らして検索性能の向上を図ることができる。しかも、異表記同一視の対象となり得る語彙に対応するハッシュ値を揃えながら、表記の異なる語彙毎に語彙索引が用意されるため、異表記同一視検索をしない場合に、目的の表記の語彙索引だけを検索するだけで良く、異表記を同一視しない場合の検索も簡単に実行できる。 According to the present invention, at the time of registration of a vocabulary index, by aligning hash values corresponding to vocabularies that can be identified with different notations, all the vocabulary indexes of vocabularies that can be identified with different notations are aligned. Since it can be registered in the entry of the hash table specified by the hash value or a list linked to the entry, the number of lists to be scanned at the time of different notation identification search can be reduced and the search performance can be improved. Moreover, a vocabulary index is prepared for each vocabulary with different notation while aligning hash values corresponding to the vocabulary that can be identified with different notation, so if you do not perform different notation identification search, the lexical index of the desired notation It is only necessary to search for the item, and the search can be easily executed when the different notations are not identified.

以下、本発明の一実施形態につき図面を参照して説明する。
図１は本発明の一実施形態に係る、データベース登録機能を有するデータベース検索システムの構成を示すブロック図である。図１のデータベース検索システムは、データベース１０と、データベース登録のための文字列入力部１１と、データベース登録部１２と、データベース検索のための文字列入力部１３と、データベース検索部１４とを備える。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a database search system having a database registration function according to an embodiment of the present invention. The database search system of FIG. 1 includes a database 10, a character string input unit 11 for database registration, a database registration unit 12, a character string input unit 13 for database search, and a database search unit 14.

データベース１０は、データ格納領域１０１と語彙索引情報格納領域１０２とを含む。データ格納領域１０１には、テキストデータ、例えばＸＭＬ(Extensible Markup Language)に代表される、論理構造を有する構造化文書のデータが格納される。語彙索引情報格納領域１０２には、語彙索引情報が格納される。語彙索引情報は、対応する語彙が格納されているデータベース１０内の全ての格納位置の情報を含む。 The database 10 includes a data storage area 101 and a lexical index information storage area 102. The data storage area 101 stores text data, for example, structured document data having a logical structure represented by XML (Extensible Markup Language). The vocabulary index information storage area 102 stores lexical index information. The vocabulary index information includes information on all storage positions in the database 10 in which the corresponding vocabulary is stored.

データベース１０にはまた、ハッシュテーブル１０３及びコリジョンチェーン１０４の群も格納される。ハッシュテーブル１０３は、語彙索引を保持するためのエントリの群を有する。ハッシュテーブル１０３内の各エントリは、それぞれ固有のハッシュ値によって特定される。ここで、ハッシュテーブル１０３のエントリを特定するハッシュ値は、先行技術とは異なって、必ずしも当該エントリに登録される語彙のハッシュ値に一致するとは限らない。本実施形態において、ハッシュテーブル１０３のエントリを特定するハッシュ値は、当該エントリに登録される語彙索引の示す語彙の表記に複数の表記が存在する場合に、その複数の表記のうちの予め定められた表記を用いて算出されるハッシュ値である。コリジョンチェーン１０４は、ハッシュテーブル１０３のあるエントリにリンクされ、当該エントリに既に登録されている語彙索引との間でハッシュ値が衝突した場合に、重複する語彙索引を登録するのに用いられる。 The database 10 also stores a group of hash tables 103 and collision chains 104. The hash table 103 has a group of entries for holding a vocabulary index. Each entry in the hash table 103 is specified by a unique hash value. Here, unlike the prior art, the hash value specifying the entry of the hash table 103 does not necessarily match the hash value of the vocabulary registered in the entry. In the present embodiment, the hash value that identifies an entry in the hash table 103 is determined in advance when a plurality of notations are present in the vocabulary notation indicated by the vocabulary index registered in the entry. This is a hash value calculated using the above notation. The collision chain 104 is linked to an entry in the hash table 103, and is used to register a duplicate vocabulary index when a hash value collides with a lexical index already registered in the entry.

文字列入力部１１は、ユーザの入力操作に応じてアプリケーションから与えられるデータベース登録要求に従い、データベース１０への登録対象となるテキストデータを入力すると共に、当該テキストデータから、語彙索引を付与すべき文字列（登録対象文字列）を抽出する。 The character string input unit 11 inputs text data to be registered in the database 10 in accordance with a database registration request given from an application in accordance with a user input operation, and from the text data, a character to be assigned a vocabulary index. Extract columns (character strings to be registered).

データベース登録部１２は、文字列入力部１１により入力されたテキストデータをデータベース１０に登録する機能を有する。データベース登録部１２はまた、このテキストデータをデータベース１０に登録する際に、文字列入力部１１によって当該テキストデータから抽出される文字列の語彙索引の群をハッシュテーブル１０３またはコリジョンチェーン１０４に登録する機能（語彙索引登録機能）を有する。データベース登録部１２は、Ｎグラム分割部１２１、ハッシュ値変換部１２２及び語彙索引登録部１２３を含む。 The database registration unit 12 has a function of registering text data input by the character string input unit 11 in the database 10. Further, when registering the text data in the database 10, the database registration unit 12 registers a group of lexical indexes of character strings extracted from the text data by the character string input unit 11 in the hash table 103 or the collision chain 104. It has a function (vocabulary index registration function). The database registration unit 12 includes an N-gram division unit 121, a hash value conversion unit 122, and a vocabulary index registration unit 123.

Ｎグラム分割部１２１は、登録対象となる文字列をＮグラムに分割する。ハッシュ値変換部１２２は、Ｎグラム分割部１２１によって分割された各グラムについて、そのグラムを構成する文字列の表記に異表記同一視の対象となり得る複数の表記が存在するものとして、当該複数の表記のいずれの場合にも同一の値となるハッシュ値に変換する。ハッシュ値変換部１２２は、表記統一変換部１２２ａ及びハッシュ値計算部１２２ｂを含む。 The N-gram dividing unit 121 divides a character string to be registered into N-grams. For each gram divided by the N-gram dividing unit 121, the hash value conversion unit 122 assumes that there are a plurality of notations that can be identified as different notations in the notation of the character string that constitutes the gram. In any case, it is converted to a hash value that is the same value. The hash value conversion unit 122 includes a notation unified conversion unit 122a and a hash value calculation unit 122b.

表記統一変換部１２２ａは、Ｎグラム分割部１２１によって分割された各グラムについて、そのグラムを構成する文字列の表記に異表記同一視の対象となり得る複数の表記が存在するものとして、当該グラムを構成する文字列の表記を当該複数の表記のうちの予め定められた表記に統一するための表記変換を行う。本実施形態では、説明を簡略化するために、登録または検索対象となる文字列の文字種が英字のみであり、且つ全角／半角のうちの半角文字のみに限られているものとする。この場合、表記統一変換部１２２ａは、文字列を構成する文字の表記を全て大文字または小文字のいずれか一方、例えば大文字に統一するように表記変換を行う。表記統一変換部１２２ａの表記変換の対象となる文字列の文字種と、変換後の表記との関係を図２に示す。図２において、統一表記に関する「文字種」と「全角／半角」の両項目に記載された「−」は、「文字種」と「全角／半角」の表記は変換の対象外であることを示す。 For each gram divided by the N-gram dividing unit 121, the unified notation conversion unit 122a assumes that there are a plurality of notations that can be identified as different notations in the notation of the character string that constitutes the gram. A notation conversion is performed to unify the notation of the constituent character strings into a predetermined notation among the plurality of notations. In the present embodiment, in order to simplify the description, it is assumed that the character type of a character string to be registered or searched is only an alphabetic character and is limited to only a half-width / half-width character. In this case, the notation unified conversion unit 122a performs notation conversion so that the notation of characters constituting the character string is unified to either uppercase letters or lowercase letters, for example, uppercase letters. FIG. 2 shows the relationship between the character type of the character string that is the object of the notation conversion of the notation unified conversion unit 122a and the notation after conversion. In FIG. 2, “-” described in both the “character type” and “full-width / half-width” items regarding the unified notation indicates that the “character type” and “full-width / half-width” are not subject to conversion.

明らかなように、文字列を構成する全ての文字（英字）が大文字の場合には表記変換は不要である。しかし本実施形態における表記統一変換部１２２ａは、表記変換の対象となる文字の大文字／小文字に無関係に、その文字を大文字に形式的に変換するように構成されている。この例では、変換前の文字が大文字の場合、その変換前の文字と変換後の文字とは、結果的に同一表記となる。勿論、表記統一変換部１２２ａが、変換前の文字が大文字であるか小文字であるかを識別し、大文字である場合には、その文字をそのまま表記変換結果として出力することも可能である。
ハッシュ値計算部１２２ｂは、表記統一変換部１２２ａによって表記が統一された文字列（各グラム）のハッシュ値を計算する。 As is clear, notation conversion is not necessary when all characters (alphabetic characters) constituting the character string are capital letters. However, the unified notation conversion unit 122a according to the present embodiment is configured to formally convert the character into uppercase regardless of the uppercase / lowercase of the character to be converted. In this example, when the character before conversion is capital letters, the character before conversion and the character after conversion become the same notation as a result. Of course, the unified notation conversion unit 122a identifies whether the character before conversion is an uppercase letter or a lowercase letter, and if it is an uppercase letter, the character can be directly output as a notation conversion result.
The hash value calculation unit 122b calculates a hash value of the character string (each gram) whose notation is unified by the notation unified conversion unit 122a.

語彙索引登録部１２３は、表記統一変換部１２２ａによる表記変換前の文字列の語彙索引を、ハッシュテーブル１０３のエントリに登録する。この語彙索引が登録される、ハッシュテーブル１０３のエントリは、ハッシュ値計算部１２２ｂによって算出された、表記統一変換部１２２ａによる表記変換後の文字列のハッシュ値で特定される。但し、上記特定されるハッシュテーブル１０３のエントリに、上記表記変換前の文字列とは表記（ここでは大文字／小文字）が異なる語彙索引が既に登録されている場合、つまりハッシュ値が衝突した場合には、語彙索引登録部１２３は対応する語彙索引を、当該エントリにリンクしたコリジョンチェーン（リスト）１０４に登録する。 The vocabulary index registration unit 123 registers the lexical index of the character string before the notation conversion by the notation unified conversion unit 122 a in the entry of the hash table 103. The entry of the hash table 103 in which this vocabulary index is registered is specified by the hash value of the character string after the notation conversion by the notation unified conversion unit 122a calculated by the hash value calculation unit 122b. However, when a vocabulary index having a different notation (in this case, uppercase / lowercase) from the character string before notation conversion is already registered in the entry of the specified hash table 103, that is, when hash values collide. The vocabulary index registration unit 123 registers the corresponding vocabulary index in the collision chain (list) 104 linked to the entry.

文字列入力部１３は、ユーザの入力操作に応じてアプリケーションから与えられるデータベース検索要求に従い、検索の対象となる文字列（検索対象文字列）を入力する。 The character string input unit 13 inputs a character string to be searched (search target character string) in accordance with a database search request given from an application in accordance with a user input operation.

データベース検索部１４は、文字列入力部１３により入力された検索対象文字列をデータベース１０から検索する機能を有する。データベース検索部１４は、データベース登録部１２内のＮグラム分割部１２１及びハッシュ値変換部１２２にそれぞれ相当する、Ｎグラム分割部１４１及びハッシュ値変換部１４２を含むと共に、語彙索引検索部１４３及び検索結果処理部１４４を含む。 The database search unit 14 has a function of searching the database 10 for a search target character string input by the character string input unit 13. The database search unit 14 includes an N-gram division unit 141 and a hash value conversion unit 142 corresponding to the N-gram division unit 121 and the hash value conversion unit 122 in the database registration unit 12, respectively. A result processing unit 144 is included.

Ｎグラム分割部１４１は、検索対象文字列をＮグラムに分割する。ハッシュ値変換部１４２は、Ｎグラム分割部１４１によって分割された各グラムについて、そのグラムを構成する文字列の表記に異表記同一視の対象となり得る複数の表記が存在するものとして、当該複数の表記のいずれの場合にも同一の値となるハッシュ値に変換する。ハッシュ値変換部１４２は、ハッシュ値変換部１２２内の表記統一変換部１２２ａ及びハッシュ値計算部１２２ｂにそれぞれ相当する、表記統一変換部１４２ａ及びハッシュ値計算部１４２ｂを含む。 The N-gram dividing unit 141 divides the search target character string into N-grams. For each gram divided by the N-gram dividing unit 141, the hash value conversion unit 142 assumes that there are a plurality of notations that can be subject to different notation in the notation of the character string that constitutes the gram. In any case, it is converted to a hash value that is the same value. The hash value conversion unit 142 includes a notation unified conversion unit 142a and a hash value calculation unit 142b corresponding to the notation unified conversion unit 122a and the hash value calculation unit 122b in the hash value conversion unit 122, respectively.

語彙索引検索部１４３は、ハッシュ値計算部１４２ｂによって算出された、表記統一変換部１４２ａによる表記変換後の文字列のハッシュ値で特定されるハッシュテーブル１０３のエントリを参照することによって対応する語彙索引を検索する。また語彙索引検索部１４３は、ハッシュテーブル１０３のエントリの参照時には、ハッシュ値衝突の有無を判定する。語彙索引検索部１４３は、ハッシュ値衝突を判定した場合、当該ハッシュ値で特定されるハッシュテーブル１０３のエントリにリンクしたコリジョンチェーン１０４を走査することによって、対応する語彙索引を検索する。 The vocabulary index search unit 143 refers to the entry of the hash table 103 identified by the hash value of the character string after the notation conversion by the notation unified conversion unit 142a, calculated by the hash value calculation unit 142b, and thereby corresponds to the vocabulary index. Search for. The vocabulary index search unit 143 determines whether or not there is a hash value collision when referring to an entry in the hash table 103. When determining the hash value collision, the lexical index search unit 143 searches the corresponding vocabulary index by scanning the collision chain 104 linked to the entry of the hash table 103 specified by the hash value.

検索結果処理部１４４は、語彙索引検索部１４３による語彙索引検索結果に基づいて、検索対象文字列と完全に一致する文字列のみ、または検索対象文字列と同一視可能な全ての文字列を取得する。 Based on the lexical index search result by the vocabulary index search unit 143, the search result processing unit 144 acquires only the character string that completely matches the search target character string or all the character strings that can be identified with the search target character string. To do.

データベース登録部１２及びデータベース検索部１４は、計算機システムにインストールされた特定のソフトウェアプログラムを当該計算機システム（内のＣＰＵ）が読み取って実行することにより実現可能である。このプログラムは、コンピュータで読み取り可能な記憶媒体（フロッピー（登録商標）ディスクに代表される磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤに代表される光ディスク、フラッシュメモリに代表される半導体メモリ等）に予め格納して頒布可能である。また、このプログラムが、ネットワークを介してダウンロード（頒布）されても構わない。 The database registration unit 12 and the database search unit 14 can be realized by a specific software program installed in the computer system being read and executed by the computer system (CPU in the computer system). This program is stored in advance in a computer-readable storage medium (a magnetic disk typified by a floppy (registered trademark) disk, a CD-ROM, an optical disk typified by a DVD, a semiconductor memory typified by a flash memory, etc.). Can be distributed. Further, this program may be downloaded (distributed) via a network.

次に、図１のデータベース検索システムにおける動作を、語彙索引の登録処理を例に、図３及び図４を参照して説明する。なお、図３は語彙索引の登録処理の手順を示すフローチャート、図４は異表記の文字列"patent"，"Patent"または"PATENT"を対象とする語彙索引の登録処理を説明するための図である。 Next, the operation of the database search system of FIG. 1 will be described with reference to FIGS. 3 and 4 taking a lexical index registration process as an example. FIG. 3 is a flowchart showing the procedure of vocabulary index registration processing, and FIG. 4 is a diagram for explaining vocabulary index registration processing for a character string “patent”, “Patent”, or “PATENT” with different notations. It is.

まず、ユーザの入力操作に応じてアプリケーションから与えられるデータベース登録要求に従い、文字列入力部１１がデータベース１０への登録対象となるテキストデータを入力したものとする。このテキストデータはデータベース登録部１２によってデータベース１０のデータ格納領域１０１に格納される。このとき文字列入力部１１は、データ格納領域１０１に格納されるテキストデータから、語彙索引を付与すべき登録対象文字列を順次抽出する。今、登録対象文字列として"patent"が抽出されたものとする。文字列入力部１１は、この登録対象文字列"patent"をデータベース登録部１２内のＮグラム分割部１２１に渡す。 First, it is assumed that the character string input unit 11 inputs text data to be registered in the database 10 in accordance with a database registration request given from an application in accordance with a user input operation. This text data is stored in the data storage area 101 of the database 10 by the database registration unit 12. At this time, the character string input unit 11 sequentially extracts registration target character strings to which a vocabulary index should be assigned from the text data stored in the data storage area 101. Now, it is assumed that “patent” has been extracted as a registration target character string. The character string input unit 11 passes this registration target character string “patent” to the N-gram dividing unit 121 in the database registration unit 12.

Ｎグラム分割部１２１は、文字列入力部１１から渡された登録対象文字列"patent"を、図４において矢印Ｂ１で示すように、Ｎグラムに分割する（ステップＳ１）。ここでは、Ｎは例えば３である。Ｎグラム分割部１２１によって分割されたグラム列は、ハッシュ値変換部１２２内の表記統一変換部１２２ａに渡される。このグラム列は、登録対象文字列が"patent"である本実施形態では、グラム（文字列）"pat"を含む。 The N-gram dividing unit 121 divides the registration target character string “patent” passed from the character string input unit 11 into N-grams as indicated by an arrow B1 in FIG. 4 (step S1). Here, N is 3, for example. The gram string divided by the N-gram dividing unit 121 is passed to the notation unified conversion unit 122 a in the hash value conversion unit 122. In the present embodiment in which the registration target character string is “patent”, the gram string includes a gram (character string) “pat”.

表記統一変換部１２２ａは、Ｎグラム分割部１２１から渡されたグラム列の中から未処理のグラムを１つ選択する（ステップＳ２）。そして表記統一変換部１２２ａは、選択されたグラムについて、そのグラムを構成する文字列（英文字列）の表記を予め定められた表記、例えば大文字表記に統一するための表記変換を行う（ステップＳ３）。これにより、変換対象グラムが上述の"pat"の場合には、当該"pat"は大文字表記の"PAT"に変換される。つまり、英字の文字列の表記が大文字表記に統一される。この表記統一変換部１２２ａによる表記の統一のための表記変換は、同一視（ここでは大文字／小文字同一視）の対象となり得る異表記のグラムについて、表記変換後のグラムのハッシュ値を同一値に揃えるために行われる。 The unified notation conversion unit 122a selects one unprocessed gram from the gram string passed from the N-gram dividing unit 121 (step S2). The notation unified conversion unit 122a performs notation conversion for unifying the notation of the character string (English character string) constituting the gram into a predetermined notation, for example, upper case notation, for the selected gram (step S3). ). As a result, when the conversion target gram is the above-mentioned “pat”, the “pat” is converted to uppercase “PAT”. In other words, the alphabetical character string representation is unified to uppercase notation. The notation conversion for unifying the notation by the notation unified conversion unit 122a is performed by setting the hash value of the gram after notation conversion to the same value for different notation grams that can be subject to the same identification (here, uppercase / lowercase identification). Done to align.

表記統一変換部１２２ａによる表記変換後（大文字表記への変換後）のグラム（文字列）はハッシュ値計算部１２２ｂに渡される。このとき、表記統一変換部１２２ａによる表記変換前のグラムが語彙索引登録部１２３に渡される。 The gram (character string) after the notation conversion (after conversion to upper case notation) by the notation unified conversion unit 122a is passed to the hash value calculation unit 122b. At this time, the gram before the notation conversion by the notation unified conversion unit 122 a is passed to the vocabulary index registration unit 123.

ハッシュ値計算部１２２ｂは、表記変換後のグラム（文字列）のハッシュ値を計算する（ステップＳ４）。ハッシュ値計算部１２２ｂによるハッシュ値計算結果は語彙索引登録部１２３に渡される。語彙索引登録部１２３は、ハッシュ値計算部１２２ｂから渡されたハッシュ値で特定される、ハッシュテーブル１０３のエントリを参照することにより、ハッシュ値の衝突の有無を判定する（ステップＳ５）。即ち語彙索引登録部１２３は、参照されたハッシュテーブル１０３のエントリに、表記統一変換部１２２ａによる表記変換前のグラム（文字列）とは異なる表記の文字列の語彙索引が既に登録されているならば、ハッシュ値の衝突があったと判定する。これに対し、参照されたハッシュテーブル１０３のエントリに語彙索引が登録されていないか、或は語彙索引が登録されていても、その語彙索引が表記統一変換部１２２ａによる表記変換前のグラム（文字列）と同一表記の文字列の語彙索引であるならば、ハッシュ値の衝突がなかったと判定する。 The hash value calculation unit 122b calculates the hash value of the gram (character string) after the notation conversion (step S4). The hash value calculation result by the hash value calculation unit 122 b is passed to the vocabulary index registration unit 123. The vocabulary index registration unit 123 determines whether or not there is a hash value collision by referring to the entry of the hash table 103 specified by the hash value passed from the hash value calculation unit 122b (step S5). That is, the vocabulary index registration unit 123 has already registered a lexical index of a character string with a notation different from the gram (character string) before the notation conversion by the notation unified conversion unit 122a in the entry of the referenced hash table 103. For example, it is determined that there is a hash value collision. On the other hand, even if the lexical index is not registered in the entry of the referenced hash table 103 or the vocabulary index is registered, the gram (character If it is a lexical index of a character string having the same notation as (column), it is determined that there is no collision of hash values.

今、ハッシュ値計算部１２２ｂから語彙索引登録部１２３に渡されたハッシュ値が、表記統一変換部１２２ａによって"pat"から変換された"PAT"のハッシュ値ＨＰであるものとする。また、このハッシュ値ＨＰで特定される、ハッシュテーブル１０３のエントリが、図４に示すようにエントリ１０３ａであり、当該エントリ１０３ａには語彙索引が登録されていないものとする。この場合、語彙索引登録部１２３はハッシュ値の衝突がなかったと判定し、ハッシュテーブル１０３のエントリ１０３ａに、表記統一変換部１２２ａによる表記変換前のグラム"pat"の語彙索引を登録する（ステップＳ６）。このステップＳ６では、データベース１０のデータ格納領域１０１における"pat"の格納位置の情報を含む語彙索引情報が、データベース１０の語彙索引情報格納領域１０２に登録される。この"pat"の格納位置は、データベース１０にテキストデータが格納される際に、当該テキストデータから抽出された登録対象文字列"patent"中の"pat"の格納位置である。ハッシュテーブル１０３のエントリ１０３ａに登録された、"pat"の語彙索引（つまり表記統一変換部１２２ａによる表記変換前の文字列"pat"）には、この登録対象文字列"patent"中の"pat"の格納位置の情報を含む語彙索引情報を指し示すポインタ情報が付加される。 Assume that the hash value passed from the hash value calculation unit 122b to the vocabulary index registration unit 123 is the hash value HP of “PAT” converted from “pat” by the unified notation conversion unit 122a. Further, it is assumed that the entry of the hash table 103 specified by the hash value HP is the entry 103a as shown in FIG. 4, and no vocabulary index is registered in the entry 103a. In this case, the vocabulary index registration unit 123 determines that there is no hash value collision, and registers the lexical index of the gram “pat” before the notation conversion by the notation unified conversion unit 122a in the entry 103a of the hash table 103 (step S6). ). In this step S 6, lexical index information including information on the storage location of “pat” in the data storage area 101 of the database 10 is registered in the vocabulary index information storage area 102 of the database 10. The storage position of “pat” is the storage position of “pat” in the registration target character string “patent” extracted from the text data when the text data is stored in the database 10. In the vocabulary index of “pat” registered in the entry 103a of the hash table 103 (that is, the character string “pat” before notation conversion by the notation unified conversion unit 122a), “pat” in the registration target character string “patent” Pointer information indicating lexical index information including information on the storage position of “is added.

なお、"pat"の語彙索引が既にハッシュテーブル１０３のエントリ１０３ａに登録され、したがって当該語彙索引により指し示される語彙索引情報がデータベース１０の語彙索引情報格納領域１０２に既に登録されている場合には、当該語彙索引情報に、上記"pat"の格納位置の情報が語彙索引登録部１２３によって追加される。ここでは、この"pat"の格納位置の情報が、ハッシュテーブル１０３のエントリ１０３ａに登録された語彙索引により指し示される語彙索引情報に追加されるだけの場合も、当該"pat"の語彙索引が等価的にハッシュテーブル１０３のエントリ１０３ａに登録されたものとして扱う。 When the vocabulary index of “pat” is already registered in the entry 103a of the hash table 103, and therefore the lexical index information pointed to by the vocabulary index is already registered in the lexical index information storage area 102 of the database 10. The vocabulary index registration unit 123 adds information on the storage location of the “pat” to the vocabulary index information. Here, even when the information on the storage location of “pat” is only added to the lexical index information pointed to by the vocabulary index registered in the entry 103 a of the hash table 103, the vocabulary index of the “pat” is stored. Equivalently, it is treated as being registered in the entry 103a of the hash table 103.

さて、語彙索引登録部１２３によってステップＳ６または後述するＳ７が実行されると、表記統一変換部１２２ａは、Ｎグラム分割部１２１によって分割されたグラム列中に未処理のグラムが存在するかを判定する（ステップＳ８）。もし、未処理のグラムが存在するならば、表記統一変換部１２２ａは未処理のグラムを１つ選択して（ステップＳ２）、そのグラムを構成する文字列の表記を大文字表記に統一するための表記変換を行う（ステップＳ３）。以下、上述した"pat"の場合と同様の動作が行われる。この動作の繰り返しによって、Ｎグラム分割部１２１によって分割されたグラム列中に未処理のグラムが存在しなくなったならば、つまりＮグラム分割部１２１によって分割された全てのグラムについて、対応する語彙索引を登録する処理が行われたならば、指定された登録対象文字列"patent"に関する一連の語彙索引の登録処理は終了となる。 When step S6 or S7 described later is executed by the vocabulary index registration unit 123, the notation unified conversion unit 122a determines whether an unprocessed gram exists in the gram sequence divided by the N-gram division unit 121. (Step S8). If there is an unprocessed gram, the notation conversion unit 122a selects one unprocessed gram (step S2), and unifies the notation of the character string constituting the gram into uppercase notation. Notation conversion is performed (step S3). Thereafter, the same operation as in the case of “pat” described above is performed. If there is no unprocessed gram in the gram string divided by the N-gram dividing unit 121 by repeating this operation, that is, for all the grams divided by the N-gram dividing unit 121, the corresponding lexical index If the process of registering is performed, a series of lexical index registration processes related to the designated registration target character string “patent” is completed.

その後、文字列入力部１１によって"Patent"または"PATENT"が登録対象文字列として抽出されたものとする。この場合、Ｎグラム分割部１２１は、登録対象文字列"Patent"または"PATENT"を、それぞれ図４において矢印Ｂ２またはＢ３で示すように、Ｎグラム（Ｎ＝３）に分割する（ステップＳ１）。Ｎグラム分割部１２１によって分割されたグラム列は表記統一変換部１２２ａに渡される。このグラム列は、登録対象文字列が"Patent"または"PATENT"の場合、それぞれグラム（文字列）"Pat"または"PAT"を含む。 Thereafter, it is assumed that “Patent” or “PATENT” is extracted as a registration target character string by the character string input unit 11. In this case, the N-gram dividing unit 121 divides the registration target character string “Patent” or “PATENT” into N-grams (N = 3) as indicated by arrows B2 or B3 in FIG. 4 (step S1). . The gram string divided by the N-gram dividing unit 121 is passed to the notation unified conversion unit 122a. When the registration target character string is “Patent” or “PATENT”, the gram string includes gram (character string) “Pat” or “PAT”, respectively.

表記統一変換部１２２ａは、Ｎグラム分割部１２１から渡されたグラム列の中から未処理のグラムを１つ選択する（ステップＳ２）。表記統一変換部１２２ａは、選択されたグラムについて、そのグラムを構成する文字列（英文字列）の表記を大文字表記に統一するための表記変換を行う（ステップＳ３）。これにより、変換対象グラムが"Pat"の場合には、当該"Pat"は、上述の"pat"の場合と同様に大文字表記"PAT"に変換される。また、変換対象グラムが"PAT"の場合には、その表記は既に大文字表記であることから、表記統一変換部１２２ａによる変換結果は、表記変換が行われない場合と同一の表記となる。この表記統一変換部１２２ａによる変換結果、つまり大文字表記に統一されたグラムが、ハッシュ値計算部１２２ｂによるハッシュ値計算の対象となる。したがって、表記統一変換部１２２ａによる表記変換前のグラムが"Pat"またはPAT"の場合、ステップＳ４においてハッシュ値計算部１２２ｂによって算出されるハッシュ値は、上述の"pat"の場合と同一値ＨＰとなる。 The unified notation conversion unit 122a selects one unprocessed gram from the gram string passed from the N-gram dividing unit 121 (step S2). The notation unified conversion unit 122a performs notation conversion for unifying the character string (English character string) constituting the gram into upper case notation for the selected gram (step S3). As a result, when the conversion target gram is “Pat”, the “Pat” is converted to uppercase notation “PAT” as in the case of “pat” described above. When the conversion target gram is “PAT”, the notation is already in upper case, so the conversion result by the notation unified conversion unit 122a is the same notation as when no notation conversion is performed. The conversion result by the notation unified conversion unit 122a, that is, the gram unified to the capital letter notation is the object of hash value calculation by the hash value calculation unit 122b. Therefore, when the gram before the notation conversion by the notation unified conversion unit 122a is “Pat” or PAT ”, the hash value calculated by the hash value calculation unit 122b in step S4 is the same value HP as that of the above“ pat ”. It becomes.

ハッシュ値計算部１２２ｂによって算出されたハッシュ値は語彙索引登録部１２３に渡される。このハッシュ値が、表記統一変換部１２２ａによって"Pat"またはPAT"から変換された"PAT"のハッシュ値ＨＰであるものとする。語彙索引登録部１２３は、このハッシュ値ＨＰで特定される、ハッシュテーブル１０３のエントリを参照することにより、ハッシュ値の衝突の有無を判定する（ステップＳ５）。ハッシュ値ＨＰで特定される、ハッシュテーブル１０３のエントリは、上述したように"pat"の語彙索引が既に登録されているエントリ１０３ａである。この場合、語彙索引登録部１２３はハッシュ値の衝突があったと判定する。すると語彙索引登録部１２３は、図４において矢印Ｃ１で示すように、ハッシュテーブル１０３のエントリ１０３ａにリンクした、データベース１０内のコリジョンチェーン１０４を辿って、当該チェーン１０４に、"Pat"またはPAT"の語彙索引を登録する（ステップＳ７）。このステップＳ７では、データベース１０のデータ格納領域１０１における登録対象文字列"Patent"または"PATENT"中の"Pat"またはPAT"の格納位置の情報を含む語彙索引情報が、データベース１０の語彙索引情報格納領域１０２に登録される。コリジョンチェーン１０４に登録された、"Pat"またはPAT"の語彙索引（つまり表記統一変換部１２２ａによる表記変換前の文字列"Pat"またはPAT"）には、この登録対象文字列"Patent"または"PATENT"中の"Pat"またはPAT"の格納位置の情報を含む語彙索引情報を指し示すポインタ情報が付加される。 The hash value calculated by the hash value calculation unit 122b is passed to the vocabulary index registration unit 123. It is assumed that this hash value is the hash value HP of “PAT” converted from “Pat” or PAT by the notation unified conversion unit 122a The lexical index registration unit 123 is specified by this hash value HP. The presence or absence of a hash value collision is determined by referring to the entry of the hash table 103 (step S5) The entry of the hash table 103 specified by the hash value HP is the lexical index of “pat” as described above. In this case, the lexical index registration unit 123 determines that there is a hash value collision, and the vocabulary index registration unit 123, as shown by an arrow C1 in FIG. Trace the collision chain 104 in the database 10 linked to the 103 entry 103a, , "Pat" or PAT "lexical index is registered (step S7). In this step S7, the vocabulary index information including the storage position information of “Pat” or “PAT” in the registration target character string “Patent” or “PATENT” in the data storage area 101 of the database 10 is converted into the vocabulary index information of the database 10. This is registered in the storage area 102. The vocabulary index of "Pat" or PAT "registered in the collision chain 104 (that is, the character string" Pat "or PAT" before the notation conversion by the notation unified conversion unit 122a) Pointer information indicating vocabulary index information including information on the storage location of “Pat” or “PAT” in the registration target character string “Patent” or “PATENT” is added.

なお、"Pat"またはPAT"の語彙索引が既にコリジョンチェーン１０４に登録され、したがって当該語彙索引により指し示される語彙索引情報がデータベース１０の語彙索引情報格納領域１０２に既に登録されている場合には、当該語彙索引情報に、上記"Pat"またはPAT"の格納位置の情報が追加される。ここでは、この"Pat"またはPAT"の格納位置の情報が、コリジョンチェーン１０４に登録された語彙索引により指し示される語彙索引情報に追加されるだけの場合も、当該"Pat"またはPAT"の語彙索引が等価的にコリジョンチェーン１０４に登録されたものとして扱う。また、ハッシュテーブル１０３のエントリ１０３ａにリンクしたコリジョンチェーン１０４が存在しない場合には、語彙索引登録部１２３は、当該エントリ１０３ａにリンクしたコリジョンチェーン１０４をデータベース１０内に新たに生成し、当該チェーン１０４に、"Pat"またはPAT"の語彙索引を登録する。このとき語彙索引登録部１２３は、生成されたコリジョンチェーン１０４を指し示すポインタ情報を、ハッシュテーブル１０３のエントリ１０３ａに付加する。また語彙索引登録部１２３は、データベース１０のデータ格納領域１０１における登録対象文字列"patent"中の"Pat"またはPAT"の格納位置の情報を含む語彙索引情報を、データベース１０の語彙索引情報格納領域１０２に登録する。 When the vocabulary index of “Pat” or “PAT” is already registered in the collision chain 104, and therefore the lexical index information indicated by the vocabulary index is already registered in the lexical index information storage area 102 of the database 10. The information on the storage location of the “Pat” or PAT is added to the vocabulary index information. Here, even when the information on the storage location of “Pat” or PAT is simply added to the lexical index information pointed to by the vocabulary index registered in the collision chain 104, the “Pat” or PAT ” It is assumed that the lexical index is equivalently registered in the collision chain 104. If the collision chain 104 linked to the entry 103a of the hash table 103 does not exist, the vocabulary index registration unit 123 newly generates a collision chain 104 linked to the entry 103a in the database 10, and the chain 104 The vocabulary index of “Pat” or “PAT” is registered in the vocabulary index register 123. At this time, the vocabulary index registration unit 123 adds pointer information indicating the generated collision chain 104 to the entry 103a of the hash table 103. Also, registration of the vocabulary index is performed. The unit 123 registers the lexical index information including information on the storage position of “Pat” or PAT in the registration target character string “patent” in the data storage area 101 of the database 10 in the vocabulary index information storage area 102 of the database 10. To do.

このように本実施形態においては、表記が異なるグラム（文字列）"pat"，"Pat"及び"PAT"の表記を全てハッシュ値変換部１２２内の表記統一変換部１２２ａによって大文字"PAT"に統一し、その統一された表記"PAT"を対象にハッシュ値変換部１２２内のハッシュ値計算部１２２ｂによるハッシュ値計算が行われるようにした。これにより、表記が異なる"pat"，"Pat"及び"PAT"に対応するハッシュ値を、大文字表記"PAT"のハッシュ値ＨＰに揃えることができる。この結果、表記が異なる"pat"，"Pat"及び"PAT"に対応するハッシュ値をＨＰに揃えたにも拘わらずに、"pat"，"Pat"及び"PAT"に共通の語彙索引を作成せずに、"pat"，"Pat"及び"PAT"個々の語彙索引を作成していながら、これら共通のハッシュ値ＨＰのグラムの集合（"pat"，"Pat"及び"PAT"）に対応する語彙索引の集合を、当該ハッシュ値ＨＰで特定される、ハッシュテーブル１０３のエントリ１０３ａ及び当該エントリ１０３ａにリンクされたコリジョンチェーン１０４に集中して登録することができる。 As described above, in this embodiment, the notation of grams (character strings) “pat”, “Pat”, and “PAT” having different notations are all changed to the capital letter “PAT” by the notation unified conversion unit 122a in the hash value conversion unit 122. The hash value calculation is performed by the hash value calculation unit 122b in the hash value conversion unit 122 for the unified notation “PAT”. Thereby, the hash values corresponding to “pat”, “Pat”, and “PAT” having different notations can be aligned with the hash value HP of the uppercase notation “PAT”. As a result, a lexical index common to “pat”, “Pat” and “PAT” is obtained even though the hash values corresponding to “pat”, “Pat” and “PAT” having different notations are aligned in HP. While creating individual vocabulary indexes of “pat”, “Pat”, and “PAT” without creating them, a set of grams of these common hash values HP (“pat”, “Pat”, and “PAT”) A set of corresponding vocabulary indexes can be registered in a concentrated manner in the entry 103a of the hash table 103 specified by the hash value HP and the collision chain 104 linked to the entry 103a.

つまり本実施形態によれば、異表記されたグラム（"pat"，"Pat"及び"PAT"）に対応するハッシュ値をハッシュ値変換部１２２によって揃えることで、当該異表記されたグラムの語彙索引の集合を、データベース１０内の局所領域に集中して配置することができる。これにより、後述するように、異表記された語彙を同一視して検索する場合に、ハッシュテーブル１０３のエントリ１０３ａにリンクされた１つのコリジョンチェーン１０４を辿る（走査する）だけで、対応する全ての語彙索引を高速に検索することが可能となる。また、異表記を同一視しないで検索する場合には、上記揃えられたハッシュ値で特定されるハッシュテーブル１０３のエントリまたは当該エントリにリンクしたコリジョンチェーン１０４の中から、目的の表記の語彙索引だけを検索するだけで良く、異表記を同一視しない場合の検索も簡単に実行できる。 That is, according to the present embodiment, the hash value conversion unit 122 aligns hash values corresponding to differently expressed grams (“pat”, “Pat”, and “PAT”), so that A set of indexes can be concentrated on a local area in the database 10. As a result, as will be described later, when searching for different vocabulary with the same identities, all the corresponding ones are simply traced (scanned) following one collision chain 104 linked to the entry 103a of the hash table 103. Can be searched at high speed. Further, when searching without identifying different notations, only the lexical index of the target notation is selected from the entries of the hash table 103 specified by the aligned hash values or the collision chain 104 linked to the entries. It is only necessary to search for “”, and a search can be easily performed when different notations are not identified.

通常、データベース１０が置かれるディスクドライブからのデータ読み出しは、ページ或はブロックと呼ばれる、一定のサイズのデータ単位で行われる。ディスクドライブ（データベース１０）から読み出された一定サイズのデータは、キャッシュメモリに保持されるのが一般的である。したがって、異表記されたグラムの語彙索引の集合に含まれている語彙索引をディスクドライブ（データベース１０）から読み出す際には、当該語彙索引の集合がまとめて読み出されてキャッシュメモリに保持される可能性が高い。この場合、上記集合中の他の語彙索引のキャッシュヒット率が高くなるため、当該他の語彙索引の一層の高速検索が可能となる。 Normally, data reading from the disk drive in which the database 10 is placed is performed in units of data of a certain size called pages or blocks. Data of a certain size read from the disk drive (database 10) is generally held in a cache memory. Therefore, when reading from the disk drive (database 10) the vocabulary index included in the lexical index set of the gram of notation, the vocabulary index set is read together and held in the cache memory. Probability is high. In this case, the cache hit rate of the other vocabulary index in the set becomes high, so that the other lexical index can be searched at a higher speed.

次に、データベース１０を対象に大文字／小文字同一視検索を行う文字列検索処理について、図５及び図６を参照して説明する。なお、図５は文字列検索処理の手順を示すフローチャート、図６は検索対象文字列が"patent"の場合の文字列検索処理を説明するための図である。 Next, a character string search process for performing an uppercase / lowercase equality search for the database 10 will be described with reference to FIGS. 5 and 6. FIG. 5 is a flowchart showing the procedure of the character string search process, and FIG. 6 is a diagram for explaining the character string search process when the search target character string is “patent”.

まず、ユーザの入力操作に応じてアプリケーションから与えられるデータベース検索要求に従い、文字列入力部１３によって検索対象文字列"patent"が入力されたのとする。文字列入力部１３は、この検索対象文字列"patent"をデータベース検索部１４内のＮグラム分割部１４１に渡す。 First, it is assumed that a search target character string “patent” is input by the character string input unit 13 in accordance with a database search request given by an application in accordance with a user input operation. The character string input unit 13 passes this search target character string “patent” to the N-gram dividing unit 141 in the database search unit 14.

Ｎグラム分割部１４１は、文字列入力部１３から渡された検索対象文字列"patent"を、図６において矢印Ｄ１で示すように、Ｎグラム（Ｎ＝３）に分割する（ステップＳ１１）。Ｎグラム分割部１４１によって分割されたグラム列は表記統一変換部１４２ａに渡される。このグラム列は、検索対象文字列が"patent"である本実施形態では、グラム（文字列）"pat"を含む。 The N-gram dividing unit 141 divides the search target character string “patent” passed from the character string input unit 13 into N-grams (N = 3) as indicated by an arrow D1 in FIG. 6 (step S11). The gram string divided by the N-gram dividing unit 141 is passed to the notation unified conversion unit 142a. In the present embodiment in which the search target character string is “patent”, the gram string includes a gram (character string) “pat”.

表記統一変換部１４２ａは、Ｎグラム分割部１４１から渡されたグラム列の中から未処理のグラムを１つ選択する（ステップＳ１２）。表記統一変換部１４２ａは、選択されたグラムについて、そのグラムを構成する文字列（英文字列）の表記を大文字表記に統一するための表記変換を行う（ステップＳ１３）。これにより、変換対象グラムが"pat"の場合には、当該"pat"は大文字表記"PAT"に変換される。明らかにように、変換対象グラムが、"pat"とは表記の異なる"paT"，"Pat"，"PAt"などである場合にも、大文字表記"PAT"に変換される。この表記統一変換部１４２ａによる変換結果、つまり大文字表記に統一されたグラムが、ハッシュ値計算部１４２ｂによるハッシュ値計算の対象となる。 The notation unified conversion unit 142a selects one unprocessed gram from the gram string passed from the N-gram dividing unit 141 (step S12). The notation unified conversion unit 142a performs notation conversion for unifying the notation of the character string (English character string) constituting the gram into upper case notation for the selected gram (step S13). As a result, when the conversion target gram is “pat”, the “pat” is converted to uppercase notation “PAT”. Obviously, even when the conversion target gram is “paT”, “Pat”, “PAt” or the like having a different notation from “pat”, it is converted to uppercase notation “PAT”. The conversion result by the notation unified conversion unit 142a, that is, the gram unified in uppercase notation, is the target of hash value calculation by the hash value calculation unit 142b.

ハッシュ値計算部１４２ｂは、表記統一変換部１４２ａによって"pat"から変換された"PAT"のハッシュ値を計算する。このハッシュ値は、上述した語彙索引登録時の動作から明らかにように、ＨＰとなる。この"PAT"のハッシュ値ＨＰで特定される、ハッシュテーブル１０３のエントリ１０３ａと、当該エントリ１０３ａにリンクされたコリジョンチェーン１０４とには、図６に示すように、"PAT"だけでなく、"pat"，"paT"，"Pat"及び"PAt"のように、大文字／小文字のみ異なる異表記のグラムが、全て同じハッシュ値ＨＰで登録されている。 The hash value calculation unit 142b calculates the hash value of “PAT” converted from “pat” by the unified notation conversion unit 142a. This hash value becomes HP, as is apparent from the operation at the time of lexical index registration described above. As shown in FIG. 6, not only “PAT”, but also “PAT” as well as “PAT” are included in the entry 103a of the hash table 103 specified by the hash value HP of “PAT” and the collision chain 104 linked to the entry 103a. Different notation grams such as “pat”, “paT”, “Pat”, and “PAt” that are different only in uppercase / lowercase are registered with the same hash value HP.

そこで語彙索引検索部１４３は、ハッシュ値計算部１４２ｂによって算出されたハッシュ値ＨＰで特定される、ハッシュテーブル１０３のエントリ１０３ａに、"pat"の語彙索引または"pat"とは大文字／小文字のみ異なるグラムの語彙索引が登録されているかを判定する（ステップＳ１５）。もし、登録されているならば、語彙索引検索部１４３はハッシュテーブル１０３のエントリ１０３ａに登録されている語彙索引に従って、対応する語彙索引情報を取得する（ステップＳ１６）。ここでは、ハッシュテーブル１０３のエントリ１０３ａには、"pat"の語彙索引が登録されている。これにより語彙索引検索部１４３は、この"pat"の語彙索引から、"pat"が格納されているデータベース１０内の全ての格納位置を示す語彙索引情報を取得する。 Therefore, the lexical index search unit 143 differs from the vocabulary index of “pat” or “pat” only in uppercase / lowercase letters in the entry 103a of the hash table 103 specified by the hash value HP calculated by the hash value calculation unit 142b. It is determined whether the gram lexical index is registered (step S15). If registered, the vocabulary index search unit 143 acquires the corresponding vocabulary index information according to the vocabulary index registered in the entry 103a of the hash table 103 (step S16). Here, the vocabulary index of “pat” is registered in the entry 103 a of the hash table 103. Thus, the vocabulary index search unit 143 acquires lexical index information indicating all storage positions in the database 10 in which “pat” is stored from the vocabulary index of “pat”.

語彙索引検索部１４３は、ステップＳ１６を実行すると、ステップＳ１７に進む。このステップ１７において、語彙索引検索部１４３は、ハッシュテーブル１０３のエントリ１０３ａにリンクしたコリジョンチェーン１０４に、"pat"または"pat"とは大文字／小文字のみ異なるグラムの語彙索引が登録されているかを判定する。もし、登録されているならば、語彙索引検索部１４３は上記コリジョンチェーン１０４に登録されている語彙索引に従って、対応する語彙索引情報を取得する（ステップＳ１８）。ハッシュテーブル１０３のエントリ１０３ａにリンクしたコリジョンチェーン１０４には、図６に示すように、"pat"とは大文字／小文字のみ異なるグラム（"paT"，"Pat"，"PAt"，"PAT"など）の語彙索引が登録されている。これにより語彙索引検索部１４３は、この"pat"とは大文字／小文字のみ異なるグラムに対応する各語彙索引から、対応する表記の文字列（"paT"，"Pat"，"PAt"，"PAT"など）が格納されているデータベース１０内の全ての格納位置を示す語彙索引情報を取得する。一方、ハッシュテーブル１０３のエントリ１０３ａに目的とする語彙索引が登録されていない場合には（ステップＳ１５）、語彙索引検索部１４３はそのままステップＳ１７に進む。 After executing step S16, the lexical index search unit 143 proceeds to step S17. In this step 17, the vocabulary index search unit 143 determines whether “pat” or a gram lexical index different from upper case / lower case from “pat” is registered in the collision chain 104 linked to the entry 103 a of the hash table 103. judge. If registered, the vocabulary index search unit 143 obtains corresponding lexical index information according to the vocabulary index registered in the collision chain 104 (step S18). In the collision chain 104 linked to the entry 103a of the hash table 103, as shown in FIG. 6, a gram ("paT", "Pat", "PAt", "PAT", etc.) that differs only in uppercase / lowercase letters from "pat". ) Vocabulary index is registered. As a result, the lexical index search unit 143 retrieves the corresponding character string (“paT”, “Pat”, “PAt”, “PAT”) from each lexical index corresponding to a gram that differs only in uppercase / lowercase letters from “pat”. Vocabulary index information indicating all storage positions in the database 10 in which “etc.” is stored is acquired. On the other hand, if the target vocabulary index is not registered in the entry 103a of the hash table 103 (step S15), the vocabulary index search unit 143 proceeds directly to step S17.

このように本実施形態においては、大文字／小文字のみ異なるグラムについては全て同じハッシュ値で登録されている。このため、検索対象文字列”patent”からＮグラムに分割されたグラム列中の”pat”について、"pat"と、当該"pat"とは大文字／小文字のみ異なる"paT"，"Pat"及び"PAt"などのグラムに共通となるハッシュ値ＨＰを計算するならば、当該ハッシュ値ＨＰに対応するハッシュテーブル１０３のエントリ１０３ａと、当該エントリ１０３ａにリンクしたコリジョンチェーン１０４を１回走査するするだけで、"pat"は勿論、"paT"，"Pat"及び"PAt"などに関する語彙索引情報を取得できる。従来は、"pat"に関して大文字／小文字を同一視して検索する場合は、８回の走査を必要とする。このことから、本実施形態においては、大文字／小文字を同一視して検索する場合の検索性能を向上できることが理解されよう。つまり本実施形態によれば、同一視したい表記の語彙を含むグラムの語彙索引をハッシュテーブル１０３またはコリジョンチェーン１０４に登録する際に、当該グラムの表記を統一してハッシュ値が同一値となるような処理を施すことにより、ハッシュテーブル１０３の参照からコリジョンチェーン１０４の走査という処理を１度しか行わなくてよくなるため、検索性能を向上させることができる。しかも、データベース１０（が置かれるディスクドライブ）内で、図６のコリジョンチェーン１０４のように、"pat"とは大文字／小文字のみ異なる"paT"や"PAt","PAT"の語彙索引を、局所的に集中して配置することができるため、これらの語彙索引の検索や対応する情報の取得も高速に実行できる。 Thus, in the present embodiment, all the grams that differ only in uppercase / lowercase are registered with the same hash value. For this reason, for “pat” in the gram string divided from the search target character string “patent” into N-grams, “pat” and “paT”, “Pat”, If a hash value HP common to a gram such as “PAt” is calculated, only the entry 103a of the hash table 103 corresponding to the hash value HP and the collision chain 104 linked to the entry 103a are scanned once. Thus, vocabulary index information regarding “paT”, “Pat”, “PAt” and the like as well as “pat” can be acquired. Conventionally, when searching for “pat” with the same capital letter / small letter, eight scans are required. From this, it will be understood that in this embodiment, it is possible to improve the search performance when searching with uppercase / lowercase characters identical. In other words, according to the present embodiment, when the grammatical index of a gram including the vocabulary of the notation to be identified is registered in the hash table 103 or the collision chain 104, the notation of the gram is unified and the hash value becomes the same value. By performing this process, the process of scanning the collision chain 104 from referring to the hash table 103 needs to be performed only once, so that the search performance can be improved. Moreover, in the database 10 (the disk drive where the data is placed), as in the collision chain 104 of FIG. 6, the vocabulary index of “paT”, “PAt”, “PAT”, which is different from “pat” only in upper / lower case, Since it can be arranged locally and concentrated, it is possible to search these lexical indexes and obtain corresponding information at high speed.

さて、語彙索引検索部１４３によってステップＳ１８が実行された場合、或はハッシュテーブル１０３のエントリ１０３ａにリンクしたコリジョンチェーン１０４に、"pat"または"pat"とは大文字／小文字のみ異なるグラムの語彙索引のいずれも登録されていないことが語彙索引検索部１４３によって判定された場合（ステップＳ１５，Ｓ１７）、表記統一変換部１４２ａは、Ｎグラム分割部１４１によって分割されたグラム列中に未処理のグラムが存在するかを判定する（ステップＳ１９）。もし、未処理のグラムが存在するならば、表記統一変換部１４２ａは未処理のグラムを１つ選択して（ステップＳ１２）、そのグラムを構成する文字列の表記を大文字表記に統一するための表記変換を行う（ステップＳ１３）。以下、上述した"pat"の場合と同様の動作が行われる。この動作の繰り返しにより、Ｎグラム分割部１４１によって分割されたグラム列中に未処理のグラムが存在しなくなったならば、データベース検索部１４内の検索結果処理部１４４は、Ｎグラム分割部１４１によって分割された全グラムについて、語彙索引検索部１４３によって取得された語彙索引情報をマージする（ステップＳ２０）。これにより検索結果処理部１４４は、データベース１０上の文字列"patent"の検索が実行できる。 When step S18 is executed by the vocabulary index search unit 143, or in the collision chain 104 linked to the entry 103a of the hash table 103, the vocabulary index of a gram that differs from “pat” or “pat” only in uppercase / lowercase letters. When the vocabulary index search unit 143 determines that none of these are registered (steps S15 and S17), the unified notation conversion unit 142a uses the unprocessed gram in the gram string divided by the N-gram division unit 141. Is determined (step S19). If there is an unprocessed gram, the notation conversion unit 142a selects one unprocessed gram (step S12), and unifies the notation of the character string constituting the gram into uppercase notation. Notation conversion is performed (step S13). Thereafter, the same operation as in the case of “pat” described above is performed. If there is no unprocessed gram in the gram string divided by the N-gram dividing unit 141 by repeating this operation, the search result processing unit 144 in the database searching unit 14 The lexical index information acquired by the lexical index search unit 143 is merged for all the divided grams (step S20). As a result, the search result processing unit 144 can search for the character string “patent” on the database 10.

次に、大文字／小文字（異表記）を同一視しない検索について、検索対象文字列"patent"から分割された文字列"pat"に対応する語彙索引を検索する場合を例に説明する。まず、文字列"pat"に対する表記統一変換部１４２ａの表記変換結果は"PAT"となり、この"PAT"に対するハッシュ値計算部１４２ｂのハッシュ値計算結果はＨＰとなる。この場合、語彙索引検索部１４３は、ハッシュ値ＨＰで特定される、ハッシュテーブル１０３のエントリ１０３ａと、当該エントリ１０３ａにリンクしたコリジョンチェーン１０４の中から、表記統一変換部１４２ａによる表記変換前の文字列"pat"に対応する唯一の語彙索引を探して、その"pat"の語彙索引で指し示される語彙索引情報だけを取得すれば良い。このため本実施形態においては、文字列"pat"と同一視可能なデータベース中の全文字列のリーフの情報を参照して、そのリーフ中の文字情報と文字列"pat"とを比較する必要のある前記第２の先行技術と異なって、異表記を同一視しない検索を効率的に行える。しかも本実施形態においては、システムに唯一存在する"pat"の語彙索引で指し示される語彙索引情報に、データベース１０中の全文字列"pat"のそれぞれの格納位置情報を含めるだけで良いため、全文字列"pat"のそれぞれにリーフの情報を作成して、そのリーフの情報に文字列"pat"の文字情報と格納位置情報を含める必要のある第２の先行技術と異なって、情報量を著しく削減できる。 Next, an example of searching for a lexical index corresponding to a character string “pat” divided from a search target character string “patent” will be described with respect to a search that does not identify uppercase / lowercase letters (different notation). First, the notation conversion result of the notation unified conversion unit 142a for the character string “pat” is “PAT”, and the hash value calculation result of the hash value calculation unit 142b for this “PAT” is HP. In this case, the vocabulary index search unit 143 determines the character before the notation conversion by the notation unified conversion unit 142a from the entry 103a of the hash table 103 specified by the hash value HP and the collision chain 104 linked to the entry 103a. It is only necessary to search for a unique lexical index corresponding to the column “pat” and obtain only the lexical index information pointed to by the vocabulary index of the “pat”. For this reason, in this embodiment, it is necessary to refer to the leaf information of all character strings in the database that can be identified with the character string “pat” and to compare the character information in the leaf with the character string “pat”. Unlike the above-described second prior art, it is possible to efficiently perform a search that does not identify different notations. Moreover, in the present embodiment, the storage position information of all the character strings “pat” in the database 10 need only be included in the lexical index information pointed to by the vocabulary index of “pat” that exists only in the system. Unlike the second prior art, in which leaf information is created for each character string "pat" and the character information and storage location information of the character string "pat" must be included in the leaf information, the amount of information Can be significantly reduced.

ところで、ハッシュ値を揃えて同一視検索を高速化するのに、図１のデータベース検索システムの外側(アプリケーション側)で、同一視したい複数の語彙を１つの語彙に統一して登録する手法を適用することも可能である。しかしこの手法では、語彙索引がどれも同じものとして登録されるため、同一視しない検索ができなくなってしまう。本実施形態ではこのような問題は起きず、同一視する場合／しない場合のどちらの検索も可能である。 By the way, in order to speed up identification search by aligning hash values, a method of registering multiple vocabularies to be identified as one vocabulary outside the database search system (application side) in FIG. 1 is applied. It is also possible to do. However, with this method, since all lexical indexes are registered as the same, it becomes impossible to perform a search that does not identify them. In the present embodiment, such a problem does not occur, and it is possible to perform a search for both cases of not being identified and not being identified.

上記実施形態では、登録または検索対象となる文字列の文字種が英字であり、且つ全角／半角のうちの半角文字のみに限られている場合を想定して、ハッシュ計算に用いられるグラムの表記を大文字に統一する構成（つまり大文字／小文字が異なる語彙についてハッシュ値を揃える構成）を適用している。しかし、ハッシュ計算に用いられるグラムの表記を小文字に統一する構成であっても構わない。また、登録または検索対象となる文字列を構成する英字に全角文字及び半角文字の両方が存在し得る場合には、ハッシュ計算に用いられるグラムの表記を、大文字または小文字で且つ全角または半角のいずれかに統一する構成とすれば良い。この構成は、「カタカナ」の大文字／小文字の違い(ソフトウェアとソフトウエア等)などを同一視する場合にも適用可能である。また、登録または検索対象となる文字列を構成する文字の文字種として「ひらがな」と「カタカナ」の両方が存在し得る場合には、ハッシュ計算に用いられるグラムの表記を「ひらがな」または「カタカナ」のいずれか一方に統一する構成とすれば良い。同様に、漢字の「斉,斎,齊,齋」のような字体の違いを同一視する場合には、字体を「斉,斎,齊,齋」のうちのいずれか１つ、例えば「斉」に統一する構成とすれば良い。そのためには、図１のハッシュ値変換部１２２，１４２内に、「斉,斎,齊,齋」を、統一すべき表記「斉」に対応付けるための、異表記統一辞書１２２ｃ，１４２ｃ（図１参照）を設けると良い。ここでは、表記統一変換部１２２ａ，１４２ａは、「斉」「斎」「齊」または「齋」で異表記統一辞書１２２ｃ，１４２ｃを参照することにより、その文字に対応付けられている「斉」を表記変換結果として出力すれば良い。この異表記統一辞書１２２ｃ，１４２ｃが、データベース１０に格納されていても構わない。 In the above embodiment, assuming that the character type of a character string to be registered or searched is an alphabetic character and is limited to only one-byte / one-byte character, the notation of the gram used for the hash calculation is A configuration in which uppercase letters are unified (that is, a configuration in which hash values are aligned for vocabularies with different uppercase / lowercase letters) is applied. However, the gram used for hash calculation may be unified in lower case. In addition, when both full-width and half-width characters can exist in the alphabet that constitutes the character string to be registered or searched, the notation of the gram used for hash calculation is either uppercase or lowercase and full-width or half-width. What is necessary is just to make it the structure unified. This configuration can also be applied to the case where the difference between upper and lower case letters (software and software, etc.) of “Katakana” is identified. In addition, when both “Hiragana” and “Katakana” can exist as the character types of the characters that constitute the character string to be registered or searched, the grammar used for the hash calculation is expressed as “Hiragana” or “Katakana”. It may be configured to be unified to either one of the above. Similarly, when identifying the difference between the kanji characters such as “Sai, Sai, Tsuji, Tsuji”, the style is set to any one of “Sai, Sai, Tsuji, Tsuji”, for example, To be unified. For this purpose, the different notation unified dictionaries 122c and 142c (FIG. 1) are used in the hash value converters 122 and 142 of FIG. (See below). Here, the notation unified conversion units 122a and 142a refer to the different notation unified dictionaries 122c and 142c by “Sai”, “Sai”, “齊”, or “齋”, thereby “Sai” associated with the character. May be output as a notation conversion result. The different notation unified dictionaries 122 c and 142 c may be stored in the database 10.

このように、ある１つの語彙を検索する際、同時に検索する可能性の高い語彙が複数あることが想定される場合、ハッシュ値を揃えてグループ化することにより、検索性能の向上を図ることが可能となる。 Thus, when searching for a single vocabulary, if it is assumed that there are a plurality of vocabularies that are likely to be searched at the same time, it is possible to improve the search performance by grouping the hash values together. It becomes possible.

また上記実施形態では、図１のデータベース検索システムが、テキストデータ登録及び語彙索引の登録を含むデータベース登録機能を有している場合を想定している。しかし、データベース検索システムが、必ずしもデータベース登録機能を有している必要はない。この場合、文字列入力部１１及びデータベース登録部１２は不要となる。つまり、データベース登録機能を有するデータベース登録システムと、データベース検索機能を有するデータベース検索システムとが分離された構成であっても構わない。ここで、データベース登録システムには、少なくともデータベース登録部１２を持たせればよい。 In the above embodiment, it is assumed that the database search system of FIG. 1 has a database registration function including text data registration and vocabulary index registration. However, the database search system does not necessarily have a database registration function. In this case, the character string input unit 11 and the database registration unit 12 are not necessary. In other words, the database registration system having the database registration function and the database search system having the database search function may be separated. Here, the database registration system may have at least the database registration unit 12.

なお、本発明は、上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Further, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment. For example, some components may be deleted from all the components shown in the embodiment.

本発明の一実施形態に係る、データベース登録機能を有するデータベース検索システムの構成を示すブロック図。The block diagram which shows the structure of the database search system which has a database registration function based on one Embodiment of this invention. 表記変換の対象となる文字列の文字種と、変換後の表記との関係を示す図。The figure which shows the relationship between the character type of the character string used as the object of notation conversion, and the notation after conversion. 同実施形態における語彙索引の登録処理の手順を示すフローチャート。6 is an exemplary flowchart illustrating a procedure of vocabulary index registration processing according to the embodiment. 同実施形態における異表記の文字列"patent"，"Patent"または"PATENT"を対象とする語彙索引の登録処理を説明するための図。The figure for demonstrating the registration process of the lexical index for the character string "patent", "Patent", or "PATENT" of different notation in the embodiment. 同実施形態における文字列検索処理の手順を示すフローチャート。The flowchart which shows the procedure of the character string search process in the embodiment. 同実施形態における検索対象文字列が"patent"の場合の文字列検索処理を説明するための図。The figure for demonstrating the character string search process in case the search object character string is "patent" in the embodiment. 検索対象文字列が"patent"の場合の従来の文字列検索処理を説明するための図。The figure for demonstrating the conventional character string search process in case a search object character string is "patent".

符号の説明Explanation of symbols

１０…データベース、１１，１３…文字列入力部、１２…データベース登録部、１４…データベース検索部、１０１…データ格納領域、１０２…語彙索引情報格納領域、１０３…ハッシュテーブル、１０４…コリジョンチェーン（リスト）、１２１，１４１…Ｎグラム分割部、１２２，１４２…ハッシュ値変換部、１２２ａ，１４２ａ…表記統一変換部、１２２ｂ，１４２ｂ…ハッシュ値計算部、１２２ｃ，１４２ｃ…異表記統一辞書、１２３…語彙索引登録部、１４３…語彙索引検索部、１４４…検索結果処理部。 DESCRIPTION OF SYMBOLS 10 ... Database, 11, 13 ... Character string input part, 12 ... Database registration part, 14 ... Database search part, 101 ... Data storage area, 102 ... Lexical index information storage area, 103 ... Hash table, 104 ... Collision chain (list ), 121, 141... N-gram dividing unit, 122, 142... Hash value conversion unit, 122a, 142a... Notation unified conversion unit, 122b, 142b .. Hash value calculation unit, 122c, 142c. Index registration unit, 143 ... vocabulary index search unit, 144 ... search result processing unit.

Claims

テキストデータをデータベースに登録するデータベース登録システムにおいて、
前記データベースにテキストデータが登録される際に、当該テキストデータに含まれている文字列を登録対象文字列としてＮグラムに分割する分割手段と、
前記分割手段によって分割された各グラムについて、そのグラムを構成する文字列の表記に異表記同一視の対象となり得る複数の表記が存在するものとして、当該複数の表記のいずれの場合にも同一の値となるハッシュ値に変換するハッシュ値変換手段と、
前記分割手段によって分割された各グラムについて、そのグラムを構成する文字列の語彙索引を、前記ハッシュ値変換手段によって変換されたハッシュ値で特定されるハッシュテーブルのエントリまたは当該エントリにリンクしたリストに登録する語彙索引登録手段と
を具備することを特徴とするデータベース登録システム。 In a database registration system that registers text data in a database,
Dividing means for dividing a character string included in the text data into N-grams as registration target character strings when the text data is registered in the database;
For each gram divided by the dividing means, there are a plurality of notations that can be subject to different notation in the notation of the character string that constitutes the gram, and the same in any case of the plurality of notations A hash value conversion means for converting the value into a hash value,
For each gram divided by the dividing means, the lexical index of the character string constituting the gram is stored in a hash table entry identified by the hash value converted by the hash value converting means or a list linked to the entry. A database registration system comprising: a vocabulary index registration means for registration.

前記ハッシュ値変換手段は、
前記分割手段によって分割された各グラムについて、そのグラムを構成する文字列の表記に異表記同一視の対象となり得る複数の表記が存在するものとして、当該グラムを構成する文字列の表記を当該複数の表記のうちの予め定められた表記に統一するための表記変換を行う表記統一変換手段と、
前記表記統一変換手段によって表記変換された文字列のハッシュ値を計算するハッシュ値計算手段と
を含むことを特徴とする請求項１記載のデータベース登録システム。 The hash value conversion means includes
For each gram divided by the dividing means, there are a plurality of notations that can be subject to different notation in the notation of the character string that constitutes the gram, and the notation of the character string that constitutes the gram A notation uniform conversion means for performing notation conversion for unifying to a predetermined notation of
The database registration system according to claim 1, further comprising: a hash value calculation unit that calculates a hash value of a character string that has been converted by the notation conversion unit.

各語彙毎に、当該語彙に対応付けて当該語彙の代表的な表記の語彙を保持する表記変換辞書を更に具備し、
前記表記統一変換手段は、前記分割手段によって分割された各グラムについて、そのグラムを構成する文字列で前記表記変換辞書を参照することにより、当該グラムを構成する文字列の表記変換を行うことを特徴とする請求項２記載のデータベース登録システム。 Each vocabulary further comprises a notation conversion dictionary that holds vocabulary of typical notations of the vocabulary in association with the vocabulary,
The notation unified conversion means, for each gram divided by the dividing means, by referring to the notation conversion dictionary with the character string constituting the gram, to perform the notation conversion of the character string constituting the gram The database registration system according to claim 2, wherein:

前記語彙索引登録手段は、
前記ハッシュ値変換手段によって変換されたハッシュ値で特定される前記ハッシュテーブルのエントリに、前記ハッシュ値変換手段による当該ハッシュ値への変換の対象となった文字列とは異なる表記の文字列の語彙索引が既に登録されているか否かによって、ハッシュ値の衝突の有無を判定する衝突判定手段と、
前記衝突判定手段によってハッシュ値の衝突がないことが判定された場合に、前記ハッシュ値変換手段による前記ハッシュ値への変換の対象となった文字列の語彙索引を、当該ハッシュ値で特定される前記ハッシュテーブルのエントリに登録する第１の登録手段と、
前記衝突判定手段によってハッシュ値の衝突があることが判定された場合に、前記ハッシュ値変換手段による前記ハッシュ値への変換の対象となった文字列の語彙索引を、当該ハッシュ値で特定される前記ハッシュテーブルのエントリにリンクしたリストに登録する第２の登録手段と
を含むことを特徴とする請求項１記載のデータベース登録システム。 The vocabulary index registration means includes:
Vocabulary of a character string having a notation different from the character string that is the target of conversion to the hash value by the hash value conversion unit in the hash table entry specified by the hash value converted by the hash value conversion unit A collision determination means for determining whether or not there is a hash value collision depending on whether or not an index has already been registered;
When it is determined by the collision determination means that there is no hash value collision, the lexical index of the character string that has been converted into the hash value by the hash value conversion means is specified by the hash value. First registration means for registering in the hash table entry;
When it is determined by the collision determination means that there is a hash value collision, the lexical index of the character string that has been converted into the hash value by the hash value conversion means is specified by the hash value. The database registration system according to claim 1, further comprising: second registration means for registering in a list linked to an entry of the hash table.

請求項１記載のデータベース登録システムによって登録された語彙索引を用いて、前記データベースに格納されたテキストデータ中の文字列を検索するデータベース検索システムにおいて、
検索対象文字列をＮグラムに分割する分割手段と、
前記分割手段によって分割された各グラムについて、そのグラムを構成する文字列の表記に異表記同一視の対象となり得る複数の表記が存在するものとして、当該複数の表記のいずれの場合にも同一の値となるハッシュ値に変換するハッシュ値変換手段と、
前記分割手段によって分割された各グラムについて、前記ハッシュ値変換手段によって変換されたハッシュ値で特定される前記ハッシュテーブルのエントリまたは当該エントリにリンクしたリストを走査することによって、当該グラムを構成する文字列の語彙索引を検索する語彙索引検索手段と、
前記語彙索引検索手段の語彙索引検索結果に基づいて、前記検索対象文字列と完全に一致する文字列のみ、または前記検索対象文字列と同一視可能な全ての文字列を取得する検索結果処理手段と
を具備することを特徴とするデータベース検索システム。 A database search system for searching a character string in text data stored in the database using the vocabulary index registered by the database registration system according to claim 1.
Dividing means for dividing the search target character string into N-grams;
For each gram divided by the dividing means, there are a plurality of notations that can be subject to different notation in the notation of the character string that constitutes the gram, and the same in any case of the plurality of notations A hash value conversion means for converting the value into a hash value,
For each gram divided by the dividing means, the characters constituting the gram by scanning the hash table entry specified by the hash value converted by the hash value converting means or a list linked to the entry. A lexical index search means for searching a lexical index of a column;
Search result processing means for acquiring, based on the vocabulary index search result of the vocabulary index search means, only a character string that completely matches the search target character string or all character strings that can be identified with the search target character string And a database search system characterized by comprising:

異表記同一視検索のための語彙索引登録方法であって、
データベースにテキストデータを登録する際に、当該テキストデータに含まれている文字列を登録対象文字列としてＮグラムに分割するステップと、
前記分割された各グラムについて、そのグラムを構成する文字列の表記に異表記同一視の対象となり得る複数の表記が存在するものとして、当該複数の表記のいずれの場合にも同一の値となるハッシュ値に変換するステップと、
前記分割された各グラムについて、そのグラムを構成する文字列の語彙索引を、当該文字列から変換されたハッシュ値で特定されるハッシュテーブルのエントリまたは当該エントリにリンクしたリストに登録するステップと
を具備することを特徴とする語彙索引登録方法。 A vocabulary index registration method for different notation identification search,
Dividing text strings included in the text data into N-grams as registration target character strings when registering text data in the database;
For each of the divided grams, it is assumed that there are a plurality of notations that can be subject to different notation in the notation of the character string constituting the gram, and the same value is obtained in any of the plurality of notations. Converting to a hash value;
Registering a lexical index of a character string constituting the gram for each divided gram in an entry of a hash table identified by a hash value converted from the character string or a list linked to the entry; A vocabulary index registration method comprising:

請求項６記載の語彙索引登録方法によって登録された語彙索引を用いて異表記同一視検索を行うための異表記同一視検索方法であって、
検索対象文字列をＮグラムに分割するステップと、
前記分割された各グラムについて、そのグラムを構成する文字列の表記に異表記同一視の対象となり得る複数の表記が存在するものとして、当該複数の表記のいずれの場合にも同一の値となるハッシュ値に変換するステップと、
前記変換されたハッシュ値で特定される前記ハッシュテーブルのエントリまたは当該エントリにリンクしたリストを走査することによって対応する語彙索引を検索するステップと、
前記語彙索引の検索結果に基づいて、前記検索対象文字列と完全に一致する文字列のみ、または前記検索対象文字列と同一視可能な全ての文字列を取得するステップと
を具備することを特徴とする異表記同一視検索方法。 An allographic identification search method for performing an allotment identification search using the lexical index registered by the lexical index registration method according to claim 6,
Dividing the search target character string into N-grams;
For each of the divided grams, it is assumed that there are a plurality of notations that can be subject to different notation in the notation of the character string constituting the gram, and the same value is obtained in any of the plurality of notations. Converting to a hash value;
Searching a corresponding lexical index by scanning an entry in the hash table identified by the converted hash value or a list linked to the entry;
Obtaining only the character string that completely matches the search target character string or all the character strings that can be identified with the search target character string, based on the search result of the vocabulary index. A different notation identification method.