JP2008026963A

JP2008026963A - Retrieval processor and program

Info

Publication number: JP2008026963A
Application number: JP2006195773A
Authority: JP
Inventors: Atsuko Eguchi; 敦子江口
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2006-07-18
Filing date: 2006-07-18
Publication date: 2008-02-07
Anticipated expiration: 2026-07-18
Also published as: JP4439496B2

Abstract

<P>PROBLEM TO BE SOLVED: To automatically create indexes of words and phrases which are frequently used for retrieval. <P>SOLUTION: When a given retrieval expression includes character string functions designating a character string, a partial matching type retrieving part 55 decides whether or not word and phrase indexes(target word and phrase indexes) using character strings designated by the character string functions as words and phrases are present. When it is decided that any target word and phrase index is not present, the partial matching type retrieval part 55 acquires positional information of the character string, and generates word and phrase indexes using the character string as words and phrases by using an N-gram index stored in an index part 422, and stores the generated word and phrase indexes in a dictionary table 110 by associating them with the acquired positional information. When any target word and phrase index is present, the partial matching type retrieval part 55 acquires the positional information of the character string by using the target word and phrase index. A document retrieval part 56 retrieves a structured document including the character string from a document DB 42 on the basis of the acquired positional information. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、複数の構造化文書が格納された文書データベースから索引を利用して検索条件に合致するデータを検索するのに好適な検索処理装置及びプログラムに関する。 The present invention relates to a search processing apparatus and a program suitable for searching data matching a search condition using an index from a document database in which a plurality of structured documents are stored.

従来から、複数の構造化文書が格納された文書データベースから検索条件に合致するデータを検索するのに索引を利用する検索処理装置が開発されている。このような検索処理装置のデータベースにテキストデータを含む文書を登録する場合、登録対象となるデータに索引付けをするのが一般的である。このような索引付けの手法としてＮ−グラム（N-gram）手法が知られている。Ｎ−グラム手法とは、例えば特許文献１に背景技術として記載されているように、文書に含まれる全ての文字をある固定の長さＮの連続する文字列（Ｎ−グラム）として扱い、索引登録と検索を行う手法である。 2. Description of the Related Art Conventionally, a search processing apparatus that uses an index to search for data that matches a search condition from a document database that stores a plurality of structured documents has been developed. When a document including text data is registered in the database of such a search processing apparatus, it is common to index the data to be registered. An N-gram method is known as such an indexing method. The N-gram method is, for example, as described in Patent Document 1 as background art, in which all characters included in a document are handled as a continuous character string (N-gram) having a fixed length N. This is a method for registration and search.

Ｎ−グラム手法における索引登録（Ｎ−グラム索引登録）は、次のように行われる。まず、データベースに登録される文書の文頭から機械的に１文字ずつずらしながら、長さＮの文字列（Ｎ−グラム）が順に切り出される。この長さＮの文字列（Ｎ−グラム）を便宜的に「語彙」と呼ぶ。但し、一般に良く知られている語彙と異なり、Ｎ−グラム手法で切り出される「語彙」には、意味を持たない「語彙」も存在する。１文字ずつずらして長さＮの文字列を切り出すことにより、文書に含まれる全ての部分文字列を網羅して取り出すことができる。このようにして切り出される語彙の全てが索引登録の対象となる。次に、データベース内での文書の位置及び当該文書中での各語彙の出現位置を含む位置情報が、その語彙に対応付けて登録される。長さＮには、言語や文字の種類によって適切な値が選ばれる。検索の際は、例えば検索条件として与えられた検索語句（文字列）が語彙に分割される。この語彙毎に索引（Ｎ−グラム索引）が検索される。これにより、語彙に一致する索引に対応付けて登録されている位置情報（文書位置−語彙出現位置）を得ることができる。
特開２００５−２３４９３０（段落０００２） Index registration (N-gram index registration) in the N-gram method is performed as follows. First, a character string (N-gram) of length N is cut out in order while mechanically shifting one character at a time from the beginning of a document registered in the database. This character string of length N (N-gram) is called “vocabulary” for convenience. However, unlike the vocabulary that is generally well known, the “vocabulary” that is extracted by the N-gram method also includes a “vocabulary” that has no meaning. By cutting out character strings of length N by shifting one character at a time, all partial character strings included in the document can be exhaustively extracted. All of the vocabularies extracted in this way are subject to index registration. Next, position information including the position of the document in the database and the appearance position of each vocabulary in the document is registered in association with the vocabulary. An appropriate value is selected for the length N depending on the language and the type of characters. When searching, for example, a search phrase (character string) given as a search condition is divided into vocabularies. An index (N-gram index) is searched for each vocabulary. Thereby, position information (document position−vocabulary appearance position) registered in association with an index that matches the vocabulary can be obtained.
JP-A-2005-234930 (paragraph 0002)

上述したようにＮ−グラム手法を適用する検索処理装置においては、索引登録及び検索のアルゴリズムが単純であるため、データベースに登録される文書に含まれている語句を抜けがなく完全に検索できるという利点がある。その一方、Ｎ−グラム手法を適用する検索処理装置は、辞書を利用した単語索引（語句索引）を持つ検索処理装置に比べて、語彙単位の索引の取り出し負荷が増えるために、検索処理に時間かかかるという問題がある。このような問題は、ＸＭＬ（Extensible Markup Language）形式の文書（ＸＭＬ文書）のような構造化文書（つまり階層型データ）が登録されたデータベースを持つ検索処理装置においても同様である。 As described above, in the search processing apparatus to which the N-gram method is applied, since the index registration and search algorithms are simple, it is possible to completely search the words and phrases included in the document registered in the database without omission. There are advantages. On the other hand, the search processing apparatus to which the N-gram method is applied is more time consuming for the search process because the load for extracting the lexical unit index is larger than the search processing apparatus having a word index (phrase index) using a dictionary. There is a problem that it takes. Such a problem also applies to a search processing apparatus having a database in which structured documents (that is, hierarchical data) such as XML (Extensible Markup Language) format documents (XML documents) are registered.

本発明は上記事情を考慮してなされたものでその目的は、検索によく利用される語句の索引を自動的に生成することにより検索処理を高速化できる検索処理装置及び及びプログラムを提供することにある。 The present invention has been made in consideration of the above circumstances, and an object of the present invention is to provide a search processing apparatus and program capable of speeding up the search processing by automatically generating an index of words frequently used for search. It is in.

本発明の１つの観点によれば、複数の構造化文書が格納された文書データベースから、与えられた検索式の示す検索条件に合致する構造化文書を検索する検索処理装置が提供される。この検索処理装置は、前記文書データベースに格納されている構造化文書の各々をＮ−グラムの部分文字列に分割することによって生成されるＮ−グラム索引であって、当該部分文字列の位置を示す位置情報と対応付けられたＮ−グラム索引を格納するＮ−グラム索引格納手段と、語句の位置を示す位置情報と対応付けられた語句索引を格納する語句索引格納手段と、前記検索式が文字列を指定する文字列関数を含む場合、当該文字列関数で指定される文字列を語句とする語句索引が前記語句索引格納手段に存在するかを判定する判定手段と、前記文字列関数で指定される文字列を語句とする語句索引が存在しない場合、前記Ｎ−グラム索引を利用して当該文字列の位置情報を取得する第１の位置取得手段と、前記文字列関数で指定される文字列を語句とする語句索引が存在しない場合、当該文字列を語句とする語句索引を生成し、当該生成された語句索引を前記第１の位置取得手段によって取得された位置情報と対応付けて前記語句索引格納手段に格納する語句索引生成手段と、前記文字列関数で指定される文字列を語句とする語句索引が存在する場合、当該語句索引を利用して当該文字列の位置情報を取得する第２の位置取得手段と、
前記検索式の示す検索条件に合致する、前記文字列関数で指定される文字列を含む構造化文書を、前記第１または第２の位置取得手段によって取得された位置情報に基づいて前記文書データベースから検索する文書検索手段とを具備する。 According to one aspect of the present invention, there is provided a search processing device for searching a structured document that matches a search condition indicated by a given search expression from a document database storing a plurality of structured documents. The search processing device is an N-gram index generated by dividing each structured document stored in the document database into N-gram partial character strings, and the position of the partial character string is determined. N-gram index storage means for storing an N-gram index associated with position information to be indicated, phrase index storage means for storing a phrase index associated with position information indicating the position of a phrase, and the search expression In the case of including a character string function that specifies a character string, a determination unit that determines whether or not a phrase index having the character string specified by the character string function as a phrase exists in the phrase index storage unit; and When there is no phrase index that uses the specified character string as a phrase, the first position acquisition unit that acquires position information of the character string using the N-gram index is specified by the character string function. Sentence If there is no phrase index having a word as a string, a phrase index having the character string as a phrase is generated, and the generated phrase index is associated with the position information acquired by the first position acquisition unit. When there is a phrase index generating unit to be stored in the phrase index storing unit and a phrase index having a character string specified by the character string function as a phrase, the position information of the character string is acquired using the phrase index. Second position acquisition means;
Based on the position information acquired by the first or second position acquisition unit, a structured document including a character string specified by the character string function that matches a search condition indicated by the search expression is stored in the document database. And a document search means for searching from.

本発明によれば、検索によく利用される語句の索引が自動的に生成されるため、検索式が文字列を指定する文字列関数を含み、且つ当該文字列を語句とする語句索引が存在する場合には、当該語句索引を利用することによって検索処理を高速化できる。 According to the present invention, since an index of words frequently used for search is automatically generated, there is a phrase index in which the search expression includes a character string function that specifies a character string and the character string is a word. In this case, the search process can be speeded up by using the phrase index.

以下、本発明の実施の形態につき図面を参照して説明する。
図１は本発明の一実施形態に係る検索処理装置を含むクライアント−サーバシステムのハードウェア構成を示すブロック図である。クライアント−サーバシステムは、主として、データベースサーバ（データベースサーバコンピュータ）１０と、複数のクライアント端末とから構成される。複数のクライアント端末はクライアント端末２０を含む。クライアント端末２０上では、データベースサーバ１０を利用するクライアントソフトウェアが動作する。クライアントソフトウェアは例えばブラウザである。クライアント端末２０を含む複数のクライアント端末は、ローカルエリアネットワーク（ＬＡＮ）のようなネットワーク３０を介してデータベースサーバ１０と接続されている。なお、図１にはクライアント端末２０以外のクライアント端末は省略されている。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram showing a hardware configuration of a client-server system including a search processing apparatus according to an embodiment of the present invention. The client-server system mainly includes a database server (database server computer) 10 and a plurality of client terminals. The plurality of client terminals include a client terminal 20. On the client terminal 20, client software that uses the database server 10 operates. The client software is a browser, for example. A plurality of client terminals including the client terminal 20 are connected to the database server 10 via a network 30 such as a local area network (LAN). In FIG. 1, client terminals other than the client terminal 20 are omitted.

データベースサーバ１０は、主メモリのようなメモリ１１を含む。データベースサーバ１０は、ハードディスクドライブのような外部記憶装置４０と接続されている。この外部記憶装置４０は、データベースサーバ１０による検索処理に用いられる検索処理プログラム４１を格納する。データベースサーバ１０及び外部記憶装置４０は検索処理装置５０を構成する。 The database server 10 includes a memory 11 such as a main memory. The database server 10 is connected to an external storage device 40 such as a hard disk drive. The external storage device 40 stores a search processing program 41 used for search processing by the database server 10. The database server 10 and the external storage device 40 constitute a search processing device 50.

図２は検索処理装置５０の主として機能構成を示すブロック図である。検索処理装置５０は、インタフェース５１、解析部５２、構造検索部５３、完全一致型検索部５４、部分一致型検索部５５、文書検索部５６及び結果生成部５７を含む。本実施形態において、これらの各部５１乃至５７は、図１のデータベースサーバ１０が外部記憶装置４０に格納されている検索処理プログラム４１をメモリ１１に読み込んで実行することにより実現される。このプログラム４１は、コンピュータ読み取り可能な記憶媒体に予め格納して頒布可能である。また、このプログラム４１が、ネットワーク３０を介してデータベースサーバ１０にダウンロードされても構わない。 FIG. 2 is a block diagram mainly showing a functional configuration of the search processing device 50. The search processing device 50 includes an interface 51, an analysis unit 52, a structure search unit 53, a complete match type search unit 54, a partial match type search unit 55, a document search unit 56, and a result generation unit 57. In the present embodiment, these units 51 to 57 are realized by the database server 10 of FIG. 1 reading the search processing program 41 stored in the external storage device 40 into the memory 11 and executing it. This program 41 can be stored in advance in a computer-readable storage medium and distributed. Further, this program 41 may be downloaded to the database server 10 via the network 30.

検索処理装置５０はまた、メモリ１１及び外部記憶装置４０を含む。外部記憶装置４０は、図１に示される検索処理プログラム４１に加えて、文書データベース（文書ＤＢ）４２及び辞書ファイル４３を格納する。文書ＤＢ４２は、文書部４２１及び索引部４２２を含む。文書部４２１は複数の構造化文書（構造化文書データ）、例えばＸＭＬ文書（ＸＭＬ文書データ）を格納する。索引部４２２は、文書ＤＢ４２に格納されている全てのＸＭＬ文書に含まれる語彙（Ｎ−グラム）毎に、その語彙の索引（Ｎ−グラム索引）を格納する。各Ｎ−グラム索引（Ｎ−グラム索引データ）は、対応する語彙に関する位置情報とリンクされている。この位置情報は、当該位置情報に対応する語彙を含む全てのＸＭＬ文書の文書ＤＢ４２内での位置（文書位置）と、当該ＸＭＬ文書において当該語彙が出現する全ての位置（語彙出現位置）とを表す。また索引部４２２は、文書ＤＢ４２に格納されている全てのＸＭＬ文書に含まれる文書構造毎に、当該文書構造の索引（構造索引）を格納する。各構造索引（構造索引データ）は、対応する構造を持つノードの位置を表す情報（位置情報）とリンクされている。 The search processing device 50 also includes a memory 11 and an external storage device 40. The external storage device 40 stores a document database (document DB) 42 and a dictionary file 43 in addition to the search processing program 41 shown in FIG. The document DB 42 includes a document part 421 and an index part 422. The document unit 421 stores a plurality of structured documents (structured document data), for example, XML documents (XML document data). The index unit 422 stores an index (N-gram index) of each vocabulary (N-gram) included in all XML documents stored in the document DB 42. Each N-gram index (N-gram index data) is linked with position information about the corresponding vocabulary. This position information includes the positions (document positions) of all XML documents including the vocabulary corresponding to the position information in the document DB 42 and all positions (vocabulary appearance positions) where the vocabulary appears in the XML document. To express. The index unit 422 stores an index (structure index) of the document structure for each document structure included in all XML documents stored in the document DB 42. Each structure index (structure index data) is linked with information (position information) indicating the position of a node having a corresponding structure.

辞書ファイル４３は、文書ＤＢ４２に格納されているＸＭＬ文書に含まれる文字列であって、後述する文字列関数で指定された文字列が語句として登録されるエントリ（語句エントリ）を有する。各語句エントリは、語句（語句を構成する文字列）と当該語句に関する位置情報とを対応付けて格納する。この位置情報は、当該位置情報と対応付けられている語句の文書ＤＢ４２内での位置（文書位置−語句出現位置）を示す。各語句エントリに登録される語句は、上述のように文字列関数で指定された文字列であることから、索引部４２２に格納されるＮ−グラム索引の語彙とは異なる。 The dictionary file 43 is a character string included in the XML document stored in the document DB 42, and has an entry (phrase entry) in which a character string designated by a character string function described later is registered as a phrase. Each word / phrase entry stores a word / phrase (character string constituting the word / phrase) and positional information related to the word / phrase in association with each other. This position information indicates the position (document position−phrase appearance position) in the document DB 42 of the phrase associated with the position information. Since the phrase registered in each phrase entry is a character string specified by the character string function as described above, it is different from the vocabulary of the N-gram index stored in the index unit 422.

メモリ１１は辞書テーブル１１０を格納する。図３は辞書テーブル１１０のデータ構造例を示す。辞書テーブル１１０は、辞書ファイル４３と同様に、文書ＤＢ４２に格納されているＸＭＬ文書に含まれる文字列であって、文字列関数で指定された文字列が語句として登録されるエントリ（語句エントリ）を有する。 The memory 11 stores a dictionary table 110. FIG. 3 shows an example of the data structure of the dictionary table 110. Similar to the dictionary file 43, the dictionary table 110 is a character string included in the XML document stored in the document DB 42, and an entry (phrase entry) in which the character string designated by the character string function is registered as a phrase. Have

各語句エントリは、語句と参照回数と位置情報とを対応付けて格納する。参照回数は、当該参照回数と対応付けられている語句が参照される回数を示す。位置情報は、当該位置情報と対応付けられている語句の文書ＤＢ４２内での位置（文書位置−語句出現位置）を示す。各語句エントリは、索引部４２２に格納されているＮ−グラム索引に対して、語句索引であるといえる。なお、語句と対応付けられている参照回数及び位置情報は、当該語句から辿ることができるならば、当該語句が格納されている語句エントリに必ずしも格納されている必要はない。 Each word / phrase entry stores a word / phrase, the number of references, and position information in association with each other. The number of times of reference indicates the number of times that a word or phrase associated with the number of times of reference is referred to. The position information indicates the position (document position−phrase appearance position) in the document DB 42 of the word / phrase associated with the position information. Each phrase entry can be said to be a phrase index with respect to the N-gram index stored in the index unit 422. Note that the reference count and position information associated with a phrase need not be stored in the phrase entry in which the phrase is stored, as long as it can be traced from the phrase.

辞書ファイル４３及び辞書テーブル１１０に語句として格納される文字列は、クライアント端末からの構造化文書問い合わせで実際に使用された検索式で指定される文字列に限られる。本実施形態では、このような検索式として、文字列取り出しを指定する文字列関数を含む検索式が該当する。 The character strings stored as words in the dictionary file 43 and the dictionary table 110 are limited to character strings specified by the search formula actually used in the structured document inquiry from the client terminal. In the present embodiment, a search expression including a character string function for designating character string extraction corresponds to such a search expression.

再び図２を参照すると、インタフェース５１は、クライアント端末２０等のクライアント端末からの構造化文書問い合わせ（構造化文書問い合わせ命令）を受け付ける。インタフェース５１はまた、この問い合わせに対する結果をクライアント端末に返す。本実施形態では、構造化文書問い合わせに、ＷＷＷ（World Wide Web）コンソーシアムで策定されているＸＱｕｅｒｙと呼ばれる問い合わせ言語が用いられる。ＸＱｕｅｒｙでは、ＸＭＬ文書の階層構造をパス指定で絞り、目的のデータを得るための演算式、関数などが用意されている。 Referring to FIG. 2 again, the interface 51 accepts a structured document inquiry (structured document inquiry command) from a client terminal such as the client terminal 20. The interface 51 also returns the result for this inquiry to the client terminal. In this embodiment, a query language called XQuery that is formulated by the WWW (World Wide Web) consortium is used for structured document queries. In XQuery, an arithmetic expression, a function, and the like for narrowing down the hierarchical structure of an XML document by specifying a path and obtaining target data are prepared.

解析部５２はインタフェース５１によって受け付けられた構造化文書問い合わせで使用される検索式（ＸＱｕｅｒｙの式）を解析し、その解析結果に応じて構造検索部５３、完全一致型検索部５４または部分一致型検索部５５を動作させる。構造検索部５３は、ＸＭＬ文書のノードの階層を上記構造化文書問い合わせで指定されたパスに従って辿り、そのパス以下のノードを特定する位置情報を索引部４２２内の構造索引から取得する。即ち構造検索部５３は、指定されたパスの構造に基づき、検索されるべきデータを絞り込むための構造検索を実行する。 The analysis unit 52 analyzes the search expression (XQuery expression) used in the structured document query received by the interface 51, and according to the analysis result, the structure search unit 53, the complete match type search unit 54, or the partial match type The search unit 55 is operated. The structure search unit 53 traces the hierarchy of nodes of the XML document according to the path specified by the structured document query, and acquires position information for specifying nodes below the path from the structure index in the index unit 422. That is, the structure search unit 53 executes a structure search for narrowing down data to be searched based on the specified path structure.

完全一致型検索部５４は、特定のタグのテキスト要素や属性値が、上記構造化文書問い合わせで指定された値に一致するデータを取得するための完全一致型検索処理を実行する。この完全一致型検索処理では、予め定められた構造のデータ位置に対して比較処理が行われる。このため索引部４２２には、文字列一致比較のための文字列索引を設定する。 The exact match type search unit 54 executes an exact match type search process for acquiring data in which the text element or attribute value of a specific tag matches the value specified in the structured document query. In this complete match type search process, a comparison process is performed on data positions having a predetermined structure. Therefore, a character string index for character string matching comparison is set in the index unit 422.

部分一致型検索部５５は、特定のタグのテキスト要素や属性値に上記構造化文書問い合わせで指定された値（指定文字列）を含むデータを取得するための部分一致型検索処理を実行する。図４は部分一致型検索部５５の機能構成を示す。部分一致型検索部５５は、判定部５５０、位置取得部（第１の位置取得部）５５１、位置取得部（第２の位置取得部）５５２、参照回数管理部５５３、エントリ生成部５５４及びロード部５５５を含む。 The partial match type search unit 55 executes a partial match type search process for acquiring data including the value (specified character string) specified by the structured document query in the text element or attribute value of a specific tag. FIG. 4 shows a functional configuration of the partial match type search unit 55. The partial match type search unit 55 includes a determination unit 550, a position acquisition unit (first position acquisition unit) 551, a position acquisition unit (second position acquisition unit) 552, a reference count management unit 553, an entry generation unit 554, and a load. Part 555.

判定部５５０は、指定文字列に対応する語句エントリ（指定文字列のエントリ）が辞書テーブル１１０及び辞書ファイル４３のいずれに存在するかを判定する。判定部５５０は、この判定結果に応じて、位置取得部５５１及び５５２のいずれにより指定文字列の位置情報を取得させるかを決定する。判定部５５０は、指定文字列のエントリが辞書テーブル１１０及び辞書ファイル４３のいずれにも存在しない場合、位置取得部５５１を動作させる。判定部５５０はまた、指定文字列のエントリが辞書テーブル１１０または辞書ファイル４３に存在する場合、位置取得部５５２ａを動作させる。 The determination unit 550 determines whether the phrase entry (designated character string entry) corresponding to the designated character string exists in the dictionary table 110 or the dictionary file 43. Based on the determination result, the determination unit 550 determines which of the position acquisition units 551 and 552 is to acquire the position information of the designated character string. The determination unit 550 operates the position acquisition unit 551 when the entry of the designated character string does not exist in either the dictionary table 110 or the dictionary file 43. The determination unit 550 also operates the position acquisition unit 552a when the entry of the designated character string exists in the dictionary table 110 or the dictionary file 43.

位置取得部５５１は、指定文字列のエントリが辞書テーブル１１０及び辞書ファイル４３のいずれにも存在しない場合、当該指定文字列を含むデータの位置情報を索引部４２２（Ｎ−グラム索引）を用いて取得する。位置取得部５５２は、指定文字列のエントリが辞書テーブル１１０に存在する場合、当該指定文字列の位置情報を辞書テーブル１１０から取得する。位置取得部５５２はまた、指定文字列のエントリが辞書ファイル４３のみに存在する場合、当該指定文字列の位置情報を辞書ファイル４３から取得する。 When the entry of the designated character string does not exist in either the dictionary table 110 or the dictionary file 43, the position acquisition unit 551 uses the index unit 422 (N-gram index) to obtain the position information of the data including the designated character string. get. When the entry of the designated character string exists in the dictionary table 110, the position acquisition unit 552 obtains the position information of the designated character string from the dictionary table 110. The position acquisition unit 552 also acquires position information of the designated character string from the dictionary file 43 when the entry of the designated character string exists only in the dictionary file 43.

参照回数管理部５５３は、辞書テーブル１１０内の語句エントリにおける参照回数を管理する。参照回数管理部５５３は、辞書テーブル１１０に基づいて指定文字列を含むデータの位置情報が取得される際に、その指定文字列に対応する語句エントリ（指定文字列のエントリ）中の参照回数を１インクリメントする。 The reference count management unit 553 manages the reference count in the word / phrase entry in the dictionary table 110. When the position information of the data including the designated character string is acquired based on the dictionary table 110, the reference number management unit 553 obtains the reference number in the phrase entry (designated character string entry) corresponding to the designated character string. Increment by one.

エントリ生成部５５４は、辞書テーブル１１０内の指定文字列のエントリにおける参照回数が予め定められた閾値を超え、且つ辞書ファイル４３内に指定文字列のエントリが存在しない場合に、当該指定文字列のエントリを生成して当該辞書ファイル４３に追加する。ロード部５５５は、辞書ファイル４３に存在する指定文字列のエントリの情報を辞書テーブル１１０にロードする。 When the number of references in the entry of the designated character string in the dictionary table 110 exceeds a predetermined threshold value and the entry of the designated character string does not exist in the dictionary file 43, the entry generation unit 554 An entry is generated and added to the dictionary file 43. The load unit 555 loads information on the entry of the designated character string existing in the dictionary file 43 into the dictionary table 110.

次に本実施形態の動作について、部分一致型検索部５５によって実行される処理を例に、図５Ａ及び図５Ｂのフローチャートを参照して説明する。
今、クライアント端末２０から検索処理装置５０に対し、構造化文書問い合わせがネットワーク３０を介して与えられたものとする。検索処理装置５０内のインタフェース５１は、このクライアント端末２０からの構造化文書問い合わせを受け付けると、当該問い合わせを解析部５２に渡す。解析部５２は、この問い合わせで使用される検索式を解析することにより、構造検索部５３、完全一致型検索部５４及び部分一致型検索部５５のいずれを動作させるかを決定する。ここで、上記検索式が、ＸＱｕｅｒｙの式であるものとする。ＸＱｕｅｒｙの式に含まれる関数として、ｓｔｒｉｎｇ（文字列）処理系と呼ばれる、文字列を扱う関数（つまり文字列関数）が知られている。文字列関数としては、指定の文字列から指定の条件に合致する部分文字列を取り出すためのｓｕｂｓｔｒｉｎｇ関数や、指定の文字列の連結を指定するｃｏｎｃａｔ関数などが定義されている。また、部分一致型検索に利用される文字列関数としては、ｃｏｎｔａｉｎｓ関数、ｓｔａｒｔ−ｗｉｔｈ関数、ｅｎｄ−ｗｉｔｈ関数などが定義されている。 Next, the operation of the present embodiment will be described with reference to the flowcharts of FIGS. 5A and 5B, taking as an example the processing executed by the partial match search unit 55.
Assume that a structured document inquiry is given from the client terminal 20 to the search processing apparatus 50 via the network 30. When receiving the structured document inquiry from the client terminal 20, the interface 51 in the search processing device 50 passes the inquiry to the analysis unit 52. The analysis unit 52 determines which of the structure search unit 53, the complete match type search unit 54, and the partial match type search unit 55 is to be operated by analyzing the search formula used in this inquiry. Here, it is assumed that the search expression is an XQuery expression. As a function included in the XQuery expression, a function that handles a character string (that is, a character string function) called a string (character string) processing system is known. As a character string function, a substring function for extracting a partial character string that matches a specified condition from a specified character string, a concat function for specifying concatenation of specified character strings, and the like are defined. Further, as a character string function used for the partial match type search, a contains function, a start-with function, an end-with function, and the like are defined.

本実施形態において、解析部５２によって解析された検索式が、部分一致型検索に利用される文字列関数、例えば
/Catalog/Book[contains(./Name/text(),”ルネッサンス”)]
のような、ｃｏｎｔａｉｎｓ関数「contains(./Name/text(),”ルネッサンス”)」を含むＸＱｅｒｙの式であるものとする。このｃｏｎｔａｉｎｓ関数を含むＸＱｅｒｙの式は、「/Catalog/Book」と一致する構造のノード（Ｂｏｏｋノード）のうち、その題目（Ｎａｍｅ）に「ルネッサンス」という文字列を含むノード（書籍）を検索することを指定する。 In the present embodiment, the search expression analyzed by the analysis unit 52 is a character string function used for partial match search, for example,
/Catalog/Book[contains(./Name/text(),”Renaissance ”)]
It is assumed that the expression is an XQuery expression including the contains function “contains (./ Name / text (),“ Renaissance ”)”. The XQuery expression including this contains function searches for a node (book) that includes the character string “Renaissance” in its title (Name) among nodes (Book nodes) having a structure matching “/ Catalog / Book”. Specify that.

このように、部分一致型検索に利用される文字列関数（ｃｏｎｔａｉｎｓ関数）を含む検索式（ＸＱｅｒｙの式）の場合、解析部５２は、部分一致型検索処理が必要であるとして、構造検索部５３及び部分一致型検索部５５を動作させる。なお、完全一致型検索処理が必要な場合、解析部５２は構造検索部５３及び完全一致型検索部５４を動作させる。 In this way, in the case of a search expression (XQuery expression) including a character string function (contains function) used for partial match type search, the analysis unit 52 assumes that partial match type search processing is required, and the structure search unit 53 and the partial match type search unit 55 are operated. Note that, when the exact match type search process is required, the analysis unit 52 operates the structure search unit 53 and the complete match type search unit 54.

構造検索部５３は、解析部５２によって解析された検索式（ＸＱｅｒｙの式）の指定するパス「/Catalog/Book」に従って、そのパス以下のノードを特定する位置情報を索引部４２２内の構造索引から取得する。構造検索部５３によって取得された位置情報は文書検索部５６に渡される。 The structure search unit 53 follows the path “/ Catalog / Book” specified by the search expression (XQuery expression) analyzed by the analysis unit 52, and uses the structure index in the index unit 422 to identify the position information specifying the nodes below that path. Get from. The position information acquired by the structure search unit 53 is passed to the document search unit 56.

一方、部分一致型検索部５５では、判定部５５０が、解析部５２によって解析された検索式（ＸＱｅｒｙの式）に、文字列取り出しを指定する文字列関数が含まれているか否かを判定する（ステップＳ１）。このステップＳ１での判定がＹＥＳの場合、判定部５５０はステップＳ２を実行する。このステップＳ２において、判定部５５０は、上記検索式（ＸＱｅｒｙの式）に含まれている文字列関数によって指定される文字列（指定文字列）「ルネッサンス」で辞書テーブル１１０を参照する。そして判定部５５０は、この指定文字列に一致する語句が格納されているエントリ（指定文字列のエントリ）が辞書テーブル１１０に存在するか否かを判定する。 On the other hand, in the partial match search unit 55, the determination unit 550 determines whether or not the search expression (XQuery expression) analyzed by the analysis unit 52 includes a character string function that specifies character string extraction. (Step S1). If the determination in step S1 is YES, determination unit 550 executes step S2. In step S <b> 2, the determination unit 550 refers to the dictionary table 110 with a character string (designated character string) “Renaissance” specified by the character string function included in the search expression (XQuery expression). Then, the determination unit 550 determines whether or not an entry (designated character string entry) storing a phrase that matches the designated character string exists in the dictionary table 110.

もし、指定文字列のエントリが辞書テーブル１１０に存在する場合、判定部５５０は辞書テーブル１１０内の当該エントリの位置を位置取得部５５２及び参照回数管理部５５３に通知する。すると参照回数管理部５５３は、辞書テーブル１１０内の指定文字列のエントリに設定されている参照回数を１インクリメントする（ステップＳ３）。一方、エントリ生成部５５４は、辞書テーブル１１０内の指定文字列のエントリから、当該エントリに設定されている位置情報、つまり指定文字列の位置情報を取得する（ステップＳ４）。 If the entry of the designated character string exists in the dictionary table 110, the determination unit 550 notifies the position acquisition unit 552 and the reference count management unit 553 of the position of the entry in the dictionary table 110. Then, the reference count management unit 553 increments the reference count set in the entry of the designated character string in the dictionary table 110 by 1 (step S3). On the other hand, the entry generation unit 554 acquires position information set in the entry, that is, position information of the designated character string, from the entry of the designated character string in the dictionary table 110 (step S4).

明らかなように、上記ステップＳ４の処理、即ち辞書テーブル１１０を利用して指定文字列の位置情報を取得する処理は、当該指定文字列を構成する全ての語彙（Ｎ−グラム）毎に索引部４２２のＮ−グラム索引を検索して、当該語彙毎の位置情報を取得することにより、指定文字列の位置情報を取得する処理（後述するステップＳ１２乃至Ｓ１４の処理）に比べて高速に実行できる。なお、ステップＳ４がステップＳ３より先に実行されても構わない。 As will be apparent, the process of step S4, that is, the process of acquiring the position information of the designated character string using the dictionary table 110 is performed by the index unit for every vocabulary (N-gram) constituting the designated character string. By searching the N-gram index of 422 and acquiring the position information for each vocabulary, it can be executed at higher speed than the process of acquiring the position information of the designated character string (the processes of steps S12 to S14 described later). . Note that step S4 may be executed prior to step S3.

参照回数管理部５５３は、指定文字列のエントリに設定されている参照回数をインクリメントすると、そのインクリメント後の参照回数を閾値と比較することにより、当該参照回数が閾値（基準の回数）を超えているか否かを判定する（ステップＳ５）。もし、インクリメント後の参照回数が閾値を超えていないならば、部分一致型検索部５５での処理は終了となる。このとき、位置取得部５５２によって取得された位置情報が、部分一致型検索部５５から文書検索部５６に渡される。 When the reference count management unit 553 increments the reference count set in the entry of the designated character string, the reference count exceeds the threshold (standard count) by comparing the incremented reference count with the threshold. It is determined whether or not (step S5). If the incremented reference count does not exceed the threshold, the process in the partial match search unit 55 ends. At this time, the position information acquired by the position acquisition unit 552 is passed from the partial match type search unit 55 to the document search unit 56.

文書検索部５６は、構造検索部５３及び部分一致型検索部５５の各々から渡された位置情報をマージし、一致する位置情報の指定する文書を、インタフェース５１によって受け付けられた構造化文書問い合わせで使用される検索式に合致する文書として、文書ＤＢ４２から検索する。 The document search unit 56 merges the position information passed from each of the structure search unit 53 and the partial match type search unit 55, and the document designated by the matching position information is obtained by the structured document query received by the interface 51. The document DB 42 is searched for a document that matches the search expression used.

一方、インクリメント後の参照回数が閾値を超えているならば、判定部５５０は今度は、指定文字列のエントリが辞書ファイル４３に存在するか否かを判定する（ステップＳ６）。もし、指定文字列のエントリが辞書ファイル４３に存在するならば、部分一致型検索部５５での処理は終了となる。 On the other hand, if the incremented reference count exceeds the threshold, the determination unit 550 determines whether or not an entry for the designated character string is present in the dictionary file 43 (step S6). If the entry of the designated character string exists in the dictionary file 43, the process in the partial match type search unit 55 ends.

これに対し、指定文字列のエントリが辞書ファイル４３に存在しないならば、判定部５５０はエントリ生成部５５４を起動する。するとエントリ生成部５５４は、辞書テーブル１１０内の指定文字列のエントリの参照頻度が高いものとして、当該指定文字列のエントリに基づき、辞書ファイル４３内に指定文字列のエントリを追加する（ステップＳ７）。ここでは、辞書テーブル１１０内の指定文字列のエントリの情報のうち、参照回数を除く情報が設定されたエントリが生成されて、辞書ファイル４３に追加される。 On the other hand, if the entry of the designated character string does not exist in the dictionary file 43, the determination unit 550 activates the entry generation unit 554. Then, the entry generation unit 554 adds the entry of the designated character string in the dictionary file 43 based on the entry of the designated character string, assuming that the reference frequency of the entry of the designated character string in the dictionary table 110 is high (step S7). ). Here, of the entry information of the designated character string in the dictionary table 110, an entry in which information excluding the reference count is set is generated and added to the dictionary file 43.

上記ステップＳ７により、検索処理装置５０が電源オフされてメモリ１１に格納されている辞書テーブル１１０のエントリ情報が消失した場合に対処できる。即ち、後述するステップＳ１０から明らかなように、検索処理装置５０の再起動後に辞書ファイル４３内のエントリの情報を辞書テーブル１１０にロードすることにより、当該エントリの情報（つまり参照頻度が高いエントリ情報）を再利用できる。ステップＳ７の処理が実行されると部分一致型検索部５５での処理は終了となる。 By the step S7, it is possible to cope with the case where the entry information of the dictionary table 110 stored in the memory 11 is lost after the search processing device 50 is powered off. That is, as will be apparent from step S10 described later, by loading the entry information in the dictionary file 43 into the dictionary table 110 after the search processing device 50 is restarted, the entry information (that is, entry information having a high reference frequency) is loaded. ) Can be reused. When the process of step S7 is executed, the process in the partial match search unit 55 ends.

次に、上記ステップＳ２において、指定文字列のエントリが辞書テーブル１１０に存在しないと判定された場合について説明する。このようにステップＳ２での判定がＮＯの場合、判定部５５０は指定文字列のエントリが辞書ファイル４３に存在するか否かを判定する（ステップＳ８）。 Next, a case will be described in which it is determined in step S2 that the entry of the designated character string does not exist in the dictionary table 110. As described above, when the determination in step S2 is NO, the determination unit 550 determines whether or not the entry of the designated character string exists in the dictionary file 43 (step S8).

もし、指定文字列のエントリが辞書ファイル４３に存在する場合、判定部５５０は辞書ファイル４３内の当該エントリの位置を位置取得部５５２及びロード部５５５に通知する。すると位置取得部５５２は、辞書ファイル４３内の指定文字列のエントリから、当該エントリに設定されている位置情報、つまり指定文字列の位置情報を取得する（ステップＳ９）。明らかなように、この辞書ファイル４３を利用して指定文字列の位置情報を取得する処理は、辞書テーブル１１０を利用して指定文字列の位置情報を取得する処理と同様に、索引部４２２のＮ−グラム索引を検索して指定文字列の位置情報を取得する処理に比べて高速に実行できる。 If the entry of the designated character string exists in the dictionary file 43, the determination unit 550 notifies the position acquisition unit 552 and the load unit 555 of the position of the entry in the dictionary file 43. Then, the position acquisition unit 552 acquires the position information set in the entry, that is, the position information of the specified character string, from the entry of the specified character string in the dictionary file 43 (step S9). As will be apparent, the process of acquiring the position information of the designated character string using the dictionary file 43 is similar to the process of acquiring the position information of the designated character string using the dictionary table 110. This can be executed at a higher speed than the process of searching the N-gram index and acquiring the position information of the designated character string.

一方、ロード部５５５は、辞書ファイル４３内の指定文字列のエントリから指定文字列の位置情報が取得されると、辞書テーブル１１０に１つエントリを追加して、当該エントリに上記指定文字列のエントリの情報をロードする（ステップＳ１０）。すると参照回数管理部５５３は、ロード部５５５によって追加された辞書テーブル１１０内のエントリ（指定文字列のエントリ）に、値が“１”の参照回数を追加設定する（ステップＳ１１）。なお、ステップＳ９において、辞書ファイル４３内の指定文字列のエントリから指定文字列の位置情報を取得することは、ロード部５５５によって追加された辞書テーブル１１０内のエントリから指定文字列の位置情報を取得することと等価である。 On the other hand, when the position information of the designated character string is acquired from the entry of the designated character string in the dictionary file 43, the load unit 555 adds one entry to the dictionary table 110 and adds the entry of the designated character string to the entry. The entry information is loaded (step S10). Then, the reference count management unit 553 additionally sets the reference count of the value “1” to the entry (designated character string entry) in the dictionary table 110 added by the load unit 555 (step S11). In step S9, obtaining the position information of the designated character string from the entry of the designated character string in the dictionary file 43 is obtained by obtaining the position information of the designated character string from the entry in the dictionary table 110 added by the load unit 555. Equivalent to getting.

次に、上記ステップＳ８において、指定文字列のエントリが辞書ファイル４３に存在しないと判定された場合について説明する。この場合、判定部５５０はその旨を指定文字列と共に位置取得部５５１に通知する。すると位置取得部５５１は、指定文字列をＮ−グラム（語彙）に分割する（ステップＳ１２）。位置取得部５５１は、分割されたＮ−グラム（語彙）毎に、索引部４２２内のＮ−グラム索引を検索することにより、Ｎ−グラム（語彙）毎に位置情報を取得する（ステップＳ１３）。位置取得部５５１は、Ｎ−グラム（語彙）毎の位置情報をマージして、指定文字列を構成するＮ−グラム（語彙）の各々の相対位置に対応する語彙出現位置を示す位置情報の集合を検出することにより、当該指定文字列の位置情報を取得する（ステップＳ１４）。 Next, the case where it is determined in step S8 that the entry of the designated character string does not exist in the dictionary file 43 will be described. In this case, the determination unit 550 notifies the position acquisition unit 551 of that fact together with the designated character string. Then, the position acquisition unit 551 divides the designated character string into N-grams (vocabulary) (step S12). The position acquisition unit 551 acquires position information for each N-gram (vocabulary) by searching the N-gram index in the index unit 422 for each divided N-gram (vocabulary) (step S13). . The position acquisition unit 551 merges the position information for each N-gram (vocabulary), and collects position information indicating the vocabulary appearance position corresponding to each relative position of the N-gram (vocabulary) constituting the specified character string. Is detected to obtain position information of the designated character string (step S14).

ステップＳ１４において位置取得部５５１によって取得された指定文字列の位置情報は、指定文字列と共にエントリ生成部５５４に渡される。エントリ生成部５５４は、この指定文字列、当該指定文字列の位置情報及び値が１（初期値）の参照回数が設定されたエントリを辞書テーブル１１０に追加する（ステップＳ１５）。このステップＳ１５の処理が実行されると部分一致型検索部５５での処理は終了となる。 The position information of the designated character string acquired by the position acquisition unit 551 in step S14 is passed to the entry generation unit 554 together with the specified character string. The entry generation unit 554 adds an entry in which the designated character string, the position information of the designated character string, and the reference count with a value of 1 (initial value) are set to the dictionary table 110 (step S15). When the process of step S15 is executed, the process in the partial match search unit 55 ends.

上述したように、本実施形態において自動生成されて辞書ファイル４３に登録され、当該辞書ファイル４３から辞書テーブル１１０にロードされる語句エントリの情報は、ユーザからの構造化文書問い合わせに基づく検索で頻繁に利用される語句の索引（語句索引）を構成している。このため、構造化文書問い合わせに基づく検索で語句索引を利用する確率を高めることができる。ここで、辞書ファイル４３を予め用意することも考えられる。しかし、そのためには検索で頻繁に利用される語句を予測しなければならない。もし、この予測が外れると、構造化文書問い合わせに基づく検索で語句索引を利用する確率が低くなる。本実施形態では、ユーザからの構造化文書問い合わせに基づく検索で利用される語句の索引のみが自動生成されるため、このようなおそれは少ない。つまり、本発明によれば、使われない語句のために語句索引が生成されることはなく、語句索引の生成コスト（辞書生成コスト）を軽減できる。 As described above, the phrase entry information automatically generated in this embodiment and registered in the dictionary file 43 and loaded from the dictionary file 43 into the dictionary table 110 is frequently retrieved by a search based on a structured document query from the user. This constitutes an index (phrase index) of phrases used in For this reason, it is possible to increase the probability of using the phrase index in the search based on the structured document query. Here, it is conceivable to prepare the dictionary file 43 in advance. However, in order to do so, it is necessary to predict words frequently used in searches. If this prediction is lost, the probability of using the phrase index in a search based on a structured document query is reduced. In this embodiment, since only the index of words used in the search based on the structured document query from the user is automatically generated, such a possibility is small. That is, according to the present invention, a phrase index is not generated for unused phrases, and the phrase index generation cost (dictionary generation cost) can be reduced.

上記実施形態では、説明の簡略化のために、辞書テーブル１１０のサイズ、或いは辞書テーブル１１０のエントリの数の上限について考慮されていない。もし、辞書テーブル１１０のサイズまたはエントリ数の上限が予め定められている場合には、当該辞書テーブル１１０を例えばＬＲＵ（Least Recently Used）法により管理すればよい。即ち、ステップＳ１５において辞書テーブル１１０にエントリを追加することにより辞書テーブル１１０のサイズまたはエントリ数が上限を超える場合には、辞書テーブル１１０のエントリのうち、その時点で最も以前に参照されたエントリを削除すれば良い。この管理手法は、辞書ファイル４３内の語句エントリにも適用可能である。また、上記閾値、辞書テーブル１１０のサイズまたはエントリ数の上限を、クライアント端末２０から指定可能としても良い。 In the above embodiment, for simplification of description, the size of the dictionary table 110 or the upper limit of the number of entries in the dictionary table 110 is not considered. If the upper limit of the size of the dictionary table 110 or the number of entries is determined in advance, the dictionary table 110 may be managed by, for example, the LRU (Least Recently Used) method. That is, if the size or the number of entries of the dictionary table 110 exceeds the upper limit by adding an entry to the dictionary table 110 in step S15, the entry that was referenced most recently at that time among the entries in the dictionary table 110 is selected. Delete it. This management method can also be applied to a phrase entry in the dictionary file 43. The threshold, the size of the dictionary table 110, or the upper limit of the number of entries may be designated from the client terminal 20.

［変形例］
次に上記実施形態の変形例について説明する。
上記実施形態においては、辞書テーブル１１０を利用することにより、索引部４２２のＮ−グラム索引を利用する場合に比べて、指定文字列の位置情報を高速で取得できる。但し、この効果は、指定文字列を構成する文字数ｎが少ない場合には低くなる。そこで本変形例では、辞書テーブル１１０のエントリから辞書ファイル４３に保存すべきエントリを決定する条件に、参照回数Ｎrだけでなく、文字数が加えられる。更に具体的に述べるならば、文字数ｎによって決まる重みｗnであって、当該文字数ｎが少ないほど小さな値となる重みｗnを参照回数Ｎrに乗じて得られる値Ｎr×ｗn（つまり重み付けされた参照回数Ｎr×ｗn）が、上記ステップＳ５において参照回数Ｎrに代えて用いられる。 [Modification]
Next, a modification of the above embodiment will be described.
In the above embodiment, by using the dictionary table 110, the position information of the designated character string can be acquired at a higher speed than when the N-gram index of the index unit 422 is used. However, this effect is low when the number n of characters constituting the designated character string is small. Therefore, in this modification, not only the reference count Nr but also the number of characters is added to the condition for determining the entry to be stored in the dictionary file 43 from the entry in the dictionary table 110. More specifically, a weight wn determined by the number of characters n, a value Nr × wn obtained by multiplying the reference number Nr by a weight wn that becomes smaller as the number n of characters is smaller (that is, a weighted reference number). Nr × wn) is used in place of the reference count Nr in step S5.

なお、本発明は、上記実施形態またはその変形例そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態またはその変形例に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、実施形態またはその変形例に示される全構成要素から幾つかの構成要素を削除してもよい。 In addition, this invention is not limited to the said embodiment or its modification example as it is, A component can be deform | transformed and embodied in the range which does not deviate from the summary in an implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment or the modification thereof. For example, you may delete a some component from all the components shown by embodiment or its modification.

本発明の一実施形態に係る検索処理装置を含むクライアント−サーバシステムのハードウェア構成を示すブロック図。The block diagram which shows the hardware constitutions of the client-server system containing the search processing apparatus which concerns on one Embodiment of this invention. 図１中の検索処理装置５０の主として機能構成を示すブロック図。The block diagram which mainly shows a function structure of the search processing apparatus 50 in FIG. 図２中の辞書テーブル１１０のデータ構造例を示す図。The figure which shows the data structure example of the dictionary table 110 in FIG. 図２中の部分一致型検索部５５の機能構成を示す図。The figure which shows the function structure of the partial matching type | mold search part 55 in FIG. 同実施形態において部分一致型検索部５５によって実行される処理の手順の一部を示すフローチャート。6 is a flowchart showing a part of a procedure of processing executed by a partial match search unit 55 in the embodiment. 同実施形態において部分一致型検索部５５によって実行される処理の手順の残りを示すフローチャート。6 is a flowchart showing the rest of the procedure of processing executed by the partial match search unit 55 in the embodiment.

符号の説明Explanation of symbols

１０…データベースサーバ、１１…メモリ、２０…クライアント端末、３０…ネットワーク、４０…外部記憶装置、４１…検索処理プログラム、４２…文書データベース（文書ＤＢ）、４３…辞書ファイル（第２の語句索引格納手段）、５１…インタフェース、５２…解析部、５３…構造検索部、５４…完全一致型検索部、５５…部分一致型検索部、５６…文書検索部、５７…結果生成部、１１０…辞書テーブル（第１の語句索引格納手段）、４２１…文書部、４２２…索引部（Ｎ−グラム索引格納手段）、５５０…判定部、５５１…位置取得部（第１の位置取得手段）、５５２…位置取得部（第２の位置取得手段）、５５３…参照回数管理部、５５４…エントリ生成部（語句索引生成手段）、５５５…ロード部。 DESCRIPTION OF SYMBOLS 10 ... Database server, 11 ... Memory, 20 ... Client terminal, 30 ... Network, 40 ... External storage device, 41 ... Search processing program, 42 ... Document database (document DB), 43 ... Dictionary file (2nd phrase index storage) Means), 51 ... Interface, 52 ... Analysis part, 53 ... Structure search part, 54 ... Complete match type search part, 55 ... Partial match type search part, 56 ... Document search part, 57 ... Result generation part, 110 ... Dictionary table (First phrase index storage means), 421 ... document part, 422 ... index part (N-gram index storage means), 550 ... determination part, 551 ... position acquisition part (first position acquisition means), 552 ... position Acquisition unit (second position acquisition unit), 553... Reference count management unit, 554... Entry generation unit (phrase index generation unit), 555.

Claims

複数の構造化文書が格納された文書データベースから、与えられた検索式の示す検索条件に合致する構造化文書を検索する検索処理装置において、
前記文書データベースに格納されている構造化文書の各々をＮ−グラムの部分文字列に分割することによって生成されるＮ−グラム索引であって、当該部分文字列の位置を示す位置情報と対応付けられたＮ−グラム索引を格納するＮ−グラム索引格納手段と、
語句の位置を示す位置情報と対応付けられた語句索引を格納する語句索引格納手段と、
前記検索式が文字列を指定する文字列関数を含む場合、当該文字列関数で指定される文字列を語句とする語句索引が前記語句索引格納手段に存在するかを判定する判定手段と、
前記文字列関数で指定される文字列を語句とする語句索引が存在しない場合、前記Ｎ−グラム索引を利用して当該文字列の位置情報を取得する第１の位置取得手段と、
前記文字列関数で指定される文字列を語句とする語句索引が存在しない場合、当該文字列を語句とする語句索引を生成し、当該生成された語句索引を前記第１の位置取得手段によって取得された位置情報と対応付けて前記語句索引格納手段に格納する語句索引生成手段と、
前記文字列関数で指定される文字列を語句とする語句索引が存在する場合、当該語句索引を利用して当該文字列の位置情報を取得する第２の位置取得手段と、
前記検索式の示す検索条件に合致する、前記文字列関数で指定される文字列を含む構造化文書を、前記第１または第２の位置取得手段によって取得された位置情報に基づいて前記文書データベースから検索する文書検索手段と
を具備することを特徴とする文書検索処理装置。 In a search processing apparatus for searching for a structured document that matches a search condition indicated by a given search expression from a document database storing a plurality of structured documents,
An N-gram index generated by dividing each structured document stored in the document database into N-gram partial character strings, and is associated with position information indicating the position of the partial character string N-gram index storage means for storing the generated N-gram index;
A phrase index storage means for storing a phrase index associated with position information indicating the position of the phrase;
When the search expression includes a character string function that specifies a character string, a determination unit that determines whether or not a word index that includes the character string specified by the character string function is present in the word index storage unit;
First position acquisition means for acquiring position information of the character string using the N-gram index when there is no phrase index that uses the character string specified by the character string function as a phrase;
If there is no phrase index that uses the character string specified by the character string function as a word, a word index that uses the character string as a word is generated, and the generated word index is acquired by the first position acquisition unit. Phrase index generating means for storing in the phrase index storage means in association with the position information,
A second position acquisition unit that acquires position information of the character string using the word index when there is a word index that uses the character string specified by the character string function as a word;
Based on the position information acquired by the first or second position acquisition unit, a structured document including a character string specified by the character string function that matches a search condition indicated by the search expression is stored in the document database. A document search processing apparatus comprising: a document search means for searching from the document search means.

前記語句索引格納手段は、前記語句索引生成手段によって生成された語句索引を当該語句索引が参照される回数を表す参照回数と対応付けて格納する揮発性の第１の語句索引格納手段と、前記第１の語句索引格納手段に格納されている語句索引の中から選択された語句索引を再利用可能なように格納する不揮発性の第２の語句索引格納手段とを含み、
前記判定手段は、前記文字列関数で指定される文字列を語句とする語句索引が前記第１及び第２の語句索引格納手段のいずれに存在するかを判定し、
前記語句索引生成手段は、前記文字列関数で指定される文字列を語句とする語句索引が前記第１の語句索引格納手段のみに存在し、且つ当該語句索引と対応付けられている参照回数が予め定められた閾値を超えている場合に当該語句索引を前記第２の語句索引格納手段に追加し、
前記文字列関数で指定される文字列を語句とする語句索引が前記第１の語句索引格納手段に存在する場合、当該語句索引に対応付けられている前記参照回数をインクリメントする参照回数管理手段と、
前記文字列関数で指定される文字列を語句とする語句索引が前記第１の語句索引格納手段に存在しないが、前記第２の語句索引格納手段には存在する場合、当該語句索引を前記第２の語句索引格納手段から前記第１の語句索引格納手段にロードして、当該ロードされた語句索引に値が初期値の参照回数を対応付けるロード手段とを更に具備する
ことを特徴とする請求項１記載の検査処理装置。 The phrase index storage means is a volatile first phrase index storage means for storing the phrase index generated by the phrase index generation means in association with a reference count representing the number of times the phrase index is referenced, Non-volatile second phrase index storage means for reusably storing a phrase index selected from the phrase indexes stored in the first phrase index storage means;
The determination means determines in which of the first and second phrase index storage means a phrase index having a phrase specified by the string function as a phrase,
The phrase index generation means has a phrase index having a character string specified by the string function as a phrase only in the first phrase index storage means, and has a reference count associated with the phrase index. If the predetermined threshold is exceeded, the word index is added to the second word index storage means;
A reference number management means for incrementing the reference number associated with the phrase index when a phrase index having the phrase specified by the character string function as a phrase is present in the first phrase index storage means; ,
If a phrase index that uses a character string specified by the string function as a phrase does not exist in the first phrase index storage means, but does not exist in the second phrase index storage means, the phrase index is the first index. 2. Load means for loading from the two phrase index storage means into the first phrase index storage means and associating the loaded phrase index with the reference count of the initial value. The inspection processing apparatus according to 1.

前記語句索引生成手段は、前記文字列関数で指定される文字列を語句とする語句索引と対応付けられている参照回数を、当該文字列の文字数で決まる当該文字数が多いほど値が大きくなる重みで重み付けし、その重み付けされた参照回数を前記閾値と比較することを特徴とする請求項２記載の検査処理装置。 The phrase index generation means is a weight that increases the number of references associated with a phrase index having a string specified by the string function as a phrase as the number of characters determined by the number of characters in the string increases. 3. The inspection processing apparatus according to claim 2, wherein the weighted reference count is compared with the threshold value.

複数の構造化文書が格納された文書データベースから、与えられた検索式の示す検索条件に合致する構造化文書をコンピュータが検索するのに用いられるプログラムであって、前記コンピュータに、
前記検索式が文字列を指定する文字列関数を含むかを判定するステップと、
前記検索式が文字列を指定する文字列関数を含む場合、当該文字列を語句とする語句索引が語句索引格納手段に存在するかを判定するステップと、
前記文字列関数で指定される文字列を語句とする語句索引が存在しない場合、前記文書データベースに格納されている構造化文書の各々をＮ−グラムの部分文字列であるＮ−グラム文字列に分割することによって生成されてＮ−グラム索引格納手段に格納されているＮ−グラム索引であって、当該Ｎ−グラム文字列の位置を示す位置情報と対応付けられたＮ−グラム索引のうち、前記文字列関数で指定される文字列を構成するＮ−グラム文字列に対応するＮ−グラム索引を利用して当該文字列の位置情報を取得するステップと、
前記文字列関数で指定される文字列を語句とする語句索引が存在しない場合、当該文字列を語句とする語句索引を生成して、当該生成された語句索引を前記取得された位置情報と対応付けて前記語句索引格納手段に格納するステップと、
前記文字列関数で指定される文字列を語句とする語句索引が存在する場合、当該語句索引を利用して当該文字列の位置情報を取得するステップと、
前記検索式の示す検索条件に合致する、前記文字列関数で指定される文字列を含む構造化文書を、前記語句索引または前記Ｎ−グラム索引を利用して取得された位置情報に基づいて前記文書データベースから検索するステップと
を実行させるためのプログラム。 A program used by a computer to search for a structured document that matches a search condition indicated by a given search expression from a document database in which a plurality of structured documents are stored.
Determining whether the search expression includes a string function specifying a string;
When the search expression includes a character string function for designating a character string, determining whether a word index having the character string as a word exists in the word index storage means;
When there is no phrase index that uses a character string specified by the character string function as a word, each structured document stored in the document database is converted into an N-gram character string, which is an N-gram partial character string. An N-gram index generated by dividing and stored in the N-gram index storage means, wherein the N-gram index is associated with position information indicating the position of the N-gram character string. Obtaining position information of the character string using an N-gram index corresponding to an N-gram character string constituting the character string specified by the character string function;
If there is no phrase index that uses the character string specified by the character string function as a word, a word index that uses the character string as a word is generated, and the generated word index corresponds to the acquired position information. And storing it in the phrase index storage means;
If there is a phrase index that uses a character string specified by the character string function as a phrase, using the phrase index to obtain position information of the character string;
A structured document that includes a character string specified by the character string function that matches a search condition indicated by the search expression is obtained based on position information acquired using the phrase index or the N-gram index. A program for executing the steps of retrieving from a document database.

前記語句索引格納手段は、前記語句索引生成手段によって生成された語句索引を当該語句索引が参照される回数を表す参照回数と対応付けて格納する揮発性の第１の語句索引格納手段と、前記第１の語句索引格納手段に格納されている語句索引の中から選択された語句索引を再利用可能なように格納する不揮発性の第２の語句索引格納手段とを含み、
前記判定するステップは、前記文字列関数で指定される文字列を語句とする語句索引が前記第１の語句索引格納手段に存在するかを判定する第１の判定ステップと、存在しない場合に当該語句索引が前記第２の語句索引格納手段に存在するかを判定する第２の判定ステップとを含み、
前記コンピュータに実行させるための、
前記文字列関数で指定される文字列を語句とする語句索引が前記第１の語句索引格納手段に存在すると判定された場合、当該語句索引に対応付けられている前記参照回数をインクリメントするステップと、
前記インクリメント後の参照回数を予め定められた閾値と比較するステップと、
前記参照回数が前記閾値を超えている場合、前記文字列関数で指定される文字列を語句とする語句索引が前記第２の語句索引格納手段に存在するかを判定する第３の判定ステップと、
前記文字列関数で指定される文字列を語句とする語句索引が前記第２の語句索引格納手段に存在しないと前記第３のステップで判定された場合、前記第１の語句索引格納手段に存在すると前記第１のステップで判定された当該文字列を語句とする語句索引を前記第２の語句索引格納手段に追加するステップと、
前記文字列関数で指定される文字列を語句とする語句索引が前記第２の語句索引格納手段に存在すると前記第２のステップで判定された場合、当該語句索引を前記第２の語句索引格納手段から前記第１の語句索引格納手段にロードして、当該ロードされた語句索引に値が初期値の参照回数を対応付けるステップとを更に含む
ことを特徴とする請求項４記載のプログラム。 The phrase index storage means is a volatile first phrase index storage means for storing the phrase index generated by the phrase index generation means in association with a reference count representing the number of times the phrase index is referenced, Non-volatile second phrase index storage means for reusably storing a phrase index selected from the phrase indexes stored in the first phrase index storage means;
The determining step includes a first determining step for determining whether or not a phrase index having a character string specified by the character string function as a phrase exists in the first phrase index storage unit; A second determination step of determining whether a phrase index exists in the second phrase index storage means,
For causing the computer to execute,
Incrementing the reference count associated with the phrase index when it is determined that a phrase index having the phrase specified by the string function as a phrase exists in the first phrase index storage unit; ,
Comparing the incremented reference count with a predetermined threshold;
A third determination step of determining whether or not a phrase index having a character string specified by the character string function as a phrase exists in the second phrase index storage unit when the reference count exceeds the threshold; ,
If it is determined in the third step that the phrase index having the character string specified by the character string function does not exist in the second phrase index storage means, the phrase index exists in the first phrase index storage means Then, adding a phrase index having the character string determined in the first step as a phrase to the second phrase index storage unit;
If it is determined in the second step that a phrase index having a phrase specified by the string function as a phrase exists in the second phrase index storage means, the phrase index is stored in the second phrase index storage. 5. The program according to claim 4, further comprising the step of loading the first phrase index storage means from the means and associating the loaded phrase index with the reference count of the initial value.

前記閾値と比較される参照回数として、前記インクリメント後の参照回数と対応付けられている前記語句索引の語句の文字数で決まる当該文字数が多いほど値が大きくなる重みで重み付けされた当該インクリメント後の参照回数が用いられることを特徴とする請求項５記載のプログラム。 Reference after the increment weighted with a weight that increases as the number of characters determined by the number of characters in the phrase index associated with the incremented reference count is increased as the reference count compared with the threshold value 6. The program according to claim 5, wherein the number of times is used.