JPH05266080A

JPH05266080A - Retrieval device

Info

Publication number: JPH05266080A
Application number: JP4065618A
Authority: JP
Inventors: Kenji Hashimoto; 賢治橋本; Katsumi Murai; 克己村井
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1992-03-24
Filing date: 1992-03-24
Publication date: 1993-10-15

Abstract

PURPOSE:To perform high-speed retrieval in the retrieval device which extracts requested document data from a large amt. of data without applying index information for retrieval. CONSTITUTION:This device is provided with a character connection information extracting means 2 to extract the connection of arbitrary two characters and to extract bit data within four bits in total from the low order of the character code of one character of either preceding or following character or respective characters of the preceding and following characters, and table preparing means 3 to prepare a table containing remaining one character after defining one of two characters as entry and the recording position outline information of a retrieval object document extracting the pair of correspondent bit data within four bits. Then, one of two connected characters to be the retrieval entry, remaining one character to be retrieval contents and bit data within four bits are obtained from the connection of two characters and the bit data within four bits extracted at the time of retrieval request, the correspondent recording position outline information is acquired by the retrieval from this table and after the retrieval object documents are concentrically aimed it is precisely retrieved.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、情報処理装置において
大量の文書データを検索用インデックス情報を用いずに
要求された文書データを引き出してくる全文検索方式を
基本とした検索装置と検索方式に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a search apparatus and a search method based on a full-text search method that retrieves requested document data from a large amount of document data in an information processing apparatus without using search index information. It is a thing.

【０００２】[0002]

【従来の技術】近年、計算機周辺技術の発達によりワー
ドプロセッサーやパーソナルコンピュータが普及し、大
量の文書データが仕事場や家庭において利用されるよう
になってきた。これら大量の文書データを整理して有効
に利用していくために、大容量のデータ記憶装置と高速
な検索マシンが研究開発されてきた。しかし、従来の検
索マシンでは検索のためにあらかじめもとのデータにイ
ンデックス情報を付ける必要があり、データ量が増大す
るにつれて人手でインデックス付けを行っていると大変
な労力が必要となってきた。また、インデックスをコン
ピュータに自動的に抽出させる試みもあるが、文書から
インデックスとなるキーワードを抽出する処理の基盤と
なる自然言語処理技術がまだ未完成であり、完全自動化
には至っていないのが現状である。これに対して、イン
デックス情報を付与することなしに検索する方法として
全文検索方式が研究開発されてきている。この全文検索
方式は、検索用インデックス情報を用いた検索方式に比
べて検索速度が遅くなるのが欠点であったが、この欠点
を解決する方法とて例えば、全文検索用テキストサーチ
マシン（電子情報通信学会技術研究報告・データ工学89
-38）や、構成文字の属性／文字位置を含むコード化に
よる全文検索の高速化手法（電子情報通信学会技術研究
報告・データ工学90-24）などがある。更に、本文の検
索の前処理として検索対象文書を絞り込むのに２文字組
の文字連接を使用することで検索の高速化をはかる方法
も提案されている。2. Description of the Related Art In recent years, word processors and personal computers have become widespread due to the development of computer peripheral technology, and a large amount of document data has come to be used at work and at home. In order to organize and effectively use such a large amount of document data, a large-capacity data storage device and a high-speed search machine have been researched and developed. However, in the conventional search machine, it is necessary to add index information to the original data in advance for the search, and as the amount of data increases, it takes a lot of labor to manually index the data. There is also an attempt to automatically extract the index on a computer, but the natural language processing technology that is the basis of the process of extracting the index keyword from the document has not yet been completed, and it is currently not fully automated. Is. On the other hand, a full-text search method has been researched and developed as a method for searching without adding index information. This full-text search method has a drawback that the search speed is slower than the search method using the index information for search, but as a method for solving this drawback, for example, a text search machine for full-text search (electronic information IEICE Technical Report / Data Engineering 89
-38) and a method for speeding up full-text search by encoding including the attributes / positions of constituent characters (IEICE Technical Research Report / Data Engineering 90-24). Further, as a preprocessing of the text search, a method of speeding up the search by using character concatenation of two character sets to narrow down the search target documents has been proposed.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら２文字組
の文字連接を検索対象文書の絞り込みに用いた場合、入
力された検索文字列から抽出された２文字組が本文中に
個々には存在するが検索文字列自体には存在しないこと
がある。例えば、検索要求として入力された検索文字列
が「東京都」であった場合に対して、検索対象文書であ
る本文中には「…東京…」「…京都…」がそれぞれ個別
に存在するけれど検索文字列「東京都」自体は存在しな
いことがある。However, when the character concatenation of two character sets is used for narrowing down the documents to be searched, the two character sets extracted from the input search character string exist individually in the text. It may not exist in the search string itself. For example, when the search character string input as the search request is "Tokyo", "... Tokyo ..." and "... Kyoto ..." are individually present in the body of the search target document. The search string “Tokyo” itself may not exist.

【０００４】前記のような理由により、検索対象候補文
書として出現する数が実際に検索文字列が存在する文書
の数の数１０倍にもなってしまった場合、要約ファイル
を用いて検索対象候補文書を絞り込んだ後で実際には存
在しない文書に対しても検索することから本文を全文検
索する検索時間が余分にかかり、マンマシンインターフ
ェースとしてユーザに対する検索待ち時間が無視できな
くなる。For the above reasons, when the number of documents that appear as search target candidate documents is ten times as many as the number of documents in which the search character string actually exists, the search target candidates are obtained using the summary file. Since documents that do not actually exist are searched after narrowing down the documents, it takes extra search time to search the full text of the text, and the man-machine interface cannot wait for the search to the user.

【０００５】本発明は全文検索方式としてみると２文字
組を前処理に用いる方式と同様であるが、本発明におい
ては前記２文字組だけではなく前記２文字組が存在する
文字列の前後に続く文字を解析し、前記解析結果をもと
に検索して、絞り込みによる検索対象候補文書の数を検
索文字列が実際に存在する文書の数に近づけることによ
り、更に高速な検索が可能となる検索方式と前記検索方
式を備えた検索装置を提供することを目的とする。The present invention is similar to the method of using a two-character set for preprocessing when viewed as a full-text search system. However, in the present invention, not only the two-character set but also the character string before and after the two-character set exist. The subsequent characters are analyzed, the search is performed based on the analysis result, and the number of search target candidate documents by the narrowing down is made closer to the number of documents in which the search character string actually exists, which enables a further high-speed search. An object of the present invention is to provide a search method and a search device provided with the search method.

【０００６】[0006]

【課題を解決するための手段】問題点を解決するために
本発明の検索装置は、検索対象となる日本語文字コード
列から２文字連接を抽出し、前記２文字連接の前後に連
接する文字コードより下位４ビット以内のビットデータ
を得て、前記２文字連接のうちの１文字をエントリとし
て残りの１文字および前記４ビット以内ビットデータの
組と、前記データの組を抽出した検索対象文書の存在し
ている前記記憶媒体の記録位置概略情報とを内容とする
表を作成し、検索要求時に検索文字列から同じ手段によ
り抽出した２文字連接と４ビット以内のビットデータか
ら検索エントリとなる２文字連接のうちの１文字と検索
内容となる残り１文字と４ビット以内データとを得て、
この前記検索エントリを用いて前記の表から検索内容１
文字と４ビット以内ビットデータとのデータ組を検索し
て対応する記録位置情報を獲得し、候補として検索対象
文書を絞り込んだ後に検索対象文書を詳細検索して検索
要求を満たす文書データを出力することができる構成を
有している。In order to solve the problems, a retrieval apparatus of the present invention extracts a two-character concatenation from a Japanese character code string to be retrieved and a character concatenated before and after the two-character concatenation. A search target document in which bit data within the lower 4 bits of the code is obtained, and one character of the two-character concatenation is used as an entry and the remaining one character and the set of the bit data within 4 bits and the data set are extracted. Of the existing storage medium recording position information is created, and a search entry is formed from two character concatenations extracted from the search character string by the same means at the time of a search request and bit data within 4 bits. One character of the two character concatenation, the remaining one character as the search content, and the data within 4 bits are obtained,
Search contents 1 from the above table using this search entry
A data set of a character and bit data within 4 bits is searched to obtain corresponding recording position information, the search target document is narrowed down as a candidate, and then the search target document is subjected to detailed search to output document data satisfying the search request. It has a configuration that enables it.

【０００７】[0007]

【作用】本発明によれば上記のように、検索対象となる
文書データから２文字で構成される文字連接を抽出し、
更に前記２文字連接の前後から４ビット以内ビットデー
タを得て、各々検索対象文書文字列と検索要求時の検索
文字列について比較することで、検索対処候補文書の数
を実際に存在する文書の数に近づけることにより絞り込
みの効率を高くすることで検索の高速化をはかることが
可能となる。前記２文字連接データおよび４ビット以内
データを圧縮変換することで記憶装置に格納するときに
容量を小さくすることが可能となり、検索対象となる文
書データの容量を大きくすることが可能となる。According to the present invention, as described above, the character concatenation composed of two characters is extracted from the document data to be searched,
Further, by obtaining bit data within 4 bits from before and after the two-character concatenation and comparing the search target document character string with the search character string at the time of a search request, the number of search handling candidate documents can be compared with the number of actually existing documents. It is possible to speed up the search by increasing the efficiency of narrowing down by approaching the number. By compressing and converting the two-character concatenated data and the data within 4 bits, it is possible to reduce the capacity when storing in the storage device, and it is possible to increase the capacity of the document data to be searched.

【０００８】従って効果的な全文検索を行うことができ
る。Therefore, an effective full-text search can be performed.

【０００９】[0009]

【実施例】以下、本発明の実施例を図面を用いて詳細に
説明する。Embodiments of the present invention will be described in detail below with reference to the drawings.

【００１０】図１は、本発明における検索方式の機能ブ
ロック図である。図１において、検索対象文書の日本語
文字コード列から１の抽出条件に基づいて２の文字連接
情報抽出手段により２文字連接と４ビット以内のビット
データを抽出し、前記２文字連接のうちの１文字をエン
トリとして残り１文字および４ビット以内のビットデー
タの組と前記データの組を抽出した検索対象文書の存在
している記録位置概略情報とから３の表作成手段により
表を得て、要求検索文字列から同じく１の抽出条件に基
づいて２の文字連接情報抽出手段により抽出された２文
字連接と４ビット以内のビットデータから５の検索エン
トリと検索内容を得て、前記検索エントリと検索内容に
より６の表検索手段を用いて前記表を検索して対応する
検索候補文書の記録位置概略情報を得て、前記検索候補
文書を７の詳細検索手段により要求検索文字列について
検索して検索結果を得ることができる。FIG. 1 is a functional block diagram of a search method according to the present invention. In FIG. 1, two character concatenation information extraction means extracts two character concatenation information and bit data within 4 bits from the Japanese character code string of the document to be searched based on one extraction condition, and the two character concatenations are extracted. With one character as an entry, the remaining one character, a bit data set within 4 bits, and the recording position outline information in which the retrieval target document in which the data set is extracted are obtained by the table creating means of 3, and a table is obtained. Similarly, 5 search entries and search contents are obtained from the 2 character concatenation extracted by the 2 character concatenation information extraction means and the bit data within 4 bits based on the same extraction condition 1 from the requested search character string, and the search entry The table is searched by the table search means 6 according to the search content to obtain the recording position outline information of the corresponding search candidate document, and the search candidate document is requested by the detailed search means 7 And then search for search string it is possible to obtain the search results.

【００１１】図２は、本発明における文字連接情報抽出
手段の具体例である。８の日本語文字列として例えば
「…、日本の首都東京と古都である京都を比較すると、
…」に対して９の抽出条件として例えば「２文字連接の
後ろに連接する文字コードの最下位から４ビットを選択
する」が与えられたとき、条件に従って１０の文字連接
情報抽出手段により１１のような２文字連接とビットデ
ータを得ることができる。日本語文字列のコードがシフ
トＪＩＳであったならば、１１のビットデータとして例
えば「日本」の後続文字「の」のコードは８２ＣＣｈで
あるから最下位から４ビットのデータは「１１００」と
なる。他も同様に選択されて１１に示す２文字連接とビ
ットデータを得る。なお、本発明では日本語のコードと
してシフトＪＩＳに関して説明しているが、他の例えば
ＪＩＳコードやＥＵＣコードについても同様の効果が得
られ、またビットの選択位置を最下位からとしているが
文字の分類に使われているビット（例えばシフトＪＩＳ
では最上位ビットは１で分類に使われない）であれば効
果に若干の差があるが任意に設定でき、これらを本発明
の範囲から排除するものではない。FIG. 2 shows a concrete example of the character connection information extracting means in the present invention. As a Japanese character string of 8, for example, "..., comparing Tokyo, the capital of Japan, with Kyoto, which is the ancient capital,
... "is given as an extraction condition of 9, for example," select 4 bits from the lowest order of the character code that is concatenated after two character concatenation ", the character concatenation information extraction means of 10 causes 11 Such two-character concatenation and bit data can be obtained. If the code of the Japanese character string is shift JIS, for example, the code of the succeeding character "no" of "Japan" is 82CCh as 11-bit data, and the data of the least significant 4 bits is "1100". .. Others are selected in the same manner to obtain two-character concatenation and bit data shown in 11. In the present invention, shift JIS is described as a Japanese code, but similar effects can be obtained with other JIS codes and EUC codes, and the bit selection position is set from the lowest order. Bits used for classification (eg shift JIS
If the highest bit is 1 and is not used for classification, there is a slight difference in effect, but it can be set arbitrarily, and these are not excluded from the scope of the present invention.

【００１２】図３は、本発明における表の構成の具体例
である。２文字連接が「あの」で２文字連接に後続する
文字から最下位４ビットのビットデータとして「０００
１」を文字連接情報抽出手段で得たとすると、２文字連
接の１番目の文字を１２の検索エントリ「あ」を得て、
残りの文字「の」とビットデータ「０００１」と検索対
象文書が存在している記録位置概略情報「０１」とから
なる１３の検索内容を得て、検索対象文書に対して順次
検索エントリと検索内容を得て同一検索エントリのもの
をひとまとめとし検索内容が同一となるもの削除して図
３に示すような構成を持つ表とする。例えば、文字
「の」はそのまま文字を表現するコードとしてシフトＪ
ＩＳコードの「８２ＣＣｈ」を使用し、記録位置概略情
報としては検索対象文書を分類するファイル番号や記録
媒体の記録位置番号を使用することが考えられる。FIG. 3 shows a concrete example of the structure of the table in the present invention. If the two-character concatenation is "that" and the character following the two-character concatenation is "000" as the least significant 4 bits of bit data.
1 "is obtained by the character concatenation information extraction means, the first character of the two character concatenation is obtained as 12 search entries" A ",
13 search contents including the remaining character “NO”, the bit data “0001”, and the recording position outline information “01” in which the search target document is present are obtained, and the search target document and the search target document are sequentially searched. When the contents are obtained, the same search entries are grouped together, and those having the same search contents are deleted to form a table having the structure shown in FIG. For example, the character "no" is shift J as a code for expressing the character as it is.
It is conceivable to use the IS code “82CCh” and use the file number for classifying the search target document or the recording position number of the recording medium as the recording position outline information.

【００１３】図４は、本発明におけるビットデータ付与
の実験データをグラフ化したものである。実験対象デー
タは、検索対象文書として特許文書１９０ファイル（平
均ファイルサイズ５２[Kbyte]、合計ファイル容量１０
[Mbyte]）を用い、要求検索文字列として日本語の漢
字、カタカナ、英字からなる語長３文字〜７文字の合計
２８３単語を無作為に選ばれたものを用いた。検索内容
として４ビット以内のビットデータとしては、文字コー
ドの下位バイト（下位８ビット）からビットを選択する
方法をビット数およびビット位置それぞれ数種類につい
て試み、比較データとして（１）２文字連接に後続する文字を日本語文字のカテゴ
リー（漢字、平仮名、カタカナ、記号、数字、外国文字
等）に分類したビットデータを使う場合（２）後続文字をそのまま用いて３文字連接とした場合（３）２文字連接だけを用いた場合の３種類についても試みた。結果の評価として用意した
尺度定義は、絞り込み率と圧縮率である。絞り込み率は
検索候補文書ファイルの数に対する実際に要求検索文字
列が存在したファイル（実在ファイル）の数の割合で定
義し、圧縮率は文字連接情報の文書内の全数に対する文
字連接情報の文書内の種類数の割合で定義する。図４の
グラフは、横軸を圧縮率、縦軸を絞り込み率とし、同一
ビット数で表現されるデータを楕円形領域で囲った結果
で、ａが上記（３）の２文字連接だけを用いた結果、ｇ
が上記（２）の３文字連接とした結果、ｂ〜ｆが１〜４
ビットまで４ビット以内ビットデータのビット数を変え
た結果で、ｆは２文字連接の前後に連接する文字から２
ビットづつを選択したものであり、ｂ〜ｅは２文字連接
の後ろに連接する文字からビットを選択したものであ
り、楕円領域中の○データはカテゴリー分類を各ビット
数に応じて分類した結果で、整理すれば、次の通りであ
る。ａ２文字連接ｂ１ビットデータ □：後続文字の最下位ビット ○：カテゴリー分け（漢字｜その他）ｃ２ビットデータ □：後続文字の下位バイトより選択 ○：カテゴリー分け（漢字｜平仮名｜カタカナ｜その
他）ｄ３ビットデータ □：後続文字の下位バイトより選択 ○：カテゴリー分け（漢字３｜平仮名３｜カタカナ｜
その他）ｅ４ビットデータ □：後続文字の下位バイトより選択 ○：カテゴリー分け（漢字６｜平仮名５｜カタカナ２｜記号｜数字｜外国文
字）ｆ４ビットデータ：前後文字それぞれの下位バイトより２ビットづつ選択ｇ３文字連接（２文字連接の後続１文字追加）この実験結果に基づくと、２文字連接で約８２％であっ
た絞り込み率がｃの２ビットで９０％を越えてｆの４ビ
ットでは３文字連接と同等の約９７％まで上昇してお
り、充分に絞り込みの効果が現れている。３文字連接は
２文字連接の後続の文字を１６ビットのままビットデー
タとしたことと同じであるから、同等の絞り込みを実現
できる４ビットデータはデータ量の減少にも効果がある
ことを示している。また同一ビット数での表現では、カ
テゴリ分類よりもコードからビットを選択した方が絞り
込み率の向上が顕著である。なお、前記実験のデータは
特許文書であるが、他の種類のデータについても圧縮率
の値等絶対的な評価の値に差はあるが圧縮率と絞り込み
率の関係、絞り込み率の向上といった相対的な値につい
ては同様の結果を得ており、本発明の効果を確認してい
る。FIG. 4 is a graph showing experimental data for adding bit data in the present invention. The experiment target data is a patent document 190 file as a search target document (average file size 52 [Kbyte], total file capacity 10
[Mbyte]) was used as the required search character string, which was randomly selected from a total of 283 words consisting of Japanese kanji, katakana, and English characters with a word length of 3 to 7 characters. For the bit data within 4 bits as the search content, a method of selecting a bit from the lower byte (lower 8 bits) of the character code was tried for each number of bits and bit positions, and as comparison data, (1) Followed by two character concatenation When using bit data that classifies the characters to be classified into Japanese character categories (Kanji, Hiragana, Katakana, symbols, numbers, foreign characters, etc.) (2) When three characters are concatenated using the subsequent characters as they are (3) 2 We also tried three types using only character concatenation. The scale definitions prepared as the evaluation of the results are the narrowing rate and the compression rate. The narrowing rate is defined as the ratio of the number of files (existing files) in which the requested search character string actually existed to the number of search candidate document files, and the compression rate is the number of files in the character concatenation information document to the total number in the character concatenation information document. It is defined by the ratio of the number of types. In the graph of FIG. 4, the horizontal axis is the compression rate and the vertical axis is the narrowing rate, and the data represented by the same number of bits is surrounded by an elliptical area. As a, only the two-character concatenation of (3) above is used. As a result, g
As a result of the three-character concatenation of (2) above, b to f are 1 to 4
Within 4 bits up to 4 bits The result of changing the number of bits of the bit data, f is 2 from the character that is connected before and after the connection of 2 characters.
Bits are selected bit by bit, b to e are bits that are selected from the characters that are concatenated after the two character concatenation, and ○ data in the elliptical area is the result of classifying according to the number of bits for each category. So, in summary, it is as follows. a 2 character concatenation b 1 bit data □: Least significant bit of subsequent character ○: Categorization (Kanji | Other) c 2 bit data □: Select from lower byte of subsequent character ○: Categorization (Kanji | Hiragana | Katakana | Other ) D 3-bit data □: Select from lower byte of subsequent character ○: Categorization (Kanji 3 | Hiragana 3 | Katakana |
Other) e 4-bit data □: Select from the lower byte of the following character ○: Categorization (Kanji 6 ｜ Hiragana 5 ｜ Katakana 2 ｜ Symbol | Numeric ｜ Foreign character) f 4 bit data: 2 bits from the lower byte of each preceding and following character Select one by one g 3 character concatenation (add 1 character after 2 character concatenation) Based on this experimental result, the narrowing rate which was about 82% with 2 character concatenation exceeded 90% with 2 bits of c and 4 bits of f In the case of 3 characters, it has risen to about 97%, which is equivalent to the concatenation of 3 characters, and the effect of sufficiently narrowing down is exhibited. Since 3-character concatenation is the same as making the subsequent characters of 2-character concatenation 16 bits into bit data, it is shown that 4-bit data that can achieve the same narrowing down is also effective in reducing the amount of data. There is. Further, in the representation with the same number of bits, the improvement of the narrowing rate is more remarkable when the bits are selected from the code than the category classification. Although the data of the experiment is a patent document, there is a difference in the absolute evaluation value such as the value of the compression rate for other types of data, but the relationship between the compression rate and the narrowing rate, and the relative improvement of the narrowing rate. The same results were obtained with respect to specific values, confirming the effect of the present invention.

【００１４】図５は、本発明における圧縮符号化の具体
例である。文字をシフトＪＩＳコードで表された１４の
「東：９３８Ｃｈ」、１５の「レ：８３８Ｃｈ」、１６
の「繻：Ｅ３８Ｃｈ」を圧縮符号化条件として１７の
「上位４ビットの削除」を用いて圧縮符号化すると、１
２ビットで表現される１８の「３８Ｃｈ」となる。前記
条件の場合、２〜３文字が重複して１つのコードで表現
されることとなる。圧縮符号化条件としては、（１）文字列中に出現する頻度の大きい文字と小さい文
字を組み合わせて変換後のコードの出現頻度を均等にす
る方法（２）文字列中で出現する頻度の極端に小さい文字、例
えばＪＩＳ第２水準の漢字等をまとめていくつかのコー
ドに変換する方法（３）文字コードから数ビットを削除する方法などが実施例として考えられる。前記条件等で検索内容
となる１文字を１２ビットで表現するコードに圧縮符号
化を行い、４ビット以内ビットデータと併せて１６ビッ
ト以内で表現するような表を作成すれば表の容量を小さ
くすることが可能となり、検索の単位としても効率の良
い形となる。FIG. 5 shows a concrete example of compression coding in the present invention. 14 "East: 938 Ch", 15 "Le: 838 Ch", 16 characters represented by shift JIS code
When "encoding: E38Ch" of No. 1 is compression-encoded using "Delete upper 4 bits" of 17 as a compression encoding condition, 1
It is 18 "38Ch" represented by 2 bits. In the case of the above condition, two or three characters are duplicated and expressed by one code. The compression coding conditions include (1) a method of equalizing the appearance frequency of the code after conversion by combining a character having a high frequency of appearance in the character string and a character having a low frequency of occurrence in the character string (2) an extreme frequency of appearance in the character string A method of collectively converting small characters, for example, JIS second-level Chinese characters, etc. into several codes (3) A method of deleting several bits from a character code, etc. can be considered as an embodiment. The size of the table can be reduced by compressing and encoding a code that represents one character that is the search content in 12 bits under the above conditions, and creating a table that can be expressed in 16 bits or less together with bit data within 4 bits. It becomes possible to do it, and it becomes an efficient form as a unit of search.

【００１５】図６は、本発明における存在フラグの具体
例である。検索対象文書の日本語文字コード列中に１９
から２４まで２文字連接「東京」と２文字連接の後ろに
連接する文字の最下位から選択された４ビットデータが
あるとすると、同一の４ビットデータはまとめられ４ビ
ットの組み合わせを表現するのに必要な１６ビットのフ
ラグに２５に示す様な存在フラグとして表現される。前
記存在フラグにより同一の２文字連接をまとめることが
可能となり、表の容量を小さくすることが可能となる。
前記図４に示した実験結果より２ビットを選択すること
で９０％以上の絞り込み率を得ており、前記２ビットを
４ビット以内ビットデータとして使用するならば存在フ
ラグは４ビットで構成されることとなり１文字を圧縮符
号化した１２ビットのデータと併せて１６ビットで表を
構成できるので、効率の良い表を得ることができる。FIG. 6 shows a specific example of the presence flag according to the present invention. 19 in the Japanese character code string of the document to be searched
If there is 4-bit data selected from the lowest of the characters that are concatenated after the two-character concatenation "Tokyo" and the two-character concatenation up to 24, the same 4-bit data is combined and represents a 4-bit combination. It is expressed as a presence flag as shown in 25 in the 16-bit flag required for. With the presence flag, the same two-character concatenation can be put together and the capacity of the table can be reduced.
By selecting 2 bits from the experimental result shown in FIG. 4, a narrowing rate of 90% or more is obtained, and if the 2 bits are used as bit data within 4 bits, the existence flag is composed of 4 bits. This means that a table can be configured with 16 bits in combination with 12-bit data obtained by compression-encoding one character, so that an efficient table can be obtained.

【００１６】図７は、本発明におけるマスクビット比較
手段の具体例である。例えば比較対象データとして７６
５２ｈと７６５４ｈを与え、２６のビットマスク条件と
して最下位より４ビットをマスクするという内容を与
え、２７のビットマスク手段によりマスクデータとして
各々７６５＊ｈ（ただし、＊はマスク状態を表す）を得
て、２８のビット比較手段により前記マスクデータを比
較して２９の「等しい」という比較結果を得る。例え
ば、１文字を圧縮符号化して得た１２ビットのデータと
存在フラグ４ビットで構成される表データを比較対象と
する場合、存在フラグのうち１ビットだけが比較に必要
となる場合あるいは存在フラグ全体を比較に使わない場
合があり比較したい場所以外をマスクする前記マスクビ
ット比較手段が必要となってくる。FIG. 7 shows a concrete example of the mask bit comparison means in the present invention. For example, as comparison target data, 76
52h and 7654h are given, and the content of masking 4 bits from the least significant is given as a 26 bit mask condition, and 765 * h (where * represents a mask state) is obtained as mask data by 27 bit mask means. Then, the mask data is compared by 28 bit comparison means to obtain 29 "equal" comparison result. For example, when 12-bit data obtained by compression-encoding one character and table data composed of 4 bits of the existence flag are to be compared, only 1 bit of the existence flag is necessary for comparison, or the existence flag There is a case where the whole is not used for the comparison, and the mask bit comparison means for masking a portion other than a place to be compared is required.

【００１７】[0017]

【発明の効果】本発明によれば次のような効果を得るこ
とができる。（１）２文字連接データに４ビット以内ビットデータを
加えることによって、検索対象候補文書の数を実際に検
索文字列が存在する文書の数に近づけることができ、無
駄な検索時間が削減されて結果として検索の高速化が可
能となる。（２）２文字連接データと４ビット以内ビットデータを
圧縮符号化することおよび存在フラグを使用すること
で、検索前処理用の表の容量を小さくすることが可能で
あり、ビットマスクデータの比較を行うことで効率よく
変換データを検索に用いることができる。According to the present invention, the following effects can be obtained. (1) By adding bit data within 4 bits to 2 character concatenated data, the number of search target candidate documents can be made close to the number of documents in which the search character string actually exists, and wasteful search time is reduced. As a result, it is possible to speed up the search. (2) The compression of the two-character concatenated data and the bit data within 4 bits and the use of the existence flag can reduce the capacity of the table for pre-search processing, and the bit mask data can be compared. By performing the above, the converted data can be efficiently used for the search.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明における検索方式の機能ブロック図FIG. 1 is a functional block diagram of a search method according to the present invention.

【図２】本発明における文字連接情報抽出手段の具体例FIG. 2 is a specific example of a character connection information extracting means according to the present invention.

【図３】本発明における表の構成の具体例FIG. 3 is a specific example of a table configuration according to the present invention.

【図４】本発明におけるビットデータ付与の実験例グラ
フFIG. 4 is an experimental example graph of bit data addition in the present invention.

【図５】本発明における圧縮符号化の具体例FIG. 5 is a specific example of compression encoding according to the present invention.

【図６】本発明における存在フラグの具体例FIG. 6 is a specific example of an existence flag according to the present invention.

【図７】本発明におけるマスクビット比較手段の具体例FIG. 7 is a specific example of mask bit comparison means in the present invention.

【符号の説明】[Explanation of symbols]

１抽出条件２文字連接情報抽出手段３表作成手段４記録位置概略情報５検索エントリおよび検索内容６表検索手段７詳細検索手段８日本語文字列の例９抽出条件（例えば、後続文字の最下位４ビット）１０文字連接情報抽出手段１１文字連接情報（２文字連接と４ビット以内ビット
データ）１２検索エントリ１３検索内容1 extraction condition 2 character concatenated information extraction means 3 table creation means 4 recording position outline information 5 search entry and search contents 6 table search means 7 detailed search means 8 example of Japanese character string 9 extraction condition (for example, lowest of succeeding character) 4 bit) 10 character connection information extraction means 11 character connection information (2 character connection and bit data within 4 bits) 12 search entry 13 search content

Claims

【特許請求の範囲】[Claims]

【請求項１】記憶媒体中の検索対象文書の日本語文字コ
ード列から任意の２文字連接を抽出し前記２文字連接の
前後どちらかにさらに連接している１文字の文字コード
の下位から４ビット以内のビットデータを取り出す、あ
るいは前記２文字連接の前と後ろにそれぞれ連接される
各１文字の文字コードの下位から合計４ビット以内のビ
ットデータを取り出す文字連接情報抽出手段と、前記の２文字連接のうちの１文字をエントリとして残り
の１文字および対応する前記４ビット以内のビットデー
タの組と、前記データの組を抽出した検索対象文書の存
在する前記記憶媒体の記録位置概略情報とを内容とする
表を作成する表作成手段とを備え、検索要求時に検索文字列から前記の文字連接情報抽出手
段により抽出された２文字連接および４ビット以内ビッ
トデータから検索エントリとなる２文字連接のうちの１
文字と検索内容となる残りの１文字と４ビット以内ビッ
トデータとを得て、この前記検索エントリを用いて前記
の表から検索内容１文字と４ビット以内ビットデータと
のデータ組を検索して対応する記録位置概略情報を獲得
し、検索対象文書を絞り込んだ後に検索対象文書を詳細
検索することを特徴とする検索装置。1. An arbitrary two-character concatenation is extracted from a Japanese character code string of a document to be searched in a storage medium, and the character code of one character which is further concatenated either before or after the two-character concatenation is 4 from the lower order. A character concatenation information extracting means for retrieving bit data within a bit or for retrieving bit data within a total of 4 bits from the lower order of the character code of each one character concatenated before and after the two character concatenation; 1 character of the character concatenation as an entry, the remaining 1 character and a corresponding set of bit data within 4 bits, and recording position outline information of the storage medium in which the search target document from which the data set is extracted exists Table creating means for creating a table having the contents of 2 characters and 4 characters Out from the bit data within the capital of search entry to become two characters connected 1
A character, the remaining 1 character as the search content, and bit data within 4 bits are obtained, and a data set of 1 character for search content and bit data within 4 bits is searched from the above table using this search entry. A search device characterized by acquiring corresponding recording position outline information, narrowing down the search target document, and then performing a detailed search of the search target document.

【請求項２】記録媒体中の検索対象文書の日本語文字コ
ード列から任意の２文字連接を抽出する文字連接抽出手
段と、前記の２文字連接のうちの１文字をエントリとして、残
りの１文字を異なる文字種類の重複を許す文字コードに
変換する符号化手段と、前記２文字連接を抽出した検索
対象文書の存在する前記記憶媒体の記録位置概略情報と
を内容とする表を作成する表作成手段とを備え、検索要求時に検索文字列から前記の文字連接抽出手段に
より抽出された２文字連接から検索エントリとなる２文
字連接のうちの１文字と前記符号化手段により符号化さ
れた圧縮文字コードとを得て、前記検索エントリを用い
て前記の表から符号化された圧縮文字コードを検索して
対応する記録位置概略情報を獲得し、検索対象文書を絞
り込んだ後に検索対象文書を詳細検索することを特徴と
する検索装置。2. A character concatenation extraction means for extracting an arbitrary two-character concatenation from a Japanese character code string of a document to be searched in a recording medium, and one character of the two-character concatenation as an entry, and the remaining 1 A table for creating a table having encoding means for converting a character into a character code that allows duplication of different character types, and recording position outline information of the storage medium in which the search target document in which the two-character concatenation is extracted exists And a compression unit encoded by the encoding unit. One character of the two-character concatenation that becomes a search entry from the two-character concatenation extracted by the character concatenation extraction unit from the search character string at the time of a search request is provided. Character code, the encoded compressed character code is searched from the table using the search entry to obtain the corresponding recording position outline information, and the search target document is narrowed down before being searched. Retrieval apparatus characterized by Advanced Search target document.

【請求項３】２文字連接のうち１文字をエントリとし
て、残りの１文字を異なる文字種類の重複を許す文字コ
ードに変換する符号化手段により圧縮符号化したデータ
と、前記２文字連接の前後に連接する文字より選択され
た４ビット以内のビットデータとの合計が１６ビット以
内で表現されるデータを表に用いることを特徴とする請
求項１または請求項２に記載の検索装置。3. Data compressed and encoded by an encoding means for converting one character of two-character concatenation as an entry and converting the remaining one character into a character code which allows duplication of different character types, and before and after the two-character concatenation. 3. The search device according to claim 1, wherein the table uses data represented by a total of 16 bits or less with bit data of 4 bits or less selected from the characters concatenated with.

【請求項４】記憶媒体中の検索対象文書の日本語文字コ
ード列から任意の２文字連接を抽出し前記２文字連接の
前後どちらかにさらに連接している１文字の文字コード
の下位から４ビット以内のビットデータを取り出す、あ
るいは前記２文字連接の前と後ろにそれぞれ連接される
各１文字の文字コードの下位から合計４ビット以内のビ
ットデータを取り出す文字連接情報抽出手段と、前記の２文字連接のうちの１文字をエントリとして、残
り１文字のデータと前記の４ビット以内のビットデータ
の組み合わせが前記文字コード列中に存在するかどうか
を有無フラグビットで表現したデータとの組と、前記デ
ータの組を抽出した検索対象文書の存在する前記記憶媒
体の記録位置概略情報とを内容とする表を作成する表作
成手段と、前記有無フラグビットのうち任意の１ビット以外あるい
は全てをマスクする有無フラグビットマスク手段とを備
え、検索要求時に検索文字列から前記の文字連接情報抽出手
段により抽出された２文字連接および４ビット以内ビッ
トデータから検索エントリとなる２文字連接のうちの１
文字と検索内容となる残りの１文字と４ビット以内ビッ
トデータとを得て、この前記検索エントリを用いて前記
の表から検索内容１文字と４ビット以内ビットデータと
のデータ組から前記有無フラグマスク手段によりマスク
したデータを比較による検索をして対応する記録位置概
略情報を獲得し、検索対象文書を絞り込んだ後に検索対
象文書を詳細検索することを特徴とする検索装置。4. An arbitrary two-character concatenation is extracted from the Japanese character code string of the document to be searched in the storage medium, and the character code of one character which is further concatenated either before or after the two-character concatenation is 4 from the lower order. A character concatenation information extracting means for retrieving bit data within a bit or for retrieving bit data within a total of 4 bits from the lower order of the character code of each one character concatenated before and after the two character concatenation; One character of the concatenation of characters is used as an entry, and a combination of the remaining one character data and the data representing whether or not a combination of the bit data within 4 bits exists in the character code string by a presence flag bit A table creating means for creating a table containing the recording position outline information of the storage medium in which the search target document in which the data set is extracted is present; Presence / absence flag bit masking means for masking all or one of arbitrary ones of the arbitration bits, and from the 2 character concatenation and 4 bit or less bit data extracted by the character concatenation information extracting means from the retrieval character string at the time of retrieval request. One of the two-character concatenation that is the search entry
The character and the remaining 1 character as the search content and the bit data within 4 bits are obtained, and the presence / absence flag is obtained from the data set of the search content 1 character and the bit data within 4 bits using the search entry. A search device characterized in that the masked data is searched by comparison to obtain corresponding recording position outline information, the search target documents are narrowed down, and then the search target documents are searched in detail.

【請求項５】２文字連接のうち１文字をエントリとし
て、残りの１文字を異なる文字種類の重複を許す文字コ
ードに変換する符号化手段により圧縮符号化して１２ビ
ットで表現したデータと、前記２文字連接の前後に連接
する文字より選択された２ビットデータを組み合わせた
４通りが前記２文字連接と同一コードで構成される２文
字連接について検索対象文書の日本語文字コード列中に
存在するかどうかを４ビットの有無フラグビットで表現
したデータとの合計が１６ビットで表現されるデータを
表に用いることを特徴とする請求項４記載の検索装置。5. Data represented by 12 bits which is compression-encoded by an encoding means for converting one character of two-character concatenation as an entry and converting the remaining one character into a character code which allows duplication of different character types, Four combinations of 2-bit data selected from characters that are connected before and after the two-character concatenation are present in the Japanese character code string of the document to be searched for two-character concatenation having the same code as the two-character concatenation. 5. The search device according to claim 4, wherein the table uses data in which a total of 16 bits is used together with data representing whether or not the flag is expressed by a 4-bit presence / absence flag bit.