JP5384315B2

JP5384315B2 - SEARCH DEVICE, METHOD, AND PROGRAM

Info

Publication number: JP5384315B2
Application number: JP2009289788A
Authority: JP
Inventors: 章裕宮田; 考藤村; 寿子塩原
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-08-04
Filing date: 2009-12-21
Publication date: 2014-01-08
Anticipated expiration: 2029-12-21
Also published as: JP2011054148A

Description

本発明は、検索装置及び方法及びプログラムに係り、特に、改ページや改行位置が確定しているドキュメント内の部分領域の撮影画像を検索クエリとして、該領域が出現するドキュメント及び該ドキュメント内における位置を取得する検索要求に応えるための、ドキュメント及びドキュメント内の各位置のインデックスを作成する検索装置及び方法及びプログラムに関する。 The present invention relates to a search apparatus, method, and program, and in particular, a document in which a region appears and a position in the document using a captured image of a partial region in a document in which a page break or a line break position is determined as a search query. The present invention relates to a search apparatus, a method, and a program for creating a document and an index of each position in the document in response to a search request for acquiring a document.

特に、改ページや改行位置が確定しているドキュメント内の該領域を含む可能性があるドキュメント及びドキュメント内における位置を網羅的に取得するのではなく、位置を一意に特定したい場合に適用される検索装及び方法及びプログラムに関する。 Especially, it is applied when you want to specify the position uniquely rather than exhaustively acquiring the document that may include the area in the document where the page break or line break position is fixed and the position in the document. The present invention relates to a search apparatus, method, and program.

ドキュメントの一部領域から、該領域がどのドキュメントに含まれているか、あるいは、どのドキュメントのどの位置に含まれているか一意に特定することが必要なシーンは少なくない。 There are not a few scenes in which it is necessary to uniquely identify from which document a part of a document is included in which document or at which position in which document.

例えば、手元に雑誌の切り抜きがある場合、切り抜いた元の雑誌を探して、切抜きの続きを読みたいことがある。この場合、当該切抜きがどの雑誌の一部であったか一意に特定できる必要がある。 For example, if there is a magazine cutout at hand, you may want to find the original magazine you cut out and read the continuation of the cutout. In this case, it is necessary to be able to uniquely identify which magazine the clipping was part of.

上記の事例は、ドキュメントの一部領域をクエリとし、膨大な量のドキュメント群の中から、該領域を含むドキュメント名、あるいはドキュメント名及びドキュメントにおける位置を問い合わせる検索システムと捉えることができる。 The above example can be regarded as a search system that uses a partial area of a document as a query and inquires about a document name including the area or a document name and a position in the document from a huge amount of documents.

そして、ドキュメント群の中から情報を取得する検索要求に応えるシステムを構築するためには、ドキュメント群を事前に分析してインデックスを作成する必要がある。 In order to construct a system that responds to a search request for obtaining information from a document group, it is necessary to analyze the document group in advance and create an index.

例えば、日本語の場合は形態素解析等の技術を用いて、ドキュメント内の文書を単語単位に分割した後、単語をインデックスのキーとし、該単語を含むドキュメント名、あるいは、ドキュメント名及びドキュメント中において該単語が登場する位置をインデックスの値とする方式が挙げられる。 For example, in the case of Japanese, after dividing a document in a document into units of words using a technique such as morphological analysis, the word is used as an index key, and the name of the document including the word, or the document name and the document There is a method in which the position where the word appears is used as an index value.

また、Ｎ文字（あるいはＮ単語）の連なりをインデックスのキーとし、その文字（あるいは単語）の連なりを含むドキュメント名、あるいは、ドキュメント名及びドキュメント中においてその文字（あるいは単語）の連なりが登場する位置をインデックスの値とする方式（文字のN-gram方式、単語のN-gram方式）もある。N-gram方式は幅広い場面で有用性が認められており、現在でも多くの拡張手法が提案されている。また、通常のN-gram方式に加え、状況に応じてＮの値を変動させる方式も実施されている（例えば、非特許文献１参照）。 Further, a sequence of N characters (or N words) is used as an index key, and a document name including the sequence of the characters (or words), or a position at which the sequence of the characters (or words) appears in the document name and document. There are also methods (character N-gram method, word N-gram method) that use as the index value. The usefulness of the N-gram method has been recognized in a wide range of situations, and many extended methods are still proposed. In addition to the normal N-gram method, a method of changing the value of N according to the situation has been implemented (see, for example, Non-Patent Document 1).

「Unicodeを用いたN-gram索引の一実現方式とその評価」情報処理学会研究会報告、2000-NL-136-17,pp.135-142."A realization method of N-gram index using Unicode and its evaluation", Information Processing Society of Japan, 2000-NL-136-17, pp.135-142.

しかしながら、上記従来の方式はどちらも、（１）インデックス識別能力低下、（２）検索ロバスト性低下の問題がある。 However, both of the above conventional methods have the problems of (1) a decrease in index identification capability and (2) a decrease in search robustness.

（１）インデックス識別能力低下の問題：
上記従来の技術では、分析対象のドキュメント数が増えるほど、インデックスのキーと値が１対１に定まらないケースが多く発生するという、インデックス識別能力低下の問題を抱えている。 (1) Problem of reduced index identification ability:
The above-described conventional technique has a problem of a decrease in index identification capability, in which as the number of documents to be analyzed increases, there are more cases where the index key and value are not determined one-to-one.

例えば、「情報」といった一般的な単語がインデックスのキーとなっている場合、複数のドキュメント名がインデックスの値として該キーに関連付けられている可能性が高い。 For example, when a general word such as “information” is an index key, there is a high possibility that a plurality of document names are associated with the key as index values.

N-gram法を用いると問題は多少改善するが、完全には解決しない。文章とは単語がランダムに並んでいるのではなく、文章として意味を成すように並んでいる。このため、意味を成すような文字（あるいは単語）の連なりは多くのドキュメントに含まれる傾向がある。例えば、２単語の連なりをインデックスのキーとする場合、「情報＋ラクダ」のような意味不明な連なりを含むドキュメントは滅多に存在しないが、「情報＋検索」、「情報＋処理」といった連なりを含むドキュメントは無数に存在する。つまり「情報＋検索」、「情報＋処理」等の、意味を成すような文字・単語の連なりから成るキーには、複数のドキュメント名がインデックスの値として該キーに関連付けられている可能性が高い。 Using the N-gram method improves the problem somewhat but does not solve it completely. Sentences are not arranged in random terms, but are arranged so that they make sense as sentences. For this reason, a series of characters (or words) that make sense tends to be included in many documents. For example, when a sequence of two words is used as an index key, there is rarely a document including an unknown sequence such as “information + camel”, but a sequence of “information + search”, “information + processing” is not included. There are countless documents to include. In other words, there is a possibility that a plurality of document names are associated with the key as an index value for a key composed of a series of meaningful characters / words such as “information + search” and “information + processing”. high.

これらの現象は、ある文字列を含むドキュメント名等を網羅的に取得する検索要求に応えるシステムを構築する場合には問題にならない。しかし、技術分野や背景技術の欄で述べたとおり、特定ドキュメントの特定位置を唯一の検索結果として取得する検索要求に応えるシステムを構築する場合には大きな問題になる。 These phenomena do not pose a problem when a system that responds to a search request that comprehensively obtains document names including a certain character string is constructed. However, as described in the technical field and background art section, it becomes a big problem when a system that responds to a search request for acquiring a specific position of a specific document as a single search result is created.

（２）検索ロバスト性低下の問題：
前述の単語のN-gramの場合、Ｎの値を大きくすれば、インデックスのキーと値が１対１に定まらないケースは減少する。例えば、「情報＋検索」（Ｎ＝２），「日本語＋情報＋検索」（Ｎ＝３），「次世代＋日本語＋情報＋検索」（Ｎ＝４）とＮを大きくするほど、インデックスのキーと値が１対１に定まりやすくなる。 (2) Retrieval robustness problem:
In the case of the above-described word N-gram, if the value of N is increased, the case where the index key and value are not determined one-to-one decreases. For example, “information + search” (N = 2), “Japanese + information + search” (N = 3), “next generation + Japanese + information + search” (N = 4), and the larger N, Index keys and values are easily determined on a one-to-one basis.

しかし、インデックスのキーをクエリとして検索する状況において、上記のようにインデックスのキーの文字数を多くする等して情報量を増やす場合、クエリに一定確率でノイズが含まれると正しい検索結果が得られないという、検索ロバスト性低下の問題が発生する。 However, in the situation where the index key is searched as a query, if the amount of information is increased by increasing the number of characters in the index key as described above, a correct search result can be obtained if the query includes noise with a certain probability. The problem of reduced search robustness occurs.

例えば、紙に書かれた文字列をＯＣＲで光学文字認識を行い、該文字列をクエリとする場合、あるいは専用ディスプレイにタッチペンで書いた文字を手書き文字認識して該文字列をクエリとする場合、文字認識処理の過程で読み取りエラー（誤認識）が発生することがある。この場合、「情報検索」と書かれた短い文字列をスキャンするよりも、「次世代日本語情報検索」と書かれた長い文字列を読み込む方が読み込みエラーが発生する可能性が高く、読み込みエラーによるクエリでは正しい検索結果が得られない。 For example, when a character string written on paper is optically recognized by OCR and the character string is used as a query, or a character written with a touch pen on a dedicated display is recognized and the character string is used as a query. A reading error (misrecognition) may occur during the character recognition process. In this case, it is more likely that a read error will occur if you read a long character string written as “Next Generation Japanese Information Search” rather than scanning a short character string written as “Information Search”. An error query does not give correct search results.

本発明は、上記の点に鑑みなされたもので、インデックスの識別能力低下、及び検索ロバスト性低下という問題を解決し、ドキュメント群の中から特定ドキュメントの特定位置を一意に取得する検索要求に応じることができ、また、クエリにノイズが含まれる場合においても、精度を低下させずに、検索要求に応じることができる検索装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, solves the problems of a decrease in index identification capability and a decrease in search robustness, and responds to a search request for uniquely acquiring a specific position of a specific document from a document group. It is another object of the present invention to provide a search device, method, and program capable of responding to a search request without degrading accuracy even when a query includes noise.

図１は、本発明の原理構成図である。 FIG. 1 is a principle configuration diagram of the present invention.

本発明（請求項１）は、改ページや改行位置が確定しているドキュメント内の一部領域を検索クエリとして、該領域が出現するドキュメント及び該ドキュメント内における位置を取得する検索要求に応えるための検索インデックスを作成し、検索を行う検索装置であって、
インデックス作成対象のドキュメントの入力を受け付けるドキュメント入力手段１０と、
ドキュメントの全体または一部領域から、文章を読む方向とそれに直交する方向を考慮した規定の形状内にある１文字以上の文字の組み合わせからなる文字ブロックを抽出する文字ブロック抽出手段１１と、
文字ブロックと該文字ブロックが出現するドキュメントにおける出現位置を関連付けてインデックス記憶手段１３に出力するインデックス出力手段１２と、を有する。 The present invention (Claim 1) responds to a search request for acquiring a document in which the area appears and a position in the document by using a partial area in a document in which a page break or a line break position is determined as a search query. A search device that creates a search index and performs a search,
A document input means 10 for receiving an input of a document to be indexed;
A character block extracting means 11 for extracting a character block consisting of a combination of one or more characters in a prescribed shape taking into consideration the direction in which the text is read and the direction orthogonal thereto, from the whole or a partial area of the document;
Index output means 12 for associating a character block with an appearance position in a document in which the character block appears, and outputting it to the index storage means 13.

また、本発明（請求項２）は、請求項１の検索装置において、
文字ブロックの中から、１文字以上の特定文字列を含むものだけを選別して以降の処理対象とする文字ブロック選別手段を更に有する。 Further, the present invention (Claim 2) is the search device according to Claim 1,
Character block selection means for selecting only those including a specific character string of one or more characters from among the character blocks and to be processed later is further provided.

また、本発明（請求項３）は、請求項１の検索装置において、
特定文字列を、予め指定された分析対象のドキュメントの各領域に満遍なく出現する１文字以上の文字列とする。 Further, the present invention (Claim 3) is the search device according to Claim 1,
The specific character string is a character string of one or more characters that appears uniformly in each area of the analysis target document specified in advance.

また、本発明（請求項４）は、請求項１の検索装置において、
ドキュメントに存在する複数の文字ブロックを含む範囲をリージョンとして同一の検索結果候補として集計を行い、集計結果が一定基準を満たす検索結果候補リージョン群を検索結果として特定する検索手段を更に有する。 Further, the present invention (Claim 4) is the search device according to Claim 1,
A search means is further provided for performing a tabulation as a single search result candidate using a range including a plurality of character blocks existing in the document as a region, and specifying a search result candidate region group satisfying a certain criterion as a search result.

また、本発明（請求項５）は、ドキュメント内の特定位置に関連付けられたコンテンツが検索結果候補である場合に、
同一コンテンツが関連付けられた位置群を同一の検索結果候補として集計を行い、集計結果が一定基準を満たす検索結果候補群を検索結果として特定する検索手段を更に有する。 Further, according to the present invention (claim 5), when the content associated with a specific position in the document is a search result candidate,
There is further provided a search means for performing a tabulation on the position groups associated with the same content as the same search result candidate, and specifying a search result candidate group satisfying a certain criterion as a search result.

また、本発明（請求項６）は、請求項２の検索装置において、
特定文字列を、文字が撮影された画像から文字情報を抽出する光学文字認識装置が利用する認識辞書記憶手段を参照して取得する。 Further, the present invention (Claim 6) is the search device according to Claim 2,
The specific character string is acquired with reference to a recognition dictionary storage means used by the optical character recognition device that extracts character information from an image of characters .

また、本発明（請求項７）は、請求項２の検索装置において、
特定文字列を、予め指定された分析対象のドキュメントに所定の回数以上出現しない１文字以上の文字列とする。 Further, the present invention (Claim 7) is the search device according to Claim 2,
The specific character string is a character string of one or more characters that does not appear more than a predetermined number of times in a previously specified document to be analyzed.

また、本発明（請求項８）は、請求項２の検索装置において、
特定文字列を、予め指定されたシンプルな形状の文字からなる１文字以上の文字列とする。 Further, the present invention (Claim 8) is the search device according to Claim 2,
The specific character string is a character string of one or more characters composed of characters having a simple shape designated in advance.

また、本発明（請求項９）は、請求項１の検索装置において、
あるドキュメント内の一部領域を検索クエリとして受け付ける入力手段と、
検索クエリから、１文字以上の組み合わせからなるクエリ文字ブロックを抽出するクエリ文字ブロック抽出手段と、
クエリ文字ブロックに基づいて、インデックス記憶手段を検索し、その検索結果を出力する検索手段と、
を更に有し、
検索手段は、
クエリ文字ブロックに基づいて、前記インデックス記憶手段を検索し、その検索結果を出力する。 Further, the present invention (claim 9) is the search device according to claim 1,
An input means for accepting a partial area in a document as a search query;
Query character block extraction means for extracting a query character block consisting of a combination of one or more characters from the search query;
Search means for searching the index storage means based on the query character block and outputting the search results;
Further comprising
Search means are
The index storage means is searched based on the query character block, and the search result is output.

また、本発明（請求項１０）は、請求項９の検索装置において、
入力手段は、
あるドキュメント内の一部領域を撮影した画像を、一般的な光学文字認識装置を用いて該画像に写っている文字列をテキストデータに変換した検索クエリを受け付ける手段を含む。 Further, the present invention (Claim 10) is the search device according to Claim 9,
Input means,
The image processing apparatus includes means for receiving a search query obtained by converting an image obtained by capturing a partial area in a document into text data using a general optical character recognition device.

また、本発明（請求項１１）は、請求項９の検索装置において、
検索結果であるドキュメント及び該ドキュメント内における位置に関連付けられたコンテンツを、検索結果と併せて、あるいは、単独で出力する手段を更に有する。 Further, the present invention (claim 11) is the search device of claim 9,
It further has means for outputting the document as a search result and the content associated with the position in the document together with the search result or independently.

また、本発明（請求項１２）は、請求項９の検索装置において、
クエリ文字ブロックの中から、１文字以上の特定文字列を含むものだけを選別して以降の処理対象とするクエリ文字ブロック選別手段を更に有する。 Further, the present invention (claim 12) is the search device according to claim 9 ,
A query character block selection means for selecting only those including one or more specific character strings from the query character blocks and to be processed thereafter is further included.

また、本発明（請求項１３）は、請求項１２の検索装置において、
光学文字認識装置が利用する認識辞書に登録されている１文字以上の文字列を特定文字列とする。 The present invention (Claim 13) is the search device according to Claim 12,
One or more character strings registered in the recognition dictionary used by the optical character recognition device are defined as specific character strings.

また、本発明（請求項１４）は、請求項１２の検索装置において、
特定文字列を、予め指定された分析対象のドキュメントの各領域に満遍なく出現する１文字以上の文字列とする。 The present invention (Claim 14) is the search device according to Claim 12,
The specific character string is a character string of one or more characters that appears uniformly in each area of the analysis target document specified in advance.

また、本発明（請求項１５）は、請求項１２の検索装置において、
特定文字列を、予め指定された分析対象のドキュメントに所定の回数以上出現しない１文字以上の文字列とする。 The present invention (Claim 15) is the search device according to Claim 12,
The specific character string is a character string of one or more characters that does not appear more than a predetermined number of times in a previously specified document to be analyzed.

また、本発明（請求項１６）は、請求項１２の検索装置において、
特定文字列を、予め指定されたシンプルな形状の文字からなる１文字以上の文字列とする。 The present invention (Claim 16) is the search device according to Claim 12,
The specific character string is a character string of one or more characters composed of characters having a simple shape designated in advance.

図２は、本発明の原理を説明するための図である。 FIG. 2 is a diagram for explaining the principle of the present invention.

本発明（請求項１７）は、改ページや改行位置が確定しているドキュメント内の一部領域を検索クエリとして、該領域が出現するドキュメント及び該ドキュメント内における位置を取得する検索要求に応えるための検索インデックスを作成し、検索を行う装置における検索方法であって、
ドキュメント入力手段が、インデックス作成対象のドキュメントの入力を受け付けるドキュメント入力ステップ（ステップ１）と、
文字ブロック抽出手段が、ドキュメントの全体または一部領域から、文章を読む方向とそれに直交する方向を考慮した規定の形状内にある１文字以上の文字の組み合わせからなる文字ブロックを抽出する文字ブロック抽出ステップ（ステップ２）と、
インデックス出力手段が、文字ブロックと該文字ブロックが出現するドキュメントにおける出現位置を関連付けてインデックス記憶手段に出力するインデックス出力ステップ（ステップ３）と、を行う。 The present invention (Claim 17) responds to a search request for acquiring a document in which the area appears and a position in the document by using a partial area in a document in which a page break or a line break position is fixed as a search query. A search method in an apparatus for creating a search index and performing a search,
A document input means for receiving an input of a document to be indexed (step 1);
Character block extraction means for extracting a character block consisting of a combination of one or more characters within a specified shape taking into account the direction in which the text is read and the direction orthogonal thereto, from the whole or a partial area of the document Step (step 2);
The index output means performs an index output step (step 3) of associating the character block with the appearance position in the document in which the character block appears and outputting it to the index storage means.

また、本発明（請求項１８）は、請求項１７の検索方法において、
入力手段が、あるドキュメント内の一部領域を検索クエリとして受け付ける入力ステップと、
クエリ文字ブロック抽出手段が、検索クエリから、１文字以上の組み合わせからなるクエリ文字ブロックを抽出するクエリ文字ブロック抽出ステップと、
検索手段が、クエリ文字ブロックに基づいて、インデックス記憶手段を検索し、その検索結果を出力する検索ステップと、を更に行う。 Further, the present invention (Claim 18) is the search method of Claim 17,
An input step in which the input means accepts a partial area in a document as a search query;
A query character block extracting means for extracting a query character block consisting of a combination of one or more characters from the search query;
The search means further performs a search step of searching the index storage means based on the query character block and outputting the search result.

また、本発明（請求項１９）は、請求項１７の検索方法において、
文字ブロック選別手段が、文字ブロックの中から、１文字以上の特定文字列を含むものだけを選別して以降の処理対象とする文字ブロック選別ステップを更に行う。 The present invention (Claim 19) provides a search method according to Claim 17,
The character block sorting means further performs a character block sorting step to be processed later after sorting only the character blocks containing one or more specific character strings.

また、本発明（請求項２０）は、請求項１９の検索方法において、
特定文字列を、予め指定された分析対象のドキュメントの各領域に満遍なく出現する１文字以上の文字列とする。 The present invention (Claim 20) provides a search method according to Claim 19,
The specific character string is a character string of one or more characters that appears uniformly in each area of the analysis target document specified in advance.

また、本発明（請求項２１）は、請求項１８の検索方法において、
クエリ文字ブロック選別手段が、クエリ文字ブロックの中から、１文字以上の特定文字列を含むものだけを選別して以降の処理対象とするクエリ文字ブロック選別ステップを更に行う。 Further, the present invention (claim 21) is the search method of claim 18,
The query character block selection means further selects only query character blocks that include one or more specific character strings from the query character blocks, and further performs a query character block selection step to be processed thereafter.

本発明（請求項２２）は、請求項１乃至１６のいずれか１項に記載の検索装置を構成する各手段としてコンピュータを機能させるための検索プログラムである。 The present invention (Claim 22) is a search program for causing a computer to function as each means constituting the search device according to any one of Claims 1 to 16.

上記のように本発明によれば、ドキュメント群の中から特定ドキュメントの特定位置を一意に取得する検索要求に応じることができる。例えば、手元に書籍の一部を切り抜きしかない場合でも、その切り抜きがどの書籍のどの部分であったか突き止めることができる。 As described above, according to the present invention, it is possible to respond to a search request for uniquely acquiring a specific position of a specific document from a document group. For example, even when only a part of a book is cut out at hand, it is possible to find out which part of the book the cut out is.

また、クエリにノイズが含まれる場合においても、精度を大幅に低下させずに上記検索要求に応じることができる。 Even when the query includes noise, it is possible to respond to the search request without significantly reducing accuracy.

例えば、ドキュメントの一部を写真撮影して光学文字認識処理を行ったような、ノイズが混じりやすいデータをクエリとしても精度が大幅に低下することがない。 For example, even if data that is likely to be mixed with noise, such as a case where a part of a document is photographed and optical character recognition processing is performed, the accuracy is not significantly reduced.

また、特定文字列を含む部分のみを文字ブロックとして利用することで、検索の網羅性を大幅に低減させることなく、インデックスサイズを減らすことができる。この際、特定文字列を光学文字認識装置が内部に保有している辞書に登録されている文字列のみとすれば、さらに、光学文字認識処理の誤認識の影響を低減できる。さらに、「▼」のような通常の文章中に頻出しない（所定の回数以上出現しない）シンプルな文字を特定文字列とし、かつ、ドキュメント中のインデックス作成箇所に付与すれば、光学文字認識処理の誤認識の影響を軽減でき、かつ、クライアント部を利用するユーザに対してインデックス作成箇所の目印になる。 Further, by using only a part including a specific character string as a character block, it is possible to reduce the index size without significantly reducing the search completeness. At this time, if the specific character string is only the character string registered in the dictionary held in the optical character recognition device, the influence of the erroneous recognition in the optical character recognition process can be further reduced. Furthermore, if a simple character such as “▼” that does not appear frequently (not appearing more than a predetermined number of times) is used as a specific character string and is added to the index creation location in the document, optical character recognition processing can be performed. The influence of misrecognition can be reduced, and it becomes a mark of the index creation location for the user who uses the client unit.

本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の第１の実施の形態におけるインデックス作成装置の構成図である。It is a block diagram of the index production apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態における読み込まれたＰＤＦファイル（書名：begetable、ファイル名：vegetable3．pdf、ページ：３ページ目）の例である。It is an example of a PDF file (book name: begetable, file name: vegetable3.pdf, page: third page) read in the first embodiment of the present invention. 本発明の第１の実施の形態におけるインデックスを作成する処理のフローチャートである。It is a flowchart of the process which produces the index in the 1st Embodiment of this invention. 本発明の第１の実施の形態における入力されたドキュメントの各ページのリストのデータ構造例である。It is an example of the data structure of the list | wrist of each page of the input document in the 1st Embodiment of this invention. 本発明の第１の実施の形態における文字ブロックの例である。It is an example of the character block in the 1st Embodiment of this invention. 本発明の第１の実施の形態における文字ブロック抽出ルールの例である。It is an example of the character block extraction rule in the 1st Embodiment of this invention. 本発明の第１の実施の形態における文字ブロック抽出部の処理結果である。It is a process result of the character block extraction part in the 1st Embodiment of this invention. 本発明の第２の実施の形態におけるシステム構成図である。It is a system block diagram in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における紙媒体（書名：vegetable、ページ：３ページ目）の例である。It is an example of a paper medium (book name: vegetable, page: third page) in the second embodiment of the present invention. 本発明の第２の実施の形態におけるインデックス作成処理のフローチャートである。It is a flowchart of the index creation process in the 2nd Embodiment of this invention. 本発明の第２の実施の形態におけるリスト化されたテキストファイルの例である。It is an example of the text file listed in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における文字ブロック抽出の例である。It is an example of character block extraction in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における文字ブロック抽出ルールの例である。It is an example of the character block extraction rule in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における文字ブロック抽出部の処理結果である。It is a processing result of the character block extraction part in the 2nd Embodiment of this invention. 本発明の第２の実施の形態におけるコンテンツＤＢの例である。It is an example of content DB in the 2nd Embodiment of this invention. 本発明の第２の実施の形態におけるクライアント部からサーバ部へ問い合わせる作業のフローチャートである。It is a flowchart of the operation | work which inquires to the server part from the client part in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における画像ファイルの例である。It is an example of the image file in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における画像ファイルから抽出されたテキストデータの例である。It is an example of the text data extracted from the image file in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における抽出された文字ブロックの例である。It is an example of the extracted character block in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における文字ブロック抽出ルールの例である。It is an example of the character block extraction rule in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における文字ブロック抽出部の処理結果である。It is a processing result of the character block extraction part in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における問い合わせ結果である。It is the inquiry result in the 2nd Embodiment of this invention. 本発明の第２の実施の形態におけるコンテンツＤＢへの問い合わせ結果である。It is the inquiry result to content DB in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における重複数集計後のコンテンツＤＢへの問い合わせ結果である。It is the inquiry result to content DB after duplication count in the 2nd Embodiment of this invention. 本発明の第２の実施の形態におけるコンテンツ表示部のコンテンツ表示手段の例である。It is an example of the content display means of the content display part in the 2nd Embodiment of this invention. 本発明の第３の実施の形態におけるシステム構成図である。It is a system block diagram in the 3rd Embodiment of this invention. 本発明の第３の実施の形態におけるリージョンＤＢの例である。It is an example of region DB in the 3rd Embodiment of this invention. 本発明の第３の実施の形態におけるコンテンツＤＢの例である。It is an example of content DB in the 3rd Embodiment of this invention. 本発明の第３の実施の形態におけるテキストファイル抽出（誤認識）の例である。It is an example of the text file extraction (false recognition) in the 3rd Embodiment of this invention. 本発明の第３の実施の形態における文字ブロック抽出部の処理結果である。It is a processing result of the character block extraction part in the 3rd Embodiment of this invention. 本発明の第３の実施の形態におけるインデックスＤＢへの問い合わせ結果である。It is the inquiry result to index DB in the 3rd Embodiment of this invention. 本発明の第３の実施の形態におけるコンテンツＤＢへの問い合わせ結果である。It is the inquiry result to content DB in the 3rd Embodiment of this invention. 本発明の第３の実施の形態における重複数集計後のコンテンツＤＢ問い合わせ結果である。It is a content DB inquiry result after duplication count in the 3rd embodiment of the present invention. 本発明の第４の実施の形態におけるインデックス作成装置の構成図である。It is a block diagram of the index production apparatus in the 4th Embodiment of this invention. 本発明の第４の実施の形態における読み取られたＰＤＦファイル（書名：yokohama、ファイル名：yokohama2．pdf、ページ：２ページ目）の例である。It is an example of the read PDF file (Book name: yokohama, File name: yokohama2.pdf, Page: 2nd page) in the 4th Embodiment of this invention. 本発明の第４の実施の形態における処理のフローチャートである。It is a flowchart of the process in the 4th Embodiment of this invention. 本発明の第４の実施の形態における入力されたページのリストである。It is the list of the input page in the 4th Embodiment of this invention. 本発明の第４の実施の形態における抽出された文字ブロックの例である。It is an example of the extracted character block in the 4th Embodiment of this invention. 本発明の第４の実施の形態における文字ブロック抽出ルールの例である。It is an example of the character block extraction rule in the 4th Embodiment of this invention. 本発明の第４の実施の形態における文字ブロック抽出部の処理結果である。It is a processing result of the character block extraction part in the 4th Embodiment of this invention. 本発明の第４の実施の形態における文字ブロック選別部の処理結果である。It is a processing result of the character block selection part in the 4th Embodiment of this invention. 本発明の第５の実施の形態におけるシステム構成図である。It is a system configuration figure in a 5th embodiment of the present invention. 本発明の第５の実施の形態における光学文字認識装置内の文字列辞書の例である。It is an example of the character string dictionary in the optical character recognition apparatus in the 5th Embodiment of this invention. 本発明の第５の実施の形態における紙媒体（書名：yokohama、ページ：２ページ目）の例である。It is an example of the paper medium (book name: yokohama, page: 2nd page) in the 5th Embodiment of this invention. 本発明の第５の実施の形態におけるサーバ部でインデックスを作成する処理のフローチャートである。It is a flowchart of the process which produces an index in the server part in the 5th Embodiment of this invention. 本発明の第５の実施の形態における文字ブロック選別部の処理結果である。It is a processing result of the character block selection part in the 5th Embodiment of this invention. 本発明の第５の実施の形態における紙媒体（書名：yokohama、ページ：２ページ目）の例である。It is an example of the paper medium (book name: yokohama, page: 2nd page) in the 5th Embodiment of this invention. 本発明の第６の実施の形態におけるドキュメント内にＱＲコードが存在する例である。It is an example in which a QR code exists in a document according to the sixth embodiment of the present invention. 本発明の第７の実施の形態におけるインデックス作成装置の構成図である。It is a block diagram of the index production apparatus in the 7th Embodiment of this invention. 本発明の第７の実施の形態におけるドキュメントの例である。It is an example of the document in the 7th Embodiment of this invention. 本発明の第７の実施の形態における処理のフローチャートである。It is a flowchart of the process in the 7th Embodiment of this invention. 本発明の第７の実施の形態におけるドキュメントリストのデータ構造である。It is a data structure of the document list | wrist in the 7th Embodiment of this invention. 本発明の第７の実施の形態における文字列分割（文字の2-gram方式）の例である。It is an example of the character string division | segmentation (2-gram system of a character) in the 7th Embodiment of this invention. 本発明の第７の実施の形態における基本文字列抽出部の処理結果である。It is a processing result of the basic character string extraction part in the 7th Embodiment of this invention. 本発明の第７の実施の形態における周辺文字列抽出部の文字列の抽出例である。It is an example of the character string extraction of the surrounding character string extraction part in the 7th Embodiment of this invention. 本発明の第７の実施の形態における周辺文字列抽出部の処理結果である。It is a processing result of the surrounding character string extraction part in the 7th Embodiment of this invention. 本発明の第８の実施の形態における検索システムの構成図である。It is a block diagram of the search system in the 8th Embodiment of this invention. 本発明の第８の実施の形態におけるドキュメントの例である。It is an example of the document in the 8th Embodiment of this invention. 本発明の第８の実施の形態におけるサーバ側の処理のフローチャートである。It is a flowchart of the process by the side of the server in the 8th Embodiment of this invention. 本発明の第８の実施の形態におけるテキストをデータ構造でリスト化した例である。It is the example which made the text list in the data structure in the 8th Embodiment of this invention. 本発明の第８の実施の形態における文字の2-gram方式で文字列分割した例である。It is the example which divided the character string by the 2-gram system of the character in the 8th Embodiment of this invention. 本発明の第８の実施の形態における基本文字列抽出部の処理結果である。It is a processing result of the basic character string extraction part in the 8th Embodiment of this invention. 本発明の第８の実施の形態における周辺文字列の抽出例である。It is an example of the extraction of the surrounding character string in the 8th Embodiment of this invention. 本発明の第８の実施の形態における周辺文字列抽出部の処理結果である。It is a processing result of the surrounding character string extraction part in the 8th Embodiment of this invention. 本発明の第８の実施の形態におけるコンテンツＤＢのデータ構造である。It is a data structure of content DB in the 8th Embodiment of this invention. 本発明の第８の実施の形態におけるクライアント部からサーバ部に問い合わせる処理のフローチャートである。It is a flowchart of the process which inquires a server part from the client part in the 8th Embodiment of this invention. 本発明の第８の実施の形態におけるドキュメントページの例である。It is an example of the document page in the 8th Embodiment of this invention. 本発明の第８の実施の形態における作成された画像ファイルの例である。It is an example of the created image file in the 8th Embodiment of this invention. 本発明の第８の実施の形態における画像ファイルから抽出されたテキストデータの例である。It is an example of the text data extracted from the image file in the 8th Embodiment of this invention. 本発明の第８の実施の形態におけるテキストデータの例である。It is an example of the text data in the 8th Embodiment of this invention. 本発明の第８の実施の形態における文字の2-gram方式で基本文字列を抽出する例である。It is an example which extracts a basic character string by the 2-gram system of the character in the 8th Embodiment of this invention. 本発明の第８の実施の形態における基本文字列抽出部で抽出した基本文字列の例である。It is an example of the basic character string extracted by the basic character string extraction part in the 8th Embodiment of this invention. 本発明の第８の実施の形態における周辺文字列の抽出方法の例である。It is an example of the extraction method of the surrounding character string in the 8th Embodiment of this invention. 本発明の第８の実施の形態における基本文字列と周辺文字列が関連付けられたデータの例である。It is an example of the data with which the basic character string and the surrounding character string in the 8th Embodiment of this invention were linked | related. 本発明の第８の実施の形態における問い合わせ結果の例（その１）である。It is an example (the 1) of the inquiry result in the 8th Embodiment of this invention. 本発明の第８の実施の形態における問い合わせ結果の例（その２）である。It is an example (the 2) of the inquiry result in the 8th Embodiment of this invention. 本発明の第８の実施の形態における問い合わせ結果の例（その３）である。It is an example (the 3) of the inquiry result in the 8th Embodiment of this invention. 本発明の第８の実施の形態における表示データ例の例である。It is an example of the display data example in the 8th Embodiment of this invention. 本発明の第９の実施の形態における領域抽出例である。It is an example of the area | region extraction in the 9th Embodiment of this invention. 本発明の第９の実施の形態において抽出された基本文字列と周辺文字列の組である。It is a set of a basic character string and a peripheral character string extracted in the ninth embodiment of the present invention. 本発明の第９の実施の形態における誤認識データが含まれる問い合わせ結果の例である。It is an example of the inquiry result in which the misrecognition data in the 9th Embodiment of this invention is contained. 本発明の第９の実施の形態における処理のフローチャートである。It is a flowchart of the process in the 9th Embodiment of this invention. 本発明の第９の実施の形態における問い合わせ結果の例である。It is an example of the inquiry result in the 9th Embodiment of this invention. 本発明の第９の実施の形態における変換された問い合わせ結果の例である。It is an example of the converted inquiry result in the 9th Embodiment of this invention.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

なお、特に図示しないが、以下の各実施の形態における検索装置やシステムを構成する各構成要素は、それぞれ、入力されたデータ及び処理結果を格納するメモリを有するものとする。 Although not particularly illustrated, each component constituting the search device and system in each of the following embodiments has a memory for storing input data and a processing result.

［第１の実施の形態］
本実施の形態では、検索装置に含まれる検索を実行するためのインデックスを作成する部分（インデックス作成装置）にのみ言及する。 [First Embodiment]
In the present embodiment, only the part for creating an index for executing the search included in the search device (index creation device) will be mentioned.

図３は、本発明の第１の実施の形態におけるインデックス作成装置の構成を示す。 FIG. 3 shows the configuration of the index creation device according to the first embodiment of the present invention.

同図に示すインデックス作成装置は、ドキュメント入力部１０、文字ブロック抽出部１１、インデックス出力部１２、インデックスＤＢ１３から構成される。 The index creation apparatus shown in FIG. 1 includes a document input unit 10, a character block extraction unit 11, an index output unit 12, and an index DB 13.

入力されるドキュメントは、文字を含むページの集合からなる紙媒体の書籍とする。本実施の形態では、この書籍の各ページをスキャナ（一般装置）で読み取り、図４に示すように、異なる閲覧環境（ＯＳ，ＰＤＦ閲覧ソフト等）においてもレイアウトが変化しない電子ファイル（ＰＤＦファイル等）に変換する。図４は、ＰＤＦファイル（書名：vegetable、ファイル名：vegetable3.pdf、ページ：３ページ目）の例である。 The input document is a paper book made up of a set of pages including characters. In this embodiment, each page of the book is read by a scanner (general apparatus), and as shown in FIG. 4, an electronic file (PDF file or the like) whose layout does not change even in different browsing environments (OS, PDF browsing software, etc.). ). FIG. 4 shows an example of a PDF file (book name: vegetable, file name: vegetable3.pdf, page: third page).

なお、図４は文章のみからなるページの例であるが、ページには図や表などの文字以外の情報が含まれていてもよい。 Note that FIG. 4 is an example of a page composed only of text, but the page may include information other than characters such as a figure and a table.

また、文字ブロック抽出部１１は、入力されたドキュメントから抽出した文字ブロックを格納するメモリ（図示せず)を有するものとする。 Further, the character block extraction unit 11 has a memory (not shown) for storing character blocks extracted from the input document.

図５は、本発明の第１の実施の形態におけるインデックスを作成する処理のフローチャートである。当該処理は、インデックス作成対象となるドキュメント群を入力する入力ステップ（ステップ１１０、１２０）、各ドキュメントの各ページ内から文字ブロックを抽出する文字ブロック抽出ステップ（ステップ１３０）、検索を実行するためのインデックスを出力する出力ステップ（ステップ１４０）に大別される。 FIG. 5 is a flowchart of a process for creating an index according to the first embodiment of this invention. The processing includes an input step (steps 110 and 120) for inputting a document group to be indexed, a character block extraction step (step 130) for extracting a character block from each page of each document, and a search for executing the search. It is roughly divided into an output step (step 140) for outputting an index.

ステップ１１０）ドキュメント入力部１０は、インデックス作成対象(分析対象)となるドキュメント群を受け付ける。 Step 110) The document input unit 10 receives a document group that is an index creation target (analysis target).

ステップ１２０）ドキュメント入力部１０は、入力された各ページのリストを図６に示すデータ構造で文字ブロック抽出部１１のメモリ(図示せず)に格納する。リスト内の各要素は、各ページを一意に示すものとする。なお、説明の便宜上、ここでは各ページを一意に表す情報としてファイル名を用いているが、ファイルのハッシュ値等、ページを一意に識別できる情報であれば他の情報を利用してもかまわない。 Step 120) The document input unit 10 stores the input list of each page in the memory (not shown) of the character block extraction unit 11 in the data structure shown in FIG. Each element in the list shall uniquely indicate each page. For convenience of explanation, the file name is used here as information uniquely representing each page, but other information may be used as long as the information can uniquely identify the page, such as a hash value of the file. .

ステップ１３０）文字ブロック抽出部１１は、入力された各ドキュメントの各ページ内から文字ブロックを抽出する。詳しくは、ドキュメント入力部１０から渡され、メモリ（図示せず）に格納されたリストに記載されている各ページを読み込み、各ページから文字ブロック群を文字ブロック抽出ルール記憶部１４のルールに則って抽出し、各文字ブロックと各文字ブロックの抽出元ページのファイル名と当該ページにおける各文字ブロックの出現位置を関連付けてインデックス出力部１２に渡す。 Step 130) The character block extraction unit 11 extracts a character block from each page of each input document. Specifically, each page that is passed from the document input unit 10 and listed in a list stored in a memory (not shown) is read, and a character block group is read from each page according to the rules of the character block extraction rule storage unit 14. Each character block, the file name of the extraction source page of each character block, and the appearance position of each character block on the page are associated and passed to the index output unit 12.

ここで「文字ブロック」とは、ページ内に並んでいる文字列から、規定の形状で抽出した１文字以上の文字群のことである。ここでは、図７のように、規定の形状を十字型として抽出する。 Here, the “character block” is a character group of one or more characters extracted in a specified shape from a character string arranged in a page. Here, the prescribed shape is extracted as a cross shape as shown in FIG.

また、文字ブロック抽出ルール記憶部１４に格納されている文字ブロック抽出ルールとは、ページからどのように文字ブロックを抽出するか規定するルールである。ここでは、図８のように、左上隅から右下隅方向へ１文字ずつずらしながら抽出することとする。 The character block extraction rule stored in the character block extraction rule storage unit 14 is a rule that defines how a character block is extracted from a page. Here, as shown in FIG. 8, extraction is performed while shifting one character at a time from the upper left corner toward the lower right corner.

また、「文字ブロックの出現位置」とは、文字ブロックがページのどの位置に出現しているか、システムの目的に応じて任意の粒度で示す位置情報である。ここでは、十字型文字ブロックの上端の文字の行、列の粒度で表現することとする。 The “appearance position of the character block” is position information indicating at which position of the character block the character block appears in an arbitrary granularity depending on the purpose of the system. Here, it is expressed by the granularity of the character row and column at the upper end of the cross-shaped character block.

図９は、本発明の第１の実施の形態における文字ブロック抽出部の処理結果である。 FIG. 9 shows the processing result of the character block extraction unit in the first embodiment of the present invention.

同図では、文字ブロック抽出部１１の処理結果を示しており、このデータがインデックス出力部１２に渡される。なお、ここでは、文字ブロックを、文字ブロックを構成する文字を上方、左方から順番に並べて表現している。 The figure shows the processing result of the character block extraction unit 11, and this data is passed to the index output unit 12. Here, the character block is expressed by arranging the characters constituting the character block in order from the top and the left.

ステップ１４０）インデックス出力部１２は、検索を実行するためのインデックスをインデックスＤＢ１３に出力する。詳しくは、文字ブロック抽出部１１から渡された各文字ブロックを図９のデータ構造でインデックスＤＢ１３に格納する。これにより、文字ブロックを問い合わせキーとしてファイル名及び出現位置を問い合わせ結果として返すインデックスＤＢ１３を実現する。 Step 140) The index output unit 12 outputs an index for executing the search to the index DB 13. Specifically, each character block passed from the character block extraction unit 11 is stored in the index DB 13 with the data structure of FIG. Thus, the index DB 13 that returns a file name and an appearance position as a query result using a character block as a query key is realized.

上記のように、文字をブロック単位で扱うことにより、ドキュメント内の各位置に固有になりやすい文字列パターンを、少ない文字数で表現できるため、識別能力が高く、ロバスト性が高いインデックスを実現できる。 As described above, by handling characters in units of blocks, a character string pattern that tends to be unique to each position in the document can be expressed with a small number of characters, so that an index with high identification capability and high robustness can be realized.

［第２の実施の形態］
本実施の形態では、サーバ部とクライアント部を設け、サーバ部においてインデックスを作成し、クライアント部からサーバ部にインデックスを問い合わせ表示する例を説明する。 [Second Embodiment]
In this embodiment, an example will be described in which a server unit and a client unit are provided, an index is created in the server unit, and the index is inquired and displayed from the client unit to the server unit.

図１０は、本発明の第２の実施の形態におけるシステム構成を示す。 FIG. 10 shows a system configuration in the second embodiment of the present invention.

同図に示すシステムは、大きく分けてサーバ部３００、クライアント部４００、外部装置からなる。 The system shown in the figure is roughly divided into a server unit 300, a client unit 400, and an external device.

サーバ部３００は、データ入力部３１０、文字ブロック抽出部３２０、インデックス出力部３３０、インデックスＤＢ３４０，コンテンツＤＢ３５０，サーバ側データ送受信部３６０、ＤＢ問い合わせ部３７０、文字ブロック抽出ルール記憶部３２１からなる。 The server unit 300 includes a data input unit 310, a character block extraction unit 320, an index output unit 330, an index DB 340, a content DB 350, a server-side data transmission / reception unit 360, a DB inquiry unit 370, and a character block extraction rule storage unit 321.

クライアント部４００は、クライアント側デバイス４１０、クライアント側データ送受信部４２０を有し、クライアント側デバイス４１０は、ドキュメント撮影部５１１、コンテンツ表示部４１２を有する。 The client unit 400 includes a client side device 410 and a client side data transmission / reception unit 420, and the client side device 410 includes a document photographing unit 511 and a content display unit 412.

外部装置としては、ドキュメント群２００を読み取るドキュメント読み取り装置１００と、光学文字認識装置１０１が設けられる。 As an external device, a document reading device 100 that reads the document group 200 and an optical character recognition device 101 are provided.

ドキュメントは、文字を含むページの集合からなる紙媒体の書籍とする。本実施の形態でドキュメント読み取り装置１００に入力されるドキュメント群２００を構成している各ページ及びクライアント部４００に入力されるドキュメントページ゛２０１の例を図１１に示す。 The document is a paper medium book composed of a set of pages including characters. FIG. 11 shows an example of each page constituting the document group 200 input to the document reading apparatus 100 and the document page 201 input to the client unit 400 in this embodiment.

ドキュメント読み取り装置１００は、サーバ部３００のデータ入力部３１０に接続されている。光学文字認識装置１０１は、サーバ部３００のデータ送受信部３６０に接続されている。 The document reading device 100 is connected to the data input unit 310 of the server unit 300. The optical character recognition device 101 is connected to the data transmission / reception unit 360 of the server unit 300.

本実施の形態では、
（１）サーバ部３００においてインデックスを作成する作業；
（２）クライアント部４００からサーバ部３００に問い合わせる作業；
を行う。 In this embodiment,
(1) Work to create an index in the server unit 300;
(2) Work to inquire the server unit 300 from the client unit 400;
I do.

（１）サーバ部３００においてインデックスを作成する作業：
当該処理は、前述の第１の実施の形態におけるインデックス作成装置に相当する。図１２は、本発明の第２の実施の形態におけるインデックス作成処理のフローチャートである。 (1) Creating an index in the server unit 300:
This process corresponds to the index creation device in the first embodiment described above. FIG. 12 is a flowchart of the index creation process in the second embodiment of the present invention.

以下では、インデックス作成対象となるドキュメントを入力する入力ステップ（ステップ２１０，２２０）、各ドキュメントの各ページ内から文字ブロックを抽出する文字ブロック抽出ステップ（ステップ２３０）、検索を実行するためのインデックスを出力する出力ステップ（ステップ２４０）を行う。 In the following, an input step (steps 210 and 220) for inputting a document to be indexed, a character block extraction step (step 230) for extracting a character block from each page of each document, and an index for executing a search are shown. An output step (step 240) of outputting is performed.

ステップ２１０）サーバ部３００のデータ入力部３１０は、ドキュメント群２００の各ドキュメントの各ページを、ドキュメント読み取り装置１００で読み取り、データ入力部３１０に渡す。 Step 210) The data input unit 310 of the server unit 300 reads each page of each document in the document group 200 with the document reading device 100 and passes it to the data input unit 310.

ここで、ドキュメント読み取り装置１００は、紙媒体に印刷されたテキストを読み取ってテキストファイルに変換する一般装置であり、ＯＣＲ機能付きスキャナ等がこれに該当する。ここでは、各ページ中のテキストは紙媒体に印刷された状態における改ページ位置、改行位置が保持されたままテキストファイルに変換されたものとする。 Here, the document reading device 100 is a general device that reads text printed on a paper medium and converts it into a text file, and corresponds to a scanner with an OCR function. Here, it is assumed that the text in each page is converted into a text file while maintaining the page break position and line feed position in a state printed on a paper medium.

ステップ２２０）データ入力部２０は、ドキュメント読み取り装置１００から渡されたテキストファイル群を読み込み、図１３に示すデータ構造でリスト化して、サーバ部３００の文字ブロック抽出部３２０に渡す。なお、説明の便宜上、ここでは、各ページを一意に示す情報としてファイル名を用いているが、ファイルのハッシュ値等、ページを一意に識別できる情報であれば他の情報を利用しても構わない。 Step 220) The data input unit 20 reads the text file group passed from the document reading device 100, lists it in the data structure shown in FIG. 13, and passes it to the character block extraction unit 320 of the server unit 300. For convenience of explanation, the file name is used as information uniquely indicating each page. However, other information may be used as long as the information can uniquely identify the page, such as a hash value of the file. Absent.

ステップ２３０）サーバ部３００の文字ブロック抽出部３２０は、データ入力部３１０から渡されたリストに記載されている各ページを読み込み、各ページから文字ブロック群を文字ブロック抽出ルール記憶部３２１のルールに則って抽出し、各文字ブロックと各文字ブロックの抽出元ページのファイル名と該ページにおける該文字ブロックの出現位置を関連付けて、インデックス出力部３３０に渡す。 Step 230) The character block extraction unit 320 of the server unit 300 reads each page described in the list passed from the data input unit 310, and uses the character block group from each page as a rule of the character block extraction rule storage unit 321. Accordingly, each character block, the file name of the extraction source page of each character block, and the appearance position of the character block on the page are associated with each other and passed to the index output unit 330.

ここで、「文字ブロック」とは、ページ内に並んでいる文字列から、規定の形状で抽出した１文字以上の文字群のことである。ここでは、図１４のように規定の形状を箱型として抽出する。 Here, the “character block” is a character group of one or more characters extracted in a prescribed shape from a character string arranged in a page. Here, the prescribed shape is extracted as a box shape as shown in FIG.

また、文字ブロック抽出ルール記憶部３２１に格納されている「文字ブロック抽出ルール」とは、ドキュメントからどのような文字ブロックを抽出するかを規定するルールである。ここでは、図１５のように、左上隅から右下隅方向へ２文字ずつずらしながら抽出することとする。 The “character block extraction rule” stored in the character block extraction rule storage unit 321 is a rule that defines what character blocks are extracted from a document. Here, as shown in FIG. 15, extraction is performed while shifting by two characters from the upper left corner toward the lower right corner.

また、「文字ブロックの出現位置」とは、文字ブロックがドキュメントのどの位置に出現しているか、システムの目的に応じて任意の粒度で示す位置情報である。ここでは、箱型文字ブロックの上左端の文字の行、列の粒度で表現することとする。 The “appearance position of the character block” is position information indicating at which position of the character block the character block appears in an arbitrary granularity according to the purpose of the system. Here, the box-type character block is expressed by the granularity of the upper leftmost character row and column.

図１６は、本発明の第１の実施の形態における文字ブロック抽出部の処理結果を示す。 FIG. 16 shows the processing result of the character block extraction unit in the first embodiment of the present invention.

同図では、文字ブロック抽出部３２０で上記の処理を行った結果を示しており、このデータがインデックス出力部３３０に渡される。なお、ここでは、文字ブロックを構成する文字を上方、左方から順番に並べて表現している。 The figure shows the result of the above processing performed by the character block extraction unit 320, and this data is passed to the index output unit 330. Here, the characters constituting the character block are expressed in order from the top and the left.

ステップ２４０）インデックス出力部３３０は、文字ブロック抽出部３２０から渡された各文字ブロックを、図１６のデータ構造でインデックスＤＢ３４０に格納する。これにより、検索時には文字ブロックを問い合わせキーとし、ファイル名及び出現位置を問い合わせ結果として返すインデックスＤＢ３４０を実現する。 Step 240) The index output unit 330 stores each character block passed from the character block extraction unit 320 in the index DB 340 with the data structure of FIG. This realizes the index DB 340 that returns a file name and an appearance position as an inquiry result using a character block as an inquiry key at the time of search.

なお、サーバ部３００のコンテンツＤＢ３５０には、書籍出版社、あるいは、一般ユーザが作成した、各書籍内の各位置(ページ、行、列等)に関係したコンテンツ及び該コンテンツに関する情報を格納しておく。コンテンツ及び該コンテンツに関する情報をコンテンツＤＢ３４０に格納するための格納作業専用端末を用意してもよいし、格納作業用Ｗｅｂアプリケーションを用意して不特定多数のユーザがＷｅｂブラウザを通じて自由にコンテンツ及び該コンテンツに関する情報を格納できるようにしてもよい。コンテンツの例としては、観光名所案内であれば各地を訪れた旅行者の体験談、化学教科書であれば化学実験映像等が挙げられる。ここでは、サーバ部３００上のデータ格納領域にコンテンツの実体を格納し、図１７に示すデータ構造でコンテンツと各書籍内の各位置の関係を格納する。 The content DB 350 of the server unit 300 stores content related to each position (page, row, column, etc.) in each book created by a book publisher or general user and information related to the content. deep. A storage dedicated terminal for storing content and information related to the content in the content DB 340 may be prepared, or a storage work Web application may be prepared so that an unspecified number of users can freely access the content and the content through a Web browser. It may be possible to store information regarding. Examples of contents include experiences of tourists who have visited various places for tourist attractions, and chemical experiment videos for chemical textbooks. Here, the substance of the content is stored in the data storage area on the server unit 300, and the relationship between the content and each position in each book is stored in the data structure shown in FIG.

（２）クライアント部４００からサーバ部３００に問い合わせる作業：
図１８は、本発明の第２の実施の形態におけるクライアント部からサーバ部へ問い合わせる作業のフローチャートである。 (2) Work for inquiring the server unit 300 from the client unit 400:
FIG. 18 is a flowchart of the work for inquiring from the client unit to the server unit in the second embodiment of the present invention.

以下では、インデックス作成対象となるページの一部を入力する入力ステップ（ステップ３１０，３２０）、入力されたページの一部から文字ブロックを抽出する文字ブロック抽出ステップ（ステップ３３０）、インデックスＤＢ３４０に問い合わせを行い、入力されたページの一部に関連付けられているコンテンツを特定する問い合わせステップ（ステップ３４０，３５０，３６０）、問い合わせた結果得られたコンテンツをクライアント部４００で表示する出力ステップ（ステップ３７０）を行う。 Hereinafter, an input step (steps 310 and 320) for inputting a part of a page to be indexed, a character block extraction step (step 330) for extracting a character block from a part of the input page, and an inquiry to the index DB 340 Inquiry step (steps 340, 350, 360) for specifying the content associated with a part of the input page, and an output step (step 370) for displaying the content obtained as a result of the inquiry on the client unit 400 I do.

ステップ３１０）クライアント部４００のクライアント側デバイス４１０では、ドキュメント撮影部４１１において、ドキュメントページ２０１の全体、または、一部を光学的に撮影して、撮影内容を画像ファイルとして保存し、クライアント側データ送受信部４２０に渡す。クライアント側データ送受信部４２０は、ドキュメント撮影部４１１から渡された画像ファイルをネットワークを通じてサーバ部３００のデータ送受信部３６０に渡す。 Step 310) In the client side device 410 of the client unit 400, the document photographing unit 411 optically photographs the whole or part of the document page 201, saves the photographing content as an image file, and transmits and receives client side data. To the unit 420. The client side data transmitting / receiving unit 420 transfers the image file transferred from the document photographing unit 411 to the data transmitting / receiving unit 360 of the server unit 300 through the network.

ドキュメントページ２０１は、ドキュメント群２００に含まれる１件の書籍の１ページである。ここでは、図１１に示すページの一部分が撮影され、図１９に示す画像ファイルが作成されたとする。 The document page 201 is one page of one book included in the document group 200. Here, it is assumed that a part of the page shown in FIG. 11 is photographed and the image file shown in FIG. 19 is created.

ステップ３２０）サーバ側データ送受信部３６０は、光学文字認識装置１０１を利用して、クライアント側データ送受信部４２０から渡された画像ファイルから図２０に示すテキストデータを抽出し、文字ブロック抽出部３２０に渡す。なお、光学文字認識装置１０１は、一般的なＯＣＲ、ソフトウェア等であり、文字が撮影された画像から文字情報を抽出し、テキストデータとしてコンピュータが利用できる形式に変換する一般装置である。 Step 320) Using the optical character recognition apparatus 101, the server-side data transmission / reception unit 360 extracts the text data shown in FIG. 20 from the image file passed from the client-side data transmission / reception unit 420, and sends it to the character block extraction unit 320. hand over. The optical character recognition device 101 is general OCR, software, or the like, and is a general device that extracts character information from an image in which characters are photographed and converts them into a format that can be used by a computer as text data.

ステップ３３０）文字ブロック抽出部３２０は、サーバ側データ送受信部３６０から渡されたテキストデータを読み込み、該テキストデータから文字ブロック群を文字ブロック抽出ルール記憶部３２１のルールに則って抽出し、ＤＢ問い合わせ３７０に渡す。 Step 330) The character block extraction unit 320 reads the text data passed from the server-side data transmission / reception unit 360, extracts a character block group from the text data according to the rules of the character block extraction rule storage unit 321, and performs a DB inquiry. Pass to 370.

ここで、「文字ブロック」は、ステップ２２０と同じく箱型の形状で図２１のように抽出する。 Here, the “character block” is extracted in a box shape as in step 220 as shown in FIG.

また、「文字ブロック抽出ルール」は、図２２のように、左上隅から右下隅方向へ１文字ずつずらしながら抽出することとする。 Also, the “character block extraction rule” is extracted while shifting one character at a time from the upper left corner toward the lower right corner as shown in FIG.

図２３は、文字ブロック抽出部３２０の処理を行った結果を示しており、このデータがＤＢ問い合わせ部３７０に渡される。なお、ここでは、文字ブロックを、文字ブロックを構成する文字を上方、左方から順番に並べて表現している。 FIG. 23 shows the result of processing of the character block extraction unit 320, and this data is passed to the DB inquiry unit 370. Here, the character block is expressed by arranging the characters constituting the character block in order from the top and the left.

ステップ３４０）ＤＢ問い合わせ部３７０は、図２３のリストを用いてインデックスＤＢ３４０に問い合わせを行う。 Step 340) The DB inquiry unit 370 makes an inquiry to the index DB 340 using the list of FIG.

まず、ＤＢ問い合わせ部３７０は、図２３の各文字ブロックに対応するファイル名と出現位置を問い合わせ、問い合わせ結果を図２４に示す形式で取得する。 First, the DB inquiry unit 370 inquires about the file name and the appearance position corresponding to each character block in FIG. 23, and acquires the inquiry result in the format shown in FIG.

ステップ３５０）次に、ＤＢ問い合わせ部３７０がコンテンツＤＢ３５０に対して図２４の問い合わせ結果を用いて問い合わせを行う。前述の通り、コンテンツＤＢ３５０には図１７に示す構造でデータが格納されている。ＤＢ問い合わせ部３７０は、図２４の各ファイル名と各出現位置の組(例えば、「vegetable3.txt」と「１行」・「３列」)を用いて、コンテンツＤＢ３５０に該ファイル名と該出現位置に対応するコンテンツ・コンテンツタイプを問い合わせ、問い合わせ結果を図２５に示す形式で取得し、コンテンツ・コンテンツタイプの重複数を集計して図２６に示す形式に変換する。 Step 350) Next, the DB inquiry unit 370 makes an inquiry to the content DB 350 using the inquiry result of FIG. As described above, the content DB 350 stores data in the structure shown in FIG. The DB inquiry unit 370 uses the set of each file name and each appearance position (for example, “vegetable3.txt”, “1 line”, “3 columns”) in FIG. The content / content type corresponding to the position is inquired, the inquiry result is acquired in the format shown in FIG. 25, and the duplication number of the content / content type is totaled and converted into the format shown in FIG.

ステップ３６０）ＤＢ問い合わせ部３７０は、上述の問い合わせ結果の中で、一定条件を満たすものを検索結果としてサーバ側データ送受信部３６０に渡す。本実施の形態においては、最も重複数の多いもの（図２６では「tomato_1.mp4」・「ムービーファイル」)を検索結果としてサーバ側データ送受信部３６０に渡すこととする。 Step 360) The DB inquiry unit 370 passes, to the server-side data transmission / reception unit 360, as a search result, the above-described inquiry result that satisfies a certain condition. In the present embodiment, the most frequently duplicated files (“tomato_1.mp4” / “movie file” in FIG. 26) are passed to the server-side data transmission / reception unit 360 as search results.

サーバ側データ送受信部３６０は、ＤＢ問い合わせ部３７０から渡された検索結果をネットワークを通じてクライアント側データ送受信部４２０に渡す。 The server-side data transmission / reception unit 360 passes the search result passed from the DB inquiry unit 370 to the client-side data transmission / reception unit 420 through the network.

ステップ３７０）クライアント側データ送受信部４２０は、サーバ側データ送受信部３６０から渡された検索結果をコンテンツ表示部４１２に渡す。 Step 370) The client-side data transmission / reception unit 420 passes the search result passed from the server-side data transmission / reception unit 360 to the content display unit 412.

コンテンツ表示部４１２は、クライアント側データ送受信部４２０から渡された検索結果を、コンテンツ表示部４１２内部で予め設定されたコンテンツ表示手段（図２７）を用いて表示する。なお、コンテンツ表示手段として、図２７に示すように、テキストファイルについてはコンテンツ表示手段としてテキストビューアを用い、サウンドファイルにはサウントプレイヤを用い、ムービーファイルについてはムービープレイヤを用いるように設定されている。 The content display unit 412 displays the search result passed from the client-side data transmission / reception unit 420 using content display means (FIG. 27) preset in the content display unit 412. As shown in FIG. 27, the content display means is set to use a text viewer as the content display means for text files, use a sound player for sound files, and use a movie player for movie files. Yes.

［第３の実施の形態］
前述の第２の実施の形態のステップ３１０において、撮影条件が悪く（光量不足、手振れ等)ドキュメント撮影部４１１が撮影した画像の品質が悪い場合に、ステップ３２０で光学文字認識装置１０１が撮影画像から正確に文字を抽出できない場合がある。また、現在の技術水準においても、光学文字認識精度は１００％ではないため、撮影画像の品質が良い場合でも一定確率で文字の誤認識が発生する。 [Third Embodiment]
In step 310 of the second embodiment described above, when the shooting conditions are bad (insufficient light quantity, camera shake, etc.) and the quality of the image shot by the document shooting unit 411 is low, the optical character recognition device 101 takes a shot image in step 320. In some cases, characters cannot be extracted accurately from. Further, even in the current technical level, the optical character recognition accuracy is not 100%, so that even when the quality of the captured image is good, erroneous recognition of characters occurs with a certain probability.

本実施の形態では、上記のように光学文字認識装置１０１において正しく文字認識が行われない場合について述べる。 In the present embodiment, a case will be described where character recognition is not performed correctly in the optical character recognition apparatus 101 as described above.

図２８は、本発明の第３の実施の形態におけるシステム構成図である。同図に示すシステムには、サーバ部３００にリージョンＤＢ３８０が追加されている。なお、処理フローは第２の実施の形態と同様である。 FIG. 28 is a system configuration diagram according to the third embodiment of the present invention. In the system shown in the figure, a region DB 380 is added to the server unit 300. Note that the processing flow is the same as in the second embodiment.

まず、事前の準備として各書籍内に複数文字ブロックを含む任意の範囲を持つリージョンを定義する。ここでは、第２の実施の形態と同様に各書籍の各ページをテキストファイルに変換し、図２９のように各テキストファイル内に１０行程度の範囲を持つリージョンを定義し、リージョンＤＢ３８０に格納する。リージョンの範囲は、「リージョン１」と「リージョン２」のように排他になるよう定義してもよいし、「リージョン３」と「リージョン４」のように一部重複して定義してもよい。 First, as a preliminary preparation, a region having an arbitrary range including a plurality of character blocks is defined in each book. Here, as in the second embodiment, each page of each book is converted into a text file, a region having a range of about 10 lines is defined in each text file as shown in FIG. 29, and stored in the region DB 380. To do. The region range may be defined to be exclusive, such as “Region 1” and “Region 2”, or may be partially overlapped such as “Region 3” and “Region 4”. .

次に、図３０のように各リージョンにコンテンツを関連付けてコンテンツＤＢ３５０に格納する。コンテンツは、「cucumber.txt」のように同一コンテンツが複数のリージョンに関連付けられていてもよい。また、「tomato_1.mp4」と「tomato＿2．mp4」のように異なるコンテンツが同一リージョンに関連付けられていてもよい。 Next, as shown in FIG. 30, each content is associated with each region and stored in the content DB 350. The content may be associated with a plurality of regions, such as “cucumber.txt”. Different contents such as “tomato_1.mp4” and “tomato_2.mp4” may be associated with the same region.

例えば、ステップ３１０〜３２０において、ドキュメント撮影部４１１が図１９と同じ領域を撮影したが、画像品質が悪いため、光学文字認識装置１０１は図２０のように抽出すべきところ、図３１のように抽出したとする。同図において、下線部分は誤認識文字を示す。この状況でステップ３２０の文字ブロック抽出部３２０の処理を行うと、図３２のようなリストが得られる。同図において、下線部分は誤認識文字を示す。 For example, in steps 310 to 320, the document photographing unit 411 has photographed the same area as that in FIG. 19, but the image quality is poor. Therefore, the optical character recognition device 101 should be extracted as shown in FIG. Suppose that it is extracted. In the figure, the underlined portion indicates a misrecognized character. In this situation, when the processing of the character block extraction unit 320 in step 320 is performed, a list as shown in FIG. 32 is obtained. In the figure, the underlined portion indicates a misrecognized character.

次に、図３２のリストを利用してステップ３５０と同様の処理（インデックスＤＢ問い合わせ）を行うと、ＤＢ問い合わせ部３７０は、問い合わせ結果を図３３で示す形式で取得する。同図において、「該当なし」とは該当する文字ブロックがインデックスＤＢ３４０に含まれていないことを示す。 Next, when processing similar to step 350 (index DB inquiry) is performed using the list of FIG. 32, the DB inquiry unit 370 acquires the inquiry result in the format shown in FIG. In the figure, “N / A” indicates that the corresponding character block is not included in the index DB 340.

ここで、ＤＢ問い合わせ部３７０は、図３３のリスト（但し、ファイル名が「該当なし」のものは除く）を用いてリージョンＤＢ３８０に問い合わせ、各ブロックと該ブロックが出現するリージョンの関係情報を図３４の形式で取得し、これをリージョンごとに出現回数をカウントして図３５の形式に変換する。同図において出現回数が最多のリージョンを、クライアント部４００が撮影した範囲に含まれているリージョンであると特定する。ここでは、「リージョン５」が該当する。 Here, the DB inquiry unit 370 makes an inquiry to the region DB 380 using the list shown in FIG. 33 (except that the file name is “not applicable”), and shows the relationship information between each block and the region in which the block appears. 34, and the number of appearances is counted for each region and converted into the format shown in FIG. In the figure, the region having the highest number of appearances is specified as the region included in the range captured by the client unit 400. Here, “Region 5” corresponds.

最後に、ＤＢ問い合わせ部３７０は、「リージョン５」に関連付けられたコンテンツ、コンテンツタイプをコンテンツＤＢ３５０に問い合わせ、以降ステップ３７０と同様の処理を行うと、コンテンツ表示部４１２にて正しいコンテンツ（cucumber.txt）が表示される。 Finally, the DB inquiry unit 370 inquires the content DB 350 about the content and content type associated with “Region 5”, and when the same processing as in Step 370 is performed thereafter, the content display unit 412 displays the correct content (cucumber.txt ) Is displayed.

このように、内部に複数の文字ブロックを含む範囲を１つのリージョンとし、撮影画像から抽出した文字ブロック群が最も多く出現するリージョンを特定して該リージョンに関連付けられたコンテンツを検索結果とする方式により、光学文字認識の精度が悪く文字ブロックに誤認識文字が混じっている場合においても正しく検索結果を求めることができる。 As described above, a range including a plurality of character blocks inside is set as one region, a region in which the character block group extracted from the photographed image appears most frequently is specified, and content associated with the region is used as a search result. Thus, even when optical character recognition accuracy is poor and misrecognized characters are mixed in the character block, the search result can be obtained correctly.

［第４の実施の形態］
本実施の形態では、第１の実施の形態よりもインデックスサイズを小さくし、かつ検索可能領域の網羅性を大幅に低減させないように、インデックスを作成する処理について説明する。 [Fourth Embodiment]
In the present embodiment, a process for creating an index will be described so that the index size is made smaller than in the first embodiment and the completeness of the searchable area is not significantly reduced.

ここでは、１つ以上の文字からなる特定文字列を含む文字ブロックのみを用いてインデックスを作成する。以下詳細な手順を示す。 Here, an index is created using only a character block including a specific character string made up of one or more characters. The detailed procedure is shown below.

図３６は、本発明の第４の実施の形態におけるインデックス作成装置の構成を示す。 FIG. 36 shows the configuration of the index creation device in the fourth embodiment of the present invention.

同図に示すインデックス作成装置は、ドキュメント入力部４０、文字ブロック抽出部４１、文字ブロック選別部４２、インデックス出力部４３、インデックスＤＢ４４、文字ブロック抽出ルール記憶部４５、外部装置の特定文字列ＤＢ４６から構成される。 The index creation apparatus shown in the figure includes a document input unit 40, a character block extraction unit 41, a character block selection unit 42, an index output unit 43, an index DB 44, a character block extraction rule storage unit 45, and a specific character string DB 46 of an external device. Composed.

入力されるドキュメント群の各ドキュメントは、文字を含むページの集合からなる紙媒体の書籍とする。本実施の形態では、この書籍の各ページをスキャナ（一般装置）で読み取り、図３７に示すように、異なる閲覧環境（ＯＳ，ＰＤＦ閲覧ソフト等）においてもレイアウトが変化しない電子ファイル(ＰＤＦファイル等)に変換する。 Each document in the input document group is assumed to be a paper medium book composed of a set of pages including characters. In this embodiment, each page of the book is read by a scanner (general apparatus), and as shown in FIG. 37, an electronic file (PDF file or the like) whose layout does not change even in different browsing environments (OS, PDF browsing software, etc.). ).

なお、図３７は、文書のみからなるページの例であるが、ページには図や表などの文字以外の情報が含まれていてもよい。 Note that FIG. 37 is an example of a page made up of only documents, but the page may include information other than characters such as a figure and a table.

また、文字ブロック抽出部４１、文字ブロック選別部４２は、抽出した文字ブロックを格納するメモリ（図示せず）を有するものとする。 The character block extraction unit 41 and the character block selection unit 42 have a memory (not shown) for storing the extracted character blocks.

外部装置である特定文字列ＤＢ４６には、事前に１つ以上の文字からなる特定文字列が１つ以上登録されているものとする。検索可能領域の網羅性を大幅に低減させないためには、ドキュメント中の各領域に満遍なく出現する文字列が登録されていることが望ましく、日本語ドキュメントの場合は「の」、「は」、「が」、「。」、「、」等の助詞や句読点がこれにあたる。以降、本実施の形態では、「の」の１語が特定文字列ＤＢ４６に登録されているものとして説明を行うが、その他の文字が特定文字列ＤＢ４６に登録されていても構わない。 It is assumed that one or more specific character strings made up of one or more characters are registered in advance in the specific character string DB 46 that is an external device. In order not to greatly reduce the comprehensiveness of searchable areas, it is desirable that character strings that appear uniformly in each area in the document are registered. In the case of Japanese documents, "no", "ha", " This includes particles and punctuation marks such as "", ".", ",". In the following description of the present embodiment, it is assumed that one word “no” is registered in the specific character string DB 46, but other characters may be registered in the specific character string DB 46.

図３８は、本発明の第４の実施の形態における処理のフローチャートである。 FIG. 38 is a flowchart of processing in the fourth embodiment of the present invention.

本実施の形態における処理は、インデックス作成対象となるドキュメント群を入力する入力ステップ（ステップ４１０，４２０）、各ドキュメントの各ページから特定文字列を含む文字ブロックを抽出する文字ブロック抽出ステップ（ステップ４３０）、検索を実行するためのインデックスを出力する出力ステップ（ステップ４４０）に分けられる。 The processing in this embodiment includes an input step (steps 410 and 420) for inputting a document group to be indexed, and a character block extraction step (step 430) for extracting a character block including a specific character string from each page of each document. ), And an output step (step 440) for outputting an index for executing the search.

ステップ４１０）ドキュメント入力部４０は、分析対象の各ドキュメントの各ページの入力を受け付ける。 Step 410) The document input unit 40 receives input of each page of each document to be analyzed.

ステップ４２０）ドキュメント入力部４０は、入力された各ページのリストを図３９に示すデータ構造で文字ブロック抽出部４１に渡す。リスト内の各要素は、各ページを一意に示すものとする。なお、説明の便宜上、ここでは各ページを一意に示す情報としてファイル名を用いているが、ファイルのハッシュ値等、ページを一意に識別できる情報であれば他の情報を利用しても構わない。 Step 420) The document input unit 40 passes the input list of each page to the character block extraction unit 41 in the data structure shown in FIG. Each element in the list shall uniquely indicate each page. For convenience of explanation, the file name is used as information uniquely indicating each page here, but other information may be used as long as the information can uniquely identify the page, such as a hash value of the file. .

ステップ４３０）文字ブロック抽出部４１は、ドキュメント入力部４０から渡されたリストに記載されている各ページを読み込み、各ページから文字ブロック群を文字ブロック抽出ルール記憶部４５のルールに則って抽出し、各文字ブロックと各文字ブロックの抽出元のページにおける該文字ブロックの出現位置を関連付けて、文字ブロック選択部４２に渡す。 Step 430) The character block extraction unit 41 reads each page described in the list passed from the document input unit 40, and extracts a character block group from each page according to the rules of the character block extraction rule storage unit 45. Each character block is associated with the appearance position of the character block in the page from which each character block is extracted, and passed to the character block selection unit 42.

ここで「文字ブロック」とは、ステップ１３０と同じく、十字型の形状で図４０のように抽出する。 Here, the “character block” is extracted in a cross shape as shown in FIG.

また、文字ブロック抽出ルール記憶部４５に格納されているルールは図４１のように、左上隅から右下隅方向へ１文字ずつずらしながら抽出することとする。 Further, the rules stored in the character block extraction rule storage unit 45 are extracted while shifting one character at a time from the upper left corner toward the lower right corner as shown in FIG.

また、「文字ブロックの出現位置」とは、文字ブロックがページのどの位置に出現しているか、システムの目的に応じて任意の粒度で示す位置情報である。ここでは、十文字型文字ブロックの上端の文字の行、列の粒度で表現することとする。 The “appearance position of the character block” is position information indicating at which position of the character block the character block appears in an arbitrary granularity depending on the purpose of the system. Here, it is expressed by the granularity of the character row and column at the upper end of the cross-shaped character block.

図４２は、本発明の第４の実施の形態における文字ブロック抽出部の処理結果を示す。同図では、文字ブロック抽出部４１でステップ４３０の処理を行った結果を示しており、このデータが文字ブロック選別部４２に渡される。なお、ここでは、文字ブロックを、文字ブロックを構成する文字を上方、左方から順番に並べて表現している。 FIG. 42 shows the processing result of the character block extraction unit in the fourth embodiment of the present invention. The figure shows the result of the processing of step 430 performed by the character block extraction unit 41, and this data is transferred to the character block selection unit 42. Here, the character block is expressed by arranging the characters constituting the character block in order from the top and the left.

ステップ４４０）文字ブロック選別部４２は、文字ブロック抽出部４１から渡された各文字ブロックについて、特定文字列ＤＢ４６に問い合わせを行い、特定文字列ＤＢ４６に登録されている語（本実施の形態では「の」）を含む文字ブロックのみを選別する。 Step 440) The character block selection unit 42 inquires of the specific character string DB 46 about each character block passed from the character block extraction unit 41, and the words registered in the specific character string DB 46 (in this embodiment, “ Only those character blocks that contain "

図４３は、本発明の第４の実施の形態における文字ブロック選別部の処理結果を示す。同図では、文字ブロック選別部４２でステップ４４０の処理を行った結果を示しており、このデータがインデックス出力部４３に渡される。 FIG. 43 shows the processing result of the character block selection unit in the fourth embodiment of the present invention. The figure shows the result of the processing of step 440 performed by the character block selection unit 42, and this data is passed to the index output unit 43.

ステップ４５０）インデックス出力部４３は、文字ブロック選別部４２から渡された各文字ブロックを、図４３のデータ構造でインデックスＤＢ４４に格納する。これにより、文字ブロックを問い合わせキーとし、ファイル名及び出現位置を問い合わせ結果として返すインデックスＤＢ４４を実現する。 Step 450) The index output unit 43 stores each character block passed from the character block selection unit 42 in the index DB 44 with the data structure of FIG. This implements the index DB 44 that uses a character block as an inquiry key and returns a file name and an appearance position as an inquiry result.

上記のように、「の」のような網羅的に出現する文字列を含む文字ブロックのみを用いてインデックスを作成することで、インデックスのサイズを小さくし、かつ検索可能領域の網羅性を大幅に低減させずにインデックスを作成できる。 As described above, by creating an index using only character blocks that include exhaustive characters such as “no”, the size of the index can be reduced and the searchability area can be greatly covered. You can create an index without reducing it.

［第５の実施の形態］
本実施の形態では、第２の実施の形態よりもインデックスサイズを小さくし、かつ光学文字認識処理の誤認識の影響を軽減できるように、インデックスを作成する方法について言及する。 [Fifth Embodiment]
In the present embodiment, a method for creating an index will be mentioned so that the index size can be made smaller than that in the second embodiment and the influence of erroneous recognition in the optical character recognition process can be reduced.

ここでは、文字ブロックが、光学文字認識装置１０１が内部に保有している文字列辞書に登録されている文字列を含む場合のみ、該文字ブロックを用いてインデックスを作成する。以下、詳細な手順を示す。 Here, only when the character block includes a character string registered in the character string dictionary held in the optical character recognition apparatus 101, an index is created using the character block. The detailed procedure is shown below.

図４４は、本発明の第５の実施の形態における検索システムの構成を示す。 FIG. 44 shows the configuration of the search system according to the fifth embodiment of the present invention.

同図に示すシステムにおいて、第２の実施の形態と同様の構成要素には同一符号を付し、その説明を省略する。 In the system shown in the figure, the same components as those of the second embodiment are denoted by the same reference numerals, and the description thereof is omitted.

なお、クライアント部４００、外部装置（特定文字列ＤＢ１０３以外）の構成・動作は第２の実施の形態と同様であるので、以降の説明では詳細を省略する。 Note that the configurations and operations of the client unit 400 and the external device (other than the specific character string DB 103) are the same as those in the second embodiment, and thus the details are omitted in the following description.

サーバ部５００は、図１０の構成に文字ブロック選別部５１０を付加した構成である。 The server unit 500 has a configuration in which a character block selection unit 510 is added to the configuration of FIG.

クライアント部６００は、第２の実施の形態と同様である。 The client unit 600 is the same as that in the second embodiment.

外部装置は、第２の実施の形態に加え、特定文字列ＤＢ１０３がある。特定文字列ＤＢ１０３には、事前に特定文字列が登録されているものとする。光学文字認識装置１０１の誤認識の影響を軽減させるためには、光学文字認識装置が精度良く認識できる文字列を含む文字ブロックのみを利用することが望ましい。一般に、ＯＣＲソフトウェア等の光学文字認識装置１０１は内部に文字列辞書を保有しており、当該文字列辞書に登録されている語はそうでない語よりも精度良く認識できる。そこで、本実施の形態では、光学文字認識装置１０１が内部に図４５のような文字列辞書を保有しており、当該辞書と同一内容が指示文字列ＤＢ１０３にも登録されているものとする。 The external device includes a specific character string DB 103 in addition to the second embodiment. It is assumed that a specific character string is registered in advance in the specific character string DB 103. In order to reduce the influence of misrecognition by the optical character recognition apparatus 101, it is desirable to use only a character block including a character string that can be accurately recognized by the optical character recognition apparatus. In general, the optical character recognition device 101 such as OCR software has a character string dictionary therein, and a word registered in the character string dictionary can be recognized with higher accuracy than a word other than that. Therefore, in the present embodiment, it is assumed that the optical character recognition apparatus 101 has a character string dictionary as shown in FIG. 45 and the same contents as the dictionary are also registered in the instruction character string DB 103.

ドキュメントは、文字を含むページの集合からなる紙媒体の書籍とする。本実施の形態でドキュメント読み取り装置１０１に入力されるドキュメント群２００を構成している各ページ及びクライアント部４００に入力されるドキュメントページ２０１の例を図４６に示す。 The document is a paper medium book composed of a set of pages including characters. FIG. 46 shows an example of each page constituting the document group 200 input to the document reading apparatus 101 and the document page 201 input to the client unit 400 in this embodiment.

ドキュメント読み取り装置１０１は、サーバ部５００のデータ入力部５１０に接続されている。光学文字認識装置１０１は、サーバ側データ送受信部３６０に接続されている。 The document reading apparatus 101 is connected to the data input unit 510 of the server unit 500. The optical character recognition device 101 is connected to the server-side data transmission / reception unit 360.

本実施の形態では、
（１）サーバ部５００においてインデックスを作成する作業；
（２）クライアント部４００からサーバ部５００に問い合わせる作業；
を行う。 In this embodiment,
(1) Work of creating an index in the server unit 500;
(2) Work to inquire the server unit 500 from the client unit 400;
I do.

（１）サーバ部５００においてインデックスを作成する作業：
図４７は、本発明の第５の実施の形態におけるサーバ部でインデックスを作成する処理のフローチャートである。以下では、インデックス作成対象となるドキュメントを入力する入力ステップ（ステップ５１０，５２０）、各ドキュメントの各ページ内から文字ブロックを抽出する文字ブロック抽出ステップ（ステップ５３０）、特定文字列ＤＢに登録されている語を含む文字ブロックを選択する文字ブロック選別ステップ（ステップ５４０）、検索を実行するためのインデックを出力する出力ステップ（ステップ５５０）を行う。 (1) Creating an index in the server unit 500:
FIG. 47 is a flowchart of processing for creating an index in the server unit according to the fifth embodiment of the present invention. In the following, an input step (steps 510 and 520) for inputting a document to be indexed, a character block extraction step (step 530) for extracting a character block from each page of each document, and a specific character string DB are registered. A character block selection step (step 540) for selecting a character block including a word is output, and an output step (step 550) for outputting an index for executing the search.

ステップ５１０）ドキュメント群２００の各ドキュメントの各ページを、ドキュメント読み取り装置１０１で読み取り、データ入力部３１０に渡す。ここで、ドキュメント読み取り装置１０１は、紙媒体に印刷されたテキストを読み取ってＰＤＦファイルに変換する一般装置であり、ＯＣＲ機能付スキャナ等がこれに該当する。ここでは、各ページ中のテキストは紙媒体に印刷された状態における改ページ位置、改行位置が保持されたままＰＤＦファイルに変換されるものとする。 Step 510) Each page of each document in the document group 200 is read by the document reading device 101 and transferred to the data input unit 310. Here, the document reading device 101 is a general device that reads text printed on a paper medium and converts it into a PDF file, and corresponds to a scanner with an OCR function. Here, it is assumed that the text in each page is converted into a PDF file while maintaining the page break position and the line feed position in a state printed on a paper medium.

ステップ５２０）データ入力部５１０は、ドキュメント読み取り装置１０１から渡されたＰＤＦファイル群を読み込み、図３９に示すデータ構造でリスト化して、サーバ部５００の文字ブロック抽出部３２０に渡す。なお、説明の便宜上、ここでは、各ページを一意に示す情報としてファイル名を用いているが、ファイルのハッシュ値等、ページを一意に識別できる情報であれば他の情報を利用しても構わない。 Step 520) The data input unit 510 reads the PDF file group passed from the document reading device 101, lists it in the data structure shown in FIG. 39, and passes it to the character block extraction unit 320 of the server unit 500. For convenience of explanation, the file name is used as information uniquely indicating each page. However, other information may be used as long as the information can uniquely identify the page, such as a hash value of the file. Absent.

ステップ５３０）サーバ部５００の文字ブロック抽出部３２０は、データ入力部３１０から渡されたリストに記載されている各ページを読み込み、各ページから文字ブロック群を文字ブロック抽出ルール記憶部３２１のルールに則って抽出し、各文字ブロックと各文字ブロックの抽出元ページのファイル名と該ページにおける該文字ブロックの出現位置を関連付けて、文字ブロック選別部５１０に渡す。 Step 530) The character block extraction unit 320 of the server unit 500 reads each page described in the list passed from the data input unit 310, and uses the character block group from each page as a rule of the character block extraction rule storage unit 321. Accordingly, each character block, the file name of the extraction source page of each character block, and the appearance position of the character block on the page are associated with each other and passed to the character block selection unit 510.

ここで「文字ブロック」は、ステップ１３０と同様に十字型の形状で図４０のように抽出する。 Here, the “character block” is extracted in a cross shape as shown in FIG.

また、「文字ブロック抽出ルール」は、図４１のように、左上隅から右下隅方向へ１文字ずつずらしながら抽出することとする。 Also, the “character block extraction rule” is extracted while shifting one character at a time from the upper left corner toward the lower right corner as shown in FIG.

図４２に文字ブロック抽出部３２０の処理結果を示す。同図では、文字ブロック抽出部３２０でステップ５３０の処理を行った結果を示しており、このデータが文字ブロック選別部５１０に渡される。なお、ここでは、文字ブロックを、文字ブロックを構成する文字を上方、左方から順番に並べて表現している。 FIG. 42 shows the processing result of the character block extraction unit 320. The figure shows the result of the processing at step 530 performed by the character block extraction unit 320, and this data is passed to the character block selection unit 510. Here, the character block is expressed by arranging the characters constituting the character block in order from the top and the left.

ステップ５４０）文字ブロック選別部５１０は、文字ブロック抽出部３２０から渡された各文字ブロックについて、特定文字列ＤＢ１０３に問い合わせを行い、特定文字列ＤＢ１０３に登録されている語を含む文字ブロックのみを選別する。 Step 540) The character block selection unit 510 makes an inquiry to the specific character string DB 103 for each character block passed from the character block extraction unit 320, and selects only character blocks including words registered in the specific character string DB 103. To do.

図４８は、本発明の第５の実施の形態における文字ブロック選別部の処理結果を示す。同図では、文字ブロック選別部５１０でステップ５４０の処理を行った結果を示しており、このデータがインデックス出力部３３０に渡される。 FIG. 48 shows the processing result of the character block selection unit in the fifth embodiment of the present invention. The figure shows the result of the processing of step 540 performed by the character block selection unit 510, and this data is passed to the index output unit 330.

ステップ５５０）インデックス出力部３３０は、文字ブロック選別部５１０から渡された各文字ブロックを、図４８のデータ構造でインデックスＤＢ５４に格納する。これにより、文字ブロックを問い合わせキーとし、ファイル名及び出現位置を問い合わせ結果として返すインデックスＤＢ３４０を実現する。 Step 550) The index output unit 330 stores each character block passed from the character block selection unit 510 in the index DB 54 with the data structure of FIG. As a result, an index DB 340 is realized that uses a character block as an inquiry key and returns a file name and an appearance position as an inquiry result.

なお、サーバ部５００のコンテンツＤＢ３５０及び、（２）クライアント部４００からサーバ部５００に問い合わせる作業は、第２の実施の形態と同様であるので、その説明を省略する。 Note that the contents DB 350 of the server unit 500 and (2) work for inquiring of the server unit 500 from the client unit 400 are the same as those in the second embodiment, and thus the description thereof is omitted.

［第６の実施の形態］
本実施の形態では、第２の実施の形態よりもインデックスサイズを小さくし、かつ、光学文字認識処理の誤認識の影響を軽減し、かつ、ドキュメント中のどの位置にインデックスが作成されているかクライアント部を利用するユーザに分かりやすいように、ドキュメント及びインデックスを作成する方法について説明する。 [Sixth Embodiment]
In the present embodiment, the index size is made smaller than in the second embodiment, the influence of erroneous recognition in the optical character recognition process is reduced, and the position in the document where the index is created is the client. A method for creating a document and an index will be described so that the user who uses the section can easily understand.

具体的には、第５の実施の形態において、特定文字列ＤＢ１０３及びドキュメントを以下のように変更する。 Specifically, in the fifth embodiment, the specific character string DB 103 and the document are changed as follows.

特定文字列ＤＢ１０３には、事前に特定文字列が登録されているものとする。光学文字認識装置１０１の誤認識の影響を軽減させるためには、光学文字認識装置が精度良く認識できる文字列のみを含む文字ブロックのみを利用することが望ましい。一般に、ＯＣＲ、ソフトウェア等の光学文字認識装置は、「▼（逆三角形）」のようなシンプルな形状であり、かつ類似する文字が少ない文字ほど精度良く認識できる。ここでは、特定文字列ＤＢ１０３に「▼」が登録されているとする。なお、本実施の形態では、「▼」のみが登録されているとして以降の説明を行うが、「■」、「●」等の文字が登録されていてもよい。また、所定の出現頻度以下（例えば文書中の出現頻度が２回以下）文字を特定文字列としてもよい。 It is assumed that a specific character string is registered in advance in the specific character string DB 103. In order to reduce the influence of erroneous recognition of the optical character recognition device 101, it is desirable to use only character blocks including only character strings that can be accurately recognized by the optical character recognition device. In general, an optical character recognition device such as OCR or software has a simple shape such as “▼ (inverted triangle)” and can recognize a character with fewer similar characters with higher accuracy. Here, it is assumed that “▼” is registered in the specific character string DB 103. In the present embodiment, the following description is given assuming that only “▼” is registered, but characters such as “■” and “●” may be registered. Moreover, it is good also considering a character below a predetermined appearance frequency (for example, the appearance frequency in a document is 2 times or less) as a specific character string.

ドキュメントは図４９に示すように、複数のページからなり、各ページに複数行の文字列を含む紙媒体とし、特定文字列をインデックスを作成したい書籍位置に記載して作成する。もしくは、既存のドキュメント中の各書籍位置に初めから記載されていた文字を特定文字列とみなしてもよい。ここでは、特定文字列として、「▼」を用いる。この文字は通常の文章中に頻出する文字ではないので、クライアント部４００を利用するユーザに対して、インデックス作成箇所の目印になる。また、図５０のように、ドキュメント内に複数のＱＲコード（二次元コード）が存在する場合と比べ、１文字で表現できる「▼」は、ドキュメント内で占有する面積が少なくて済む。なお、本実施の形態では、「▼」のみを特定文字列とするが、「■」「●」等を特定文字列としてもよい。 As shown in FIG. 49, the document is made up of a plurality of pages, a paper medium including a plurality of lines of character strings on each page, and a specific character string is written at a book position where an index is to be created. Or you may consider the character described from the beginning in each book position in the existing document as a specific character string. Here, “▼” is used as the specific character string. Since this character is not a character that appears frequently in normal sentences, it becomes a mark for creating an index for a user who uses the client unit 400. Further, as shown in FIG. 50, “▼” that can be expressed by one character requires less area to be occupied in the document than when a plurality of QR codes (two-dimensional codes) exist in the document. In the present embodiment, only “▼” is the specific character string, but “■”, “●”, etc. may be the specific character string.

以降の処理は、第５の実施の形態と同様であるので、その説明を省略する。 Since the subsequent processing is the same as that of the fifth embodiment, the description thereof is omitted.

[第７の実施の形態]
本実施の形態では、検索装置に含まれる、検索を実行するためのインデックスを作成する部分（インデックス作成装置）についてのみ言及する。 [Seventh embodiment]
In the present embodiment, only a part (index creation device) that creates an index for executing a search included in the search device will be described.

図５１は、本発明の第７の実施の形態におけるインデックス作成装置の構成を示す。同図に示すインデックス作成装置は、ドキュメント入力部１０１０、基本文字列抽出部１０１１、周辺文字列抽出部１０１２、インデックス出力部１０１３、インデックスＤＢ１０１４から構成される。 FIG. 51 shows the configuration of the index creation device in the seventh embodiment of the present invention. The index creation apparatus shown in FIG. 1 includes a document input unit 1010, a basic character string extraction unit 1011, a peripheral character string extraction unit 1012, an index output unit 1013, and an index DB 1014.

ドキュメントは、図５２に示すように、複数のページからなり、各ページに複数行の文字列を含み、異なる閲覧環境（ＯＳ，ＰＤＦ閲覧ソフト等）においても文章の改行位置が変化しない電子ファイル（ＰＤＦファイル等）とする。なお、図５２は文章のみからなるドキュメントの例であるが、ドキュメントには図や表などの文字以外の情報が含まれていてもよい。 As shown in FIG. 52, the document is composed of a plurality of pages, each page includes a plurality of lines of character strings, and an electronic file in which the line break position of the sentence does not change even in different browsing environments (OS, PDF browsing software, etc.) PDF file). Note that FIG. 52 is an example of a document consisting only of text, but the document may include information other than characters such as a figure and a table.

また、基本文字列抽出部１０１１と周辺文字列抽出部１０１２は抽出した文字列を格納するメモリ（図示せず）を有するものとする。 The basic character string extraction unit 1011 and the surrounding character string extraction unit 1012 have a memory (not shown) for storing the extracted character string.

以下に、本実施の形態における処理フローを示す。 The processing flow in the present embodiment is shown below.

図５３は、本発明の第７の実施の形態における処理のフローチャートである。 FIG. 53 is a flowchart of processing in the seventh embodiment of the present invention.

ステップ１００１）ドキュメント入力部１０１０は、分析対象のドキュメント群の入力を受け付ける。 Step 1001) The document input unit 1010 receives an input of a document group to be analyzed.

ステップ１００２）ドキュメント入力部１０１０は、該ドキュメント群に含まれるドキュメントのリストを図５４に示すデータ構造で基本文字列抽出部１０１１に渡す。リスト内の各要素は、各ドキュメントを一意に示すものとする。なお、説明の便宜上、ここでは各ドキュメントを一意に示す情報としてファイル名を用いているが、ファイルのハッシュ値等、ドキュメントを一意に識別できる情報であれば他の情報を利用しても構わない。 Step 1002) The document input unit 1010 passes the list of documents included in the document group to the basic character string extraction unit 1011 in the data structure shown in FIG. Each element in the list shall uniquely indicate each document. For convenience of explanation, the file name is used as information uniquely indicating each document. However, other information may be used as long as the information can uniquely identify the document, such as a hash value of the file. .

ステップ１００３）基本文字列抽出部１０１１は、ドキュメント入力部１０１０から渡されたリストに記載されている各ドキュメントを読み込み、各ドキュメントから基本文字列群を抽出し、各基本文字列の抽出元ドキュメントのファイル名と該ドキュメントにおける該基本文字列の出現位置を関連付けて、周辺文字列抽出部１０１２に渡す。ここで、「基本文字列」とは、文字列分割手法を用いて文章を特定の単位に分割したものである。例えば、形態素解析を用いて文章を単語単位に分割したもの、あるいは、N-gram法を用いて文章をＮ文字（あるいはＮ単語）の連なりに分割したものがあげられる。ここでは、図５５のように、文字の２−gram方式で分割を行うものとする。また、「基本文字列の出現位置」とは、基本文字列がドキュメントのどの位置に出現しているか、システムの目的に応じて任意の粒度で示す位置情報である。ここでは、ページ、行、列の粒度で表現することとする。 Step 1003) The basic character string extraction unit 1011 reads each document described in the list passed from the document input unit 1010, extracts a basic character string group from each document, and extracts the basic character string extraction source document. The file name and the appearance position of the basic character string in the document are associated with each other and passed to the surrounding character string extraction unit 1012. Here, the “basic character string” is a sentence divided into specific units using a character string dividing method. For example, the sentence is divided into words using morphological analysis, or the sentence is divided into a series of N characters (or N words) using the N-gram method. Here, as shown in FIG. 55, it is assumed that the character is divided by the 2-gram method. The “appearance position of the basic character string” is position information indicating at which position in the document the basic character string appears with an arbitrary granularity according to the purpose of the system. Here, the page, row, and column granularity are used.

図５６は、本発明の第７の実施の形態における基本文字列抽出部の処理結果を示す。同図では、基本文字列抽出部１０１１でステップ１００３の処理を行った結果を示しており、このデータが周辺文字列抽出部１０１２に渡される。 FIG. 56 shows the processing result of the basic character string extraction unit in the seventh embodiment of the present invention. The figure shows the result of the processing in step 1003 performed by the basic character string extraction unit 1011, and this data is passed to the surrounding character string extraction unit 1012.

ステップ１００４）周辺文字列抽出部１０１２は、基本文字列抽出部１０１１から渡された各基本文字列をメモリ（図示せず）に格納し、当該基本文字列の周辺文字列群を抽出し、該基本文字列と該周辺文字列群を関連付けてメモリ（図示せず）に格納し、インデックス出力部１０１３に渡す。ここでは、図５７に示すように、各基本文字列の１文字目の上・左、下の各１文字を周辺文字列とする。なお、基本文字列の上・左・下だけでなく、上・左上・左・左下・下・右下・右・右上等、基本文字列の周辺に位置する他の文字列を利用しても構わない。 Step 1004) The peripheral character string extraction unit 1012 stores each basic character string passed from the basic character string extraction unit 1011 in a memory (not shown), extracts a peripheral character string group of the basic character string, The basic character string and the peripheral character string group are associated with each other, stored in a memory (not shown), and passed to the index output unit 1013. Here, as shown in FIG. 57, each of the upper, left, and lower characters of the first character of each basic character string is set as a peripheral character string. In addition to the top, left, and bottom of the basic character string, other character strings located around the basic character string, such as top, top left, left, bottom left, bottom, bottom right, right, top right, etc. may be used. I do not care.

図５８は、本発明の第１の実施の形態における周辺文字列抽出部の処理結果を示す。同図では、周辺文字列抽出部１０１２でステップ１００４の処理を行った結果を示しており、このデータがインデックス出力部１０１３に渡される。 FIG. 58 shows the processing result of the surrounding character string extraction unit in the first embodiment of the present invention. In the figure, the result of the processing in step 1004 performed by the peripheral character string extraction unit 1012 is shown, and this data is passed to the index output unit 1013.

ステップ１００５）インデックス出力部１０１３は、周辺文字列抽出部１０１２から渡された各基本文字列と周辺文字列が関連付けられたものを、図５８のデータ構造でインデックスＤＢ１０１４に格納する。これにより、基本文字列及び周辺文字列の組み合わせを問い合わせキーとし、ファイル名及び出現位置を問い合わせ結果として返すインデックスＤＢ１０１４を実現する。 Step 1005) The index output unit 1013 stores, in the index DB 1014, the data structure shown in FIG. 58 in which each basic character string passed from the peripheral character string extraction unit 1012 is associated with the peripheral character string. Thus, an index DB 1014 is realized that uses a combination of a basic character string and a surrounding character string as an inquiry key and returns a file name and an appearance position as an inquiry result.

上記のように、ドキュメントのインデックスを作成する際に、文字・単語の前後の連なりだけでなく、ユーザがドキュメントを閲覧する際のドキュメント（印刷物、ＰＤＦファイル等）における文字の位置関係に着目し、基本文字列とその周辺文字列を関連付けてインデックスのキーとすることにより、各ドキュメントに固有になりやすい文字列パターンを、少ない文字数で表現できるため、識別能力が高く、ロバスト性が高いインデックスを実現できる。 As described above, when creating a document index, not only the sequence of characters and words, but also the positional relationship of characters in the document (printed material, PDF file, etc.) when the user views the document, By associating the basic character string and the surrounding character string and using it as an index key, the character string pattern that tends to be unique to each document can be expressed with a small number of characters, thus realizing an index with high identification capability and high robustness. it can.

［第８の実施の形態］
図５９は、本発明の第８の実施の形態における検索システムの構成図である。 [Eighth Embodiment]
FIG. 59 is a block diagram of the search system in the eighth embodiment of the present invention.

同図に示すシステムは、大きく分けてサーバ部１、クライアント部３、外部装置からなる。 The system shown in the figure is roughly divided into a server unit 1, a client unit 3, and an external device.

サーバ部１は、データ入力部２０、基本文字列抽出部２１、周辺文字列抽出部２２、インデックス出力部２３、インデックスＤＢ２４，コンテンツＤＢ２５，サーバ側データ送受信部２６、ＤＢ問い合わせ部２７からなる。 The server unit 1 includes a data input unit 20, a basic character string extraction unit 21, a peripheral character string extraction unit 22, an index output unit 23, an index DB 24, a content DB 25, a server side data transmission / reception unit 26, and a DB inquiry unit 27.

クライアント部３は、クライアント側デバイス３０、クライアント側デバイス３０を構成するドキュメント撮影部３１、クライアント側デバイス３０を構成するコンテンツ表示部３２、クライアント側データ送受信部３３からなる。 The client unit 3 includes a client side device 30, a document photographing unit 31 constituting the client side device 30, a content display unit 32 constituting the client side device 30, and a client side data transmitting / receiving unit 33.

外部装置は、ドキュメント読み取り装置１００、光学文字認識装置１０１からなる。本実施の形態でドキュメント読み取り装置１００に入力されるドキュメント群２００、クライアント部３に入力されるドキュメントページ２０１の例を図６０に示す。ドキュメントは、同図に示すように、複数のページからなり、各ページに複数行の文字列を含む紙媒体とする。 The external device includes a document reading device 100 and an optical character recognition device 101. An example of a document group 200 input to the document reading apparatus 100 and a document page 201 input to the client unit 3 in the present embodiment is shown in FIG. As shown in the figure, the document is a paper medium including a plurality of pages and each page including a plurality of lines of character strings.

ドキュメント読み取り装置１００は、サーバ部１のデータ入力部２０に接続されている。光学文字認識装置１０１はサーバ側データ送受信部２６に接続されている。 The document reading device 100 is connected to the data input unit 20 of the server unit 1. The optical character recognition device 101 is connected to the server-side data transmission / reception unit 26.

本実施の形態では、
（１）サーバ部１においてインデックスを作成する作業；
（２）クライアント部３からサーバ部１に問い合わせる作業；
を行う。 In this embodiment,
(1) Work to create an index in the server unit 1;
(2) Work for inquiring the server unit 1 from the client unit 3;
I do.

（１）サーバ部１においてインデックスを作成する作業：
当該処理は、前述の第７の実施の形態におけるインデックス作成装置に相当する。 (1) Work to create an index in the server unit 1:
This process corresponds to the index creation device in the seventh embodiment described above.

図６１は、本発明の第８の実施の形態におけるサーバ側の処理のフローチャートである。 FIG. 61 is a flowchart of processing on the server side according to the eighth embodiment of the present invention.

ステップ２００１）ドキュメント群２００は、図６０のように、複数のページからなり、各ページに複数行の文字列を含む紙媒体の書籍群とする。各書籍には、それぞれを一意に識別できる書名が付いているものとする。なお、説明の便宜上、ここでは各ドキュメントを一意に示す情報として書名を用いているが、書籍のISBN等、ドキュメントを一意に識別できる情報であれば他の情報を利用してもよい。 Step 2001) As shown in FIG. 60, the document group 200 is composed of a plurality of pages, and a book group of paper media including a plurality of lines of character strings on each page. It is assumed that each book has a book name that can uniquely identify each book. For convenience of explanation, the title is used as information uniquely indicating each document. However, other information such as ISBN of a book may be used as long as the information can uniquely identify the document.

ドキュメント読み取り装置１００は、紙媒体に印刷されたテキストを読み取ってテキストファイルに変換する一般装置であり、ＯＣＲ機能付きスキャナ等がこれに該当する。ここでは、ドキュメント中のテキストは紙媒体に印刷された状態における改ページ位置、改行位置が保持されたままテキストファイルに変換されるものとする。 The document reading device 100 is a general device that reads text printed on a paper medium and converts it into a text file, and corresponds to a scanner with an OCR function. Here, it is assumed that the text in the document is converted into a text file while maintaining the page break position and the line feed position in a state printed on a paper medium.

ステップ２００２）データ入力部２０は、ドキュメント読み取り装置１００から渡されたテキストファイル群を読み込み、図６２に示すデータ構造でリスト化して、サーバ１の基本文字列抽出部２１に渡す。 Step 2002) The data input unit 20 reads the text file group passed from the document reading device 100, lists it in the data structure shown in FIG. 62, and passes it to the basic character string extraction unit 21 of the server 1.

ステップ２００３）サーバ１の基本文字列抽出部２１は、データ入力部２０から渡されたリストに記載されている各ドキュメントを読み込み、各ドキュメントから基本文字列群を抽出し、各基本文字列の抽出元ドキュメントのテキストファイル名と該ドキュメントにおける該基本文字列の出現位置を関連付けて、周辺文字列抽出部２２に渡す。 Step 2003) The basic character string extraction unit 21 of the server 1 reads each document described in the list passed from the data input unit 20, extracts a basic character string group from each document, and extracts each basic character string. The text file name of the original document and the appearance position of the basic character string in the document are associated with each other and passed to the peripheral character string extraction unit 22.

ここで、「基本文字列」とは、文字列分割手法を用いて文章を特定の単位に分割したものである。例えば、形態素解析を用いて文章を単語単位に分割したもの、あるいは、Ｎ-gram法を用いて文章をＮ文字（あるいはＮ単語）の連なりに分割したものがあげられる。ここでは、図６３のように文字の２-gram方式で分割を行うものとする。 Here, the “basic character string” is a sentence divided into specific units using a character string dividing method. For example, the sentence is divided into words using morphological analysis, or the sentence is divided into a series of N characters (or N words) using the N-gram method. Here, it is assumed that the character is divided by the 2-gram method as shown in FIG.

「基本文字列の出現位置」とは、基本文字列がドキュメントのどの位置に出現しているか、システムの目的に応じて任意の粒度で示す位置情報である。ここでは、ページ、行、列の粒度で表現することとする。 The “appearance position of the basic character string” is position information indicating at which position in the document the basic character string appears in an arbitrary granularity according to the purpose of the system. Here, the page, row, and column granularity are used.

図６４に示すのは、基本文字列抽出部２１にて上記の処理を行った結果であり、このデータが周辺文字列抽出部２２に渡される。 FIG. 64 shows the result of the above processing performed by the basic character string extraction unit 21, and this data is passed to the surrounding character string extraction unit 22.

ステップ２００４）周辺文字列抽出部２２は、基本文字列抽出部２１から渡された基本文字列（図６４）について、該基本文字列の周辺文字列群を抽出し、該基本文字列と該周辺文字列群を関連付けて、インデックス出力部２３に渡す。ここでは、図６５に示すように、各基本文字列の１文字目の上・左・下の各１文字を周辺文字列とする。なお、基本文字列の上・左・下だけでなく、上・左上・左・左下・下・右下・右・右上等、基本文字列の周辺に位置する他の文字列を利用しても構わない。 Step 2004) The peripheral character string extraction unit 22 extracts a peripheral character string group of the basic character string from the basic character string (FIG. 64) passed from the basic character string extraction unit 21, and the basic character string and the peripheral character string. The character string group is associated and passed to the index output unit 23. Here, as shown in FIG. 65, each of the upper, left, and lower characters of the first character of each basic character string is set as a peripheral character string. In addition to the top, left, and bottom of the basic character string, other character strings located around the basic character string, such as top, top left, left, bottom left, bottom, bottom right, right, top right, etc. may be used. I do not care.

図６６は、本発明の第８の実施の形態における周辺文字列抽出部の処理結果を示す。同図に示すデータがインデックス出力部２３に渡される。 FIG. 66 shows the processing result of the surrounding character string extraction unit in the eighth embodiment of the present invention. The data shown in the figure is passed to the index output unit 23.

ステップ２００５）インデックス出力部２３は、周辺文字列抽出部２２から渡された各基本文字列と周辺文字列が関連付けられたものを、図６６のデータ構造でインデックスＤＢ２４に格納する。 Step 2005) The index output unit 23 stores, in the index DB 24, the data structure shown in FIG. 66 in which each basic character string passed from the peripheral character string extraction unit 22 is associated with the peripheral character string.

なお、サーバ１のコンテンツＤＢ２５には、書籍出版社、あるいは、一般ユーザが作成した、各書籍内の各位置（ページ、行、列等）に関係したコンテンツ及び該コンテンツに関する情報を格納しておく。コンテンツ及び該コンテンツに関する情報をコンテンツＤＢ２５に格納するために格納作業専用端末を用意してもよいし、格納作業用Ｗｅｂアプリケーションを用意して不特定多数のユーザがＷｅｂブラウザを通じて自由にコンテンツ及び該コンテンツに関する情報を格納できるようにしてもよい。コンテンツの例としては、観光名所案内であれば各地を訪れた旅行者の体験談、化学教科書であれば化学実験映像等が挙げられる。ここでは、サーバ１上のデータ格納領域にコンテンツの実体を格納し、図６７に示すデータ構造でコンテンツと各書籍内の各位置の関係を格納する。 The content DB 25 of the server 1 stores content related to each position (page, row, column, etc.) in each book created by a book publisher or a general user and information related to the content. . A storage work dedicated terminal may be prepared to store content and information related to the content in the content DB 25, or a storage work Web application is prepared so that an unspecified number of users can freely access the content and the content through a Web browser. It may be possible to store information regarding. Examples of contents include experiences of tourists who have visited various places for tourist attractions, and chemical experiment videos for chemical textbooks. Here, the substance of the content is stored in the data storage area on the server 1, and the relationship between the content and each position in each book is stored in the data structure shown in FIG.

（２）クライアント部３からサーバ部１に問い合わせる作業：
図６８は、本発明の第８の実施の形態におけるクライアント部からサーバ部に問い合わせる処理のフローチャートである。 (2) Inquiry from the client unit 3 to the server unit 1:
FIG. 68 is a flowchart of processing for inquiring from the client unit to the server unit according to the eighth embodiment of the present invention.

ステップ３００１）ドキュメントページ２０１は、ドキュメント群２００に含まれる１件の書籍の１ページである。クライアント側デバイス３０は、ドキュメント撮影部３１、コンテンツ表示部３２からなる。 Step 3001) The document page 201 is one page of one book included in the document group 200. The client side device 30 includes a document photographing unit 31 and a content display unit 32.

ドキュメント撮影部３１は、ドキュメントページ２０１の全体、または一部分を光学的に撮影して、撮影内容を画像ファイルとして保存し、クライアント側データ送受信部３３に渡す。ここでは、図６９に示すドキュメントページの一部分が撮影され、図７０に示す画像ファイルが作成されたとする。 The document photographing unit 31 optically photographs the whole or part of the document page 201, stores the photographing content as an image file, and passes it to the client side data transmitting / receiving unit 33. Here, it is assumed that a part of the document page shown in FIG. 69 is photographed and the image file shown in FIG. 70 is created.

クライアント側データ送受信部３３は、ドキュメント撮影部３１から渡された画像ファイルをネットワークを通じてサーバ部１のデータ送受信部２６に渡す。 The client side data transmitting / receiving unit 33 transfers the image file transferred from the document photographing unit 31 to the data transmitting / receiving unit 26 of the server unit 1 through the network.

ステップ３００２）サーバ側データ送受信部２６は、光学文字認識装置１０１を利用して、クライアント側データ送受信部３３から渡された画像ファイルから図７１に示すテキストデータを抽出し、基本文字列抽出部２１に渡す。なお、光学文字認識装置１０１は、一般的なＯＣＲソフトウェア等であり、文字が撮影されたが画像から文字情報を抽出し、テキストデータとしてコンピュータが利用できる形式に変換する一般装置である。 Step 3002) The server-side data transmitting / receiving unit 26 uses the optical character recognition device 101 to extract the text data shown in FIG. 71 from the image file passed from the client-side data transmitting / receiving unit 33, and the basic character string extracting unit 21 To pass. The optical character recognition device 101 is general OCR software or the like, and is a general device that extracts character information from an image but converts it into a format that can be used by a computer as text data.

ステップ３００３）基本文字列抽出部２１は、サーバ側データ送受信部２６から渡されたテキストデータを読み込み、テキストデータにおける最初の行、最後の行、最初の列、最後の列を除く部分（図７２の点線で囲まれた部分）から、図７３のように文字の２-gram方式で分割を行う方式で基本文字列を抽出し、図７４に示すデータ構造で周辺文字列抽出部２２に渡す。 Step 3003) The basic character string extracting unit 21 reads the text data passed from the server-side data transmitting / receiving unit 26, and removes the first line, the last line, the first column, and the last column in the text data (FIG. 72). 73), a basic character string is extracted by a method of dividing characters by the 2-gram method as shown in FIG. 73, and is passed to the surrounding character string extraction unit 22 by the data structure shown in FIG.

ステップ３００４）周辺文字列抽出部２２は、基本文字列抽出部２１から渡された各基本文字列について、該基本文字列の周辺文字列群を抽出し、該基本文字列と該周辺文字列群を関連付けて、ＤＢ問い合わせ部２７に渡す。ここでは、図７５に示すように、各基本文字列の１文字目の上・左・下の各１文字を周辺文字列とする。 Step 3004) The peripheral character string extracting unit 22 extracts a peripheral character string group of the basic character string for each basic character string passed from the basic character string extracting unit 21, and the basic character string and the peripheral character string group Are passed to the DB inquiry unit 27. Here, as shown in FIG. 75, each of the upper, left, and lower characters of the first character of each basic character string is set as a peripheral character string.

ステップ３００５）ＤＢ問い合わせ部２７は、図７６に示すような周辺文字列抽出部２２から渡された基本文字列と周辺文字列が関連付けられたデータを用いて、インデックスＤＢ２４及びコンテンツＤＢ２５に問い合わせを行う。 Step 3005) The DB inquiry unit 27 makes an inquiry to the index DB 24 and the content DB 25 using data in which the basic character string and the peripheral character string passed from the peripheral character string extraction unit 22 as shown in FIG. 76 are associated. .

まず、ＤＢ問い合わせ部２７がインデックスＤＢ２４に対して問い合わせを行う。前述のとおり、インデックスＤＢ２４には、図６６に示す構造でデータが格納されている。 First, the DB inquiry unit 27 makes an inquiry to the index DB 24. As described above, the index DB 24 stores data in the structure shown in FIG.

ＤＢ問い合わせ部２７は、図７６の基本文字列と各周辺文字列の組（例：「北東」と「名」「ツ」「ば」）を用いて、インデックスＤＢ２４に該基本文字列と該周辺文字列の組に対応するファイル名と出現位置を問い合わせ、問い合わせ結果を図７７で示す形式で取得する。 The DB inquiry unit 27 uses the combination of the basic character string and each peripheral character string (eg, “northeast”, “name”, “tu”, “ba”) in FIG. The file name corresponding to the character string pair and the appearance position are inquired, and the inquiry result is obtained in the format shown in FIG.

ステップ３００６）次に、ＤＢ問い合わせ部２７がコンテンツＤＢ２５に対して、上述の問い合わせ結果（図７７）を用いて問い合わせを行う。前述のとおり、コンテンツＤＢ２５には、図６７に示すデータ構造でデータが格納されている。ＤＢ問い合わせ部２７は、図７７の各ファイル名と各出現位置の組（例：「Germany_1.txt」と「１ページ２行７列目」）を用いて、コンテンツＤＢ２５に該ファイル名と該出現位置に対応するコンテンツタイプを問い合わせ、問い合わせ結果を図７８に示す形式で取得し、コンテンツ・コンテンツタイプの組の重複を削除して図７９に示す形式に変換する。 Step 3006) Next, the DB inquiry unit 27 makes an inquiry to the content DB 25 using the above inquiry result (FIG. 77). As described above, the content DB 25 stores data in the data structure shown in FIG. The DB inquiry unit 27 uses the combination of each file name and each appearance position (for example, “Germany_1.txt” and “1 page, 2nd row, 7th column”) in FIG. The content type corresponding to the position is inquired, the inquiry result is acquired in the format shown in FIG. 78, and the duplication of the content / content type pair is deleted and converted into the format shown in FIG.

ステップ３００７）ＤＢ問い合わせ部２７は、上述の問い合わせ結果（図７９）をサーバ側データ送受信部２６に渡す。 Step 3007) The DB inquiry unit 27 passes the inquiry result (FIG. 79) to the server-side data transmission / reception unit 26.

サーバ側データ送受信部２６は、ＤＢ問い合わせ部２７から渡されたデータ（コンテンツとコンテンツタイプ）（図７９）をネットワークを通じてクライアント側データ送受信部３３に渡す。 The server-side data transmission / reception unit 26 passes the data (content and content type) (FIG. 79) passed from the DB inquiry unit 27 to the client-side data transmission / reception unit 33 through the network.

ステップ３００８）クライアント側データ送受信部３３は、サーバ側データ送受信部２６から渡されたデータ（図７９）をコンテンツ表示部３２に渡す。 Step 3008) The client-side data transmission / reception unit 33 passes the data (FIG. 79) passed from the server-side data transmission / reception unit 26 to the content display unit 32.

コンテンツ表示部３２は、クライアント側データ送受信部３３から渡されたデータ（図７６）を、コンテンツ表示部３２内部で予め設定されたコンテンツ表示手段（図８０）を用いて表示する。 The content display unit 32 displays the data (FIG. 76) passed from the client-side data transmission / reception unit 33 using content display means (FIG. 80) preset in the content display unit 32.

［第９の実施の形態］
前述の第８の実施の形態のステップ３００１において、撮影条件が悪く（光量不足、手ぶれ等）ドキュメント撮影部３１が撮影した画像に品質が悪い場合に、ステップ３００２で光学文字認識装置１０１が撮影画像から正確に文字を抽出できない場合がある。また、現在の技術水準においても、光学文字認識の精度は１００％ではないため、撮影画像の品質が良い場合でも、一定確率で文字の誤認識が発生する。光学文字認識装置１０１において正しく文字認識が行われない場合、その誤った文字データに基づいてステップ３００５，３００６でＤＢ問い合わせ部２７がインデックスＤＢ２４及びコンテンツＤＢ２５に問い合わせを行っても、撮影したドキュメントの位置に関連付けられたコンテンツは得られない。 [Ninth Embodiment]
In step 3001 of the above-described eighth embodiment, when the shooting conditions are bad (insufficient light quantity, camera shake, etc.) and the quality of the image shot by the document shooting unit 31 is poor, the optical character recognition device 101 takes a shot image in step 3002. In some cases, characters cannot be extracted accurately from. Further, even in the current technical level, the accuracy of optical character recognition is not 100%, so that even when the quality of the captured image is good, erroneous recognition of characters occurs with a certain probability. If the optical character recognition apparatus 101 does not recognize characters correctly, the position of the photographed document can be obtained even if the DB inquiry unit 27 makes an inquiry to the index DB 24 and the content DB 25 in steps 3005 and 3006 based on the incorrect character data. Content associated with is not available.

本実施の形態では、このような画像品質が悪い状態でも正しい検索結果が得られるようにする例を説明する。 In the present embodiment, an example will be described in which a correct search result is obtained even in such a state where the image quality is poor.

例えば、ドキュメント撮影部３１が図７０に示す領域を撮影したが、画像品質が悪いため、光学文字認識装置１０１は、図７１のように抽出すべきところ、図８１のように抽出したとする。この状況でステップ３００１〜３００４を行い、図８２に示す基本文字列・周辺文字列の組が得られたとする。 For example, it is assumed that the document photographing unit 31 has photographed the area shown in FIG. 70, but the image quality is poor, so that the optical character recognition device 101 should extract as shown in FIG. Assume that steps 3001 to 3004 are performed in this situation, and a basic character string / peripheral character string pair shown in FIG. 82 is obtained.

次に、ステップ３００５において、ＤＢ問い合わせ部２７がコンテンツＤＢ２５に対して、図８２の基本文字列と周辺文字列の組を用いて問い合わせを行う。但し、図８２の問い合わせ結果には誤認識された文字による誤ったデータが含まれているため、図３８のデータを用いてコンテンツＤＢ２５に問い合わせた結果は、図８２のように該当するデータが見つからなかったり、他のファイル名、出現位置を取得してしまったり（例えば、図８３最下行）する。 Next, in step 3005, the DB inquiry unit 27 makes an inquiry to the content DB 25 using the combination of the basic character string and the peripheral character string shown in FIG. However, since the inquiry result of FIG. 82 includes erroneous data due to misrecognized characters, the result of inquiry to the content DB 25 using the data of FIG. 38 is that the corresponding data is found as shown in FIG. Or another file name or appearance position is acquired (for example, the bottom line in FIG. 83).

この問題を第８の実施の形態におけるステップ３００６を図８４に示す処理を行うことで解決する。図８４に示すステップ４００６，４００７のようにすることで、ステップ３００１においてドキュメント撮影部３１の撮影画像の品質が悪い場合、あるいは、ステップ３００２において光学文字認識装置１０１の認識精度が悪い場合にも対応できる。 This problem is solved by performing the process shown in FIG. 84 at step 3006 in the eighth embodiment. 84, when the quality of the captured image of the document photographing unit 31 is poor at step 3001, or when the recognition accuracy of the optical character recognition device 101 is poor at step 3002. it can.

図８４は、本発明の第９の実施の形態における処理のフローチャートである。 FIG. 84 is a flowchart of processing in the ninth embodiment of the present invention.

以下では、図６８のステップ３００６，３００７の代わりにステップ４００６，４００７のみ示し、他のステップは図６８の処理と同様であるため、その説明を省略する。 In the following, only steps 4006 and 4007 are shown instead of steps 3006 and 3007 in FIG. 68, and the other steps are the same as the processing in FIG.

ステップ４００６）ＤＢ問い合わせ部２７がコンテンツＤＢ２５に対して、上述の問い合わせ結果（図８３）を用いて問い合わせを行う。ＤＢ問い合わせ部２７は図８３の各ファイル名と各出現位置の組（ただし、該等データなしのものを除く）を用いて、コンテンツＤＢ２５に該ファイル名と該出現位置に対応するコンテンツとコンテンツタイプを問い合わせ、問い合わせ結果を図８５に示す形式で取得し、コンテンツ・コンテンツタイプの組の重複数を集計して図８６に示す形式に変換する。 Step 4006) The DB inquiry unit 27 makes an inquiry to the content DB 25 using the above inquiry result (FIG. 83). The DB inquiry unit 27 uses the set of each file name and each appearance position in FIG. 83 (excluding those having no such data), and stores the content corresponding to the file name, the appearance position, and the content type in the content DB 25. 85, the inquiry result is acquired in the format shown in FIG. 85, and the duplicates of the content / content type pairs are aggregated and converted into the format shown in FIG.

ステップ４００７）ＤＢ問い合わせ部２７は、上述の問い合わせ結果（図８６）のうち、複数の異なるコンテンツが存在する場合は重複数が最大のもの（この例では重複数５件のNarrative_1.txt）をサーバ側データ送受信部２６に渡す。 Step 4007) The DB inquiry unit 27 uses the above-described inquiry result (FIG. 86) as a server when the plurality of different contents are present, and the duplication number is the largest (in this example, five Narrative_1.txt). To the side data transmitter / receiver 26.

サーバ側データ送受信部２６は、ＤＢ問い合わせ部２７から渡されたデータ（図７９）をネットワークを通じてクライアント側データ送受信部３３に渡す。 The server-side data transmission / reception unit 26 passes the data (FIG. 79) passed from the DB inquiry unit 27 to the client-side data transmission / reception unit 33 through the network.

上記の処理を行った後、図６８のステップ３００８を行うことで、コンテンツ表示部３２にて正しいコンテンツが表示される。 After performing the above processing, the correct content is displayed on the content display unit 32 by performing step 3008 of FIG.

上記のように、第１〜第９の実施の形態により、書籍等のレイアウトが特定されたコンテンツの全体ではなく、一部の上方の文字列配置を用いて、どのコンテンツのどの部分であるかを特定することができる。また、文字を読む方向以外の方向で、文字列を組み合わせてインデックスを作成することにより、少ない文字数で検索結果の誤り率を低く抑えることが可能となる。また、コンテンツ全体ではなく、一部の情報のみを用いて特定を行うため、コンテンツの位置特定粒度（ページ単位ではなく、行単位など）小さくすることもできるため、厳密な位置を特定することができる。 As described above, according to the first to ninth embodiments, which part of which content is not the whole content whose layout such as a book is specified, but a part of the upper character string arrangement. Can be specified. Also, by creating an index by combining character strings in directions other than the direction of reading characters, it is possible to reduce the error rate of search results with a small number of characters. In addition, since the specification is performed using only a part of the information, not the entire content, the content position specifying granularity (such as a line unit, not a page unit) can be reduced. it can.

なお、図３，５１、に示すインデックス作成装置、図１０、４４，５９に示すサーバ部及びクライアント部の構成要素の動作をプログラムとして構築し、インデックス作成装置、サーバ部、クライアント部として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The operation of the components of the index creation device shown in FIGS. 3 and 51 and the server unit and client unit shown in FIGS. 10, 44 and 59 is constructed as a program and used as the index creation device, server unit, and client unit. It can be installed in a computer and executed, or distributed via a network.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

１サーバ部
３クライアント部
１０ドキュメント入力手段、ドキュメント入力部
１１文字ブロック抽出手段、文字ブロック抽出部
１２インデックス出力手段、インデックス出力部
１３インデックス記憶手段、インデックスＤＢ
１４文字ブロック抽出ルール記憶部
２０データ入力部
２１基本文字列抽出部
２２周辺文字列抽出部
２３インデックス出力部
２４インデックスＤＢ
２５コンテンツＤＢ
２６サーバ側データ送受信部
２７ＤＢ問い合わせ部
３０クライアント側デバイス
３１ドキュメント撮影部
３２コンテンツ表示部
３３クライアント側データ送受信部
４０ドキュメント入力部
４１文字ブロック抽出部
４２文字ブロック選別
４３インデックス出力部
４４インデックスＤＢ
４６特定文字列ＤＢ
１００ドキュメント読み取り装置
１０１光学文字認識装置
１０３特定文字列ＤＢ
２００ドキュメント群
２０１ドキュメントページ
３００サーバ部
３１０データ入力部
３２０文字ブロック抽出部
３２１文字ブロック抽出ルール記憶部
３３０インデックス出力部
３４０インデックスＤＢ
３５０コンテンツＤＢ
３６０サーバ側データ送受信部
３７０ＤＢ問い合わせ部
４００クライアント部
４１０クライアント部
４１１ドキュメント撮影部
４１２コンテンツ表示部
４２０クライアント側データ送受信部
５００サーバ部
５１０文字ブロック選別部
１０１０ドキュメント入力部
１０１１基本文字列抽出部
１０１２周辺文字列中ｓｈ通部
１０１３インデックス出力部
１０１４インデックスＤＢ DESCRIPTION OF SYMBOLS 1 Server part 3 Client part 10 Document input means, Document input part 11 Character block extraction means, Character block extraction part 12 Index output means, Index output part 13 Index storage means, Index DB
14 character block extraction rule storage unit 20 data input unit 21 basic character string extraction unit 22 peripheral character string extraction unit 23 index output unit 24 index DB
25 Content DB
26 server-side data transmission / reception unit 27 DB inquiry unit 30 client-side device 31 document photographing unit 32 content display unit 33 client-side data transmission / reception unit 40 document input unit 41 character block extraction unit 42 character block selection 43 index output unit 44 index DB
46 Specific character string DB
100 Document Reading Device 101 Optical Character Recognition Device 103 Specific Character String DB
200 Document Group 201 Document Page 300 Server Unit 310 Data Input Unit 320 Character Block Extraction Unit 321 Character Block Extraction Rule Storage Unit 330 Index Output Unit 340 Index DB
350 Content DB
360 Server-side data transmission / reception unit 370 DB inquiry unit 400 Client unit 410 Client unit 411 Document photographing unit 412 Content display unit 420 Client-side data transmission / reception unit 500 Server unit 510 Character block selection unit 1010 Document input unit 1011 Basic character string extraction unit 1012 Character string sh pass part 1013 Index output part 1014 Index DB

Claims

改ページや改行位置が確定しているドキュメント内の一部領域を検索クエリとして、該領域が出現するドキュメント及び該ドキュメント内における位置を取得する検索要求に応えるための検索インデックスを作成し、検索を行う検索装置であって、
インデックス作成対象のドキュメントの入力を受け付けるドキュメント入力手段と、
ドキュメントの全体または一部領域から、文章を読む方向とそれに直交する方向を考慮した規定の形状内にある１文字以上の文字の組み合わせからなる文字ブロックを抽出する文字ブロック抽出手段と、
前記文字ブロックと該文字ブロックが出現するドキュメントにおける出現位置を関連付けてインデックス記憶手段に出力するインデックス出力手段と、
を有することを特徴とする検索装置。 Create a search index for responding to a search request to obtain a document in which the area appears and a position in the document by using a partial area in the document in which the page break or line break position is fixed as a search query. A search device to perform,
A document input means for accepting input of documents to be indexed;
A character block extracting means for extracting a character block consisting of a combination of one or more characters in a prescribed shape taking into account the direction in which the text is read and the direction orthogonal thereto, from the whole or a partial area of the document;
Index output means for associating the character block with an appearance position in the document in which the character block appears, and outputting to the index storage means;
A search device comprising:

前記文字ブロックの中から、１文字以上の特定文字列を含むものだけを選別して以降の処理対象とする文字ブロック選別手段を更に有する
請求項１記載の検索装置。 The search device according to claim 1, further comprising a character block selection unit that selects only one character string including one or more specific character strings from the character blocks and sets a target for subsequent processing.

前記特定文字列は、
予め指定された分析対象のドキュメントの各領域に満遍なく出現する１文字以上の文字列とする
請求項２記載の検索装置。 The specific character string is
The search device according to claim 2, wherein the character string includes one or more characters that uniformly appear in each area of the document to be analyzed specified in advance.

ドキュメントに存在する複数の文字ブロックを含む範囲をリージョンとして同一の検索結果候補として集計を行い、集計結果が一定基準を満たす検索結果候補リージョン群を検索結果として特定する検索手段を更に有する
請求項１記載の検索装置。 The search unit further includes a search unit that aggregates a range including a plurality of character blocks existing in a document as a region as a single search result candidate, and specifies a search result candidate region group that satisfies a certain criterion as a search result. The described search device.

ドキュメント内の特定位置に関連付けられたコンテンツが検索結果候補である場合に、
同一コンテンツが関連付けられた位置群を同一の検索結果候補として集計を行い、集計結果が一定基準を満たす検索結果候補群を検索結果として特定する検索手段を更に有する
請求項１記載の検索装置。 If the content associated with a specific location in the document is a search result candidate,
The search device according to claim 1, further comprising: a search unit that counts a group of positions associated with the same content as the same search result candidate, and specifies a search result candidate group that satisfies a certain criterion as a search result.

前記特定文字列は、文字が撮影された画像から文字情報を抽出する光学文字認識装置が利用する認識辞書記憶手段を参照して取得する
請求項２記載の検索装置。 The search device according to claim 2, wherein the specific character string is acquired with reference to a recognition dictionary storage unit used by an optical character recognition device that extracts character information from an image of a character photographed .

前記特定文字列は、
予め指定された分析対象のドキュメントに所定の回数以上出現しない１文字以上の文字列とする
請求項２記載の検索装置。 The specific character string is
The search device according to claim 2, wherein the character string is one or more characters that do not appear more than a predetermined number of times in a document to be analyzed specified in advance.

前記特定文字列は、
予め指定されたシンプルな形状の文字からなる１文字以上の文字列とする
請求項２記載の検索装置。 The specific character string is
The search device according to claim 2, wherein the search device is a character string of one or more characters composed of characters having a simple shape designated in advance.

あるドキュメント内の一部領域を検索クエリとして受け付ける入力手段と、
前記検索クエリから、１文字以上の組み合わせからなるクエリ文字ブロックを抽出するクエリ文字ブロック抽出手段と、
前記クエリ文字ブロックに基づいて、前記インデックス記憶手段を検索し、その検索結果を出力する検索手段と、
を更に有し、
前記検索手段は、
前記クエリ文字ブロックに基づいて、前記インデックス記憶手段を検索し、その検索結果を出力する
請求項１記載の検索装置。 An input means for accepting a partial area in a document as a search query;
Query character block extraction means for extracting a query character block consisting of a combination of one or more characters from the search query;
Search means for searching the index storage means based on the query character block and outputting the search results;
Further comprising
The search means includes
The search device according to claim 1, wherein the index storage unit is searched based on the query character block and the search result is output.

前記入力手段は、
あるドキュメント内の一部領域を撮影した画像を、一般的な光学文字認識装置を用いて該画像に写っている文字列をテキストデータに変換した検索クエリを受け付ける手段を含む
請求項９記載の検索装置。 Before fill power means,
The search according to claim 9, further comprising means for receiving a search query obtained by converting an image obtained by capturing a partial area in a document into text data using a general optical character recognition device. apparatus.

前記検索結果であるドキュメント及び該ドキュメント内における位置に関連付けられたコンテンツを、検索結果と併せて、あるいは、単独で出力する手段を更に有する
請求項９記載の検索装置。 The search device according to claim 9, further comprising means for outputting the search result document and the content associated with the position in the document together with the search result or independently.

前記クエリ文字ブロックの中から、１文字以上の特定文字列を含むものだけを選別して以降の処理対象とするクエリ文字ブロック選別手段を更に有する
請求項９記載の検索装置。 The search device according to claim 9 , further comprising a query character block selecting unit that selects only those including one or more specific character strings from the query character blocks to be processed later.

光学文字認識装置が利用する認識辞書に登録されている１文字以上の文字列を特定文字列とする
請求項１２記載の検索装置。 The search device according to claim 12, wherein a character string of one or more characters registered in a recognition dictionary used by the optical character recognition device is set as a specific character string.

前記特定文字列は、
予め指定された分析対象のドキュメントの各領域に満遍なく出現する１文字以上の文字列とする
請求項１２記載の検索装置。 The specific character string is
The search device according to claim 12, wherein the character string includes one or more characters that uniformly appear in each area of a document to be analyzed specified in advance.

前記特定文字列は、
予め指定された分析対象のドキュメントに所定の回数以上出現しない１文字以上の文字列とする
請求項１２記載の検索装置。 The specific character string is
The search device according to claim 12, wherein the character string is one or more characters that do not appear more than a predetermined number of times in a document to be analyzed specified in advance.

前記特定文字列は、
予め指定されたシンプルな形状の文字からなる１文字以上の文字列とする
請求項１２記載の検索装置。 The specific character string is
The search device according to claim 12, wherein the search device is a character string of one or more characters composed of characters of a simple shape designated in advance.

改ページや改行位置が確定しているドキュメント内の一部領域を検索クエリとして、該領域が出現するドキュメント及び該ドキュメント内における位置を取得する検索要求に応えるための検索インデックスを作成し、検索を行う装置における検索方法であって、
ドキュメント入力手段が、インデックス作成対象のドキュメントの入力を受け付けるドキュメント入力ステップと、
文字ブロック抽出手段が、ドキュメントの全体または一部領域から、文章を読む方向とそれに直交する方向を考慮した規定の形状内にある１文字以上の文字の組み合わせからなる文字ブロックを抽出する文字ブロック抽出ステップと、
インデックス出力手段が、前記文字ブロックと該文字ブロックが出現するドキュメントにおける出現位置を関連付けてインデックス記憶手段に出力するインデックス出力ステップと、
を行うことを特徴とする検索方法。 Create a search index for responding to a search request to obtain a document in which the area appears and a position in the document by using a partial area in the document in which the page break or line break position is fixed as a search query. A search method in a device for performing
A document input means for receiving an input of a document to be indexed, and a document input step;
Character block extraction means for extracting a character block consisting of a combination of one or more characters within a specified shape taking into account the direction in which the text is read and the direction orthogonal thereto, from the whole or a partial area of the document Steps,
An index output means for associating the character block with an appearance position in the document in which the character block appears, and outputting to the index storage means;
The search method characterized by performing.

入力手段が、あるドキュメント内の一部領域を検索クエリとして受け付ける入力ステップと、
クエリ文字ブロック抽出手段が、前記検索クエリから、１文字以上の組み合わせからなるクエリ文字ブロックを抽出するクエリ文字ブロック抽出ステップと、
検索手段が、前記クエリ文字ブロックに基づいて、前記インデックス記憶手段を検索し、その検索結果を出力する検索ステップと、
を更に行う請求項１７記載の検索方法。 An input step in which the input means accepts a partial area in a document as a search query;
A query character block extracting means for extracting a query character block consisting of a combination of one or more characters from the search query;
A search means for searching the index storage means based on the query character block and outputting the search results;
18. The search method according to claim 17, further comprising:

文字ブロック選別手段が、前記文字ブロックの中から、１文字以上の特定文字列を含むものだけを選別して以降の処理対象とする文字ブロック選別ステップを更に行う
請求項１７記載の検索方法。 18. The search method according to claim 17, wherein the character block selection means further performs a character block selection step to be processed later after selecting only the character block including one or more specific character strings.

前記特定文字列を、予め指定された分析対象のドキュメントの各領域に満遍なく出現する１文字以上の文字列とする
請求項１９記載の検索方法。 The search method according to claim 19, wherein the specific character string is a character string of one or more characters that uniformly appears in each region of a document to be analyzed specified in advance.

クエリ文字ブロック選別手段が、前記クエリ文字ブロックの中から、１文字以上の特定文字列を含むものだけを選別して以降の処理対象とするクエリ文字ブロック選別ステップを更に行う
請求項１８記載の検索方法。 19. The search according to claim 18, wherein the query character block selection means further selects only a query character block that includes one or more specific character strings from the query character blocks, and further performs a query character block selection step to be processed thereafter. Method.

請求項１乃至１６のいずれか１項に記載の検索装置を構成する各手段としてコンピュータを機能させるための検索プログラム。 The search program for functioning a computer as each means which comprises the search device of any one of Claims 1 thru | or 16.