JP7257204B2

JP7257204B2 - Character string search device, character string search method, and character string search program

Info

Publication number: JP7257204B2
Application number: JP2019053038A
Authority: JP
Inventors: 清孝宮井; 清孝粕渕; 明子吉田; 一博北村; 万理寺田; 光規梅原
Original assignee: Screen Holdings Co Ltd
Current assignee: Screen Holdings Co Ltd
Priority date: 2019-03-20
Filing date: 2019-03-20
Publication date: 2023-04-13
Anticipated expiration: 2039-03-20
Also published as: JP2020154776A

Description

本発明は、光学的文字認識（以下「ＯＣＲ」という）において誤認識され易い文字を集めて作成される誤認識文字テーブル、その作成方法、および、当該誤認識文字テーブルを使用する文字列検索装置等に関する。 The present invention provides an erroneously recognized character table created by collecting characters that are likely to be erroneously recognized in optical character recognition (hereinafter referred to as "OCR"), a method for creating the same, and a character string search device using the erroneously recognized character table. etc.

従来より、ＯＣＲ（Optical Character Recognition）の精度を向上させるために種々の手法が提案されている。例えば、ＯＣＲのために読み取るべき画像に対し、テキスト部を正確に抽出すべく当該画像の歪み補正や地紋やゴミ等のノイズを除去等の処理を行う手法が知られている（例えば特許文献３参照）。また、大量の学習データを用いた機械学習によりＯＣＲの精度を向上させる手法や、ＯＣＲにおいて誤りやすい単語の辞書（以下「誤認識単語辞書」という）を作成する手法も考えられている。 Conventionally, various methods have been proposed to improve the accuracy of OCR (Optical Character Recognition). For example, there is a known method of correcting distortion of an image to be read for OCR and removing noise such as background patterns and dust in order to accurately extract the text portion of the image (for example, Patent Document 3). reference). Also, a method of improving the accuracy of OCR by machine learning using a large amount of learning data, and a method of creating a dictionary of words that are likely to be mistaken in OCR (hereinafter referred to as "misrecognition word dictionary") have been considered.

なお、本願で開示される誤認識文字テーブルや文字列検索装置に関連して、特許文献１には、ＯＣＲにおいて誤認識され易い文字である誤認識文字群を管理する誤認識文字リストを用いて、入力された検索文字列により文書画像のＯＣＲ結果を検索する文書画像検索システムが記載されている（段落［００２４］等参照）。この文書画像検索システムは、入力された検索文字列による検索結果が得られない場合、入力された検索文字列の一部をワイルドカードに置き換えて再検索を行い、さらに、ワイルドカードを含む検索文字列による検索結果が得られない場合には、誤認識文字リストに基づき検索文字中の一部の文字を別の誤認識文字に置き換えて再検索を行うように構成されている（段落［００６８］、［００７１］等参照）。また特許文献２には、ＯＣＲ結果に基づき生成される学習セット（例えば、ＯＣＲモジュールによって識別されたキャラクタのそれぞれに対して、それぞれのキャラクタに対応しているとして識別されたイメージレットに対する平均および分散等）を用いてＯＣＲの認識精度を向上させるためシステムや方法等が記載されている（段落［００１４］～［００２０］等参照）。 In relation to the erroneously recognized character table and the character string search device disclosed in the present application, Patent Document 1 discloses that an erroneously recognized character list that manages a group of erroneously recognized characters that are likely to be erroneously recognized in OCR is used. , a document image retrieval system for retrieving an OCR result of a document image by an input retrieval character string (see paragraph [0024], etc.). In this document image retrieval system, if no search result is obtained with the entered search character string, part of the entered search character string is replaced with a wildcard and the search is performed again. If no search result is obtained by the column, it is configured to replace some of the characters in the search character with another misrecognized character based on the misrecognized character list and perform the search again (paragraph [0068] , [0071], etc.). Also, in Patent Document 2, a training set generated based on OCR results (e.g., for each character identified by the OCR module, the mean and variance for the imagelets identified as corresponding to each character) etc.) to improve OCR recognition accuracy (see paragraphs [0014] to [0020], etc.).

特開２００４－２１３０９１号公報Japanese Patent Application Laid-Open No. 2004-213091 特表２０１３－５０９６６４号公報Japanese Patent Publication No. 2013-509664 特開２００５－１９６５６３号公報JP-A-2005-196563

ＯＣＲにより得られるテキストデータに対しキーワード（検索語としての入力文字列）による検索を行うシステムが開発されている。このようなシステムにおいて所望の検索結果を得るために、通常、ＯＣＲ結果としてのテキストデータを人間が目視で確認して認識誤りを修正したものを検索対象とする手法が採られる。 A system has been developed for searching text data obtained by OCR using keywords (input character strings as search words). In order to obtain desired search results in such a system, a method is generally adopted in which text data as OCR results are visually checked by a human and recognition errors are corrected, and the text data are searched.

しかし、ＯＣＲ結果としてのテキストデータの全てを目視で確認するには極めて大きなコストを要する。また、ＯＣＲ結果において認識誤りの修正漏れが生じた場合、上記のようなシステムにおいて正しい検索結果を得ることができない。また、既述の誤認識単語辞書を作成する場合、新しい語句が現れる度に辞書更新が必要となり、そのためのコストすなわちメンテナンスのコストが継続的に必要である。さらに、同一文字であっても、使用されるプリンタやフォント等によってＯＣＲによる認識結果が相違することがあり、従来の誤認識単語辞書や誤認識文字リストにおいてその相違を調整するのは困難である。 However, visually confirming all the text data as the OCR result requires an extremely high cost. In addition, if an OCR result fails to correct a recognition error, the above system cannot obtain a correct search result. Moreover, when creating the above-mentioned misrecognized word dictionary, it is necessary to update the dictionary every time a new word appears, and the cost for this, that is, the maintenance cost, is continuously required. Furthermore, even for the same character, the results of OCR recognition may differ depending on the printer, font, etc. used, and it is difficult to adjust for such differences in conventional misrecognized word dictionaries and misrecognized character lists. .

そこで、目視による確認や継続的な辞書更新を必要とすることなく、上記テキストデータの生成のためのＯＣＲの対象とすべき画像の印刷に使用されるプリンタやフォント等が異なっても上記の検索を低コストで適切に行えるようにする方法や装置を提供することが望まれる。 Therefore, even if the printer, font, etc. used for printing the image to be subjected to OCR for generating the text data is different, the above search can be performed without requiring visual confirmation or continuous dictionary updating. It would be desirable to provide a method and apparatus that would allow this to be performed adequately at low cost.

本発明の第１の局面は、テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識することにより得られるＯＣＲ結果としてのテキストデータを検索対象とする文字列検索装置であって、
テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識するＯＣＲ装置において誤認識される可能性が所定の許容範囲を超えるとみなされる文字である誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの認識結果としての文字である誤認識文字を対応付ける誤認識文字対応付けデータと、前記誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの当該文字の画像の印刷形態を対応付ける印刷形態対応付けデータとを含む、誤認識文字テーブルと、
外部から与えられる入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語を作成し、前記入力文字列におけるいずれの文字も、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれにも一致しない場合には、前記入力文字列を検索語とする検索語作成部と、
前記検索語作成部により得られる検索語に一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する検索部とを備える。 A first aspect of the present invention is a character string search device for searching text data as an OCR result obtained by reading a target image containing text from a recording medium on which the target image is printed and recognizing characters. and
A high-probability misrecognition character, which is a character whose possibility of misrecognition in an OCR device that recognizes characters by reading a target image containing text from a recording medium on which the target image is printed exceeds a predetermined allowable range. to each of the misrecognized character association data that associates misrecognized characters that are characters as a result of recognition when the image of the character is misrecognized by the OCR device, and to each of the misrecognized high-probability characters On the other hand, an erroneously recognized character table including print form correspondence data that associates the print form of the image of the character when the image of the character is erroneously recognized by the OCR device;
If any character in the input character string given from the outside matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table, the matching character in the input character string When a search term is created by replacing characters with wildcards, and none of the characters in the input character string match any of the misrecognized high-probability characters and the misrecognized characters registered in the misrecognized character table. a search term creation unit that uses the input character string as a search term;
a search unit that searches text data as the OCR result for a character string that matches the search term obtained by the search term creation unit;

本発明の第２の局面は、本発明の第１の局面において、
前記検索語作成部は、
前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致し、かつ、当該一致する文字に対応付けられた印刷形態が前記対象画像の印刷形態と一致する場合に、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語を作成し、
前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するが、当該一致する文字に対応付けられた印刷形態が前記対象画像の印刷形態と一致しない場合に、前記入力文字列を検索語とする。 A second aspect of the present invention is the first aspect of the present invention,
The search term creation unit is
A printing form in which any character in the input character string matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table and is associated with the matching character matches the print form of the target image, create a search term by replacing the matching character in the input string with a wildcard;
Any of the characters in the input character string matches either the misrecognition high-probability character or the misrecognition character registered in the misrecognition character table, but the printing form associated with the matching character is If the input character string does not match the print form of the target image, the input character string is used as a search term.

本発明の第３局面は、テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識することにより得られるＯＣＲ結果としてのテキストデータを検索対象とする文字列検索装置であって、
テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識するＯＣＲ装置において誤認識される可能性が所定の許容範囲を超えるとみなされる文字である誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの認識結果としての文字である誤認識文字を対応付ける誤認識文字対応付けデータと、前記誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの当該文字の画像の印刷形態を対応付ける印刷形態対応付けデータとを含む、誤認識文字テーブルと、
外部から与えられる入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、前記入力文字列を検索語とするとともに、前記入力文字列における当該一致する文字を前記誤認識文字テーブルによって当該一致する文字に対応付けられた他の文字に置き換えることにより検索語を作成し、前記入力文字列におけるいずれの文字も、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれにも一致しない場合に、前記入力文字列を検索語とする検索語作成部と、
前記検索語作成部により得られる検索語のいずれかに一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する検索部とを備え、
前記検索部は、前記検索語作成部により得られる検索語のいずれに一致する文字列も前記ＯＣＲ結果としてのテキストデータの中から見出せない場合において、前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するときに、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより得られる検索語に一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する。 A third aspect of the present invention is a character string search device for searching text data as an OCR result obtained by reading a target image containing text from a recording medium on which the target image is printed and recognizing characters. There is
A high-probability misrecognition character, which is a character whose possibility of misrecognition in an OCR device that recognizes characters by reading a target image containing text from a recording medium on which the target image is printed exceeds a predetermined allowable range. to each of the misrecognized character association data that associates misrecognized characters that are characters as a result of recognition when the image of the character is misrecognized by the OCR device, and to each of the misrecognized high-probability characters On the other hand, an erroneously recognized character table including print form correspondence data that associates the print form of the image of the character when the image of the character is erroneously recognized by the OCR device;
If any character in an input character string given from the outside matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table, the input character string is regarded as a search word. and creating a search term by replacing the matching character in the input character string with another character associated with the matching character in the misrecognized character table, wherein any character in the input character string is a search term creating unit that uses the input character string as a search term when neither the misrecognized high-probability character nor the misrecognized character registered in the misrecognized character table match;
a search unit that searches text data as the OCR result for a character string that matches any of the search terms obtained by the search term creation unit ;
In the case where a character string that matches any of the search terms obtained by the search term creation unit cannot be found from the text data as the OCR result, the search unit is configured so that any character in the input character string Matches a search word obtained by replacing the matching character in the input character string with a wild card when it matches either the misrecognition high-probability character or the misrecognition character registered in the misrecognition character table. A character string is searched from the text data as the OCR result .

本発明の第４局面は、テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識することにより得られるＯＣＲ結果としてのテキストデータを検索対象とする文字列検索装置であって、
テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識するＯＣＲ装置において誤認識される可能性が所定の許容範囲を超えるとみなされる文字である誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの認識結果としての文字である誤認識文字を対応付ける誤認識文字対応付けデータと、前記誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの当該文字の画像の印刷形態を対応付ける印刷形態対応付けデータとを含む、誤認識文字テーブルと、
外部から与えられる入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致し、かつ、当該一致する文字に対応付けられた印刷形態が前記対象画像の印刷形態と一致する場合に、前記入力文字列を検索語とするとともに、前記入力文字列における当該一致する文字を前記誤認識文字テーブルによって当該一致する文字に対応付けられた他の文字に置き換えることにより検索語を作成し、前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するが、当該一致する文字に対応付けられた印刷形態が前記対象画像の印刷形態と一致しない場合に、前記入力文字列のみを検索語とする、検索語作成部と、
前記検索語作成部により得られる検索語のいずれかに一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する検索部と
を備え、
前記検索部は、前記検索語作成部により得られる検索語のいずれに一致する文字列も前記ＯＣＲ結果としてのテキストデータの中から見出せない場合において、前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するときに、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより得られる検索語に一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する。 A fourth aspect of the present invention is a character string search device for searching text data as an OCR result obtained by reading a target image containing text from a recording medium on which the target image is printed and recognizing characters. There is
A high-probability misrecognition character, which is a character whose possibility of misrecognition in an OCR device that recognizes characters by reading a target image containing text from a recording medium on which the target image is printed exceeds a predetermined allowable range. to each of the misrecognized character association data that associates misrecognized characters that are characters as a result of recognition when the image of the character is misrecognized by the OCR device, and to each of the misrecognized high-probability characters On the other hand, an erroneously recognized character table including print form correspondence data that associates the print form of the image of the character when the image of the character is erroneously recognized by the OCR device;
Any character in the input character string given from the outside matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table, and is associated with the matching character. If the printed form matches the printed form of the target image, the input character string is used as a search term, and the matching characters in the input character string are associated with the matching characters by the misrecognized character table. A search term is created by replacing the character with another character, and any character in the input character string matches either the misrecognized high-probability character or the misrecognized character registered in the misrecognized character table. a search term creation unit that uses only the input character string as a search term when the print format associated with the matching characters does not match the print format of the target image ;
a search unit that searches text data as the OCR result for a character string that matches any of the search terms obtained by the search term creation unit;
with
In the case where a character string that matches any of the search terms obtained by the search term creation unit cannot be found from the text data as the OCR result, the search unit is configured so that any character in the input character string Matches a search word obtained by replacing the matching character in the input character string with a wild card when it matches either the misrecognition high-probability character or the misrecognition character registered in the misrecognition character table. A character string is searched from the text data as the OCR result .

本発明の他の局面は、本発明の上記局面ならびに後述の実施形態およびその変形例に関する説明から明らかであるので、その説明を省略する。 Since other aspects of the present invention are clear from the above aspects of the present invention and the descriptions of the embodiments and modifications thereof described below, descriptions thereof are omitted.

本発明の第１の局面によれば、外部から与えられる入力文字列におけるいずれかの文字が、誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、その入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語が作成され、その検索語に一致する文字列がＯＣＲ結果としてのテキストデータの中から検索される。これにより、当該ＯＣＲ結果としてのテキストデータに誤認識文字が含まれる場合であっても検索漏れを抑制することができる。このようにしてワイルドカードを含む検索語を作成するために使用される当該誤認識文字テーブルには、複数の異なる印刷形態のいずれかで印刷された画像に対してＯＣＲ装置によりいずれかの文字が誤認識されると当該文字が登録される。このため、検索対象としてのテキストデータを作成するためのＯＣＲの対象画像の印刷形態が異なっても、検索漏れを確実に抑制することができる。また、当該誤認識文字テーブルが文字列検索装置において使用されると、その検索対象としてのテキストデータをＯＣＲ装置により作成するときに当該テキストデータを目視で確認する作業や誤認識単語辞書の更新作業等が不要となり、これらの作業のためのコストが削減される。 According to the first aspect of the present invention, when any character in the input character string given from the outside matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table Then, a search term is created by replacing the matching characters in the input character string with a wild card, and the text data as the OCR result is searched for a character string that matches the search term. As a result, even if the text data as the OCR result contains erroneously recognized characters, it is possible to prevent omissions in retrieval. In the erroneously recognized character table used to create a search word including a wild card in this way, any character is recognized by an OCR device for an image printed in any of a plurality of different print formats. If the character is erroneously recognized, the character is registered. Therefore, even if the printing form of the target image of OCR for creating text data as a search target is different, it is possible to reliably suppress the omission of search. Further, when the misrecognized character table is used in a character string search device, when text data to be searched is created by an OCR device, the work of visually confirming the text data and the work of updating the misrecognized word dictionary. etc., and the cost for these operations is reduced.

本発明の第２の局面によれば、外部から与えられる入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致し、かつ、当該一致する文字に対応付けられた印刷形態がＯＣＲの対象画像の印刷形態と一致する場合に、その入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語が作成され、当該入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致しても、当該一致する文字に対応付けられた印刷形態がＯＣＲの対象画像の印刷形態と一致しない場合には、入力文字列が検索語とされ、ワイルドカードは使用されない。これにより、本発明の第１の局面と同様の効果を奏しつつ、余分なまたは不適切な検索結果の出力が抑制される。 According to the second aspect of the present invention, any character in the input character string given from the outside matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table. and, if the printing form associated with the matching characters matches the printing form of the target image for OCR, a search term is created by replacing the matching characters in the input character string with wildcards, Even if any of the characters in the input character string match any of the misrecognition high-probability characters and the misrecognition characters registered in the misrecognition character table, the printing form associated with the matching character does not match the print form of the OCR target image, the input character string is used as a search term and wildcards are not used. As a result, output of redundant or inappropriate search results is suppressed while providing the same effects as in the first aspect of the present invention.

本発明の第３の局面によれば、外部から与えられる入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、その入力文字列が検索語とされるとともに、その入力文字列における当該一致する文字を上記誤認識文字テーブルによって当該一致する文字に対応付けられた他の文字に置き換えることにより検索語が作成される。これにより、本発明の第１の局面と同様の効果が得られる。ただし、余分なまたは不適切な検索結果の出力抑制の点では本発明の第３の局面の方が有利である。また、本発明の第３の局面によれば、検索語作成部により得られる検索語のいずれに一致する文字列もＯＣＲ結果としてのテキストデータの中から見出せない場合において、上記入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するときに、上記入力文字列における当該一致する文字をワイルドカードに置き換えることにより得られる検索語に一致する文字列がＯＣＲ結果としてのテキストデータの中から検索される。これにより、余分なまたは不適切な検索結果の出力を抑制しつつ検索漏れが抑制される。 According to the third aspect of the present invention, any character in the input character string given from the outside matches any of the misrecognition high-probability characters and the misrecognition characters registered in the misrecognition character table. In this case, the input character string is used as a search term, and the search term is obtained by replacing the matching characters in the input character string with other characters associated with the matching characters by the misrecognized character table. created. Thereby, the same effects as those of the first aspect of the present invention can be obtained. However , the third aspect of the present invention is more advantageous in suppressing the output of redundant or inappropriate search results. Further, according to the third aspect of the present invention, when a character string matching any of the search terms obtained by the search term creation unit cannot be found from the text data as the OCR result, any of the input character strings obtained by replacing the matching character in the input character string with a wild card when the character matches one of the misrecognized high-probability character and the misrecognized character registered in the misrecognized character table The text data as the OCR result is searched for a character string that matches the searched word. This suppresses search omissions while suppressing the output of superfluous or inappropriate search results.

本発明の第４の局面によれば、外部から与えられる入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致し、かつ、当該一致する文字に対応付けられた印刷形態がＯＣＲの対象画像の印刷形態と一致する場合に、その入力文字列が検索語とされるとともに、その入力文字列における当該一致する文字を上記誤認識文字テーブルによって当該一致する文字に対応付けられた他の文字に置き換えることにより検索語が作成され、その入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致しても、当該一致する文字に対応付けられた印刷形態がＯＣＲの対象画像の印刷形態と一致しない場合には、その入力文字列のみが検索語とされる。また、本発明の第４の局面によれば、検索語作成部により得られる検索語のいずれに一致する文字列もＯＣＲ結果としてのテキストデータの中から見出せない場合において、上記入力文字列におけるいずれかの文字が、上記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するときに、上記入力文字列における当該一致する文字をワイルドカードに置き換えることにより得られる検索語に一致する文字列がＯＣＲ結果としてのテキストデータの中から検索される。これにより、本発明の第３の局面と同様の効果を奏しつつ、余分なまたは不適切な検索結果の出力が更に抑制される。 According to the fourth aspect of the present invention, any character in the input character string given from the outside matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table. and when the print form associated with the matching character matches the print form of the target image for OCR, the input character string is used as a search term, and the corresponding character in the input character string is Misrecognition in which a search word is created by replacing the matching character with another character associated with the misrecognized character table, and any character in the input character string is registered in the misrecognized character table. Even if it matches either the high-probability character or the misrecognized character, if the print form associated with the matching character does not match the print form of the target image for OCR, only the input character string is retrieved. word. Further, according to the fourth aspect of the present invention, when a character string matching any of the search terms obtained by the search term creation unit cannot be found from the text data as the OCR result, any of the input character strings obtained by replacing the matching character in the input character string with a wild card when the character matches one of the misrecognized high-probability character and the misrecognized character registered in the misrecognized character table The text data as the OCR result is searched for a character string that matches the searched word. As a result, the output of redundant or inappropriate search results is further suppressed while providing the same effects as in the third aspect of the present invention.

本発明の他の局面の効果については、本発明の上記局面の効果ならびに下記実施形態およびその変形例の効果についての説明から明らかであるので、説明を省略する。 The effects of other aspects of the present invention are clear from the description of the effects of the above aspects of the present invention and the effects of the following embodiments and modifications thereof, so description thereof will be omitted.

本発明の一実施形態に係る誤認識文字テーブルの作成に使用されるシステムの構成を示すブロック図である。1 is a block diagram showing the configuration of a system used to create an erroneously recognized character table according to one embodiment of the present invention; FIG. 上記誤認識文字テーブルの作成に使用されるシステムにおいて誤認識文字テーブル作成装置として機能するコンピュータのハードウェア構成を示すブロック図である。FIG. 4 is a block diagram showing the hardware configuration of a computer functioning as an erroneously recognized character table creation device in the system used to create the erroneously recognized character table; 上記誤認識文字テーブルを作成するための処理（誤認識文字テーブル作成処理）を示すフローチャートである。4 is a flow chart showing a process (misrecognition character table creation process) for creating the misrecognition character table. 上記誤認識文字テーブルの一例を示す図である。It is a figure which shows an example of the said misrecognized character table. 上記誤認識文字テーブルを備える文字列検索装置の一例を説明するための図である。It is a figure for demonstrating an example of the character string search apparatus provided with the said misrecognized character table. 図５に示す文字列検索装置におけるハードウェア構成を示すブロック図である。6 is a block diagram showing a hardware configuration in the character string search device shown in FIG. 5; FIG. 図５に示す文字列検索装置において使用される検索語を説明するための図である。6 is a diagram for explaining search words used in the character string search device shown in FIG. 5; FIG. 図５に示す文字列検索装置における文字列検索処理の一例を示すフローチャートである。6 is a flowchart showing an example of character string search processing in the character string search device shown in FIG. 5; 図５に示す文字列検索装置における文字列検索処理の別例を示すフローチャートである。6 is a flowchart showing another example of character string search processing in the character string search device shown in FIG. 5; 上記誤認識文字テーブルを利用したＯＣＲ調整量テーブル作成処理の一例を示すフローチャートである。8 is a flow chart showing an example of OCR adjustment amount table creation processing using the erroneously recognized character table. 図１０に示すＯＣＲ調整量テーブル作成処理により作成されるＯＣＲ調整量テーブルの一例を示す図である。FIG. 11 is a diagram showing an example of an OCR adjustment amount table created by the OCR adjustment amount table creation process shown in FIG. 10; 図１０に示すＯＣＲ調整量テーブル作成処理により作成されるＯＣＲ調整量テーブルを用いたＯＣＲ処理（光学的文字認識処理）を示すフローチャートである。11 is a flowchart showing OCR processing (optical character recognition processing) using an OCR adjustment amount table created by the OCR adjustment amount table creation process shown in FIG. 10; 図５に示す文字列検索装置の第１の変形例における文字列検索処理を示すフローチャートである。6 is a flowchart showing character string search processing in the first modification of the character string search device shown in FIG. 5; 図５に示す文字列検索装置の第２の変形例における文字列検索処理を示すフローチャートである。6 is a flowchart showing character string search processing in the second modification of the character string search device shown in FIG. 5; 図５に示す文字列検索装置の第３の変形例における文字列検索処理を示すフローチャートである。6 is a flowchart showing character string search processing in the third modification of the character string search device shown in FIG. 5;

以下、添付図面を参照しつつ本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.

＜１．全体構成＞
図１は、本発明の一実施形態に係る誤認識文字テーブルの作成に使用されるシステムの構成を示すブロック図である。本システムは、コンピュータ１０とスキャナ２０と第１から第３プリンタＰ１～Ｐ３とを備えている。コンピュータ１０と第１から第３プリンタＰ１～Ｐ３とは、ＬＡＮ（Loacal Area Network）によって通信可能に接続されており、スキャナ２０はコンピュータ１０に接続されている。コンピュータ１０は、所定のプログラムを実行することにより、印刷制御装置として機能し、ＯＣＲ装置としても機能し、また、誤認識テーブル作成装置としても機能する。また、第１から第３プリンタＰ１～Ｐ３は、互いに異なる解像度を有しており、各プリンタＰｉ（ｉ＝１，２，３）は、記録媒体として３種類の用紙すなわち用紙Ｓ１～Ｓ３を選択的に使用できるように構成されている。これら３種類の用紙は、互いにインクの滲み易さの異なる用紙である。なお、図１に示す例では、コンピュータ１０は、第１から第３プリンタＰ１～Ｐ３からなる３台のプリンタのいずれをも選択的に使用でき、第１から第３プリンタＰ１～Ｐ３のそれぞれは、３種類の用紙Ｓ１～Ｓ３のいずれをも選択的に使用できるが、選択的に使用可能なプリンタは２台または４台以上であってもよく、また、各プリンタで選択的に使用可能な用紙は、２種類または４種類以上であってもよい。また、プリンタ間で使用可能な用紙の種類が異なっていてもよい。 <1. Overall configuration>
FIG. 1 is a block diagram showing the configuration of a system used to create an erroneously recognized character table according to one embodiment of the present invention. The system comprises a computer 10, a scanner 20 and first to third printers P1 to P3. The computer 10 and the first to third printers P1 to P3 are communicably connected via a LAN (Local Area Network), and the scanner 20 is connected to the computer 10. FIG. By executing a predetermined program, the computer 10 functions as a print control device, an OCR device, and an error recognition table creation device. Also, the first to third printers P1 to P3 have different resolutions, and each printer Pi (i=1, 2, 3) selects three types of paper, that is, paper S1 to S3 as recording media. configured for general use. These three types of paper have different susceptibility to ink bleeding. In the example shown in FIG. 1, the computer 10 can selectively use any of the three printers consisting of first to third printers P1 to P3, and each of the first to third printers P1 to P3 is , any of the three types of paper S1 to S3 can be selectively used, but the number of printers that can be selectively used may be two or four or more, and each printer can be selectively used Two types or four or more types of paper may be used. Also, the types of paper that can be used may differ between printers.

図２は、コンピュータ１０のハードウェア構成を示すブロック図である。このコンピュータ１０は、本体１１、補助記憶装置１２、光ディスクドライブ１３、表示部１４、キーボード１５、およびマウス１６を備えている。本体１１は、ＣＰＵ１１１、メモリ１１２、第１ディスクインタフェース部１１３、第２ディスクインタフェース部１１４、表示制御部１１５、入力インタフェース部１１６、およびネットワークインタフェース部１１７を含んでいる。ＣＰＵ１１１、メモリ１１２、第１ディスクインタフェース部１１３、第２ディスクインタフェース部１１４、表示制御部１１５、入力インタフェース部１１６、およびネットワークインタフェース部１１７は、システムバスを介して互いに接続されている。第１ディスクインタフェース部１１３には補助記憶装置１２が接続されている。第２ディスクインタフェース部１１４には光ディスクドライブ１３が接続されている。表示制御部１１５には、表示部（表示装置）１４が接続されている。入力インタフェース部１１６には、キーボード１５およびマウス１６が接続されている。ネットワークインタフェース部１１７には、ネットワーク（ＬＡＮ）３が接続されている。補助記憶装置１２は磁気ディスク装置等である。光ディスクドライブ１３には、ＤＶＤ（Digital Versatile Disc）またはＣＤ－ＲＯＭ（Compact Disc Read Only Memory）等のコンピュータ読み取り可能な記録媒体としての光ディスク１７が挿入される。表示部１４は液晶ディスプレイ等である。表示部１４は、使用者が所望する情報を表示するために使用される。キーボード１５およびマウス１６は、このコンピュータ１０に対して使用者が指示を入力するために使用される。 FIG. 2 is a block diagram showing the hardware configuration of the computer 10. As shown in FIG. This computer 10 has a main body 11 , an auxiliary storage device 12 , an optical disk drive 13 , a display section 14 , a keyboard 15 and a mouse 16 . Main body 11 includes CPU 111 , memory 112 , first disk interface section 113 , second disk interface section 114 , display control section 115 , input interface section 116 and network interface section 117 . The CPU 111, memory 112, first disk interface section 113, second disk interface section 114, display control section 115, input interface section 116, and network interface section 117 are connected to each other via a system bus. The auxiliary storage device 12 is connected to the first disk interface section 113 . The optical disc drive 13 is connected to the second disc interface section 114 . A display unit (display device) 14 is connected to the display control unit 115 . A keyboard 15 and a mouse 16 are connected to the input interface section 116 . A network (LAN) 3 is connected to the network interface unit 117 . The auxiliary storage device 12 is a magnetic disk device or the like. An optical disc 17 as a computer-readable recording medium such as a DVD (Digital Versatile Disc) or a CD-ROM (Compact Disc Read Only Memory) is inserted into the optical disc drive 13 . The display unit 14 is a liquid crystal display or the like. The display unit 14 is used to display information desired by the user. A keyboard 15 and a mouse 16 are used by the user to input instructions to the computer 10 .

補助記憶装置１２には、誤認識文字テーブルを作成するためのプログラム（以下「誤認識文字テーブル作成プログラム」という）１８が格納されている。この誤認識文字テーブルは、図４に示すように、ＯＣＲにおいて誤認識される可能性の高い文字を誤認識高可能性文字として集め、誤認識高可能性文字のそれぞれにつき、ＯＣＲの対象とすべき画像（以下「ＯＣＲ対象画像」という）の作成のための印刷形態を特定する幾つかの情報（プリンタやフォントの識別情報等）を対応づけるものである（詳細は後述）。ＣＰＵ１１１は、補助記憶装置１２に格納された誤認識文字テーブル作成プログラム１８をメモリ１１２に読み出して実行することにより、誤認識文字テーブルを作成するための各種機能を実現する。メモリ１１２は、ＲＡＭ（Random Access Memory）およびＲＯＭ（Read Only Memory）を含んでいる。メモリ１１２は、補助記憶装置１２に格納された誤認識文字テーブル作成プログラム１８をＣＰＵ１１１が実行するためのワークエリアとして機能する。なお、誤認識文字テーブル作成プログラム１８は、上記ＤＶＤ等のコンピュータ読み取り可能な記録媒体（非一過性の記録媒体）に格納されて提供される。すなわち、使用者は、例えば、誤認識文字テーブル作成プログラム１８の記録媒体としての光ディスク１７を購入して光ディスクドライブ１３に挿入し、光ディスク１７から誤認識文字テーブル作成プログラム１８を読み出して補助記憶装置１２にインストールする。また、これに代えて、ＬＡＮ３等のネットワークを介して送信される誤認識文字テーブル作成プログラム１８をネットワークインタフェース部１１７で受信して、それを補助記憶装置１２にインストールするようにしてもよい。なお、誤認識文字テーブル作成プログラム１８がＣＰＵ１１１により実行されることにより、誤認識テーブルの作成に必要なＯＣＲ対象の文字画像の印刷のためにプリンタＰ１～Ｐ３を制御する機能、および、印刷された当該文字画像を光学的に読み取って文字を認識するＯＣＲ機能が実現される。 The auxiliary storage device 12 stores a program 18 for creating an erroneously recognized character table (hereinafter referred to as "erroneously recognized character table creating program"). As shown in FIG. 4, this misrecognition character table collects characters that are highly likely to be misrecognised by OCR as misrecognition high possibility characters. It associates several pieces of information (printer and font identification information, etc.) specifying a printing form for creating an image to be printed (hereinafter referred to as an "OCR target image") (details will be described later). The CPU 111 reads the erroneously recognized character table creation program 18 stored in the auxiliary storage device 12 into the memory 112 and executes it, thereby realizing various functions for creating the erroneously recognized character table. The memory 112 includes RAM (Random Access Memory) and ROM (Read Only Memory). The memory 112 functions as a work area for the CPU 111 to execute the erroneously recognized character table creation program 18 stored in the auxiliary storage device 12 . The erroneously recognized character table creation program 18 is provided by being stored in a computer-readable recording medium (non-transitory recording medium) such as the DVD. That is, for example, the user purchases an optical disc 17 as a recording medium for the erroneously recognized character table creation program 18, inserts it into the optical disc drive 13, reads out the erroneously recognized character table creation program 18 from the optical disc 17, and stores it in the auxiliary storage device 12. to install. Alternatively, the misrecognized character table creation program 18 transmitted via a network such as the LAN 3 may be received by the network interface section 117 and installed in the auxiliary storage device 12 . The CPU 111 executes the erroneous recognition character table creation program 18 to control the printers P1 to P3 for printing the character images to be OCR-required for creating the erroneous recognition table, and An OCR function of optically reading the character image and recognizing the character is realized.

＜２．誤認識文字テーブル作成処理＞
図３は、本実施形態に係る誤認識文字テーブルを作成するためにコンピュータ１０において上記誤認識文字テーブル作成プログラムに基づき実行される処理（以下「誤認識文字テーブル作成処理」という）を示すフローチャートである。すなわち、図１に示すシステムにおいて、コンピュータ１０内のＣＰＵ１１１は、誤認識文字テーブルを作成するために上記誤認識文字テーブル作成プログラムにしたがって下記のように動作する。なお、上記誤認識文字テーブル作成プログラムの起動時すなわち誤認識文字テーブル作成処理の開始時において、第１から第３プリンタＰ１～Ｐ３はいずれも未使用状態であり、各プリンタＰｉ（ｉ＝１，２，３）での印刷に使用可能な用紙Ｓ１～Ｓ３はいずれも未使用状態であり、各プリンタＰｉで使用可能なフォントも全て未使用状態であるものとする。 <2. False Recognition Character Table Creation Processing>
FIG. 3 is a flowchart showing a process (hereinafter referred to as "misrecognition character table creation process") executed by the computer 10 based on the misrecognition character table creation program to create the misrecognition character table according to the present embodiment. be. That is, in the system shown in FIG. 1, the CPU 111 in the computer 10 operates as follows according to the misrecognized character table creation program to create the misrecognized character table. At the start of the erroneously recognized character table creation program, that is, at the start of the erroneously recognized character table creation process, the first to third printers P1 to P3 are all unused, and each printer Pi (i=1, 2 and 3), all of the sheets S1 to S3 that can be used for printing are in an unused state, and all the fonts that can be used in each printer Pi are also in an unused state.

まずＣＰＵ１１１は、図１のシステムにおける第１から第３プリンタＰ１～Ｐ３のうち未使用のプリンタのいずれか１つを使用すべきプリンタ（以下「使用プリンタ」という）Ｐｓとして設定する（ステップＳ１０１）。誤認識文字テーブル作成処理の開始後、最初に当該ステップＳ１０１が実行される直前では、第１から第３プリンタＰ１～Ｐ３の全てが未使用状態である。次に、使用プリンタＰｓで使用可能なフォント（通常は「明朝」や「ゴシック」等の複数のフォントが使用可能）のうち未使用のいずれかのフォントを使用フォントＦｓとして設定する（ステップＳ１０２）。続いて、使用プリンタＰｓで使用可能な３種類の用紙Ｓ１～Ｓ３のうち未使用のいずれかの種類の用紙を使用用紙Ｓｓとして設定する（ステップＳ１０３）。 First, the CPU 111 sets one of the unused printers P1 to P3 in the system of FIG. 1 as the printer to be used (hereinafter referred to as "printer to be used") Ps (step S101). . Immediately before step S101 is executed for the first time after the erroneously recognized character table creation process is started, all of the first to third printers P1 to P3 are in an unused state. Next, any unused font among the fonts available for the printer Ps to be used (usually, a plurality of fonts such as "Mincho" and "Gothic" can be used) is set as the font to be used Fs (step S102). ). Subsequently, one of the three types of paper S1 to S3 that can be used by the printer Ps, which is not used, is set as the paper to be used Ss (step S103).

その後、使用プリンタＰｓにより使用フォントＦｓを使用して、ＯＣＲ対象文字の全てを使用用紙Ｓｓに文字画像として印刷する（ステップＳ１０４）。ここで、ＯＣＲ対象文字とは、スキャナ２０で読み取る画像から認識すべき文字の全て、すなわち、スキャナ２０とコンピュータ１０とそこで実行される誤認識文字テーブル作成プログラムとにより実現されるＯＣＲ装置８０の認識対象となり得る文字の全てをいう。 After that, using the font Fs used by the printer Ps used, all the characters to be OCR are printed as character images on the paper Ss used (step S104). Here, the OCR target characters are all the characters to be recognized from the image read by the scanner 20, that is, the recognition of the OCR device 80 realized by the scanner 20, the computer 10, and the erroneously recognized character table creation program executed therein. It means all the characters that can be the target.

次に、ステップＳ１０４で印刷されたＯＣＲ対象文字の画像をスキャナ２０により対象画像として読み取り（ステップＳ１０６）、対象画像における各文字をパターン認識により特定して文字コードを生成しＯＣＲ結果文字として出力する（ステップＳ１０８）。なお、ステップＳ１０６で読み取るべき対象画像が印刷された用紙は手作業によりスキャナ２０の読み取る位置に移動させるものとするが、これに代えて、当該印刷された用紙を使用プリンタＰｓからスキャナ２０へ移動させる機構を備え、当該機構をコンピュータ１０で制御するようにしてもよい。 Next, the image of the OCR target characters printed in step S104 is read as a target image by the scanner 20 (step S106), each character in the target image is specified by pattern recognition, character codes are generated, and output as OCR result characters. (Step S108). Note that the paper on which the target image to be read is printed in step S106 is manually moved to the reading position of the scanner 20, but instead of this, the printed paper is moved from the used printer Ps to the scanner 20. The computer 10 may control the mechanism.

その後、誤認識文字テーブルに登録すべき文字を決定すべく下記の処理を行う。
まず、ＯＣＲ対象文字のいずれか１つと、それに対応するＯＣＲ結果文字の１つに着目する（ステップＳ１１０）。なお、ステップＳ１０８が実行された直後では、ＯＣＲ対象文字の全ておよびＯＣＲ結果文字の全ては未着目の状態である。 After that, the following processing is performed to determine characters to be registered in the misrecognized character table.
First, one of the OCR target characters and one of the corresponding OCR result characters are focused (step S110). Immediately after step S108 is executed, all of the OCR target characters and all of the OCR result characters are in an unfocused state.

次に、着目した２つの文字が互いに一致しているか否か、すなわち、着目した１つのＯＣＲ対象文字（以下「着目ＯＣＲ対象文字」という）のコードと着目した１つのＯＣＲ結果文字（以下「着目ＯＣＲ結果文字」という）のコードとが互いに一致しているか否かを判定する（ステップＳ１１２）。この判定の結果、当該着目した２つの文字が互いに一致している場合には誤認識は生じていないものとしてステップＳ１１６へ進み、当該着目した２つの文字が互いに異なる場合には誤認識が生じたものとしてステップＳ１１４へ進む。 Next, whether or not the two characters of interest match each other, that is, the code of the one OCR target character of interest (hereinafter referred to as "the OCR target character of interest") and the one OCR result character of interest (hereinafter "the character of interest") It is determined whether or not the code of the OCR result character" matches each other (step S112). As a result of this determination, if the two characters of interest match each other, it is assumed that misrecognition has not occurred, and the process proceeds to step S116. Then, the process proceeds to step S114.

ステップＳ１１４へ進んだ場合、下記事項を互いに対応付けて下記のように誤認識文字テーブル（以下、単に「テーブル」ともいう）に登録し、その後、ステップＳ１１６へ進む。
（１）着目ＯＣＲ対象文字のコードをテーブルに格納することにより、当該文字を「誤認識高可能性文字」として登録する。
（２）着目ＯＣＲ結果文字のコードをテーブルに格納することにより、当該文字を「誤認識文字」として登録する。
（３）対象画像の印刷に使用されたプリンタＰｓ、フォントＦｓ、用紙Ｓｓを示すデータを格納することにより、当該プリンタＰｓ、当該フォントＰｓ、および当該用紙Ｓｓによって特定される印刷形態を登録する。例えば図４に示すように、ＯＣＲ対象文字の１つである「ソ」（ローマ字表記の“ＳＯ”に相当する文字）が「ン」（ローマ字表記の“Ｎ”に相当する文字）として認識された場合、すなわち着目ＯＣＲ対象文字の「ソ」に対する着目ＯＣＲ結果文字が「ン」である場合、「ソ」が誤認識高可能性文字（文字１）として登録されるとともに、これに対応付けて、「ン」が誤認識文字（文字２）として登録され、さらにＯＣＲ対象文字の「ソ」を含む対象画像の印刷形態を特定するプリンタＰｓ，フォントＦｓ、用紙Ｓｓが登録される。図４に示す例では、プリンタＰｓとして「プリンタＰ１」が、フォントＦｓとして「明朝」が、用紙Ｓｓとして「用紙Ｓ１」がそれぞれ登録される。 If the process proceeds to step S114, the following items are associated with each other and registered in a misrecognized character table (hereinafter also simply referred to as "table") as described below, and then the process proceeds to step S116.
(1) By storing the code of the OCR object character of interest in a table, the character is registered as a "character with a high probability of misrecognition".
(2) By storing the code of the OCR result character of interest in the table, the character is registered as an "erroneously recognized character".
(3) By storing data indicating the printer Ps, font Fs, and paper Ss used to print the target image, the printing form specified by the printer Ps, font Ps, and paper Ss is registered. For example, as shown in FIG. 4, "So" (a character corresponding to "SO" in Roman alphabet notation), which is one of the characters to be OCRed, is recognized as "N" (character corresponding to "N" in Roman alphabet notation). In other words, when the OCR result character of interest for the OCR target character of interest "So" is "N", "So" is registered as a character with a high probability of misrecognition (character 1), and is associated with it. , "n" are registered as erroneously recognized characters (character 2), and the printer Ps, font Fs, and paper Ss specifying the print form of the target image including the OCR target character "so" are registered. In the example shown in FIG. 4, "Printer P1" is registered as the printer Ps, "Mincho" as the font Fs, and "Paper S1" as the paper Ss.

ステップＳ１１６では、未着目のＯＣＲ対象文字があるか否かを判定する。この判定の結果、未着目のＯＣＲ対象文字がある場合には、ステップＳ１１０へ戻る。以降、未着目のＯＣＲ対象文字がなくなるまでステップＳ１１０～Ｓ１１６を繰り返し実行し、未着目のＯＣＲ対象文字がなくなると、ステップＳ１１８へ進む。この時点では、１つの印刷形態の対象画像におけるＯＣＲ対象文字の全てにつき誤認識されたか否かが判定されており、誤認識された文字については誤認識文字テーブルに上記（１）～（３）の登録が行われている。 In step S116, it is determined whether or not there is an unfocused OCR target character. As a result of this determination, if there is an unfocused OCR target character, the process returns to step S110. Thereafter, steps S110 to S116 are repeatedly executed until there are no more unfocused OCR target characters, and when there are no more unfocused OCR target characters, the process proceeds to step S118. At this point, it has been determined whether or not all of the OCR target characters in the target image of one print format have been erroneously recognized. is being registered.

ステップＳ１１８では、使用プリンタＰｓで使用可能な全ての種類の用紙（図１のシステムでは３種類の用紙Ｓ１～Ｓ３）が使用されたか否かを判定する。この判定の結果、全ての種類の用紙が使用されてはいない場合には、全てのＯＣＲ対象文字を未着目状態とし（ステップＳ１１９）、ステップＳ１０３へ戻る。以降、全ての種類の用紙Ｓ１～Ｓ３が使用されるまでステップＳ１０３～Ｓ１１９を繰り返し実行し、全ての種類の用紙Ｓ１～Ｓ３が使用されると、ステップＳ１２０へ進む。 In step S118, it is determined whether or not all types of paper usable by the used printer Ps (three types of paper S1 to S3 in the system of FIG. 1) have been used. As a result of this determination, if all types of paper have not been used, all OCR target characters are put in an unfocused state (step S119), and the process returns to step S103. Thereafter, steps S103 to S119 are repeatedly executed until all types of sheets S1 to S3 are used, and when all types of sheets S1 to S3 are used, the process proceeds to step S120.

ステップＳ１２０では、使用プリンタＰｓで使用可能な全てのフォントが使用されたか否かを判定する。この判定の結果、全てのフォントが使用されてはいない場合には、全てのＯＣＲ対象文字を未着目状態とし、使用プリンタＰｓで使用可能な全ての種類の用紙を未使用状態として（ステップＳ１２１）、ステップＳ１０２へ戻る。以降、全てのフォントが使用されるまでステップＳ１０２～Ｓ１２１を繰り返し実行し、全てのフォントが使用されると、ステップＳ１２２へ進む。 In step S120, it is determined whether or not all fonts available in the used printer Ps have been used. As a result of this determination, if all the fonts are not used, all the OCR target characters are put in an unfocused state, and all types of paper usable in the used printer Ps are put in an unused state (step S121). , the process returns to step S102. Thereafter, steps S102 to S121 are repeatedly executed until all fonts are used, and when all fonts are used, the process proceeds to step S122.

ステップＳ１２２では、誤認識文字テーブルの作成に使用可能な全てのプリンタ（図１のシステムでは３台のプリンタＰ１～Ｐ３）が使用されたか否かを判定する。この判定の結果、全てのプリンタが使用されてはいない場合（いずれかのプリンタが未使用の場合）には、全てのＯＣＲ対象文字を未着目状態とし、使用プリンタＰｓで使用可能な全ての種類の用紙および全てのフォントを未使用状態として（ステップＳ１２３）、ステップＳ１０１へ戻る。以降、全てのプリンタが使用されるまでステップＳ１０１～Ｓ１２３を繰り返し実行し、全てのプリンタが使用されると、誤認識文字テーブル作成処理を終了する。 In step S122, it is determined whether or not all printers (three printers P1 to P3 in the system of FIG. 1) that can be used to create the erroneously recognized character table have been used. As a result of this determination, if all the printers are not used (if any printer is not used), all the OCR target characters are put in a non-focused state, and all types usable by the used printer Ps are displayed. paper and all fonts are set as unused (step S123), and the process returns to step S101. After that, steps S101 to S123 are repeatedly executed until all printers are used, and when all printers are used, the erroneously recognized character table creation process ends.

このようにして誤認識文字テーブル作成処理を終了した時点では、ＯＣＲ対象文字の全てが、プリンタとフォントと用紙の種類で特定される印刷形態の全てで印刷され、当該印刷により得られる対象画像における各文字がパターン認識（ＯＣＲ）により特定されてＯＣＲ結果として出力され、それらのＯＣＲ結果に基づき誤認識の発生の有無が判定され、その判定結果に基づき誤認識文字テーブルが作成されている。図４は、このようにして作成される誤認識文字テーブルの一例を示している。 At the time when the erroneously recognized character table creation process is completed in this way, all of the OCR target characters are printed in all of the printing formats specified by the printer, font, and paper type, and the target image obtained by the printing is Each character is specified by pattern recognition (OCR) and output as an OCR result. Based on these OCR results, the presence or absence of misrecognition is determined, and an erroneously recognized character table is created based on the determination result. FIG. 4 shows an example of an erroneously recognized character table created in this way.

＜３．文字列検索装置＞
上記のような本実施形態に係る誤認識文字テーブル（図４参照）は、テキストを含む画像からＯＣＲ装置によって認識される文字からなるテキストデータを検索対象とし、外部から入力される文字列に一致する文字列を当該テキストデータにおいて探す文字列検索装置において使用することができる。 <3. Character string search device>
The erroneously recognized character table (see FIG. 4) according to the present embodiment as described above searches for text data consisting of characters recognized by an OCR device from an image containing text, and matches a character string input from the outside. It can be used in a character string search device that searches for a character string in the text data.

＜３．１構成＞
図５は、このような文字列検索装置を備える検索システムの構成を示す概略図である。この検索システムは、検索すべき文字列を使用者の操作によって入力するための検索用端末装置３０と、その検索用端末装置３０において入力された検索語としての文字列（以下「入力文字列」という）に一致する文字列を検索対象としてのテキストデータの中から探す文字列検索装置４０と、当該テキストデータを作成するためのＯＣＲ装置８０とを備えている。検索用端末装置３０および文字列検索装置４０は、インターネット５に接続されており、インターネット５によってよって互いに通信可能である。また、文字列検索装置４０とＯＣＲ装置８０とは、ＬＡＮによって互いに通信可能に接続されている。なお、図５の検索システムは、１台の検索用端末装置３０を含んでいるが、複数台の検索用端末装置を含んでいてもよい。 <3.1 Configuration>
FIG. 5 is a schematic diagram showing the configuration of a search system including such a character string search device. This search system includes a search terminal device 30 for inputting a character string to be searched by a user's operation, and a character string as a search word (hereinafter "input character string") input in the search terminal device 30. ) in text data as a search target, and an OCR device 80 for creating the text data. The search terminal device 30 and the character string search device 40 are connected to the Internet 5 and can communicate with each other via the Internet 5 . Also, the character string search device 40 and the OCR device 80 are communicably connected to each other via a LAN. Although the search system in FIG. 5 includes one search terminal device 30, it may include a plurality of search terminal devices.

検索用端末装置３０は、パーソナルコンピュータ（以下「パソコン」と略記する）において所定プログラムを実行することにより実現されている。すなわち、当該所定プログラムに基づき検索用端末装置３０は、使用者の入力操作に応じて、文字列検索装置４０内のテキストデータＤｔｘにおいて検索すべき文字列を入力文字列として受け取ると、その入力文字列をインターネット５を介して文字列検索装置４０に送り、その後、文字列検索装置４０からその入力文字列に基づく検索結果を受け取って表示するように構成されている。 The search terminal device 30 is realized by executing a predetermined program in a personal computer (hereinafter abbreviated as "PC"). That is, based on the predetermined program, when the search terminal device 30 receives a character string to be searched in the text data Dtx in the character string search device 40 as an input character string in response to the user's input operation, the input character string A string is sent to a character string search device 40 via the Internet 5, and then a search result based on the input character string is received from the character string search device 40 and displayed.

文字列検索装置４０は、図５に示すように、検索処理装置４５と、それに接続された補助記憶装置４６と、検索処理装置４５およびＯＣＲ装置８０からＬＡＮ３を介してアクセス可能に構成されたネットワーク接続記憶装置（Network Attached Storage）（以下「ＮＡＳ」という）４８とを備えている。検索処理装置４５は、パソコンにおいて後述の文字列検索プログラムＳｐｇを実行することにより実現されている。補助記憶装置４６およびＮＡＳ４８は磁気ディスク等を用いて構成されている。補助記憶装置４６には、既述の誤認識文字テーブル作成処理により作成された誤認識文字テーブルＥｔｂｌと後述の文字列検索プログラムＳｐｇとが格納されており、ＮＡＳ４８には、検索対象としてのテキストデータＤｔｘが格納されている。 The character string search device 40, as shown in FIG. and a network attached storage (hereinafter referred to as “NAS”) 48 . The search processing device 45 is implemented by executing a character string search program Spg, which will be described later, on a personal computer. The auxiliary storage device 46 and NAS 48 are configured using magnetic disks or the like. The auxiliary storage device 46 stores an erroneously recognized character table Etbl created by the erroneously recognized character table creating process described above and a character string search program Spg, which will be described later. Dtx is stored.

ＯＣＲ装置８０は、ＯＣＲ処理装置８５と、それに接続されたスキャナ８６とを備えている。ＯＣＲ処理装置８５は、パソコンを用いて実現されており、ＯＣＲプログラムに基づき、スキャナ８６によりテキストを含む画像を読み取り、パターン認識により当該画像に含まれる文字を特定することでＯＣＲ結果としてのテキストデータを生成する。このテキストデータは、ＬＡＮ３を介してＮＡＳ４８に送られ、文字列検索装置４０における検索対象のテキストデータＤｔｘとしてＮＡＳ４８に格納される。このテキストデータＤｔｘには、スキャナ８６により読み取られるＯＣＲ対象画像の印刷時の出力条件、すなわち当該ＯＣＲ対象画像の印刷に使用されたプリンタ、フォント、および用紙の種類により特定される印刷形態を示す情報も含まれている。なお、ＯＣＲ処理装置８５を実現するために使用されるＯＣＲプログラムは、特に限定されるものではなく、既知のプログラムを使用することができる。 The OCR device 80 comprises an OCR processing device 85 and a scanner 86 connected thereto. The OCR processing device 85 is implemented using a personal computer, reads an image containing text with a scanner 86 based on an OCR program, and identifies characters contained in the image by pattern recognition to obtain text data as an OCR result. to generate This text data is sent to NAS 48 via LAN 3 and stored in NAS 48 as text data Dtx to be searched by character string search device 40 . The text data Dtx contains information indicating the print format specified by the output conditions for printing the OCR target image read by the scanner 86, that is, the printer, font, and type of paper used to print the OCR target image. is also included. Note that the OCR program used to implement the OCR processing device 85 is not particularly limited, and any known program can be used.

＜３．２検索処理装置および文字列検索処理＞
図６は、図５に示す文字列検索装置４０における検索処理装置４５のハードウェアとしてのパソコンの構成（検索処理装置４５のハードウェア構成）を示すブロック図である。この検索処理装置４５のハードウェア構成は、内蔵の補助記憶装置１２に代えて外付けの補助記憶装置４６を備える点で図２のコンピュータ１０のハードウェア構成と相違するが、その他の点では図２のコンピュータ１０のハードウェア構成と同様であるので、同一部分には同一の参照符号を付して詳しい説明を省略する。なお、図５に示す文字列検索装置４０では、補助記憶装置４６は検索処理装置４５に外付けされているが、検索処理装置４５に内蔵されていてもよい。 <3.2 Search processing device and character string search processing>
FIG. 6 is a block diagram showing the configuration of a personal computer (hardware configuration of search processing device 45) as the hardware of search processing device 45 in character string search device 40 shown in FIG. The hardware configuration of this search processing device 45 differs from the hardware configuration of the computer 10 in FIG. 2, the same parts are denoted by the same reference numerals, and detailed description thereof will be omitted. Note that in the character string search device 40 shown in FIG.

既述のように補助記憶装置４６には、誤認識文字テーブルＥｔｂｌおよび文字列検索プログラムＳｐｇが格納されている。検索処理装置４５内のＣＰＵ１１１は、補助記憶装置４６に格納された文字列検索プログラムＳｐｇをメモリ１１２に読み出して実行し、これにより後述の文字列検索処理が実現される。なお、文字列検索プログラムＳｐｇは、ＤＶＤ等のコンピュータ読み取り可能な記録媒体（非一過性の記録媒体）に格納されて提供される。すなわち使用者は、例えば、文字列検索プログラムＳｐｇの記録媒体としての光ディスク１７を購入して光ディスクドライブ１３に挿入し、光ディスク１７から文字列検索プログラムＳｐｇを読み出して補助記憶装置４６にインストールする。また、これに代えて、インターネット５を介して送信される文字列検索プログラムＳｐｇをネットワークインタフェース部１１７で受信して、それを補助記憶装置４６にインストールするようにしてもよい。 As described above, the auxiliary storage device 46 stores the erroneously recognized character table Etbl and the character string search program Spg. The CPU 111 in the search processing device 45 reads the character string search program Spg stored in the auxiliary storage device 46 into the memory 112 and executes it, thereby realizing character string search processing described later. The character string search program Spg is provided by being stored in a computer-readable recording medium (non-transitory recording medium) such as a DVD. For example, the user purchases the optical disc 17 as a recording medium for the character string search program Spg, inserts it into the optical disc drive 13 , reads the character string search program Spg from the optical disc 17 , and installs it in the auxiliary storage device 46 . Alternatively, the character string search program Spg transmitted via the Internet 5 may be received by the network interface unit 117 and installed in the auxiliary storage device 46 .

検索処理装置４５は、検索用端末装置３０からインターネット５を介して入力文字列を受け取り、この入力文字列に基づき、ＮＡＳ４８に格納されたＯＣＲ結果としてのテキストデータＤｔｘを検索する。このとき、入力文字列を検索語として使用するだけでなく、図７に示すように、入力文字列から誤認識文字テーブルＥｔｂｌを用いて作成された新たな検索語も用いて文字列の検索を行う（詳細は後述）。以下、検索処理装置４５により実行される文字列検索処理につき、図８および図９に示す２つの例を説明する。なお、検索処理装置４５は、図８に示す文字列検索処理と図９に示す文字列検索処理とを選択的に実行可能で、これら２つの文字列検索処理のうちいずれを実行するかを使用者の入力操作により指定できるように構成されていてもよい。 The search processing device 45 receives an input character string from the search terminal device 30 via the Internet 5, and searches the text data Dtx as the OCR result stored in the NAS 48 based on this input character string. At this time, not only the input character string is used as a search term, but also a new search term created from the input character string using the misrecognized character table Etbl as shown in FIG. 7 is used to search for the character string. (details will be described later). Two examples shown in FIGS. 8 and 9 of the character string search processing executed by the search processing device 45 will be described below. The search processing device 45 can selectively execute the character string search processing shown in FIG. 8 and the character string search processing shown in FIG. It may be configured so that it can be designated by a user's input operation.

＜３．２．１文字列検索処理の一例＞
図８は、検索処理装置４５において実行される文字検索処理の一例を示すフローチャートである。当該文字列検索処理が実行される場合、検索処理装置４５のハードウェアとしてのパソコンのＣＰＵ１１１は、文字列検索プログラムＳｐｇに基づき下記のように動作する。 <3.2.1 Example of character string search processing>
FIG. 8 is a flow chart showing an example of character search processing executed in the search processing device 45. As shown in FIG. When the character string search process is executed, the CPU 111 of the personal computer as the hardware of the search processing device 45 operates as follows based on the character string search program Spg.

図８の文字列検索処理に対応する文字列検索プログラムＳｐｇが起動されると、ＣＰＵ１１１は、検索用端末装置３０から検索ための入力文字列を受け取るまで待機する状態となり、入力文字列を受け取ると（ステップＳ２０１）、当該入力文字列における未着目の文字のいずれかに着目する（ステップＳ２０３）。なお、入力文字列を受け取った時点では、当該入力文字列における全ての文字は未着目状態である。 When the character string search program Spg corresponding to the character string search process of FIG. 8 is activated, the CPU 111 waits until it receives an input character string for searching from the search terminal device 30. When the input character string is received, (Step S201), one of the unfocused characters in the input character string is focused (Step S203). Note that all characters in the input character string are in an unfocused state when the input character string is received.

次に、着目文字が誤認識文字テーブルＥｔｂｌに登録されているか否かを判定する（ステップＳ２０４）。図４に示す誤認識文字テーブルＥｔｂｌが使用される場合、着目文字が、このテーブルＥｔｂｌに文字１として登録されている文字（「ソ」、「タ」、「高」、…、「リ」、…）、文字２として登録されている文字（「ン」、「タ」、「▲高▼（はしご高）」、…、「ソ」、…）、および、文字３として登録されている文字（「ク」、…「ン」、…）のいずれかの文字であるか否かを判定する。ここで、“はしご高”と呼ばれている「高」の異体字（Unicode“9AD9”が割り当てられている文字）を、便宜上、「▲高▼」と表記するものとする（以下においても同様）。ステップＳ２０４での判定の結果、着目文字が誤認識文字テーブルＥｔｂｌに登録されている場合にはステップＳ２０６へ進み、着目文字が誤認識文字テーブルＥｔｂｌに登録されていない場合にはステップＳ２０８へ進む。 Next, it is determined whether or not the target character is registered in the erroneously recognized character table Etbl (step S204). When the erroneously recognized character table Etbl shown in FIG. 4 is used, the character of interest is a character registered as character 1 in this table Etbl (“So”, “Ta”, “High”, . ...), characters registered as character 2 ("N", "ta", "▲high▼ (ladder height)", ..., "So", ...), and characters registered as character 3 ( It is determined whether the character is one of the characters "ku", ... "n", ...). Here, for the sake of convenience, the variant character of '高' called 'ladder height' (characters to which Unicode '9AD9' is assigned) will be written as '▲高▼' (same below). ). As a result of determination in step S204, if the character of interest is registered in the erroneously recognized character table Etbl, the process proceeds to step S206, and if the character of interest is not registered in the erroneously recognized character table Etbl, the process proceeds to step S208.

ステップＳ２０６へ進んだ場合、着目文字はＯＣＲ装置８０により誤認識される可能性が許容範囲を超えるとみなし、入力文字列における着目文字をワイルドカード（「？」）に置き換え（ステップＳ２０６）、その後、ステップＳ２０８へ進む。 If the process proceeds to step S206, the possibility of the target character being erroneously recognized by the OCR device 80 is deemed to exceed the allowable range, and the target character in the input character string is replaced with a wildcard ("?") (step S206). , the process proceeds to step S208.

ステップＳ２０８では、入力文字列に未着目の文字があるか否かを判定する。この判定の結果、入力文字列に未着目の文字がある場合には、ステップＳ２０３へ戻る。以降、入力文字列において未着目の文字がなくなるまでステップＳ２０３～Ｓ２０８を繰り返し実行し、未着目の文字がなくなると、ステップＳ２１０へ進む。この時点では、入力文字列に含まれる文字のうち誤認識文字テーブルＥｔｂｌに登録されている文字は、いずれも、ＯＣＲ装置８０により誤認識される可能性が許容範囲を超えるとみなされ、ワイルドカードに置き換えられている。 In step S208, it is determined whether or not there is an unfocused character in the input character string. As a result of this determination, if there is an unfocused character in the input character string, the process returns to step S203. After that, steps S203 to S208 are repeatedly executed until there are no unfocused characters in the input character string, and when there are no unfocused characters, the process proceeds to step S210. At this point, all characters registered in the misrecognized character table Etbl among the characters included in the input character string are considered to have a possibility of being misrecognized by the OCR device 80 exceeding the allowable range, and wildcards are used. has been replaced by

ステップＳ２１０では、この入力文字列を検索語とし、当該検索語に一致する文字列を、ＮＡＳ４８に格納された検索対象としてのテキストデータＤｔｘの中から検索する。 In step S210, this input character string is used as a search term, and text data Dtx as a search target stored in the NAS 48 is searched for a character string that matches the search term.

その後、上記検索による検索結果が検索用端末装置３０で表示されるように、当該検索結果を示すデータをインターネット５を介して検索用端末装置３０に送る（ステップＳ２１２）。これにより、検索用端末装置３０において、例えば、検索対象としてのテキストデータＤｔｘのうちステップＳ２１０の時点での入力文字列（例えば図７に示す検索語２の文字列）に一致する文字列を含む文または段落等が当該文字列をハイライト状態にして表示される。 Thereafter, data indicating the search results is sent to the search terminal device 30 via the Internet 5 so that the search results of the above search are displayed on the search terminal device 30 (step S212). As a result, in the search terminal device 30, for example, the text data Dtx to be searched includes a character string that matches the input character string (for example, the character string of the search term 2 shown in FIG. 7) at the time of step S210. A sentence, paragraph, or the like is displayed with the character string in question highlighted.

上記のような文字列検索処理によれば、例えば図７に示すように、検索用端末装置３０から受け取る入力文字列（検索語１）が「ベンチャー」であるとすると（ステップＳ２０１）、図４に示すように文字「ン」が誤認識文字テーブルに登録されているので、ステップＳ２１０では、検索語２としての入力文字列「ベ？チャー」に一致する文字列がテキストデータＤｔｘの中から検索される。これにより、ＯＣＲ結果としてのテキストデータＤｔｘにおいて、「ベンチャー」という語における文字「ン」が誤認識されて「ベソチャー」として含まれている場合であっても、ワイルドカードを含む「ベ？チャー」という検索語に一致する文字列として「ベソチャー」を含む文または段落等が検索結果として表示される。なお、検索語における“？”は任意の１文字を表すものとする（以下においても同様）。 According to the character string search process as described above, as shown in FIG. 7, for example, assuming that the input character string (search word 1) received from the search terminal device 30 is "venture" (step S201), Since the character "n" is registered in the erroneously recognized character table as shown in FIG. be done. As a result, even if the character "n" in the word "venture" is erroneously recognized and included as "besochar" in the text data Dtx as the OCR result, "be?char" including the wildcard is Sentences, paragraphs, etc. containing "besochar" as a character string that matches the search term are displayed as search results. It should be noted that "?" in the search term represents any one character (the same applies hereinafter).

また、例えば図７に示すように、検索用端末装置３０から受け取る入力文字列（検索語１）が「高島」であるとすると（ステップＳ２０１）、図４に示すように文字「高」が誤認識文字テーブルに登録されているので、ステップＳ２１０では、検索語２としての入力文字列「？島」に一致する文字列がテキストデータＤｔｘの中から検索される。これにより、ＯＣＲ結果としてのテキストデータＤｔｘにおいて、「高島」という語における文字「高」が誤認識されて「▲高▼島」として含まれている場合であっても、ワイルドカードを含む「？島」という検索語に一致する文字列として「▲高▼島」を含む文または段落等が検索結果として表示される。 For example, as shown in FIG. 7, if the input character string (search word 1) received from the search terminal device 30 is "Takashima" (step S201), the character "高" is incorrect as shown in FIG. Since it is registered in the recognized character table, in step S210, the text data Dtx is searched for a character string that matches the input character string "?jima" as the second search term. As a result, in the text data Dtx as the OCR result, even if the character "高" in the word "Takashima" is erroneously recognized and included as "▲Taka▼jima", "?" including the wildcard is displayed. Sentences, paragraphs, or the like that include "Takashima" as a character string that matches the search term "island" are displayed as search results.

このように、ＯＣＲ結果としてのテキストデータＤｔｘが検索対象であって、その中にＯＣＲ装置８０により誤認識された文字が含まれる場合であっても、検索漏れを抑制することができる。また、図８の文字列検索処理においてワイルドカードを含む検索語の作成に使用される誤認識文字テーブルＥｔｂｌには、プリンタや、フォント、用紙の種類の異なる種々の印刷形態のいずれかで印刷された画像に対してＯＣＲ装置によりいずれかの文字が誤認識されると当該文字が登録される。このため、検索対象としてのテキストデータＤｔｘを作成するためのＯＣＲ対象画像の印刷形態が異なっても、検索漏れを確実に抑制することができる。 In this way, even if the text data Dtx as the OCR result is the search target and includes characters that are erroneously recognized by the OCR device 80, it is possible to suppress search omissions. Also, the misrecognized character table Etbl used to create search words including wildcards in the character string search process of FIG. If any character is erroneously recognized by the OCR device in the image obtained, the character is registered. Therefore, even if the print form of the OCR target image for creating the text data Dtx as the search target is different, it is possible to reliably suppress search omissions.

＜３．２．２文字列検索処理の別例＞
図９は、検索処理装置４５において実行される文字検索処理の別例を示すフローチャートである。当該文字列検索処理が実行される場合、検索処理装置４５のハードウェアとしてのパソコンのＣＰＵ１１１は、文字列検索プログラムＳｐｇに基づき下記のように動作する。なお、本例の文字列検索処理におけるステップのうち図８に示す上記一例の文字列検索処理と同一部分には、同一のステップ番号を付し、詳しい説明を省略する。 <3.2.2 Another example of character string search processing>
FIG. 9 is a flow chart showing another example of the character search process executed by the search processing device 45. As shown in FIG. When the character string search process is executed, the CPU 111 of the personal computer as the hardware of the search processing device 45 operates as follows based on the character string search program Spg. Among the steps in the character string search process of this example, the same step numbers are assigned to the same parts as those of the character string search process of the example shown in FIG. 8, and detailed description thereof will be omitted.

図９の文字列検索処理に対応する文字列検索プログラムＳｐｇが起動されると、ＣＰＵ１１１は、検索用端末装置３０から検索のための入力文字列を受け取るまで待機する状態となり、入力文字列を受け取ると（ステップＳ２０１）、当該入力文字列を１つの検索語として、検索に使用すべき検索語群に含める（ステップＳ２０２）。なお、文字列検索プログラムが起動された後、ステップＳ２０２が実行される直前では、当該検索語群にはいずれの検索語も含まれていない。 When the character string search program Spg corresponding to the character string search process of FIG. 9 is activated, the CPU 111 waits until it receives an input character string for searching from the search terminal device 30, and receives the input character string. (Step S201), the input character string is included as one search term in the search term group to be used for searching (Step S202). After the character string search program is activated and immediately before step S202 is executed, the search term group does not include any search term.

次に、当該入力文字列における未着目の文字のいずれかに着目し（ステップＳ２０３）、着目文字が誤認識文字テーブルＥｔｂｌに登録されているか否かを判定する（ステップＳ２０４）。この判定の結果、着目文字が誤認識文字テーブルＥｔｂｌに登録されている場合にはステップＳ２２０へ進み、着目文字が誤認識文字テーブルＥｔｂｌに登録されていない場合にはステップＳ２２２へ進む。 Next, any unfocused character in the input character string is focused (step S203), and it is determined whether or not the focused character is registered in the erroneously recognized character table Etbl (step S204). As a result of this determination, if the character of interest is registered in the erroneously recognized character table Etbl, the process proceeds to step S220, and if the character of interest is not registered in the erroneously recognized character table Etbl, the process proceeds to step S222.

ステップＳ２２０へ進んだ場合、着目文字はＯＣＲ装置８０により誤認識される可能性が許容範囲を超えるとみなし、検索語群に含まれる各検索語における着目文字を、誤認識文字テーブルにより当該着目文字に対応付けられる他の文字に置き換えることにより、検索語を新たに作成して検索語群に含める。その後、ステップＳ２２２へ進む。 If the process proceeds to step S220, the possibility of the target character being erroneously recognized by the OCR device 80 is deemed to exceed the allowable range, and the target character in each search word included in the search word group is identified by the erroneously recognized character table. A new search term is created and included in the search term group by replacing with other characters associated with . After that, the process proceeds to step S222.

ステップＳ２２２では、入力文字列に未着目の文字があるか否かを判定する。この判定の結果、入力文字列に未着目の文字がある場合には、ステップＳ２０３へ戻る。以降、入力文字列において未着目の文字がなくなるまでステップＳ２０３～Ｓ２２２を繰り返し実行し、未着目の文字がなくなると、ステップＳ２２４へ進む。この時点では、入力文字列に含まれる文字のうち誤認識文字テーブルＥｔｂｌに登録されている文字は、いずれもＯＣＲ装置８０により誤認識される可能性が許容範囲を超えるとみなされ、入力文字列において当該登録されている文字のすくなくとも１つを誤認識文字テーブルＥｔｂｌにより対応付けられる他の文字にそれぞれ置き換えることにより得られる検索語の全てが、新たに作成されて検索語群に含められている。 In step S222, it is determined whether or not there is an unfocused character in the input character string. As a result of this determination, if there is an unfocused character in the input character string, the process returns to step S203. After that, steps S203 to S222 are repeatedly executed until there are no unfocused characters in the input character string, and when there are no unfocused characters, the process proceeds to step S224. At this point, the possibility of erroneous recognition by the OCR device 80 of any of the characters included in the input character string and registered in the erroneously recognized character table Etbl is deemed to exceed the allowable range, and the input character string All of the search words obtained by replacing at least one of the registered characters with other characters associated by the misrecognized character table Etbl are newly created and included in the search word group. .

ステップＳ２２４では、検索語群におけるいずれかの検索語に一致する文字列を、ＮＡＳ４８に格納された検索対象としてのテキストデータＤｔｘの中から検索する。 In step S224, text data Dtx as a search target stored in NAS 48 is searched for a character string that matches any search term in the search term group.

その後、上記検索による検索結果が検索用端末装置３０で表示されるように、当該検索結果を示すデータをインターネット５を介して検索用端末装置３０に送る（ステップＳ２２６）、例えば、検索語群における各検索語につき、テキストデータＤｔｘにおいて当該検索語（例えば図７に示す検索語３の文字列）に一致する文字列を含む文または段落等が当該文字列をハイライト状態にして表示される。 After that, data indicating the search results is sent to the search terminal device 30 via the Internet 5 so that the search results of the search are displayed on the search terminal device 30 (step S226). For each search term, sentences or paragraphs including a character string matching the search term (for example, the character string of search term 3 shown in FIG. 7) in the text data Dtx are displayed with the character string highlighted.

上記のような文字列検索処理によれば、例えば図７に示すように、検索用端末装置３０から受け取る入力文字列（検索語１）が「ベンチャー」であるとすると（ステップＳ２０１）、図４に示すように、文字「ン」が誤認識文字テーブルＥｔｂｌにおいて登録されているとともに、文字「ン」が誤認識文字テーブルＥｔｂｌにより文字「ソ」と対応付けられているので、ステップＳ２２４では、図７において“べ（ン｜ソ）チャー”（検索語３）として表記されている２つの文字列「ベンチャー」および「ベソチャー」のそれぞれにつき、一致する文字列がテキストデータＤｔｘの中から検索される。これにより、ＯＣＲ結果としてのテキストデータＤｔｘにおいて、「ベンチャー」という語における文字「ン」が誤認識されて「ベソチャー」として含まれている場合であっても、検索語群に含まれる１つの検索語に一致する文字列として「ベソチャー」を含む文または段落等が検索結果として表示される。 According to the character string search process as described above, as shown in FIG. 7, for example, assuming that the input character string (search word 1) received from the search terminal device 30 is "venture" (step S201), , the character "n" is registered in the misrecognized character table Etbl, and the character "n" is associated with the character "so" in the misrecognized character table Etbl. 7, for each of the two character strings "venture" and "besochar" represented as "be(n|so)char" (search term 3), the text data Dtx is searched for matching character strings. . As a result, even if the character "n" in the word "venture" is erroneously recognized and included as "besochar" in the text data Dtx as the OCR result, one search term included in the search word group Sentences, paragraphs, etc. containing "besochar" as a character string that matches the word are displayed as search results.

また、例えば図７に示すように、検索用端末装置３０から受け取る入力文字列（検索語１）が「高島」であるとすると（ステップＳ２０１）、図４に示すように、文字「高」が誤認識文字テーブルＥｔｂｌにおいて登録されているとともに、文字「高」が誤認識文字テーブルＥｔｂｌにより文字「▲高▼」と対応付けられているので（図４のＩＤ＝３の行を参照されたい）、ステップＳ２２４では、図７において“（高｜▲高▼）島”（検索語３）として表記されている２つの文字列「高島」および「▲高▼島」のそれぞれにつき、一致する文字列がテキストデータＤｔｘの中から検索される。 Also, as shown in FIG. 7, for example, if the input character string (search term 1) received from the search terminal device 30 is "Takashima" (step S201), the character "Taka" is displayed as shown in FIG. Since it is registered in the erroneously recognized character table Etbl and the character '高' is associated with the character '▲高▼' by the erroneously recognized character table Etbl (see the row of ID=3 in FIG. 4). , in step S224, for each of the two character strings "Takashima" and "Takashima" represented as "(高|▲高▼)島" (search word 3) in FIG. is retrieved from the text data Dtx.

このように、図９の文字列検索処理によれば、ＯＣＲ結果としてのテキストデータＤｔｘが検索対象であって、その中にＯＣＲ装置８０により誤認識された文字が含まれる場合であっても検索漏れを抑制することができ、図８の文字列検索処理と同様の効果が得られる。ただし、図９の文字列検索処理では、入力文字列における文字のうち誤認識文字テーブルＥｔｂｌに登録されている文字がワイルドカードに置き換えられるのではなく、当該文字が誤認識文字テーブルＥｔｂｌによりそれに対応付けられる他の文字に置き換えられることから、不適切または余分な検索結果の出力を抑えるという点で、図９の文字列検索処理は図８の文字列検索処理よりも有利である。一方、検索漏れを抑制するという点では、図８の文字列検索処理は図９の文字列検索処理よりも有利である。 As described above, according to the character string search process of FIG. 9, even if the text data Dtx as the OCR result is the search target and includes characters erroneously recognized by the OCR device 80, the search is performed. Omission can be suppressed, and the same effect as the character string search processing of FIG. 8 can be obtained. However, in the character string search process of FIG. 9, the characters registered in the misrecognized character table Etbl among the characters in the input character string are not replaced with wildcards, but the characters are handled by the misrecognized character table Etbl. The character string search processing of FIG. 9 is more advantageous than the character string search processing of FIG. 8 in terms of suppressing the output of inappropriate or redundant search results because the characters are replaced with other characters attached. On the other hand, the character string search process of FIG. 8 is more advantageous than the character string search process of FIG. 9 in terms of suppressing omissions in search.

＜４．ＯＣＲ調整量テーブルとＯＣＲ処理＞
本実施形態に係る誤認識文字テーブル（図４参照）は、ＯＣＲ装置による文字認識の精度（以下「ＯＣＲ精度」という）を向上させるためのＯＣＲ調整量テーブルを作成するために使用することができる。このＯＣＲ調整量テーブルは、ＯＣＲ対象画像の印刷に使用されるプリンタや、フォント、用紙の種類等により特定される印刷形態に対し、ＯＣＲ対象画像を読み取って文字を認識する処理（以下「ＯＣＲ処理」という）における適切な調整量を対応付けるテーブルであり、例えば図１１に示すように構成されている。 <4. OCR Adjustment Table and OCR Processing>
The erroneously recognized character table (see FIG. 4) according to the present embodiment can be used to create an OCR adjustment amount table for improving the accuracy of character recognition by an OCR device (hereinafter referred to as "OCR accuracy"). . This OCR adjustment amount table is used to read the OCR target image and recognize characters (hereinafter referred to as "OCR processing ”), and is configured as shown in FIG. 11, for example.

このようなＯＣＲ調整量テーブルは、図２に示すように構成されたコンピュータ１０で所定プログラムに基づき図１０に示すようなＯＣＲ調整量テーブル作成処理を実行することにより作成することができる。以下、図１０を参照して、このＯＣＲ調整量テーブル作成処理につき説明する。なお、このＯＣＲ調整量テーブル作成処理の開始時において、使用可能なプリンタはいずれも未使用状態であり、使用可能な各プリンタでの印刷に使用可能な用紙はいずれも未使用状態であり、各プリンタで使用可能なフォントも全て未使用状態であるものとする。また、図１０において「ＯＣＲ調整量」とは、ＯＣＲ装置においてスキャナで読み取った画像に対して文字認識のために施される画像処理におけるいずれか１つの調整量であり、例えば、読み取った画像に含まれる文字の画像に対する縦方向調整量（縦方向の太らせまたは細らせの量）または横方向調整量（横方向の太らせまたは細らせの量）等である（図１１参照）。 Such an OCR adjustment amount table can be created by executing an OCR adjustment amount table creating process as shown in FIG. 10 based on a predetermined program in the computer 10 configured as shown in FIG. The OCR adjustment amount table creating process will be described below with reference to FIG. At the start of this OCR adjustment amount table creation process, all available printers are in an unused state, and all available paper for printing in each available printer is in an unused state. It is assumed that all fonts available in the printer are unused. Further, in FIG. 10, the "OCR adjustment amount" is any one adjustment amount in the image processing performed for character recognition on the image read by the scanner in the OCR device. The amount of adjustment in the vertical direction (the amount of vertical thickening or thinning) or the amount of horizontal adjustment (the amount of horizontal thickening or thinning) for the included character image (see FIG. 11).

図１０のＯＣＲ調整量テーブル作成処理では、まず、ＯＣＲ対象画像（テキストを含む画像）の印刷に使用可能なプリンタのうち未使用のいずれかのプリンタを使用プリンタＰｓとして設定する（ステップＳ３０１）。次に、使用プリンタＰｓで使用可能なフォントのうち未使用のいずれかのフォントを使用フォントＦｓとして設定する（ステップＳ３０２）。続いて、使用プリンタＰｓで使用可能な種類の用紙のうち未使用のいずれかの種類の用紙を使用用紙Ｓｓとして設定する（ステップＳ３０３）。 In the OCR adjustment amount table creation process of FIG. 10, first, any unused printer among printers that can be used to print an OCR target image (image including text) is set as the printer to be used Ps (step S301). Next, any unused font among the fonts that can be used in the used printer Ps is set as the used font Fs (step S302). Subsequently, one of the types of paper that can be used by the printer Ps that is not used is set as the paper to be used Ss (step S303).

次に、ＯＣＲ調整量を予め決められた最小値に設定する（ステップＳ３０４）。 Next, the OCR adjustment amount is set to a predetermined minimum value (step S304).

その後、ＯＣＲ対象文字（ＯＣＲ装置の認識対処となり得る全ての文字）のうち誤認識文字テーブルに登録されている文字をＯＣＲ調整対象文字とし、ＯＣＲ調整対象文字のうち未着目のいずれかの文字に着目する（ステップＳ３０６）。なお、このＯＣＲ調整量テーブル作成処理の開始後、最初にステップＳ３０６が実行される直前では、ＯＣＲ調整対象文字は全て未着目状態である。 After that, among the OCR target characters (all characters that can be recognized by the OCR device), the characters registered in the erroneously recognized character table are set as OCR adjustment target characters, and one of the OCR adjustment target characters that is not focused is selected. Pay attention (step S306). Note that immediately before step S306 is executed for the first time after the start of the OCR adjustment amount table creation processing, all the OCR adjustment target characters are in an unfocused state.

次に、着目文字を使用プリンタＰｓにより使用フォントＦｓを使用して印刷し、ＯＣＲ装置（例えば図５に示すＯＣＲ装置８０）によりその印刷された着目文字を画像として読み取ってパターン認識で文字を決定し、当該文字（のコードデータ）をＯＣＲ結果文字として出力する（ステップＳ３０８）。 Next, the character of interest is printed by the printer Ps to be used using the font Fs to be used, the printed character of interest is read as an image by an OCR device (for example, the OCR device 80 shown in FIG. 5), and the character is determined by pattern recognition. Then, the character (code data thereof) is output as an OCR result character (step S308).

その後、着目文字（着目したＯＣＲ対象文字）とそれに対応するＯＣＲ結果文字とを比較し、両文字が一致しているか否かを示すデータを比較結果として保存する（ステップＳ３１０）。 Thereafter, the target character (the focused OCR target character) is compared with the corresponding OCR result character, and data indicating whether or not the two characters match is stored as the comparison result (step S310).

次に、未着目のＯＣＲ調整対象文字があるか否かを判定する（ステップＳ３１２）。この判定の結果、未着目のＯＣＲ調整対象文字がある場合にはステップＳ３０６へ戻る。以降、未着目のＯＣＲ調整対象文字がなくなるまでステップＳ３０６～Ｓ３１２を繰り返し実行し、未着目のＯＣＲ調整対象文字がなくなると、ステップＳ３１４へ進む。 Next, it is determined whether or not there is an unfocused OCR adjustment target character (step S312). As a result of this determination, if there is an unfocused OCR adjustment target character, the process returns to step S306. Thereafter, steps S306 to S312 are repeatedly executed until there are no more unfocused OCR adjustment target characters, and when there are no more unfocused OCR adjustment target characters, the process proceeds to step S314.

ステップＳ３１４では、ＯＣＲ調整量を予め決められた調整単位量だけ増大させ、その後、ＯＣＲ調整量が予め決められた最大値を超えたか否かを判定する（ステップＳ３１６）。この判定の結果、ＯＣＲ調整量が当該最大値を超えていない場合には、全てのＯＣＲ調整対象文字を未着目状態とし（ステップＳ３１７）、ステップＳ３０６へ戻る。以降、ＯＣＲ調整量が当該最大値を超えるまでステップＳ３０６～Ｓ３１７を繰り返し実行し、ＯＣＲ調整量が当該最大値を超えるとステップＳ３１８へ進む。 In step S314, the OCR adjustment amount is increased by a predetermined adjustment unit amount, and then it is determined whether or not the OCR adjustment amount exceeds a predetermined maximum value (step S316). As a result of this determination, if the OCR adjustment amount does not exceed the maximum value, all the OCR adjustment target characters are set to the unfocused state (step S317), and the process returns to step S306. After that, steps S306 to S317 are repeatedly executed until the OCR adjustment amount exceeds the maximum value, and when the OCR adjustment amount exceeds the maximum value, the process proceeds to step S318.

ステップＳ３１８へ進んだ時点では、上記最小値から上記最大値までの範囲における上記調整単位量間隔での各ＯＣＲ調整量につき、各ＯＣＲ調整対象文字とそれに対応するＯＣＲ結果文字との比較結果が保存されている。そこで、これらの比較結果に基づき最良のＯＣＲ調整量を求め、当該最良のＯＣＲ調整量を、使用プリンタＰｓ、使用フォントＦｓ、および、使用用紙Ｓｓにより特定される印刷形態と対応付けてＯＣＲ調整量テーブルに登録する（ステップＳ３１８）。ここで、上記最小値から上記最大値までの範囲における上記調整単位量間隔でのＯＣＲ調整量のうち、各ＯＣＲ調整対象文字とそれに対応するＯＣＲ結果文字とからなる文字対のうち互いに一致する文字対の数が最も多いＯＣＲ調整量を、最良のＯＣＲ調整量とみなすものとする。 When proceeding to step S318, for each OCR adjustment amount in the adjustment unit amount interval in the range from the minimum value to the maximum value, the comparison result between each OCR adjustment target character and the corresponding OCR result character is stored. It is Therefore, the best OCR adjustment amount is obtained based on these comparison results, and the OCR adjustment amount is associated with the printing form specified by the used printer Ps, the used font Fs, and the used paper Ss. Register in the table (step S318). Here, in the OCR adjustment amount at the adjustment unit amount interval in the range from the minimum value to the maximum value, characters that match each other in a character pair consisting of each OCR adjustment target character and the corresponding OCR result character The OCR adjustment with the highest number of pairs shall be considered the best OCR adjustment.

その後、使用プリンタＰｓで使用可能な全ての種類の用紙が使用されたか否かを判定する（ステップＳ３２０）。この判定の結果、全ての種類の用紙が使用されてはいない場合には、全てのＯＣＲ調整対象文字を未着目状態とし（ステップＳ３２１）、ステップＳ３０３へ戻る。以降、全ての種類の用紙が使用されるまでステップＳ３０３～Ｓ３２１を繰り返し実行し、全ての種類の用紙が使用されると、ステップＳ３２２へ進む。 After that, it is determined whether or not all types of paper that can be used by the used printer Ps have been used (step S320). As a result of this determination, if not all types of paper have been used, all the OCR adjustment target characters are put in an unfocused state (step S321), and the process returns to step S303. Thereafter, steps S303 to S321 are repeatedly executed until all types of paper are used, and when all types of paper are used, the process advances to step S322.

ステップＳ３２２では、使用プリンタＰｓで使用可能な全てのフォントが使用されたか否かを判定する。この判定の結果、全てのフォントが使用されてはいない場合には、全てのＯＣＲ調整対象文字を未着目状態とし、使用プリンタＰｓで使用可能な全ての種類の用紙を未使用状態として（ステップＳ３２３）、ステップＳ３０２へ戻る。以降、全てのフォントが使用されるまでステップＳ３０２～Ｓ３２３を繰り返し実行し、全てのフォントが使用されると、ステップＳ３２４へ進む。 In step S322, it is determined whether or not all fonts available for the printer Ps used have been used. As a result of this determination, if all the fonts are not used, all the OCR adjustment target characters are put in an unfocused state, and all types of paper that can be used in the used printer Ps are put in an unused state (step S323). ), and the process returns to step S302. After that, steps S302 to S323 are repeatedly executed until all fonts are used, and when all fonts are used, the process proceeds to step S324.

ステップＳ３２４では、ＯＣＲ対象画像の印刷に使用可能な全てのプリンタが使用されたか否かを判定する。この判定の結果、全てのプリンタが使用されてはいない場合には、全てのＯＣＲ調整対象文字を未着目状態とし、使用プリンタＰｓで使用可能な全ての種類の用紙および全てのフォントを未使用状態として（ステップＳ３２５）、ステップＳ３０１へ戻る。以降、全てのプリンタが使用されるまでステップＳ３０１～Ｓ３２５を繰り返し実行し、全てのプリンタが使用されると、ＯＣＲ調整量テーブル作成処理を終了する。 In step S324, it is determined whether or not all printers available for printing the OCR target image have been used. As a result of this determination, if not all the printers are being used, all the OCR adjustment target characters are put in the unfocused state, and all types of paper and all fonts that can be used in the used printer Ps are put in the unused state. (step S325) and returns to step S301. After that, steps S301 to S325 are repeatedly executed until all printers are used, and when all printers are used, the OCR adjustment amount table creation processing ends.

上記では、１つのＯＣＲ調整量についてのＯＣＲ調整量テーブル作成処理（図１０）を説明したが、他のＯＣＲ調整量についても同様の処理によりＯＣＲ調整テーブルを作成することができる。例えば、既述の縦調整量と横調整量とのそれぞれにつきＯＣＲ調整量テーブル作成処理を実行し、その実行結果を１つのテーブルにまとめると、図１１に示すようなＯＣＲ調整量テーブルが得られる。 Although the OCR adjustment amount table creation processing (FIG. 10) for one OCR adjustment amount has been described above, OCR adjustment tables can be created for other OCR adjustment amounts by similar processing. For example, when the OCR adjustment amount table creation process is executed for each of the vertical adjustment amount and the horizontal adjustment amount described above and the execution results are compiled into one table, an OCR adjustment amount table as shown in FIG. 11 is obtained. .

図１１に示すように、このＯＣＲ調整量テーブルでは、ＯＣＲ対象画像の印刷時の出力条件毎に、すなわちＯＣＲ対象画像の印刷に使用されるプリンタ、フォント、および用紙の種類により特定される印刷形態毎に、ＯＣＲ装置により高精度に文字を認識できるＯＣＲ調整量（図１１に示す例では文字の画像に対する縦調整量および横調整量）が示されている。 As shown in FIG. 11, in this OCR adjustment amount table, the print mode specified for each output condition when printing the OCR target image, that is, the printer, font, and paper type used for printing the OCR target image. For each, the OCR adjustment amount (the vertical adjustment amount and the horizontal adjustment amount for the character image in the example shown in FIG. 11) that enables the OCR device to recognize the character with high accuracy is shown.

図１２は、このようなＯＣＲ調整量テーブルを用いてＯＣＲ装置により画像から文字を認識してテキストデータを生成するためのＯＣＲ処理を示すフローチャートである。 FIG. 12 is a flow chart showing OCR processing for generating text data by recognizing characters from an image with an OCR device using such an OCR adjustment amount table.

このＯＣＲ処理では、まず、ＯＣＲ対象画像の印刷時の出力条件、具体的には、ＯＣＲ対象画像の印刷に使用されたプリンタ、フォント、および用紙の種類により特定される印刷形態を取得する（ステップＳ４０２）。次に、ＯＣＲ調整量テーブルからこの出力条件（印刷形態）に対応するＯＣＲ調整量を取得する（ステップＳ４０４）。その後、当該ＯＣＲ調整量をＯＣＲ装置において設定して、当該ＯＣＲ装置によりＯＣＲを実行する（ステップＳ４０６）。すなわち、当該ＯＣＲ装置により、ＯＣＲ対象画像を読み取ってパターン認識で当該ＯＣＲ対象画像から文字を特定してテキストデータを生成する。 In this OCR processing, first, the output conditions for printing the OCR target image, specifically, the print form specified by the printer, font, and paper type used to print the OCR target image are acquired (step S402). Next, the OCR adjustment amount corresponding to this output condition (printing form) is obtained from the OCR adjustment amount table (step S404). Thereafter, the OCR adjustment amount is set in the OCR device, and OCR is performed by the OCR device (step S406). That is, the OCR device reads an OCR target image, identifies characters from the OCR target image by pattern recognition, and generates text data.

図１１のＯＣＲ調整量テーブルを使用するものとすると、このようなＯＣＲ処理によれば、ステップＳ４０２で取得される出力条件がプリンタＰ１、ゴシックのフォント、および、用紙Ｓ１で特定される印刷形態に相当する場合、縦調整量として“－１［ｐｉｘ］”という細らせ量が、横調整量として“＋１［ｐｉｘ］”という太らせ量が、ＯＣＲ装置にそれぞれ設定されてＯＣＲが実行される。このようにして、ＯＣＲ対象画像の印刷時の出力条件に応じて適切なＯＣＲ調整量がＯＣＲ装置に設定されるので、当該ＯＣＲ装置により高い精度で文字を認識することができる。 Assuming that the OCR adjustment amount table of FIG. 11 is used, according to such OCR processing, the output conditions acquired in step S402 are the printer P1, the Gothic font, and the printing form specified by the paper S1. If so, a thinning amount of "-1 [pix]" as the vertical adjustment amount and a thickening amount of "+1 [pix]" as the horizontal adjustment amount are set in the OCR device and OCR is executed. . In this manner, an appropriate OCR adjustment amount is set in the OCR device according to the output conditions when printing the OCR target image, so that the OCR device can recognize characters with high accuracy.

＜５．効果＞
以上のように、本実施形態に係る誤認識文字テーブルは、文字列検索装置に使用することができ、ＯＣＲ装置におけるＯＣＲ調整量を決定するためのＯＣＲ調整量テーブルの作成にも使用することができる。これにより、以下のような効果が得られる。 <5. Effect>
As described above, the erroneously recognized character table according to this embodiment can be used in a character string search device, and can also be used to create an OCR adjustment amount table for determining the OCR adjustment amount in an OCR device. can. This provides the following effects.

上記誤認識テーブルを使用した文字列検索装置では、ＯＣＲ結果を目視でチェックしなくとも、ＯＣＲ結果としてのテキストデータから文字列を高い精度で検索し検索漏れを抑制することができる。このため、ＯＣＲ結果としてのテキストデータの全てを目視で確認する必要がなくなり、このような確認作業によるコストが削減される。 In the character string search device using the misrecognition table, even if the OCR result is not visually checked, the character string can be searched with high accuracy from the text data as the OCR result, and the omission of search can be suppressed. Therefore, it is not necessary to visually confirm all the text data as the OCR result, and the cost for such confirmation work can be reduced.

また従来、既述の誤認識単語辞書に載っていない未知の単語を検索することは困難であったが、上記誤認識テーブルを使用した文字列検索装置では、未知の単語の検索も可能となる。すなわち、ＯＣＲ結果としてのテキストデータから文字列を検索する場合であっても、入力文字列のうちＯＣＲで誤認識され易い文字をワイルドカードまたは誤認識文字テーブルで当該文字に対応付けられる他の文字に置き換えることにより検索語が作成され（図８のステップＳ２０６、図９のステップＳ２２０）、これにより未知の単語も検索することができる。 Conventionally, it has been difficult to search for unknown words that are not listed in the misrecognition word dictionary. However, the character string search device using the misrecognition table makes it possible to search for unknown words. . That is, even when searching for a character string from text data as an OCR result, characters that are likely to be erroneously recognized by OCR in the input character string are replaced with wildcards or other characters associated with the characters in the erroneously recognized character table. (step S206 in FIG. 8, step S220 in FIG. 9), and thus unknown words can be retrieved.

また、従来において使用していた誤認識単語辞書が不要になることから、辞書の継続的更新も不要であり、ＯＣＲ装置におけるメンテナンスのコストが低減される。 In addition, since the conventionally used erroneously recognized word dictionary is no longer necessary, continuous updating of the dictionary is not required, and the maintenance cost of the OCR apparatus is reduced.

また、誤認識文字テーブルを使用することで、ＯＣＲ装置による文字認識の精度が高くなくても、文字列検索装置の検索精度を向上させることができる。 Further, by using the erroneously recognized character table, it is possible to improve the search accuracy of the character string search device even if the accuracy of character recognition by the OCR device is not high.

また、ＯＣＲ対象画像に含まれるテキストに異体字（例えば「高」と「▲高▼」）が含まれる場合であっても、誤認識文字テーブルを使用することで、ＯＣＲ結果としてのテキストデータを通常と同様に扱うことができる。 In addition, even if the text included in the OCR target image includes variant characters (for example, "high" and "▲high▼"), the misrecognized character table can be used to convert the text data as the OCR result. It can be treated like normal.

さらに、誤認識文字テーブルを用いて作成されたＯＣＲ調整量テーブルをＯＣＲ装置において使用することで、ＯＣＲ装置による文字認識の精度を向上させることができる。 Furthermore, by using the OCR adjustment amount table created using the erroneously recognized character table in the OCR device, the accuracy of character recognition by the OCR device can be improved.

＜６．変形例＞
本発明は上記実施形態に限定されるものではなく、本発明の範囲を逸脱しない限りにおいてさらに種々の変形を施すことができる。以下、上記実施形態に係る誤認識文字テーブルを使用して既述の文字列検索装置の変形例について説明する。 <6. Variation>
The present invention is not limited to the above embodiments, and various modifications can be made without departing from the scope of the present invention. A modified example of the aforementioned character string search device will be described below using the erroneously recognized character table according to the above embodiment.

＜６．１第１の変形例＞
上記のように、図５に示す文字列検索装置において図８に示す文字列検索処理または図９に示す文字列検索処理が行われるが、これらの検索処理を組み合わせた文字列検索処理を行うようにしてもよい。すなわち、入力文字列のうち誤認識文字テーブルＥｔｂｌに登録されている文字をワイルドに置き換えて検索を行う処理と、入力文字列のうち誤認識文字テーブルＥｔｂｌに登録されている文字を誤認識文字テーブルＥｔｂｌにより当該文字に対応付けられる他の文字に置き換えて検索を行う処理とを組み合わせた文字列検索処理を行うようにしてもよい。 <6.1 First Modification>
As described above, the character string search processing shown in FIG. 8 or the character string search processing shown in FIG. 9 is performed in the character string search apparatus shown in FIG. can be That is, a process of performing a search by replacing characters registered in the misrecognized character table Etbl in the input character string with wild characters, and a process of searching characters registered in the misrecognized character table Etbl in the input character string. A character string search process may be performed in combination with a process of searching by replacing the character with another character associated with the character by Etbl.

図１３は、このような変形例における文字列検索処理の一例を示すフローチャートである。この図１３の文字列検索処理のうちステップＳ２０１～Ｓ２２４は、図９の文字列検索処理におけるステップＳ２０１～Ｓ２２４とそれぞれ同一であるので、それらの説明を省略する。図１３の文字列検索処理では、入力文字列のうち誤認識文字テーブルＥｔｂｌに登録されている文字を誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられる他の文字に置き換えて検索を行う処理（図９参照）によっては検索結果が得られない場合に、入力文字列のうち誤認識文字テーブルＥｔｂｌに登録されている文字をワイルドに置き換えて検索を行う。 FIG. 13 is a flowchart showing an example of character string search processing in such a modification. Steps S201 to S224 in the character string search process in FIG. 13 are the same as steps S201 to S224 in the character string search process in FIG. 9, respectively, so description thereof will be omitted. In the character string search process of FIG. 13, the characters registered in the erroneously recognized character table Etbl in the input character string are replaced with other characters that are associated with the character of interest in the erroneously recognized character table Etbl, and the search is performed (see FIG. 13). 9)), the characters registered in the erroneously recognized character table Etbl in the input character string are replaced with wild characters to perform the search.

すなわち、ステップＳ２３０において、ステップＳ２０２～Ｓ２２２において作成される検索語群における少なくとも１つの検索語に一致する文字列が対象テキストデータＤｔｘにおいて見出せたか否かを判定する。この判定の結果、検索語群におけるいずれの検索語についてもそれに一致する文字列が対象テキストデータＤｔｘにおいて見出せない場合にはステップＳ２３２へ進み、検索語群における少なくとも１つの検索語に一致する文字列が見出せた場合にはステップＳ２３６へ進む。 That is, in step S230, it is determined whether or not a character string matching at least one search term in the search term group created in steps S202 to S222 has been found in the target text data Dtx. As a result of this determination, if no character string matching any of the search words in the search word group can be found in the target text data Dtx, the process proceeds to step S232, and a character string matching at least one search word in the search word group is found. is found, the process proceeds to step S236.

ステップＳ２３２では、ステップＳ２０１で受け取った入力文字列のうち誤認識文字テーブルＥｔｂｌの登録されている文字を全てワイルドカードに置き換えることにより、ワイルドカード検索語を作成する。 In step S232, a wildcard search term is created by replacing all the characters registered in the misrecognized character table Etbl in the input character string received in step S201 with wildcards.

次に、このワイルドカード検索語に一致する文字列を、ＮＡＳ４８に格納された検索対象のテキストデータＤｔｘの中から検索し（ステップＳ２３４）、その後、ステップＳ２３６へ進む。 Next, the search target text data Dtx stored in the NAS 48 is searched for a character string that matches this wildcard search term (step S234), and then the process proceeds to step S236.

ステップＳ２３６では、上記検索による検索結果が検索用端末装置３０で表示されるように、当該検索結果を示すデータをインターネット５を介して検索用端末装置３０に送る（ステップＳ２３６）。これにより、検索用端末装置３０において、例えば、検索対象としてのテキストデータＤｔｘのうち上記のいずれかの検索語に一致する文字列を含む文または段落等が当該文字列をハイライト状態にして表示される。 In step S236, data indicating the search results is sent to the search terminal device 30 via the Internet 5 so that the search results of the above search are displayed on the search terminal device 30 (step S236). As a result, in the search terminal device 30, for example, sentences or paragraphs including a character string matching any of the above search words in the text data Dtx to be searched are displayed with the character string highlighted. be done.

上記文字列検索処理では、上記検索語群における少なくとも１つの検索語に一致する文字列が対象テキストデータＤｔｘにおいて見出せた場合には、上記ワイルドカード検索語による検索は行われない（ステップＳ２３０）。一方、上記検索語群におけるいずれの検索語についてもそれに一致する文字列が対象テキストデータＤｔｘにおいて見出せない場合には、上記ワイルドカード検索語による検索が行われる。したがって、このような文字列検索処理によれば、不適切または余分な検索結果の出力を抑えつつ、検索漏れを確実に抑制することができる。 In the character string search process, if a character string matching at least one search term in the search term group is found in the target text data Dtx, the wildcard search term search is not performed (step S230). On the other hand, if no character string matching any of the search words in the search word group is found in the target text data Dtx, a search using the wildcard search words is performed. Therefore, according to such character string search processing, it is possible to reliably suppress search omissions while suppressing the output of inappropriate or redundant search results.

＜６．２第２の変形例＞
上記実施形態に係る誤認識文字テーブルＥｔｂｌでは、そこに登録された文字に対し、当該文字を含む画像の印刷に使用されたプリンタ、フォント、および、用紙の種類により特定される印刷形態（出力条件）が対応付けられている。一方、図８、図９、および、図１３にそれぞれ示す文字列検索処理では、ＯＣＲ対象画像の印刷形態に関連する処理は含まれていない。しかし、ＯＣＲ検索結果としてのテキストデータＤｔｘを検索対象とする文字列検索において不適切または余分な検索結果の出力を抑えるべく、図５に示す文字列検索装置４０において実行すべき文字列検索処理においてＯＣＲ対象画像の印刷形態に関連する処理を含めることも考えられる。 <6.2 Second Modification>
In the erroneously recognized character table Etbl according to the above-described embodiment, for characters registered therein, a printing form (output condition ) are associated. On the other hand, the character string search processing shown in FIGS. 8, 9, and 13 does not include processing related to the print form of the OCR target image. However, in the character string search process to be executed in the character string search device 40 shown in FIG. It is also conceivable to include processing related to the print form of the OCR target image.

図１４は、図５に示す文字列検索装置の第２の変形例における文字列検索処理、すなわち図８の文字列検索処理においてＯＣＲ対象画像の印刷形態に関連する処理を含めた構成の文字列検索処理を示すフローチャートである。 FIG. 14 shows the character string search processing in the second modification of the character string search device shown in FIG. 6 is a flowchart showing search processing;

この文字列検索処理は、図８の文字列検索処理に対し、ステップＳ２０６の直前にステップＳ２０５が挿入されている点が異なり、その他のステップは、図８の文字列検索処理のステップと同様であり、対応するステップには同一のステップ番号が付されている。 This character string search process differs from the character string search process of FIG. 8 in that step S205 is inserted immediately before step S206, and the other steps are the same as those of the character string search process of FIG. corresponding steps are given the same step numbers.

この文字列検索処理では、ステップＳ２０４において、着目文字が誤認識文字テーブルＥｔｂｌに登録されていると判定されると、ステップＳ２０６の実行前に、誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられたプリンタ、フォント、および、用紙の種類により特定される印刷形態（出力条件）が検索対象としてのテキストデータ（対象テキストデータ）Ｄｔｘの元画像の印刷形態に一致するか否かを判定する（ステップＳ２０５）。ここで、対象テキストデータＤｔｘの元画像とは、ＯＣＲ装置８０によって対象テキストデータＤｔｘを生成するためのＯＣＲ対象画像である。 In this character string search process, if it is determined in step S204 that the target character is registered in the erroneously recognized character table Etbl, before step S206 is executed, the target character is associated with the target character by the erroneously recognized character table Etbl. It is determined whether or not the print form (output condition) specified by the printer, font, and paper type matches the print form of the original image of the text data (target text data) Dtx to be searched (step S205). ). Here, the original image of the target text data Dtx is an OCR target image for generating the target text data Dtx by the OCR device 80 .

ステップＳ２０５の判定の結果、誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられた印刷形態がＯＣＲ対象画像（元画像）の印刷形態に一致する場合には、ステップＳ２０６へ進む。一方、この判定の結果、誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられた印刷形態がＯＣＲ対象画像の印刷形態に一致しない場合には、ステップＳ２０６を実行することなくステップＳ２０８へ進む。これにより、着目文字が誤認識文字テーブルＥｔｂｌに登録されていても、当該着目文字に対応付けられた印刷形態がＯＣＲ対象画像の印刷形態に一致しない場合には、入力文字列において当該着目文字がワイルドカードに置き換えられることはない。この文字列検索処理における上記以外の処理については、図８の文字列検索処理と同様であるので説明を省略する。 If the result of determination in step S205 is that the print form associated with the target character in the erroneously recognized character table Etbl matches the print form of the OCR target image (original image), the process proceeds to step S206. On the other hand, if the result of this determination is that the print form associated with the character of interest in the erroneously recognized character table Etbl does not match the print form of the OCR target image, the process proceeds to step S208 without executing step S206. As a result, even if the character of interest is registered in the erroneously recognized character table Etbl, if the print form associated with the character of interest does not match the print form of the OCR target image, the character of interest is not included in the input character string. It is never replaced by a wildcard. Processing other than the above in this character string search processing is the same as the character string search processing in FIG. 8, so description thereof will be omitted.

＜６．３第３の変形例＞
図１５は、図５に示す文字列検索装置の第３の変形例における文字列検索処理、すなわち図９の文字列検索処理においてＯＣＲ対象画像の印刷形態に関連する処理を含めた構成の文字列検索処理を示すフローチャートである。 <6.3 Third Modification>
FIG. 15 shows the character string search processing in the third modification of the character string search device shown in FIG. 6 is a flowchart showing search processing;

この文字列検索処理は、図９の文字列検索処理に対し、ステップＳ２２０の直前にステップＳ２０５が挿入されている点が異なり、その他のステップは、図９の文字列検索処理のステップと同様であり、対応するステップには同一のステップ番号が付されている。 This character string search process differs from the character string search process of FIG. 9 in that step S205 is inserted immediately before step S220, and the other steps are the same as those of the character string search process of FIG. corresponding steps are given the same step numbers.

この文字列検索処理においても、ステップＳ２０４において、着目文字が誤認識文字テーブルＥｔｂｌに登録されていると判定されると、ステップＳ２２０の実行前に、誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられたプリンタ、フォント、および、用紙の種類により特定される印刷形態（出力条件）が対象テキストデータＤｔｘの元画像すなわちＯＣＲ対象画像に一致するか否かを判定する（ステップＳ２０５）。 Also in this character string search process, if it is determined in step S204 that the target character is registered in the erroneously recognized character table Etbl, the target character is associated with the erroneously recognized character table Etbl before step S220. It is determined whether or not the printing form (output condition) specified by the printer, font, and paper type matches the original image of the target text data Dtx, that is, the OCR target image (step S205).

ステップＳ２０５の判定の結果、誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられた印刷形態がＯＣＲ対象画像の印刷形態に一致する場合には、ステップＳ２２０へ進む。一方、この判定の結果、誤認識文字テーブルＥｔｂｌにより着目文字に対応付けられた印刷形態がＯＣＲ対象画像の印刷形態に一致しない場合には、ステップＳ２２０を実行することなくステップＳ２２２へ進む。これにより、着目文字が誤認識文字テーブルＥｔｂｌに登録されていても、当該着目文字に対応付けられた印刷形態がＯＣＲ対象画像の印刷形態に一致しない場合には、入力文字列において当該着目文字を誤認識文字テーブルＥｔｂｌにより当該着目文字に対応付けられた他の文字に置き換えて検索語が新たに作成されることはない。この文字列検索処理における上記以外の処理については、図９の文字列検索処理と同様であるので説明を省略する。 If the result of determination in step S205 is that the print form associated with the target character in the erroneously recognized character table Etbl matches the print form of the OCR target image, the process proceeds to step S220. On the other hand, if the result of this determination is that the print form associated with the character of interest in the erroneously recognized character table Etbl does not match the print form of the OCR target image, the process proceeds to step S222 without executing step S220. As a result, even if the character of interest is registered in the erroneously recognized character table Etbl, if the print form associated with the character of interest does not match the print form of the OCR target image, the character of interest can be added to the input character string. A new search word is not created by replacing the target character with another character associated with the erroneously recognized character table Etbl. Processing other than the above in this character string search processing is the same as the character string search processing in FIG. 9, so description thereof will be omitted.

１０ …コンピュータ
１８ …誤認識文字テーブル作成プログラム
２０ …スキャナ
３０ …検索用端末装置
４０ …文字列検索装置
４５ …検索処理装置
８０ …ＯＣＲ装置
８５ …ＯＣＲ処理装置
８６ …スキャナ
Ｅｔｂｌ …誤認識文字テーブル
Ｄｔｘ …テキストデータ（検索対象、ＯＣＲ結果）
Ｓｐｇ …文字列検索プログラム DESCRIPTION OF SYMBOLS 10...Computer 18...Misrecognition character table creation program 20...Scanner 30...Search terminal device 40...Character string search device 45...Search processing device 80...OCR device 85...OCR processing device 86...Scanner Etbl...Misrecognition character table Dtx …text data (search target, OCR result)
Spg … Character string search program

Claims

テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識することにより得られるＯＣＲ結果としてのテキストデータを検索対象とする文字列検索装置であって、
テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識するＯＣＲ装置において誤認識される可能性が所定の許容範囲を超えるとみなされる文字である誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの認識結果としての文字である誤認識文字を対応付ける誤認識文字対応付けデータと、前記誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの当該文字の画像の印刷形態を対応付ける印刷形態対応付けデータとを含む、誤認識文字テーブルと、
外部から与えられる入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語を作成し、前記入力文字列におけるいずれの文字も、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれにも一致しない場合には、前記入力文字列を検索語とする検索語作成部と、
前記検索語作成部により得られる検索語に一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する検索部と
を備える、文字列検索装置。 A character string search device for searching text data as an OCR result obtained by reading a target image containing text from a recording medium on which the target image is printed and recognizing characters,
A high-probability misrecognition character, which is a character whose possibility of misrecognition in an OCR device that recognizes characters by reading a target image containing text from a recording medium on which the target image is printed exceeds a predetermined allowable range. to each of the misrecognized character association data that associates misrecognized characters that are characters as a result of recognition when the image of the character is misrecognized by the OCR device, and to each of the misrecognized high-probability characters On the other hand, an erroneously recognized character table including print form correspondence data that associates the print form of the image of the character when the image of the character is erroneously recognized by the OCR device;
If any character in the input character string given from the outside matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table, the matching character in the input character string When a search term is created by replacing characters with wildcards, and none of the characters in the input character string match any of the misrecognized high-probability characters and the misrecognized characters registered in the misrecognized character table. a search term creation unit that uses the input character string as a search term;
A character string search device, comprising: a search unit that searches text data as the OCR result for a character string that matches the search term obtained by the search term creation unit.

前記検索語作成部は、
前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致し、かつ、当該一致する文字に対応付けられた印刷形態が前記対象画像の印刷形態と一致する場合に、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語を作成し、
前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するが、当該一致する文字に対応付けられた印刷形態が前記対象画像の印刷形態と一致しない場合に、前記入力文字列を検索語とする、請求項１に記載の文字列検索装置。 The search term creation unit is
A printing form in which any character in the input character string matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table and is associated with the matching character matches the print form of the target image, create a search term by replacing the matching character in the input string with a wildcard;
Any of the characters in the input character string matches either the misrecognition high-probability character or the misrecognition character registered in the misrecognition character table, but the printing form associated with the matching character is 2. The character string search device according to claim 1 , wherein said input character string is used as a search term when it does not match the print form of said target image.

テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識することにより得られるＯＣＲ結果としてのテキストデータを検索対象とする文字列検索装置であって、
テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識するＯＣＲ装置において誤認識される可能性が所定の許容範囲を超えるとみなされる文字である誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの認識結果としての文字である誤認識文字を対応付ける誤認識文字対応付けデータと、前記誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの当該文字の画像の印刷形態を対応付ける印刷形態対応付けデータと含む、誤認識文字テーブルと、
外部から与えられる入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、前記入力文字列を検索語とするとともに、前記入力文字列における当該一致する文字を前記誤認識文字テーブルによって当該一致する文字に対応付けられた他の文字に置き換えることにより検索語を作成し、前記入力文字列におけるいずれの文字も、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれにも一致しない場合に、前記入力文字列を検索語とする検索語作成部と、
前記検索語作成部により得られる検索語のいずれかに一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する検索部と
を備え、
前記検索部は、前記検索語作成部により得られる検索語のいずれに一致する文字列も前記ＯＣＲ結果としてのテキストデータの中から見出せない場合において、前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するときに、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより得られる検索語に一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する、文字列検索装置。 A character string search device for searching text data as an OCR result obtained by reading a target image containing text from a recording medium on which the target image is printed and recognizing characters,
A high-probability misrecognition character, which is a character whose possibility of misrecognition in an OCR device that recognizes characters by reading a target image containing text from a recording medium on which the target image is printed exceeds a predetermined allowable range. to each of the misrecognized character association data that associates misrecognized characters that are characters as a result of recognition when the image of the character is misrecognized by the OCR device, and to each of the misrecognized high-probability characters On the other hand, an erroneously recognized character table including printing form correspondence data that associates the printing form of the character image when the character image is erroneously recognized by the OCR device;
If any character in an input character string given from the outside matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table, the input character string is regarded as a search word. and creating a search term by replacing the matching character in the input character string with another character associated with the matching character in the misrecognized character table, wherein any character in the input character string is a search term creating unit that uses the input character string as a search term when neither the misrecognized high-probability character nor the misrecognized character registered in the misrecognized character table match;
a search unit that searches text data as the OCR result for a character string that matches any of the search terms obtained by the search term creation unit ;
In the case where a character string that matches any of the search terms obtained by the search term creation unit cannot be found from the text data as the OCR result, the search unit is configured so that any character in the input character string Matches a search word obtained by replacing the matching character in the input character string with a wild card when it matches either the misrecognition high-probability character or the misrecognition character registered in the misrecognition character table. A character string search device for searching for a character string from text data as the OCR result .

テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識することにより得られるＯＣＲ結果としてのテキストデータを検索対象とする文字列検索装置であって、
テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識するＯＣＲ装置において誤認識される可能性が所定の許容範囲を超えるとみなされる文字である誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの認識結果としての文字である誤認識文字を対応付ける誤認識文字対応付けデータと、前記誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの当該文字の画像の印刷形態を対応付ける印刷形態対応付けデータとを含む、誤認識文字テーブルと、
外部から与えられる入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致し、かつ、当該一致する文字に対応付けられた印刷形態が前記対象画像の印刷形態と一致する場合に、前記入力文字列を検索語とするとともに、前記入力文字列における当該一致する文字を前記誤認識文字テーブルによって当該一致する文字に対応付けられた他の文字に置き換えることにより検索語を作成し、前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するが、当該一致する文字に対応付けられた印刷形態が前記対象画像の印刷形態と一致しない場合に、前記入力文字列のみを検索語とする、検索語作成部と、
前記検索語作成部により得られる検索語のいずれかに一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する検索部と
を備え、
前記検索部は、前記検索語作成部により得られる検索語のいずれに一致する文字列も前記ＯＣＲ結果としてのテキストデータの中から見出せない場合において、前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致するときに、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより得られる検索語に一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する、文字列検索装置。 A character string search device for searching text data as an OCR result obtained by reading a target image containing text from a recording medium on which the target image is printed and recognizing characters,
A high-probability misrecognition character, which is a character whose possibility of misrecognition in an OCR device that recognizes characters by reading a target image containing text from a recording medium on which the target image is printed exceeds a predetermined allowable range. to each of the misrecognized character association data that associates misrecognized characters that are characters as a result of recognition when the image of the character is misrecognized by the OCR device, and to each of the misrecognized high-probability characters On the other hand, an erroneously recognized character table including print form correspondence data that associates the print form of the image of the character when the image of the character is erroneously recognized by the OCR device;
Any character in the input character string given from the outside matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table, and is associated with the matching character. If the printed form matches the printed form of the target image, the input character string is used as a search term, and the matching characters in the input character string are associated with the matching characters by the misrecognized character table. A search term is created by replacing the character with another character, and any character in the input character string matches either the misrecognized high-probability character or the misrecognized character registered in the misrecognized character table. a search term creation unit that uses only the input character string as a search term when the print format associated with the matching characters does not match the print format of the target image;
a search unit that searches text data as the OCR result for a character string that matches any of the search terms obtained by the search term creation unit ;
In the case where a character string that matches any of the search terms obtained by the search term creation unit cannot be found from the text data as the OCR result, the search unit is configured so that any character in the input character string Matches a search word obtained by replacing the matching character in the input character string with a wild card when it matches either the misrecognition high-probability character or the misrecognition character registered in the misrecognition character table. A character string search device for searching for a character string from text data as the OCR result .

前記印刷形態対応付けデータは、前記ＯＣＲ装置により誤認識された前記文字の画像の印刷において使用された印刷装置、フォント、および、記録媒体のうち少なくとも１つを特定する情報を含む、請求項１から４のいずれか１項に記載の文字列検索装置。 2. The printing form correspondence data includes information specifying at least one of a printing device, a font, and a recording medium used in printing the image of the character erroneously recognized by the OCR device. 5. The character string search device according to any one of 4 .

前記印刷形態対応付けデータは、前記ＯＣＲ装置により誤認識された前記文字の画像の印刷において使用された記録媒体としての紙の種類を特定する情報を含み、
前記紙の種類を特定する情報は、前記対象画像の印刷において使用されるインクの滲み易さを識別できる情報を含む、請求項５に記載の文字列検索装置。 The print form association data includes information specifying the type of paper as a recording medium used in printing the image of the character that was erroneously recognized by the OCR device,
6. The character string search device according to claim 5 , wherein the information specifying the type of paper includes information for identifying how easily ink used in printing the target image bleeds.

テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識することにより得られるＯＣＲ結果としてのテキストデータを検索対象とする文字列検索方法であって、
外部から与えられる入力文字列と予め作成された誤認識文字テーブルとから検索語を作成する検索語作成ステップと、
前記検索語作成ステップにより得られる検索語に一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する検索ステップと
を備え、
前記誤認識文字テーブルは、
テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識するＯＣＲ装置において誤認識される可能性が所定の許容範囲を超えるとみなされる文字である誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの認識結果としての文字である誤認識文字を対応付ける誤認識文字対応付けデータと、
前記誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの当該文字の画像の印刷形態を対応付ける印刷形態対応付けデータと
を含み、
前記検索語作成ステップでは、前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語を作成し、前記入力文字列におけるいずれの文字も、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれにも一致しない場合に、前記入力文字列を検索語とする、文字列検索方法。 A character string search method for searching text data as an OCR result obtained by reading a target image containing text from a recording medium on which the target image is printed and recognizing characters,
a search term creating step of creating a search term from an input character string given from the outside and a misrecognized character table created in advance;
a search step of searching text data as the OCR result for a character string that matches the search term obtained by the search term creation step ;
The misrecognized character table is
A high-probability misrecognition character, which is a character whose possibility of misrecognition in an OCR device that recognizes characters by reading a target image containing text from a recording medium on which the target image is printed exceeds a predetermined allowable range. erroneously recognized character association data that associates an erroneously recognized character, which is a character as a recognition result when the image of the character is erroneously recognized by the OCR device, with each of the
printing form association data that associates, with each of the misrecognition high-probability characters, the printing form of the image of the character when the image of the character is misrecognized by the OCR device;
including
In the search word creation step, if any character in the input character string matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table, the input character string A search term is created by replacing the matching characters in with wildcards, and any character in the input character string is either a misrecognized high-probability character or a misrecognized character registered in the misrecognized character table a character string search method , wherein the input character string is used as a search term if the input character string does not match either .

前記印刷形態対応付けデータは、前記ＯＣＲ装置により誤認識された前記文字の画像の印刷において使用された印刷装置、フォント、および、記録媒体のうち少なくとも１つを特定する情報を含む、請求項７に記載の文字列検索方法。 8. The printing form association data includes information specifying at least one of a printing device, a font, and a recording medium used in printing the image of the character erroneously recognized by the OCR device. The string search method described in .

前記印刷形態対応付けデータは、前記ＯＣＲ装置により誤認識された前記文字の画像の印刷において使用された記録媒体としての紙の種類を特定する情報を含み、
前記紙の種類を特定する情報は、前記対象画像の印刷において使用されるインクの滲み易さを識別できる情報を含む、請求項８に記載の文字列検索方法。 The print form association data includes information specifying the type of paper as a recording medium used in printing the image of the character that was erroneously recognized by the OCR device,
9. The character string search method according to claim 8 , wherein the information specifying the type of paper includes information for identifying how easily ink used in printing the target image bleeds.

テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識することにより得られるＯＣＲ結果としてのテキストデータを検索対象とする文字列検索プログラムであって、
外部から与えられる入力文字列と予め作成された誤認識文字テーブルとから検索語を作成する検索語作成ステップと、
前記検索語作成ステップにより作成された検索語のいずれかに一致する文字列を前記ＯＣＲ結果としてのテキストデータの中から検索する検索ステップと
を、コンピュータのＣＰＵにメモリを利用して実行させ、
前記誤認識文字テーブルは、
テキストを含む対象画像が印刷された記録媒体から当該対象画像を読み取って文字を認識するＯＣＲ装置において誤認識される可能性が所定の許容範囲を超えるとみなされる文字である誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの認識結果としての文字である誤認識文字を対応付ける誤認識文字対応付けデータと、
前記誤認識高可能性文字のそれぞれに対し、当該文字の画像が当該ＯＣＲ装置により誤認識されたときの当該文字の画像の印刷形態を対応付ける印刷形態対応付けデータと
を含み、
前記検索語作成ステップでは、前記入力文字列におけるいずれかの文字が、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれかに一致する場合に、前記入力文字列における当該一致する文字をワイルドカードに置き換えることにより検索語を作成し、前記入力文字列におけるいずれの文字も、前記誤認識文字テーブルに登録された誤認識高可能性文字および誤認識文字のいずれにも一致しない場合に、前記入力文字列を検索語とする、文字列検索プログラム。 A character string search program for searching text data as an OCR result obtained by reading a target image containing text from a recording medium on which the target image is printed and recognizing characters,
a search term creating step of creating a search term from an input character string given from the outside and a misrecognized character table created in advance;
causing a CPU of a computer to execute a search step of searching text data as the OCR result for a character string matching any of the search terms created in the search term creating step, using a memory ;
The misrecognized character table is
A high-probability misrecognition character, which is a character whose possibility of misrecognition in an OCR device that recognizes characters by reading a target image containing text from a recording medium on which the target image is printed exceeds a predetermined allowable range. erroneously recognized character association data that associates an erroneously recognized character, which is a character as a recognition result when the image of the character is erroneously recognized by the OCR device, with each of the
printing form association data that associates, with each of the misrecognition high-probability characters, the printing form of the image of the character when the image of the character is misrecognized by the OCR device;
including
In the search word creation step, if any character in the input character string matches any of the misrecognition high probability character and the misrecognition character registered in the misrecognition character table, the input character string A search term is created by replacing the matching characters in with wildcards, and any character in the input character string is either a misrecognized high-probability character or a misrecognized character registered in the misrecognized character table A character string search program that uses the input character string as a search term if the input character string does not match either .

前記印刷形態対応付けデータは、前記ＯＣＲ装置により誤認識された前記文字の画像の印刷において使用された印刷装置、フォント、および、記録媒体のうち少なくとも１つを特定する情報を含む、請求項１０に記載の文字列検索プログラム。 11. The printing form correspondence data includes information specifying at least one of a printing device, a font, and a recording medium used in printing the image of the character erroneously recognized by the OCR device. The string search program described in .

前記印刷形態対応付けデータは、前記ＯＣＲ装置により誤認識された前記文字の画像の印刷において使用された記録媒体としての紙の種類を特定する情報を含み、
前記紙の種類を特定する情報は、前記対象画像の印刷において使用されるインクの滲み易さを識別できる情報を含む、請求項１１に記載の文字列検索プログラム。 The print form association data includes information specifying the type of paper as a recording medium used in printing the image of the character that was erroneously recognized by the OCR device,
12. The character string search program according to claim 11 , wherein the information specifying the type of paper includes information for identifying how easily ink used in printing the target image bleeds.