JPH09185674A

JPH09185674A - Device and method for detecting and correcting erroneously recognized character

Info

Publication number: JPH09185674A
Application number: JP7343450A
Authority: JP
Inventors: Toshihiro Fujinami; 稔弘藤並; Tomoyuki Tada; 多田　　智之; Hidenobu Kaneoka; 秀信金岡
Original assignee: Omron Corp; Omron Tateisi Electronics Co
Current assignee: Omron Corp
Priority date: 1995-12-28
Filing date: 1995-12-28
Publication date: 1997-07-15

Abstract

PROBLEM TO BE SOLVED: To easily correct an erroneously recognized character in a character string recognized through image processing such as pattern matching and to unnecessitate the correction of erroneously recognized character. SOLUTION: Character string data (n1 and n2), for which the character string inputted in image data through pattern matching, etc., are divided into clauses by performing morpheme analysis (n3 and n4). Then, the position where clauses composed of one character except for a partition symbol such as punctuation marks are connected, the position of clause composed of one Chinese character (KANJI) and the position, where the clause composed of one KANJI is connected with the clause composed of connected independent word and attached word composed of one KANJI, are detected as the positions of erroneous recognition (n5). Then, the characters at these positions of erroneous recognition are replaced with other characters and corrected (n7).

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、パターンマッチ
ング等の画像処理によって認識された文字列中における
誤認識の文字を検出する誤認識文字検出装置および誤認
識文字検出方法と、この検出された誤認識の文字を訂正
する誤認識文字訂正装置および誤認識文字訂正方法に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a misrecognized character detection device and a misrecognized character detection method for detecting a misrecognized character in a character string recognized by image processing such as pattern matching, and the detected error. The present invention relates to an erroneously recognized character correction device and an erroneously recognized character correction method for correcting recognized characters.

【０００２】[0002]

【従来の技術】書面等に記載された文章をＯＣＲ（Opti
cal Charcter Reader ）等のイメージスキャナを用いて
画像データとして取り込ませ、この画像データに対して
パターンマッチング等の画像処理を行って、文字毎に文
字を認識させることによって、書面等に記載された文章
を電子化された情報として簡単に入力できるシステムが
普及しつつある。2. Description of the Related Art OCR (Opti
A text written in a document, etc. is read by capturing it as image data using an image scanner such as cal charter reader), performing image processing such as pattern matching on this image data, and recognizing each character. Systems that can easily input as electronic information are becoming popular.

【０００３】このシステムには大きな問題がある。それ
は、パターンマッチング等の画像処理における文字の認
識精度が１００パーセントではないということである。
すなわち、書面等に記載された文章を正確に電子化され
た情報として入力できないという問題である。さらに、
誤認識された文字の出現には規則性がない。このため、
電子化された情報である文字列に対して１字１字確認し
て、誤認識されている文字があれば該文字を訂正すると
いう作業を行わなければならなず、この訂正作業にかか
る負担が大きい。There are major problems with this system. That is, the recognition accuracy of characters in image processing such as pattern matching is not 100%.
That is, there is a problem in that a sentence written on a document or the like cannot be accurately input as electronic information. further,
There is no regularity in the appearance of misrecognized characters. For this reason,
It is necessary to check each character in the character string, which is digitized information, and correct any erroneously recognized characters, which is a burden on the correction work. Is big.

【０００４】そこで、以下に示す文字認識された文字列
から誤認識されている文字を自動的に検出する方法が提
案されている。文字を認識したときに、該文字の認識結果の確信度
を求め、確信度の低い文字を誤認識された文字である可
能性があるとする方法。Therefore, there has been proposed a method for automatically detecting a character that has been erroneously recognized from a character string that has been recognized as described below. A method in which when a character is recognized, a certainty factor of the recognition result of the character is obtained, and a character having a low certainty factor may be a character that has been erroneously recognized.

【０００５】文字認識された文字列に対して形態素
解析・構文解析・意味解析を行い、文法的に接続不可能
な箇所および意味的に接続不可能な箇所を検出して、こ
れらの箇所の文字列の文字を誤認識された文字であると
する方法。Morphological analysis / syntactic analysis / semantic analysis is performed on the recognized character string to detect grammatically unconnectable parts and semantically unconnectable parts, and to detect characters in these parts. How to treat the characters in a column as misrecognized characters.

【０００６】文字毎に、複数の文字候補を検出し、
これらの文字候補の全ての組み合わせの中から文法的接
続条件や文字の連接確率等が最適の組み合わせを、認識
結果とする方法。A plurality of character candidates are detected for each character,
A method in which, of all the combinations of these character candidates, the combination having the optimum grammatical connection condition, concatenation probability of characters, etc. is used as the recognition result.

【０００７】[0007]

【発明が解決しようとする課題】しかしながら、上記し
たの方法では、手書き文字やコピー等により品質が低
下した文字列の場合、文字の認識率が十分ではなく、文
字の認識結果の確信度を正しく評価することはできな
い。However, in the above method, in the case of a handwritten character or a character string whose quality is deteriorated due to copying, etc., the character recognition rate is not sufficient and the certainty factor of the character recognition result is correct. It cannot be evaluated.

【０００８】また、の方法では、意味解析を実行する
ために、単語毎に細かく意味分類された大規模な意味辞
書を設ける必要があり、システムが大型化してしまうと
いう問題がある。In addition, the method (1) requires a large-scale semantic dictionary in which the meanings are finely classified for each word in order to execute the semantic analysis, which causes a problem that the system becomes large.

【０００９】さらにの方法では、文字候補の全ての組
み合わせのなかから文法的接続条件や文字の連接確率等
を求めるので、例えば、１０文字の文字列に対して文字
候補を１０個検出するシステムでは、１０の１０乗通り
の組み合わせに対して文法的接続条件や文字の連接確率
等を求めることになる。このため、多大な処理時間が必
要となる問題がある。In the further method, the grammatical connection condition, the concatenation probability of characters, and the like are obtained from all combinations of character candidates, so for example, in a system that detects 10 character candidates for a character string of 10 characters. The grammatical connection conditions, the concatenation probability of characters, etc. are obtained for 10 to the 10th power combinations. Therefore, there is a problem that a great deal of processing time is required.

【００１０】この発明の目的は、パターンマッチング等
の画像処理によって認識された文字列中における誤認識
の文字の訂正作業が簡単に行える誤認識文字検出装置お
よび誤認識文字検出方法を提供することにある。It is an object of the present invention to provide an erroneously recognized character detection device and an erroneously recognized character detection method which can easily correct an erroneously recognized character in a character string recognized by image processing such as pattern matching. is there.

【００１１】また、この発明は、パターンマッチング等
の画像処理によって認識された文字列中における誤認識
の文字の訂正作業を不要にする誤認識文字訂正装置およ
び誤認識文字訂正方法を提供することを目的とする。The present invention also provides an erroneously recognized character correction apparatus and an erroneously recognized character correction method that eliminates the need to correct the erroneously recognized characters in a character string recognized by image processing such as pattern matching. To aim.

【００１２】[0012]

【課題を解決するための手段】請求項１に記載したこの
発明の誤認識文字検出装置は、文字列である単語および
その単語の属性を示すデータを登録した辞書ファイル
と、前記辞書ファイルを用いて入力された文字列に対し
て文節切りを含む形態素解析を行う形態素解析手段と、
前記形態素解析手段によって文節切りされた文字列にお
いて句読点等の区切り記号以外の１文字からなる文節が
連接している箇所を文字が誤認識されている文字列であ
るとして検出する誤認識検出手段と、を備えたことを特
徴とする。An erroneously recognized character detection device according to the present invention described in claim 1 uses a dictionary file in which a word which is a character string and data indicating an attribute of the word are registered, and the dictionary file. Morpheme analysis means for performing morpheme analysis including segmentation on the character string input as
An erroneous recognition detecting means for detecting a place where a bunsetsu consisting of one character other than a delimiter such as a punctuation mark is connected as a character string in which the character is erroneously recognized, in the character string segmented by the morphological analysis section. , Is provided.

【００１３】この構成では、入力された文字列に対して
単語およびその単語の属性を示すデータを登録した辞書
ファイルを用いて形態素解析を行う。この、形態素解析
で文節切りしたときに句読点等の区切り記号以外の１文
字からなる文節が連接している箇所を文字が誤認識され
ている文字列であるとして検出する。With this configuration, morphological analysis is performed on the input character string using a dictionary file in which words and data indicating the attributes of the words are registered. When the phrase is cut by the morphological analysis, a place where a phrase consisting of one character other than a delimiter such as a punctuation mark is connected is detected as a character string in which the character is erroneously recognized.

【００１４】請求項２に記載したこの発明の誤認識文字
検出装置は、文字列である単語およびその単語の属性を
示すデータを登録した辞書ファイルと、前記辞書ファイ
ルを用いて入力された文字列に対して文節切りを含む形
態素解析を行う形態素解析手段と、前記形態素解析手段
によって文節切りされた文字列において１文字の漢字か
らなる文節を文字が誤認識されている文字列であるとし
て検出する誤認識検出手段と、を備えたことを特徴とす
る。The erroneously recognized character detection device according to the present invention as defined in claim 2 is a dictionary file in which a word that is a character string and data indicating the attributes of the word are registered, and a character string that is input using the dictionary file. A morpheme analysis means for performing morpheme analysis including bunsetsu cutting, and a bunsetsu consisting of one Kanji character in the bunsetsu-branched character string is detected as a character string in which a character is erroneously recognized. And an erroneous recognition detecting means.

【００１５】この構成では、入力された文字列に対して
単語およびその単語の属性を示すデータを登録した辞書
ファイルを用いて形態素解析を行う。この、形態素解析
で文節切りしたときに、１文字の漢字からなる文節があ
れば、その文節を文字が誤認識されている文字列である
として検出する。In this configuration, the morpheme analysis is performed on the input character string by using the dictionary file in which the data indicating the word and the attribute of the word is registered. When the phrase is cut by the morphological analysis, if there is a phrase consisting of one Chinese character, the phrase is detected as a character string in which the character is erroneously recognized.

【００１６】請求項３に記載したこの発明の誤認識文字
検出装置は、文字列である単語およびその単語の属性を
示すデータを登録した辞書ファイルと、前記辞書ファイ
ルを用いて入力された文字列に対して文節切りを含む形
態素解析を行う形態素解析手段と、前記形態素解析手段
によって文節切りされた文字列において１文字の漢字か
らなる文節と漢字１文字の自立語を含む文節とが連接し
ている箇所を文字が誤認識されている文字列であるとし
て検出する誤認識検出手段と、を備えたことを特徴とす
る。According to a third aspect of the present invention, there is provided a erroneously recognized character detection device in which a word which is a character string and a dictionary file in which data indicating attributes of the word are registered, and a character string which is input using the dictionary file. A morpheme analysis means for performing morpheme analysis including bunsetsu cutting, and a bunsetsu consisting of one kanji character and a bunsetsu containing an independent word of one kanji character are concatenated in the character string segmented by the morpheme analysis means. And an erroneous recognition detecting means for detecting the portion where the character is recognized as a character string in which the character is erroneously recognized.

【００１７】この構成では、入力された文字列に対して
単語およびその単語の属性を示すデータを登録した辞書
ファイルを用いて形態素解析を行う。この、形態素解析
で文節切りしたときに、１文字の漢字からなる文節と漢
字１文字の自立語を含む文節とが連接している箇所を文
字が誤認識されている文字列であるとして検出する。With this configuration, morphological analysis is performed on the input character string by using a dictionary file in which words and data indicating the attributes of the words are registered. When bunsetsu is cut by morphological analysis, a portion where a bunsetsu consisting of one kanji character and a bunsetsu containing an independent word of one kanji character are connected to each other is detected as a character string in which characters are misrecognized. .

【００１８】請求項４に記載したこの発明の誤認識文字
検出装置は、入力された文字列を表示する表示手段を備
え、前記表示手段は、前記誤認識検出手段で誤認識であ
ることを検出した文字の文字列とそれ以外の文字列とを
異なる表示形式で表示する手段であることを特徴とす
る。The erroneously recognized character detection device of the present invention according to claim 4 is provided with display means for displaying the inputted character string, and the display means detects erroneous recognition by the erroneous recognition detection means. It is a means for displaying a character string of the selected character and a character string other than that in different display formats.

【００１９】この構成では、誤認識であることが検出さ
れた文字の文字列と、それ以外の文字の文字列とが異な
る表示形式で表示される。In this configuration, the character string of the character detected to be erroneous recognition and the character string of the other characters are displayed in different display formats.

【００２０】請求項５に記載したこの発明の誤認識文字
検出装置は、書面等に記載された文章を画像データとし
て取り込み、該画像データに対してパターンマッチング
等の画像処理を行って該文章を構成する文字毎に文字を
認識する文字認識手段と、認識した文字からなる文字列
を入力する入力手段を備えたことを特徴とする。An erroneously recognized character detection device according to the present invention described in claim 5 takes in a sentence written on a document or the like as image data, and performs image processing such as pattern matching on the image data to extract the sentence. Characteristic recognition means for recognizing a character for each constituent character and input means for inputting a character string of the recognized character are provided.

【００２１】この構成では、入力手段が書面等に記載さ
れた文章を画像データとして取り込み、パターンマッチ
ング等の画像処理を行って該文章を構成する文字毎に文
字を認識する。そして、認識された文字からなる文字列
が入力される。With this configuration, the input means takes in a sentence written on a document or the like as image data, performs image processing such as pattern matching, and recognizes each character constituting the sentence. Then, a character string composed of the recognized characters is input.

【００２２】請求項６に記載したこの発明の誤認識文字
訂正装置は、請求項５に記載の誤認識文字検出装置にお
いて、前記文字認識手段には、文字毎に認識した文字以
外にも複数の文字を文字候補として検出する手段を含
み、前記誤認識検出手段によって文字の誤認識されてい
る文字列を検出したときに、該文字列の文字を前記文字
候補として検出されている文字で置換する置換手段を備
え、前記置換手段での置換後、再び前記形態素解析およ
び誤認識検出を実行することを特徴とする。The erroneously recognized character correction device according to the present invention is the erroneously recognized character detection device according to claim 5, wherein the character recognition means has a plurality of characters other than the characters recognized for each character. A means for detecting a character as a character candidate is included, and when a character string in which a character is erroneously recognized is detected by the erroneous recognition detection means, the character of the character string is replaced with the character detected as the character candidate. It is characterized in that a replacement means is provided, and after the replacement by the replacement means, the morphological analysis and the false recognition detection are executed again.

【００２３】この構成では、文字毎に認識した文字以外
に複数の文字を文字候補として検出しておき、前記誤認
識検出手段によって文字が誤認識されている文字列を検
出すると、該文字列の文字を文字候補の文字での置換
後、再び前記形態素解析および誤認識検出を実行して、
誤認識されている文字の有無を検出する。In this configuration, a plurality of characters other than the recognized character are detected as character candidates, and when the character string in which the character is erroneously recognized is detected by the erroneous recognition detecting means, the character string of the character string is detected. After replacing the character with the character of the character candidate, again perform the morphological analysis and false recognition detection,
Detects the presence or absence of misrecognized characters.

【００２４】請求項７に記載したこの発明の誤認識文字
訂正装置は、文字候補として検出された文字には、優先
順位が付され、前記置換手段には、優先順位に基づいて
置換する文字を文字候補から抽出する手段を含むことを
特徴とする。According to the erroneously recognized character correction device of the present invention as set forth in claim 7, the characters detected as character candidates are prioritized, and the replacing means selects a character to be replaced based on the priority. It is characterized in that it includes means for extracting from character candidates.

【００２５】この構成では、文字候補として検出されて
いる文字に付されている優先順位に基づいて、置換する
文字が文字候補から抽出される。In this configuration, the character to be replaced is extracted from the character candidate based on the priority given to the character detected as the character candidate.

【００２６】請求項８に記載したこの発明の誤認識文字
訂正装置は、請求項５に記載の誤認識文字検出装置にお
いて、文字毎に形状が類似する文字を登録した類似辞書
を備え、前記誤認識検出手段によって文字の誤認識して
いる文字列を検出したときに、該文字列の文字の形状が
類似する文字を類似辞書から検出して、前記誤認識して
いる文字をこの検出した類似の文字で置換する置換手段
を備え、前記置換手段での置換後、再び前記形態素解析
および誤認識検出を実行することを特徴とする。An erroneously recognized character correction device according to the present invention described in claim 8 is the device for erroneously recognized character detection according to claim 5, further comprising a similar dictionary in which characters each having a similar shape are registered. When a character string in which a character is erroneously recognized is detected by the recognition detecting means, a character in which the shape of the character in the character string is similar is detected from the similarity dictionary, and the erroneously recognized character is detected in this similarity. It is characterized in that it comprises a replacement means for replacing with a character, and after the replacement by the replacement means, the morphological analysis and erroneous recognition detection are executed again.

【００２７】この構成では、前記誤認識検出手段によっ
て文字が誤認識されている文字列を検出すると、該文字
列の文字の形状が類似する文字を類似辞書から検出す
る。そして、誤認識している文字をこの検出した類似の
文字で置換後、再び前記形態素解析および誤認識検出を
実行して、誤認識されている文字の有無を検出する。In this configuration, when a character string in which a character is erroneously recognized by the erroneous recognition detecting means is detected, a character having a similar character shape in the character string is detected from the similar dictionary. After replacing the erroneously recognized character with the detected similar character, the morphological analysis and the erroneous recognition detection are executed again to detect the presence or absence of the erroneously recognized character.

【００２８】請求項９に記載したこの発明の誤認識文字
訂正装置は、前記置換手段によって置換した文字の文字
列と、それ以外の文字列とを異なる表示形式で表示する
手段を備えたことを特徴とする。According to a ninth aspect of the present invention, there is provided the erroneously recognized character correction device including means for displaying a character string of a character replaced by the replacing means and a character string other than that in different display formats. Characterize.

【００２９】この構成では、置換された文字の文字列と
それ以外の文字列とが異なる表示形式で表示される。In this structure, the replaced character string and the other character strings are displayed in different display formats.

【００３０】請求項１０〜請求項１８に記載したこの発
明の誤認識文字検出方法は、それぞれ上記請求項１〜請
求項９に記載した誤認識文字検出装置の構成が方法で記
載されている。According to the erroneously recognized character detection method of the present invention described in claims 10 to 18, the configuration of the erroneously recognized character detection device described in claims 1 to 9 is described as a method.

【００３１】[0031]

【発明の実施の形態】図１はこの発明の実施の形態であ
る誤認識文字訂正装置の機能を示すブロック図である。
誤認識文字訂正装置１は、入力部２と、画像データ記憶
部３と、文字認識部４と、文字候補記憶部５と、文字列
データ作成部６と、単語辞書７と、文法辞書８と、形態
素解析部９と、判定部１０と、訂正部１１と、表示部１
２とを備えている。また、入力部２と、画像データ記憶
部３と、文字認識部４と、文字列データ作成部６と、単
語辞書７と、文法辞書８と、形態素解析部９と、判定部
１０と、表示部１２とでこの発明の実施の形態である誤
認識文字検出装置１ａが構成される。FIG. 1 is a block diagram showing the function of an erroneously recognized character correction device according to an embodiment of the present invention.
The misrecognized character correction device 1 includes an input unit 2, an image data storage unit 3, a character recognition unit 4, a character candidate storage unit 5, a character string data creation unit 6, a word dictionary 7, and a grammar dictionary 8. , Morphological analysis unit 9, determination unit 10, correction unit 11, and display unit 1
2 is provided. Further, the input unit 2, the image data storage unit 3, the character recognition unit 4, the character string data creation unit 6, the word dictionary 7, the grammar dictionary 8, the morpheme analysis unit 9, the determination unit 10, and the display. The unit 12 constitutes the misrecognized character detection device 1a according to the embodiment of the present invention.

【００３２】入力部２は、ＯＣＲ等を用いて書面に書か
れた文章等を画像データで取り込む。画像データ記憶部
３は、入力部２で取り込んだ画像データを記憶する。文
字認識部４は、画像データ記憶部３に記憶されている文
章の画像データから、１文字ずつ文字を切り出し、パタ
ーンマッチング等によって文字毎に複数の文字候補を優
先順位を付けて検出する。文字候補記憶部５は、文字認
識部５で検出された文字候補を優先順位とともに記憶す
る。文字列データ作成部６は、優先順位を用いて文字候
補から抽出した文字からなる文字列のデータを作成す
る。単語辞書７は、単語の文字列とその単語の属性を対
応させて記憶している。文法辞書８は、文法規則を記憶
している。形態素解析部９は、単語辞書７と文法辞書８
を用いて文字列データ作成部６で作成された文字列デー
タを文節切りする。判定部１０は、文節切りされた結果
に基づいて誤認識されている文字列を検出する。訂正部
１１は、検出された誤認識されている文字列の文字を他
の文字で置換する訂正処理を実行する。表示部１２は、
文字列データ作成部６で作成された文字列の表示等を行
う。The input unit 2 takes in a text or the like written on a document using OCR or the like as image data. The image data storage unit 3 stores the image data captured by the input unit 2. The character recognition unit 4 cuts out characters one by one from the image data of the sentence stored in the image data storage unit 3 and detects a plurality of character candidates by prioritizing each character by pattern matching or the like. The character candidate storage unit 5 stores the character candidates detected by the character recognition unit 5 together with the priority order. The character string data creation unit 6 creates data of a character string composed of characters extracted from character candidates by using the priority order. The word dictionary 7 stores the character strings of words and the attributes of the words in association with each other. The grammar dictionary 8 stores grammar rules. The morphological analysis unit 9 includes a word dictionary 7 and a grammar dictionary 8.
Is used to segment the character string data created by the character string data creating unit 6. The determination unit 10 detects a character string that is erroneously recognized based on the result of segmentation. The correction unit 11 performs a correction process of replacing the detected character of the erroneously recognized character string with another character. The display unit 12 is
The character string created by the character string data creating unit 6 is displayed.

【００３３】図２は、この発明の実施の形態である誤認
識文字訂正装置の一連の処理を示すフローチャートであ
る。まず、簡単に誤認識文字訂正装置１の動作を説明す
る。誤認識文字訂正装置１は、入力部２において書面等
に記載された文章を画像データで取り込み、これを画像
データ記憶部３に記憶する（ｎ１）。文字認識部４は、
ｎ１で取り込んだ文章の画像データから１文字ずつ文字
を切り出し、文字毎に文字候補を検出する（ｎ２）。文
字候補記憶部５が文字毎に検出された文字候補を記憶す
る。文字列データ作成部６がｎ２で検出された文字候補
を用いて文字列データを作成し（ｎ３）、形態素解析部
９でこの文字列データの形態素解析を行う（ｎ４）。そ
して、判定部１０が形態素解析の結果から誤認識の文字
の文字列の箇所を検出する誤認識箇所検出処理を行う
（ｎ５）。このときに、誤認識の文字の文字列が検出さ
れなければこの文字列のデータを表示部１２に表示して
処理を完了する（ｎ６→ｎ８）。一方、ｎ５で誤認識の
文字の文字列を検出したときには、誤認識であることを
検出した文字を他の文字候補で置換する訂正処理を行い
（ｎ６→ｎ７）、ｎ３以降の処理を繰り返す。ｎ７の訂
正処理がこの発明でいう置換手段に相当する。FIG. 2 is a flowchart showing a series of processes of the erroneously recognized character correction device according to the embodiment of the present invention. First, the operation of the erroneously recognized character correction device 1 will be briefly described. The erroneously recognized character correction device 1 takes in a sentence written on a document or the like in the input unit 2 as image data and stores it in the image data storage unit 3 (n1). The character recognition unit 4
Characters are cut out one by one from the image data of the sentence captured in n1, and character candidates are detected for each character (n2). The character candidate storage unit 5 stores the character candidates detected for each character. The character string data creation unit 6 creates character string data using the character candidates detected in n2 (n3), and the morpheme analysis unit 9 performs morpheme analysis of this character string data (n4). Then, the determination unit 10 performs a misrecognized part detection process for detecting a part of the character string of the misrecognized character from the result of the morphological analysis (n5). At this time, if the character string of the erroneously recognized character is not detected, the data of this character string is displayed on the display unit 12 and the process is completed (n6 → n8). On the other hand, when the character string of the misrecognized character is detected at n5, a correction process of replacing the character detected as the misrecognized character with another character candidate (n6 → n7), and the processes after n3 are repeated. The correction process of n7 corresponds to the replacing means in the present invention.

【００３４】以下、詳細に誤認識文字訂正装置１の動作
を説明する。ｎ１では、入力部２が書面に記載された文
章を画像データとして取り込む。この取り込まれた画像
データは、画像データ記憶部３に記憶される。The operation of the erroneously recognized character correction device 1 will be described in detail below. At n1, the input unit 2 takes in the text written on the document as image data. The captured image data is stored in the image data storage unit 3.

【００３５】ｎ２では、ｎ１で取り込んだ画像データに
対して画像処理を行う。この画像処理としては、画像デ
ータである文書の文字を１文字ずつ切り出し、文字毎に
パターンマッチング等を行って文字を認識する。この認
識では、文字毎に所定数の文字候補を検出する。ここ
で、検出される文字候補には優先順位がつけられる。そ
して、文字毎の文字候補は文字ラティスとして文字候補
記憶部５に記憶される。At n2, image processing is performed on the image data captured at n1. In this image processing, characters of a document, which is image data, are cut out one by one, and pattern matching or the like is performed for each character to recognize the characters. In this recognition, a predetermined number of character candidates are detected for each character. Here, the detected character candidates are prioritized. Then, the character candidates for each character are stored in the character candidate storage unit 5 as a character lattice.

【００３６】図３は「この年金支給の問題についても、
当然でしょ。」と書かれた文章を画像データとして取り
込んだときに作成された文字ラティスである。各文字毎
に１０文字の文字候補を検出しており、文字候補の文字
には優先順位が付けられている。図では優先順位の高い
文字から順に示している（左側の文字ほど優先順位が高
い。）。FIG. 3 shows that "this pension payment problem also
Obviously. It is a character lattice created when the text written as "is captured as image data. Character candidates of 10 characters are detected for each character, and the characters of the character candidates are prioritized. In the figure, the characters are shown in descending order of priority (the letters on the left are higher in priority).

【００３７】ｎ３では、文字列データ作成部６が第１候
補の文字（最も優先度の高い文字）からなる文字列のデ
ータを作成する。すなわち、上記した例では、「この隼
金支給の間題についても、当黙でしよ。」と言う文字列
データが作成されることになる。At n3, the character string data creating unit 6 creates data of a character string consisting of the first candidate character (the character having the highest priority). That is, in the above-mentioned example, the character string data saying "Please be silent about this problem of the provision of money" is created.

【００３８】ｎ４では、形態素解析部９がｎ３で作成さ
れた文字列のデータに対して、形態素解析を行い、該文
字列を文節切りする。ここで簡単に形態素解析について
説明する。形態素解析とは、入力された文字列を構成し
ている単語を発見するとともに、単語間の接続関係を明
らかにする処理である。図４に、形態素解析を行う一般
的な形態素解析システムの構成を示す。形態素解析シス
テム２０は、文字列のデータを記憶する文バッファ２１
と、辞書を検索する辞書検索部２２と、単語が登録され
た形態素辞書２３と、形態素間の接続が成立するかどう
かを判定する接続検証部２４と、形態素間の接続規則を
記憶した接続規則記憶部２５とを備えている。なお、こ
こで言う形態素辞書２３が単語辞書７に相当し、接続規
則記憶部２５が文法辞書８に相当し、文バッファ２１と
辞書検索部２２と接続検証部２４とで形態素解析部９を
構成する。言い換えれば、図１に示した単語辞書７、文
法辞書８、および、形態素解析部９によって形態素解析
システム２０が構成されている。At n4, the morpheme analysis unit 9 performs morpheme analysis on the data of the character string created at n3, and punctures the character string. Here, the morphological analysis will be briefly described. Morphological analysis is a process of discovering the words that make up the input character string and clarifying the connection relationships between the words. FIG. 4 shows the configuration of a general morphological analysis system that performs morphological analysis. The morphological analysis system 20 includes a sentence buffer 21 that stores character string data.
A dictionary search unit 22 that searches a dictionary, a morpheme dictionary 23 in which words are registered, a connection verification unit 24 that determines whether a connection between morphemes is established, and a connection rule that stores a connection rule between morphemes. And a storage unit 25. The morpheme dictionary 23 here corresponds to the word dictionary 7, the connection rule storage unit 25 corresponds to the grammar dictionary 8, and the sentence buffer 21, the dictionary search unit 22, and the connection verification unit 24 constitute the morpheme analysis unit 9. To do. In other words, the word dictionary 7, the grammar dictionary 8, and the morphological analysis unit 9 shown in FIG. 1 constitute a morphological analysis system 20.

【００３９】文バッファ２１には、ｎ３で作成された文
字列データが記憶される。辞書検索部２２は、予め決め
られている分かち書きの方法（例えば、最長一致法、２
文節最長一致法、文節数最小法等）にしたがって、文バ
ッファ２１に記憶されている文字列のデータを形態素辞
書２３を用いて、分かち書きを行う。分かち書きされた
単語の候補に対して、接続検証部２４で連接する単語
（その前に切り出された単語）との接続チェックを行
う。この接続チェックは、該単語の候補と前に切り出さ
れた連接する単語との形態素間の接続規則が成立するか
どうかを確認する処理である。形態素間の接続規則が成
立すれば該単語が正しく切り出されたと判定し、形態素
間の接続規則が成立しなければ該単語が正しく切り出さ
れていないと判定する。そして、形態素間の接続規則が
成立していなければ改めて形態素辞書２３を引いて、他
の単語候補を探し同様の処理を行う。このとき、他の単
語候補が存在しなければ、連接する単語候補の切り出し
に問題があったとして単語の切り出しをやり直す。The sentence buffer 21 stores the character string data created in n3. The dictionary search unit 22 uses a predetermined separating method (for example, the longest matching method, 2
The character string data stored in the sentence buffer 21 is divided into pieces using the morpheme dictionary 23 in accordance with the longest phrase matching method, the minimum number of clauses method, and the like). The connection verification unit 24 performs a connection check with the words (words cut out before) connected to the word candidates separated into words. This connection check is a process of confirming whether or not a connection rule between the morphemes of the word candidate and the concatenated word cut out previously is established. If the connection rule between morphemes is established, it is determined that the word is correctly cut out, and if the connection rule between morphemes is not established, it is determined that the word is not cut out correctly. Then, if the connection rule between morphemes is not established, the morpheme dictionary 23 is drawn again, another word candidate is searched for, and the same processing is performed. At this time, if there is no other word candidate, it is determined that there is a problem in cutting out the concatenated word candidate, and the word is cut out again.

【００４０】このようにして、文字列を構成している単
語を発見するとともに、単語間の接続関係を明らかにさ
れ、文字列のデータが文節単位に分割される。In this way, the words forming the character string are discovered, the connection relation between the words is clarified, and the character string data is divided into bunsetsu units.

【００４１】例えば、上記した例における第１候補の文
字からなる「この隼金支給の間題についても、当黙でし
よ。」という文字列データは形態素解析によって図５に
示す文節単位に分割される。For example, the character string data "I'm still silent about this falcon payment problem" consisting of the first candidate character in the above example is divided into bunsetsu units shown in FIG. 5 by morphological analysis. To be done.

【００４２】ｎ５では判定部１０が、ｎ４において文節
切りされた結果からｎ２において文字を誤認識した箇所
を検出する。この実施の形態では、以下に示す〜の
いずれかの条件を満たすときには、そこを文字が誤認識
された箇所であるとして検出する。句読点等（、。〔〕等）の区切り記号以外の１文字
（漢字、ひらがな、カタカナ、英数字等）からなる文節
が連接する箇所１文字の漢字からなる文節である箇所１文字の漢字からなる文節と、漢字１文字の自立語を
含む文節が連接する箇所上記した条件を設定した理由は、日本語において文法上
１文字では文節が構成されない。なお、ひらがな１文字
からなる文節が単独であるときには、この文節を文字を
誤認識した箇所として検出しないようにしている理由
は、文字の認識において、ひらがなは、他の文字に誤認
識されることがほとんどなく、且つ、２文字連続して誤
認識されることもほとんどない。このため、ひらがなが
１文字で文節を構成するのは、そのひらがなに連接する
文節の漢字を誤認識したために、付属語であったこのひ
らがなが接続しなくなったと考えられるからである。At n5, the determination unit 10 detects a portion at which a character is erroneously recognized at n2 from the result of segmentation at n4. In this embodiment, when any one of the following conditions (1) to (4) is satisfied, the character is detected as a portion where a character is erroneously recognized. Part where bunsetsu consisting of 1 character (Kanji, Hiragana, Katakana, alphanumeric, etc.) other than punctuation marks (,. [] Etc.) is concatenated Part where bunsetsu consisting of 1 Kanji consists of 1 Kanji The place where a bunsetsu and a bunsetsu containing an independent word of one kanji character are concatenated The reason for setting the above condition is that a bunsetsu is not composed of one character in Japanese grammar. When there is only one syllable consisting of one hiragana character, the reason why this syllable is not detected as a part where the character is erroneously recognized is that the hiragana is erroneously recognized as another character in character recognition. There is almost no error, and there is almost no chance that two characters will be erroneously recognized in succession. For this reason, the reason why the hiragana consists of one character is that it is considered that the hiragana, which is an adjunct word, is no longer connected because the kanji of the bunsetsu connected to the hiragana is erroneously recognized.

【００４３】図６は、上記したｎ５における誤認識箇所
を検出する処理を示すフローチャートである。この処理
は判定部１０で行われる。ｎ４における形態素解析処理
で文節切りされた文字列のデータを取り込む（ｎ１
１）。そして、初期設定としてｍａｅと言う変数を０に
セットするとともに、先頭の文節を注目文節に設定する
（ｎ１２、ｎ１３）。注目文節とは、以下の処理を行う
対象とする文節である。また、ｍａｅと言う変数は以下
のようにして設定される。注目文節が１文字の漢字から
なる文節であれば２に設定する。注目文節が句読点等の
区切り記号および漢字以外の１文字からなる文節であれ
ば１に設定する。注目文節が上記以外であれば０に設定
する。FIG. 6 is a flow chart showing the processing for detecting the erroneously recognized portion in the above n5. This processing is performed by the determination unit 10. The data of the character string segmented by the morphological analysis processing in n4 is taken in (n1
1). Then, as an initial setting, a variable called mae is set to 0, and the leading bunsetsu is set to the focused bunsetsu (n12, n13). The target bunsetsu is a bunsetsu for which the following processing is performed. A variable called mae is set as follows. If the target bunsetsu is a bunsetsu consisting of one Kanji character, set it to 2. Set to 1 if the bunsetsu of interest is a bunsetsu consisting of one character other than punctuation marks and delimiters. Set to 0 if the target phrase is other than the above.

【００４４】初期設定が終了すると、注目文節が１文字
の文節であるかどうかを判定する（ｎ１４）。ｎ１４
で、１文字の文節であると判定すると、この文節の１文
字が、句読点等の区切り記号であるかどうかを判定する
（ｎ１５）。ｎ１５で区切り記号であると、ｎ２５に進
み変数ｍａｅを０に設定する。そして、注目文節の後ろ
に文節があるかないかを判定する（ｎ２６）。後ろに文
節があれば１つ後ろの文節を注目文節に設定し（ｎ２
７）、ｎ１４に戻る。ｎ２６で後ろに文節がないと判定
すると処理を完了する。When the initial setting is completed, it is determined whether the target phrase is a one-character phrase (n14). n14
When it is determined that the phrase is one character, it is determined whether one character of this phrase is a delimiter such as a punctuation mark (n15). If it is a delimiter in n15, the process proceeds to n25 and the variable mae is set to 0. Then, it is determined whether or not there is a phrase after the focused phrase (n26). If there is a bunsetsu behind, set the next bunsetsu as the attention bunsetsu (n2
7) Return to n14. If it is determined in n26 that there is no clause behind, the process is completed.

【００４５】ｎ１５で区切り記号以外の１文字からなる
文節であると判定すると、変数ｍａｅが０かどうかを判
定する（ｎ１６）。前回の注目文節が句読点等の区切り
記号以外の漢字、ひらがな、カタカナ、英数字等の１文
字からなる文節であったきに、変数ｍａｅが１または２
に設定されている。すなわち、この変数ｍａｅは現在の
注目文節より１つ前の文節の形態を示している。したが
って、ｎ１６の判定では、現在の注目文節より１つ前の
文節が句読点等の区切り記号以外の漢字、ひらがな、カ
タカナ、英数字等１文字からなる文節であったかどうか
を判定している。When it is determined in n15 that the clause is composed of one character other than the delimiter, it is determined whether the variable mae is 0 (n16). The variable mae is 1 or 2 when the previous interesting bunsetsu was a bunsetsu consisting of 1 character such as Kanji, Hiragana, Katakana, and alphanumeric characters other than punctuation marks.
Is set to That is, this variable mae indicates the form of the bunsetsu one before the current focused bunsetsu. Therefore, in the determination of n16, it is determined whether or not the bunsetsu one before the current bunsetsu of interest is a bunsetsu consisting of one character such as kanji, hiragana, katakana, and alphanumeric characters other than delimiters such as punctuation marks.

【００４６】そして、現在の注目文節より１つ前の文節
が句読点等の区切り記号以外の１文字からなる文節であ
ったときには、区切り記号以外の１文字からなる文節が
連接している箇所であるので、現在の注目文節より１つ
前の文節と現在の注目文節と、を文字が誤認識されてい
る箇所と判定し（ｎ１７）、ｎ１８に進む。一方、現在
の注目文節より１つ前の文節が区切り記号以外の１文字
からなる文節でなくｍａｅが０に設定されていれば、ｎ
１７の処理を行うことなく、ｎ１８に進む。ｎ１８で
は、現在の注目文節が漢字１文字からなる文節であるか
どうかが判定される。ここで、現在の注目文節が漢字１
文字からなる文節ではない時（ひらがな、または、カタ
カナ、英数字等の１文字文節である時）には、ｎ２１に
進んでｍａｅを１に設定し、ｎ２６に進む。When the bunsetsu one before the current bunsetsu of interest is a bunsetsu consisting of one character other than the delimiter such as a punctuation mark, the bunsetsu consisting of one character other than the delimiter is concatenated. Therefore, it is determined that the bunsetsu one before the current bunsetsu and the current bunsetsu are erroneously recognized (n17), and the process proceeds to n18. On the other hand, if mae is set to 0 instead of a clause consisting of one character other than the delimiter, the clause immediately preceding the current focused clause is n
The process proceeds to n18 without performing the process of 17. At n18, it is determined whether the current focused phrase is a phrase consisting of one Kanji character. Here, the current phrase of interest is Kanji 1
When it is not a phrase consisting of characters (when it is a one-character phrase such as hiragana, katakana, or alphanumeric characters), the process proceeds to n21 to set mae to 1, and proceeds to n26.

【００４７】ｎ１８で、現在の注目文節が漢字１文字か
らなる文節であると判定した時には、この漢字１文字か
らなる現在の注目文節を文字が誤認識されている箇所と
判定し（ｎ１９）、ｎ２０に進んでｍａｅを２に設定
し、ｎ２６に進む。このｎ１９の処理で、１文字の文節
と連接していない漢字１文字からなる文節も誤認識の箇
所として検出される。When it is determined in n18 that the current focused phrase is a phrase consisting of one Kanji character, it is determined that the current focused phrase consisting of one Kanji character is a portion where the character is erroneously recognized (n19). Go to n20, set mae to 2, and proceed to n26. In the process of n19, a phrase consisting of one Kanji character that is not connected to a one-character phrase is also detected as a misrecognized portion.

【００４８】また、ｎ１４で注目文節が１文字からなる
文節でないと判定されたときには、ｎ２２に進み変数ｍ
ａｅが２かどうかを判定する。ここで、１つ前の文節が
漢字１文字からなる文節であったかどうかを確認してい
る。そして、ｍａｅが２でなければｎ２５に進んでｍａ
ｅを０に設定し、ｎ２６以降の処理を行う。ｎ２２でｍ
ａｅが２であると（１つ前の文節が漢字１文字からなる
文節であった場合）、注目文節が漢字１文字の自立語を
含む文節であるかどうかを判定する（ｎ２３）。ｎ２３
で注目文節が漢字１文字の自立語を含む文節でないと判
定すると、ｎ２５に進んでｍａｅを０に設定し、ｎ２６
以降の処理を行う。ｎ２３で注目文節が漢字１文字の自
立語を含む文節であると判定すると、この注目文節に含
まれる自立語である漢字１文字を誤認識の文字として判
定し（ｎ２４）、ｎ２５でｍａｅを０に設定して、ｎ２
６以降の処理を行う。なお、この１文字の漢字の付属語
となって文節を構成しているひらがな等は認識誤りがあ
った文字として検出されない。If it is determined in n14 that the target bunsetsu is not a bunsetsu consisting of one character, the process proceeds to n22 and the variable m
It is determined whether ae is 2. Here, it is confirmed whether or not the preceding clause is a clause consisting of one Kanji character. If mae is not 2, proceed to n25 and ma
e is set to 0, and the processing from n26 is performed. m in n22
When ae is 2 (when the preceding bunsetsu is a bunsetsu consisting of one Kanji character), it is determined whether or not the focused bunsetsu is a bunsetsu containing an independent word of one Kanji character (n23). n23
If it is determined that the focused bunsetsu is not a bunsetsu containing an independent word of one Kanji character, the process proceeds to n25 to set mae to 0, and n26
The following processing is performed. If it is determined in n23 that the focused phrase is a phrase including an independent word of one kanji character, one kanji character that is an independent word included in this focused phrase is determined as a misrecognized character (n24), and mae is 0 in n25. Set to n2
Step 6 and subsequent steps are performed. It should be noted that the hiragana and the like, which form a phrase as an adjunct to this one-character Chinese character, are not detected as a character with a recognition error.

【００４９】図７は、図５に示した文節切りされた文字
列のデータに対して誤認識箇所検出処理によって誤認識
と検出された文字列を示す図である。ここで、従来のよ
うに形態素解析された結果から文法上の接続だけで誤認
識の箇所を検出する方式では、「し／よ」の部分だけし
か文字の誤認識を検出することはできない。しかし、本
実施の形態では、誤認識された文字列を全て検出するこ
とができた。上記したｎ５の誤認識箇所検出処理が完了
すると、判定部１０がｎ６で誤認識箇所の有無を判定す
る。そして、誤認識箇所があると判定すると、訂正部１
１がｎ７の訂正処理を実行する。FIG. 7 is a diagram showing character strings that are detected as erroneous recognition by the erroneously recognized portion detection processing for the data of the character string obtained by segmentation shown in FIG. Here, in the conventional method of detecting the erroneous recognition part only from the grammatical connection from the result of the morphological analysis, the erroneous recognition of the character can be detected only in the "shi / yo" part. However, in the present embodiment, all erroneously recognized character strings could be detected. When the above-described misrecognition portion detection processing of n5 is completed, the determination unit 10 determines whether there is a misrecognition portion in n6. When it is determined that there is a misrecognized portion, the correction unit 1
1 executes n7 correction processing.

【００５０】また、本願発明で言う誤認文字検出装置１
ａでは、誤認識された文字を訂正する機能を有していな
いので、以下に示す訂正処理は実行されない。ただし、
表示部１２に、ここで判定した誤認識箇所とそれ以外の
箇所とを異なる表示形式で表示して処理を完了する。し
たがって、誤認識された文字を訂正をする作業者は、文
字が誤認識されている箇所を表示形式の違いから簡単に
見つけることができるので、訂正作業を簡単に行うこと
ができるようになる。The misidentified character detection device 1 referred to in the present invention
Since a does not have the function of correcting the erroneously recognized character, the correction process described below is not executed. However,
The erroneously recognized part determined here and the other part are displayed in different display formats on the display unit 12, and the process is completed. Therefore, the operator who corrects the erroneously recognized character can easily find the location where the character is erroneously recognized from the difference in the display format, so that the correction work can be easily performed.

【００５１】誤認識と判定された箇所における文字の置
換は、以下のルールにしたがって実行する。前後に誤認識とした文字が連接していない部分では、
文字候補の優先度の順（第２候補、第３候補・・の順）
に置換する。誤認識文字が２文字連接している部分では、一文字の
み置換し、他方の文字を第１候補の文字とする。また、
文字候補の優先準位を加算したときにその値が小さいも
のから優先する。また、加算値が同じ場合には、後ろの
文字を第１候補の文字とする。文字候補の組み合わせに
おいて、どちらか一方の文字を第１候補とする全ての組
み合わせが完了たときには、第２候補の文字を第１候補
の文字とみなして同様の処理を行う。このようにして文
字を置換するのは、上記したように２文字連続して文字
が誤認識されることがほとんどないという理由からであ
る。この訂正処理における文字が置換される順番を示
す。１回目、前の文字を第２候補、後ろの文字は第１候補２回目、前の文字を第１候補、後ろの文字は第２候補３回目、前の文字を第３候補、後ろの文字は第１候補４回目、前の文字を第１候補、後ろの文字は第３候補５回目、前の文字を第４候補、後ろの文字は第１候補・・・１７回目、前の文字を第１０候補、後ろの文字は第１候
補１８回目、前の文字を第１候補、後ろの文字は第１０候
補１９回目、前の文字を第３候補、後ろの文字は第２候補２０回目、前の文字を第２候補、後ろの文字は第３候補・・・ｎ７における訂正処理行われると、ｎ３以降の処理を繰
り返す。すなわち、文字列から文字を誤認識している箇
所が無くなるまで、ｎ３〜ｎ７の処理が繰り返し実行さ
れる。The replacement of the character in the portion which is determined to be erroneous recognition is executed according to the following rules. In the part where the characters that were misrecognized before and after are not connected,
Order of priority of character candidates (order of second candidate, third candidate ...)
Replace with In a portion where two misrecognized characters are connected, only one character is replaced and the other character is set as the first candidate character. Also,
When the priority levels of character candidates are added, the one with the smaller value is given priority. Further, when the added values are the same, the subsequent character is set as the first candidate character. In the combination of character candidates, when all combinations in which one of the characters is the first candidate are completed, the second candidate character is regarded as the first candidate character and the same process is performed. The reason why the characters are replaced in this way is that the characters are rarely erroneously recognized consecutively as described above. The order in which characters are replaced in this correction process is shown. 1st time, previous character is 2nd candidate, back character is 1st candidate 2nd time, previous character is 1st candidate, back character is 2nd candidate 3rd time, previous character is 3rd candidate, back character Is the 1st candidate 4th time, the previous character is the 1st candidate, the latter character is the 3rd candidate 5th time, the previous character is the 4th candidate, the latter character is the 1st candidate ..... 17th time, the previous character The 10th candidate, the latter character is the first candidate 18th time, the previous character is the first candidate, the latter character is the 10th candidate 19th time, the preceding character is the third candidate, the latter character is the second candidate 20th time, When the previous character is the second candidate and the latter character is the third candidate ... When the correction process in n7 is performed, the processes after n3 are repeated. That is, the processes of n3 to n7 are repeatedly executed until there is no part where the character is erroneously recognized from the character string.

【００５２】上記した「この隼金支給の間題について
も、当黙でしよ。」と言う文字列データは、図３に示す
文字ラティスを用いて図８に示すように訂正が行われ
る。図８からも明らかなように、この例では訂正処理が
５回行われたときに、誤認識文字の訂正が完了したと判
定されている（ｎ６で誤認識箇所が無いと判定され
る。）。The above-mentioned character string data "Please be silent about this problem of the provision of money" is corrected using the character lattice shown in FIG. 3 as shown in FIG. As is clear from FIG. 8, in this example, it is determined that the correction of the erroneously recognized character is completed when the correction process is performed five times (it is determined that there is no erroneously recognized portion in n6). .

【００５３】ｎ８では、この訂正処理が行われた文字列
を表示部１２に表示する。表示部１２における表示例を
図９に示す。訂正処理において置換された文字にはアン
ダラインが付されている。このアンダラインによって、
訂正された文字であるかどうかを示している。したがっ
て、操作者は訂正されて箇所が簡単に判断できるので、
正しく訂正されているかどうかを確認する作業を簡単に
行うことができる。At n8, the corrected character string is displayed on the display unit 12. FIG. 9 shows a display example on the display unit 12. Characters replaced in the correction process are underlined. With this underline,
Indicates whether the character is corrected. Therefore, the operator can correct and easily determine the location,
It is possible to easily perform the work of confirming whether the correction is correct.

【００５４】なお、本実施の形態では、表示するときに
訂正した文字にはアンダラインを付けるとしたが、訂正
した文字のみ反転表示する等してそれ以外の文字（訂正
されていない文字）との表示形式を変えるようにしても
よい。また、本実施の形態ではパターンマッチングにお
いて、複数の文字候補を検出するとしたが、文字毎に形
状が類似する文字を記憶した類似辞書を設けておき、こ
の類似辞書から置換する文字を抽出するようにしてもよ
い。このようにすることで、文字候補記憶部５や、複数
の文字候補を検出する処理を不要にすることもできる。In this embodiment, the corrected character is displayed with an underline when it is displayed. However, only the corrected character is displayed in reverse video and the other characters (uncorrected characters) are displayed. The display format of may be changed. Further, in the present embodiment, a plurality of character candidates are detected in the pattern matching. However, a similar dictionary that stores characters with similar shapes for each character is provided, and characters to be replaced are extracted from this similar dictionary. You may By doing so, the character candidate storage unit 5 and the process of detecting a plurality of character candidates can be omitted.

【００５５】[0055]

【発明の効果】以上のように、この発明によれば、文字
列を画像データとして取り込み、文字毎にパターンマッ
チング等によって認識した文字の誤認識を確実に検出す
ることができる。As described above, according to the present invention, a character string can be captured as image data and erroneous recognition of a character recognized by pattern matching for each character can be reliably detected.

【００５６】また、誤認識を検出した文字とそれ以外の
文字を異なる表示形式で表示しているので、作業者は誤
認識されている箇所を簡単に知ることができ、訂正作業
が簡単に行える。Further, since the character in which the misrecognition is detected and the other characters are displayed in different display formats, the operator can easily know the location of the misrecognized character and can easily perform the correction work. .

【００５７】また、この発明の誤認識文字訂正装置によ
れば、誤認識された文字が自動的に訂正されるので、訂
正作業を不要にすることができる。According to the erroneously recognized character correction device of the present invention, the erroneously recognized character is automatically corrected, so that the correction work can be eliminated.

【００５８】さらに、訂正した文字列とそれ以外の文字
列とを異なる表示形式で表示しているので、誤認識され
た文字の訂正が正しく行われているかどうかを簡単に確
認することができる。Further, since the corrected character string and the other character string are displayed in different display formats, it is possible to easily confirm whether or not the erroneously recognized character is corrected correctly.

【図面の簡単な説明】[Brief description of the drawings]

【図１】この発明の実施の形態である誤認識文字訂正装
置の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of an erroneously recognized character correction device according to an embodiment of the present invention.

【図２】同実施の形態である誤認識文字訂正装置の処理
を示すフローチャートである。FIG. 2 is a flowchart showing a process of the erroneously recognized character correction device according to the same embodiment.

【図３】「この年金支給の問題についても、当然でし
ょ。」と言う文字列を画像データとして取り込んだとき
に検出された文字候補を示す図である。FIG. 3 is a diagram showing character candidates detected when a character string “This problem of pension payment is also natural” is captured as image data.

【図４】形態素解析を行う形態素解析システムの構成を
示す図である。FIG. 4 is a diagram showing a configuration of a morphological analysis system that performs morphological analysis.

【図５】「この隼金支給の間題についても、当黙でし
ょ。」という文字列を形態素解析によって文節切りした
結果を示す図である。FIG. 5 is a diagram showing a result of segmentation of a character string “It is still silent about this falconry issue” by morphological analysis.

【図６】誤認識文字を検出する処理を示すフローチャー
トである。FIG. 6 is a flowchart showing a process of detecting a misrecognized character.

【図７】誤認識箇所検出処理によって誤認識と検出され
た文字列を示す図である。FIG. 7 is a diagram showing a character string detected as erroneous recognition by the erroneously recognized portion detection processing.

【図８】検出された誤認識文字の訂正の経過を示す図で
ある。FIG. 8 is a diagram showing a process of correcting detected misrecognized characters.

【図９】表示部における文字列データの表示例を示す図
である。FIG. 9 is a diagram showing a display example of character string data on a display unit.

【符号の説明】[Explanation of symbols]

１−誤認識文字訂正装置１ａ−誤認識文字毛演出装置２−入力部３−画像データ記憶部４−類似辞書５−文字認識部６−文字候補記憶部７−文字列データ作成部８−単語辞書９−文法辞書１０−形態素解析部１１−判定部１２−訂正部１３−表示部 1-False recognition character correction device 1a-False recognition character hair rendering device 2-Input unit 3-Image data storage unit 4-Similar dictionary 5-Character recognition unit 6-Character candidate storage unit 7-Character string data creation unit 8-Word Dictionary 9-Grammar dictionary 10-Morphological analysis unit 11-Determination unit 12-Correction unit 13-Display unit

Claims

【特許請求の範囲】[Claims]

【請求項１】文字列である単語およびその単語の属性
を示すデータを登録した辞書ファイルと、前記辞書ファ
イルを用いて入力された文字列に対して文節切りを含む
形態素解析を行う形態素解析手段と、前記形態素解析手
段によって文節切りされた文字列において句読点等の区
切り記号以外の１文字からなる文節が連接している箇所
を文字が誤認識されている文字列であるとして検出する
誤認識検出手段と、を備えたことを特徴とする誤認識文
字検出装置。1. A dictionary file in which a word that is a character string and data indicating the attribute of the word are registered, and a morphological analysis unit that performs morphological analysis including phrase segmentation on a character string input using the dictionary file. And erroneous recognition detection for detecting a position where a bunsetsu consisting of one character other than a delimiter such as a punctuation mark is connected as a character string in which characters are erroneously recognized in the character string segmented by the morpheme analysis unit. An erroneously recognized character detection device comprising:

【請求項２】文字列である単語およびその単語の属性
を示すデータを登録した辞書ファイルと、前記辞書ファ
イルを用いて入力された文字列に対して文節切りを含む
形態素解析を行う形態素解析手段と、前記形態素解析手
段によって文節切りされた文字列において１文字の漢字
からなる文節を文字が誤認識されている文字列であると
して検出する誤認識検出手段と、を備えたことを特徴と
する誤認識文字検出装置。2. A morphological analysis means for performing morphological analysis including phrase segmentation on a character string input using the dictionary file, in which a word that is a character string and data indicating an attribute of the word are registered. And erroneous recognition detecting means for detecting a bunsetsu consisting of one Chinese character in the character string segmented by the morphological analysis means as a character string in which a character is erroneously recognized. False recognition character detection device.

【請求項３】文字列である単語およびその単語の属性
を示すデータを登録した辞書ファイルと、前記辞書ファ
イルを用いて入力された文字列に対して文節切りを含む
形態素解析を行う形態素解析手段と、前記形態素解析手
段によって文節切りされた文字列において１文字の漢字
からなる文節と漢字１文字の自立語を含む文節とが連接
している箇所を文字が誤認識されている文字列であると
して検出する誤認識検出手段と、を備えたことを特徴と
する誤認識文字検出装置。3. A morphological analysis unit that performs morphological analysis including phrase segmentation on a character string input using the dictionary file, in which a word that is a character string and data indicating the attribute of the word are registered. In the character string segmented by the morpheme analysis means, a character string is erroneously recognized at a position where a bunsetsu consisting of one Kanji character and a bunsetsu containing an independent word of one Kanji character are connected. An erroneously recognized character detection device comprising:

【請求項４】入力された文字列を表示する表示手段を
備え、前記表示手段は、前記誤認識検出手段で誤認識であるこ
とを検出した文字の文字列とそれ以外の文字列とを異な
る表示形式で表示する手段であることを特徴とする請求
項１、２、または、３のいずれかに記載の誤認識文字検
出装置。4. A display means for displaying an input character string is provided, wherein the display means distinguishes between a character string of a character detected by the misrecognition detection means as misrecognition and a character string other than that. The erroneously recognized character detection device according to any one of claims 1, 2, or 3, which is a means for displaying in a display format.

【請求項５】書面等に記載された文章を画像データと
して取り込み、該画像データに対してパターンマッチン
グ等の画像処理を行って該文章を構成する文字毎に文字
を認識する文字認識手段と、認識した文字からなる文字
列を入力する入力手段を備えたことを特徴とする請求項
１〜３、または、４のいずれかに記載の誤認識文字検出
装置。5. A character recognition means for capturing a sentence written in a document or the like as image data, performing image processing such as pattern matching on the image data, and recognizing a character for each character constituting the sentence, The misrecognized character detection device according to claim 1, further comprising an input unit for inputting a character string composed of recognized characters.

【請求項６】請求項５に記載の誤認識文字検出装置に
おいて、前記文字認識手段には、文字毎に認識した文字以外にも
複数の文字を文字候補として検出する手段を含み、前記誤認識検出手段によって文字の誤認識されている文
字列を検出したときに、該文字列の文字を前記文字候補
として検出されている文字で置換する置換手段を備え、前記置換手段での置換後、再び前記形態素解析および誤
認識検出を実行することを特徴とする誤認識文字訂正装
置。6. The erroneously recognized character detection device according to claim 5, wherein the character recognition means includes means for detecting a plurality of characters as character candidates in addition to the character recognized for each character. When a character string in which a character is erroneously recognized is detected by the detection means, a replacement means for replacing the character of the character string with the character detected as the character candidate is provided, and after replacement by the replacement means, again. An erroneously recognized character correction device characterized by executing the morphological analysis and erroneous recognition detection.

【請求項７】文字候補として検出された文字には、優
先順位が付され、前記置換手段には、優先順位に基づいて置換する文字を
文字候補から抽出する手段を含むことを特徴とする請求
項６記載の誤認識文字訂正装置。7. The characters detected as character candidates are given priorities, and the replacing means includes means for extracting characters to be replaced from the character candidates based on the priority order. Item 6. The misrecognized character correction device according to Item 6.

【請求項８】請求項５に記載の誤認識文字検出装置に
おいて、文字毎に形状が類似する文字を登録した類似辞書を備
え、前記誤認識検出手段によって文字の誤認識している文字
列を検出したときに、該文字列の文字の形状が類似する
文字を類似辞書から検出して、前記誤認識している文字
をこの検出した類似の文字で置換する置換手段を備え、前記置換手段での置換後、再び前記形態素解析および誤
認識検出を実行することを特徴とする誤認識文字訂正装
置。8. The misrecognized character detection device according to claim 5, further comprising a similarity dictionary in which characters having similar shapes are registered for each character, and a character string misrecognized by the misrecognition detection unit is displayed. When it is detected, the character string of the character string is detected from the similar dictionary character shape is similar, a replacement means for replacing the erroneously recognized character with the detected similar character, the replacement means The erroneously recognized character correction device, characterized in that the morphological analysis and the erroneous recognition detection are executed again after the replacement of.

【請求項９】前記置換手段によって置換した文字の文
字列と、それ以外の文字列とを異なる表示形式で表示す
る手段を備えたことを特徴とする請求項６、７、また
は、８のいずれかに記載の誤認識文字訂正装置。9. The apparatus according to claim 6, further comprising means for displaying a character string of the character replaced by the replacing means and a character string other than that in different display formats. False recognition character correction device described in crab.

【請求項１０】文字列である単語およびその単語の属
性を示すデータを登録した辞書ファイルを用いて入力さ
れた文字列に対して文節切りを含む形態素解析を行い、
文節切りされた文字列において１文字からなる文節が連
接している箇所を文字が誤認識されている文字列である
として検出することを特徴とする誤認識文字検出方法。10. A morphological analysis including segmentation is performed on a character string input using a dictionary file in which a word that is a character string and data indicating attributes of the word are registered.
An erroneously recognized character detection method, which detects a concatenation of a bunsetsu consisting of one character in a punctured character string as a character string in which a character is erroneously recognized.

【請求項１１】文字列である単語およびその単語の属
性を示すデータを登録した辞書ファイルを用いて入力さ
れた文字列に対して文節切りを含む形態素解析を行い、
文節切りされた文字列において１文字の漢字からなる文
節を文字が誤認識されている文字列であるとして検出す
ることを特徴とする誤認識文字検出方法。11. A morphological analysis including segmentation is performed on a character string input using a dictionary file in which a word that is a character string and data indicating attributes of the word are registered,
An erroneously recognized character detection method, which detects a bunsetsu consisting of one Chinese character in a punctured character string as a character string in which a character is erroneously recognized.

【請求項１２】文字列である単語およびその単語の属
性を示すデータを登録した辞書ファイルを用いて入力さ
れた文字列に対して文節切りを含む形態素解析を行い、
文節切りされた文字列において１文字の漢字からなる文
節と漢字１文字の自立語を含む文節とが連接している箇
所を文字が誤認識されている文字列であるとして検出す
ることを特徴とする誤認識文字検出方法。12. A morphological analysis including segmentation is performed on a character string input using a dictionary file in which a word that is a character string and data indicating attributes of the word are registered,
In the character string obtained by segmenting a phrase, a portion where a phrase consisting of one Kanji character and a phrase including an independent word of one Kanji character are detected as a character string in which the character is erroneously recognized. False recognition character detection method.

【請求項１３】入力された文字列に対して、誤認識で
あることを検出した文字の文字列とそれ以外の文字列と
を異なる表示形式で表示することを特徴とする請求項１
０、１１、または、１２のいずれかに記載の誤認識文字
検出方法。13. A character string of a character detected to be erroneous recognition and a character string other than that are displayed in different display formats with respect to the input character string.
The misrecognized character detection method according to any one of 0, 11, or 12.

【請求項１４】書面等に記載された文章を画像データ
として取り込み、パターンマッチング等の画像処理を行
って、該文章を構成する文字毎に文字を認識し、認識し
た文字からなる文字列を入力することを特徴とする請求
項１０〜１２、または、１３のいずれかに記載の誤認識
文字検出方法。14. A sentence written in a document or the like is taken in as image data, image processing such as pattern matching is performed, a character is recognized for each character forming the sentence, and a character string composed of the recognized characters is input. The erroneously recognized character detection method according to any one of claims 10 to 12, or 13.

【請求項１５】請求項１４に記載の誤認識文字検出方
法において、文字毎に認識した文字以外にも複数の文字を文字候補と
して検出しておき、文字が誤認識されている文字列を検出したときには、該
文字列の文字を前記文字候補として検出されている文字
で置換し、この置換後に再度形態素解析を行って、誤認
識している文字の有無を検出することを特徴とする誤認
識文字訂正方法。15. The misrecognized character detection method according to claim 14, wherein a plurality of characters other than the character recognized for each character are detected as character candidates, and a character string in which the character is misrecognized is detected. In this case, the character of the character string is replaced with the character detected as the character candidate, and after this replacement, the morphological analysis is performed again to detect the presence or absence of the erroneously recognized character. Character correction method.

【請求項１６】文字候補として検出されている文字に
優先順位を付し、この優先順位に基づいて、文字候補か
ら置換する文字を抽出することを特徴とする請求項１５
記載の誤認識文字訂正装置。16. The character detected as a character candidate is prioritized, and the character to be replaced is extracted from the character candidate based on this priority.
Corrected character recognition device.

【請求項１７】請求項１４に記載の誤認識文字検出方
法において、文字が誤認識されている文字列を検出したときに、該文
字列の文字の形状が類似する文字を文字毎に形状が類似
する文字が登録された類似辞書から検出し、前記誤認識
している文字列の文字をこの検出した類似の文字で置換
し、この置換した後に形態素解析を行って、誤認識して
いる文字の有無を検出することを特徴とする誤認識文字
訂正方法。17. The erroneously recognized character detection method according to claim 14, wherein when a character string in which a character is erroneously recognized is detected, a character having a character shape similar to that of the character string A similar character is detected from a registered similar dictionary, the character of the character string that is erroneously recognized is replaced with the detected similar character, and after this replacement, a morphological analysis is performed to erroneously recognize the character. A method for correcting a misrecognized character, which is characterized by detecting the presence or absence of a character.

【請求項１８】置換した文字の文字列と、それ以外の
文字列とを異なる表示形式で表示することを特徴とする
請求項１５、１６、または、１７のいずれかに記載の誤
認識文字訂正方法。18. The erroneously recognized character correction according to claim 15, wherein the replaced character string and the other character string are displayed in different display formats. Method.