JP2003178261A

JP2003178261A - Character recognizing device and program

Info

Publication number: JP2003178261A
Application number: JP2001376526A
Authority: JP
Inventors: Koichi Inoue; 浩一井上
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2001-12-10
Filing date: 2001-12-10
Publication date: 2003-06-27

Abstract

<P>PROBLEM TO BE SOLVED: To provide a character recognizing device and a character recognizing program causing no reduction in a recognizing rate in character recognition even if there is a word unregistered in a word dictionary for recognition- correction. <P>SOLUTION: This character recognizing device 1 has a scanner 2 for taking in document image data, an external storage device 3 for storing various programs for controlling the connection with the taken-in document image data, a recognition result, various dictionaries, the character recognition and a network, and a network interface 4, and is connected to the network 5. A retrieving server 6 for executing retrieving service is connected to the network 5. This retrieving server 6 accepts a retrieving request for an unknown word from the character recognizing device 1, and returns a retrieving result by a prescribed protocol (such as http). <P>COPYRIGHT: (C)2003,JPO

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、文書画像から文字
領域を抽出して文字認識し、画像中の文字情報に対応す
る文字符号を出力する光学的文字読み取り装置及びプロ
グラムに関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an optical character reading apparatus and program for extracting a character area from a document image, recognizing the character, and outputting a character code corresponding to character information in the image.

【０００２】[0002]

【従来の技術】文字認識の後処理として、単語辞書を用
いた後処理が行われることが多い。単語辞書を用意して
おき、認識結果の文字によって構成される候補文字列が
その辞書に含まれる単語とマッチするか否かを判定し、
マッチした場合にはその単語を優先して第１候補に現れ
るようにする方法は一般的に用いられていた。たとえば
特開平９−４４６０６号公報においては、文字認識ある
いは音声認識対象文書に出現すると予想される単語を辞
書に格納して上記の後処理を行う場合に、出現した単語
と、その関連する単語の出現頻度情報とを参照して、正
しい認識結果か否かを判定する方法が提案されている。
これによれば、予め辞書に多くの単語を登録しておくこ
とにより、言語処理における未知語を減らし、認識結果
をより妥当な方向に導くことができるようになる。2. Description of the Related Art Post-processing using a word dictionary is often performed as post-processing for character recognition. Prepare a word dictionary, determine whether the candidate character string composed of the characters of the recognition result matches the words included in the dictionary,
When a match is found, the method of giving priority to the word and making it appear in the first candidate is generally used. For example, in Japanese Unexamined Patent Publication No. 9-44606, when a word expected to appear in a character recognition or speech recognition target document is stored in a dictionary and the above post-processing is performed, the appearance word and its related word are A method of determining whether or not the recognition result is correct by referring to the appearance frequency information has been proposed.
According to this, by registering a large number of words in the dictionary in advance, it is possible to reduce unknown words in language processing and guide the recognition result in a more appropriate direction.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、専門用
語などを中心に新しい単語は増加する一方であり、それ
らは常に未知語として扱われることになる。場合によっ
ては他の登録単語との競合の結果、たとえ認識が正しく
行われていても誤った方向に修正を受ける結果となるこ
とがある。また、この増え続ける未知語に対し、言語辞
書だけをオンラインで定期的に更新することは可能であ
るが、辞書を提供する側は新しい言葉を見つけ、その中
から認識対象として利用が見込まれる単語を抽出して辞
書に組み込みリリースする作業を続けなければならな
い。この発明は上記の問題点を解決するためになされた
もので、辞書に登録されていない単語があっても、文字
認識における認識率が低下することがない文字認識装
置、および文字認識プログラムを提供することを目的と
する。However, new words are increasing mainly in technical terms, and they are always treated as unknown words. In some cases, conflicts with other registered words may result in corrections in the wrong direction, even if recognition was successful. In addition, it is possible to regularly update only the language dictionary online for this ever-increasing number of unknown words, but the side providing the dictionary finds new words, and from these words words that are expected to be used as recognition targets. You have to continue to work extracting and embedding in the dictionary and releasing. The present invention has been made to solve the above problems, and provides a character recognition device and a character recognition program in which the recognition rate in character recognition does not decrease even if there are words that are not registered in a dictionary. The purpose is to do.

【０００４】[0004]

【課題を解決するための手段】前記の課題を解決するた
めに、請求項１記載の発明では、認識結果に対して単語
辞書を用いて言語処理を行い、認識結果の補正を行う文
字認識装置において、言語処理の結果、前記単語辞書に
ない未知語と判断された単語についてネットワークを介
して外部データベースにアクセスして、マッチする単語
が所定数以上得られた場合にはその語を前記単語辞書に
ある単語としてみなし、前記補正に用いることを最も主
要な特徴とする。請求項２記載の発明では、請求項１に
おいて、前記外部データベースとしてインターネット検
索サイトを用い、未知語を検索語句として検索要求を送
信することを主要な特徴とする。また、請求項３記載の
発明では、請求項２において、検索によってマッチする
単語が所定数以上の時に前記検索語句を広く用いられて
いる単語と見なして前記認識結果の補正に用いることを
主要な特徴とする。また、請求項４記載の発明では、請
求項２において、検索によってマッチする単語が所定数
以上の時に前記検索語句を前記単語辞書に登録すること
を主要な特徴とする。また、請求項５記載の発明では、
請求項２において、検索結果のリンクから得られる相異
なるデータの数を計数し、これが所定数以上であれば前
記検索語句を広く用いられている単語とみなして前記認
識結果の補正に用いることを主要な特徴とする。また、
請求項６記載の発明では、請求項２において、検索結果
のリンクから得られる相異なるデータの数を計数し、こ
れが所定数以上であれば前記検索語句を前記単語辞書に
登録することを主要な特徴とする。また、請求項７記載
の発明では、請求項５において、前記検索語が広く用い
られている単語であると判断された場合に、前記検索語
が含まれるリンク先ドキュメント内に存在する語で前記
単語辞書にあるものについて、その出現頻度情報を高く
設定することを主要な特徴とする。そして、請求項８記
載の発明では、認識結果に対して単語辞書を用いて言語
処理を行い、認識結果の補正を行う文字認識プログラム
において、言語処理の結果、前記単語辞書にない未知語
と判断された単語についてネットワークを介して外部デ
ータベースにアクセスして、マッチする単語が所定数得
られた場合にはその語を前記単語辞書にある単語とみな
す未知語検索・判定処理モジュールと、前記単語辞書に
ある単語とみなされた単語を用いて前記言語処理を行う
言語処理モジュールとを備えたプログラムである。In order to solve the above-mentioned problems, according to the invention of claim 1, a character recognition device for performing language processing on a recognition result using a word dictionary and correcting the recognition result. In the above, as a result of the language processing, an external database is accessed through the network for a word that is determined to be an unknown word that is not in the word dictionary, and when a predetermined number or more of matching words are obtained, the word is used as the word dictionary. The most important feature is that it is regarded as a word in 1. A second aspect of the present invention is characterized in that, in the first aspect, an internet search site is used as the external database, and a search request is transmitted using an unknown word as a search phrase. Further, in the invention according to claim 3, in claim 2, when the number of words matched by the search is equal to or more than a predetermined number, the search phrase is regarded as a widely used word and is used for the correction of the recognition result. Characterize. The invention according to claim 4 is characterized in that, in claim 2, when the number of words matched by the search is equal to or more than a predetermined number, the search phrase is registered in the word dictionary. In the invention according to claim 5,
3. The method according to claim 2, wherein the number of different data obtained from the search result links is counted, and if this is a predetermined number or more, the search term is regarded as a widely used word and used for correcting the recognition result. The main feature. Also,
According to a sixth aspect of the present invention, in the second aspect, the number of different data obtained from the search result links is counted, and if the number is a predetermined number or more, the search term is registered in the word dictionary. Characterize. In the invention according to claim 7, when it is determined in claim 5 that the search word is a widely used word, the word existing in the linked document including the search word is used as the word. The main feature is that the appearance frequency information of the word dictionary is set high. Further, in the invention according to claim 8, in a character recognition program that performs language processing on a recognition result using a word dictionary and corrects the recognition result, it is determined that the result of the language processing is an unknown word that is not in the word dictionary. An unknown word search / judgment processing module that regards a selected word as a word in the word dictionary when a predetermined number of matching words are obtained by accessing an external database through the network, and the word dictionary And a language processing module that performs the language processing using a word regarded as a certain word.

【０００５】[0005]

【発明の実施の形態】以下、図面により本発明の実施の
形態を詳細に説明する。図１は本発明の一実施形態を示
すシステム構成図である。図に示す様に、本発明の文字
認識装置１は、文書画像データを取り込むスキャナ２
と、取り込んだ文書画像データや認識結果、各種辞書、
また文字の認識やネットワークとの接続を制御する各種
プログラム等を格納する外部記憶装置３と、ネットワー
クインターフェ−ス４を有し、ネットワーク５に接続し
ている。文字認識装置１は通常のマイクロプロセッサを
有するパーソナルコンピュータで構成しても、専用のハ
ードウエアで構成しても良い。ネットワーク５には検索
サービスを行う検索サーバ６が接続されている。この検
索サーバ６はいわゆる検索サイトと言われるもので、通
常に行われる方法でインターネット上を探索し、得られ
たドキュメントからキーワードを抽出して図示しないデ
ータベースを作成する。このデータベースは文字認識装
置１から特定のＵＲＬに対するアクセスによって検索で
き、所定のプロトコル（ｈｔｔｐなど）のレスポンスと
して結果が返されるようになっている。ネットワーク５
には、他の端末装置やパーソナルコンピュータ（ＰＣ）
７が接続されており、認識結果等を送信する事ができ
る。ネットワーク５はインターネットのような広域の通
信網でも、イーサネット（登録商標）のようなローカル
エリアネットワークであってもよい。しかし、検索サー
バ６はインターネットに接続されている必要がある。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described in detail below with reference to the drawings. FIG. 1 is a system configuration diagram showing an embodiment of the present invention. As shown in the figure, a character recognition device 1 according to the present invention includes a scanner 2 for capturing document image data.
And the captured document image data, recognition results, various dictionaries,
Further, it has an external storage device 3 for storing various programs for controlling character recognition and connection with a network, and a network interface 4, and is connected to a network 5. The character recognition device 1 may be composed of a personal computer having an ordinary microprocessor or dedicated hardware. A search server 6 that provides a search service is connected to the network 5. The search server 6 is a so-called search site, and searches the Internet by a usual method, extracts keywords from the obtained documents and creates a database (not shown). This database can be searched from the character recognition device 1 by accessing a specific URL, and the result is returned as a response of a predetermined protocol (http etc.). Network 5
Includes other terminal devices and personal computers (PCs)
7 is connected, and the recognition result and the like can be transmitted. The network 5 may be a wide area communication network such as the Internet or a local area network such as Ethernet (registered trademark). However, the search server 6 needs to be connected to the Internet.

【０００６】図２は本発明の文字認識装置１の実施例を
示す構成図である。図２に基づき本発明の動作の詳細を
説明する。同図において、画像入力部１１は、文書画像
データをスキャナ２を用いて紙原稿上から取り込むか、
あるいはネットワークインターフェースプログラムを通
じてＨＴＴＰやＳＭＴＰ等のプロトコルにより外部記憶
装置３に入力する。ネットワークインターフェースプロ
グラムはｗｅｂブラウザと呼ばれる公知のプログラムで
良い。外部記憶装置３に置かれた文書画像データは文字
認識装置１内の図示しない主記憶メモリーに転送され
る。１２は前処理部であり、文字認識に先立って、文書
領域の抽出、二値化を行った後、前記抽出した文書領域
を行ごとの領域へ、そして文字ごとの領域へと分割して
いく。分割には通常行方向に平行な方向へのラインごと
の射影を求め、文字色（二値化したので通常は黒）の画
素が０または少数のラインで領域を分割する方法が用い
られる。また、各行領域に対しても行方向に垂直な射
影を求め、射影が小さい部分で分割した後、文字の大き
さ等の条件から１文字を形作る領域へと再統合する。な
お、文書領域の抽出と二値化の順序はどちらを先に行っ
ても良い。原稿の種類によっては二値化の前に写真やカ
ラーの図形等を効率良く除外できるかもしれない。文字
認識部１３は、各々の１文字領域に対して１文字認識を
行う。たとえば認識対象文字のそれぞれに対して、所定
の方法で特徴量を予め求めておいた文字パターン辞書１
４を用いて、前記１文字領域からも同様にして特徴量を
求めて、その相互の距離が小さいものを認識候補として
選択する。この処理により、それぞれの１文字領域に対
して一つまたは複数の認識候補文字と、それぞれに対す
る距離値が得られる。次に言語処理部１５は単語辞書１
６を参照して上記認識結果の列に対して言語処理による
後処理を行う。この詳細については後述するが、行の先
頭から最後にいたる単語列としての仮説を作成しなが
ら、それぞれの仮説にコストを与え、そのコストが小さ
い（または大きい）ものを選ぶことによって正しい単語
列を推測し、もっとも確からしい単語列にあわせて認識
文字の候補順を入れ替える。FIG. 2 is a block diagram showing an embodiment of the character recognition device 1 of the present invention. Details of the operation of the present invention will be described with reference to FIG. In the figure, the image input unit 11 takes in document image data from a paper original using the scanner 2, or
Alternatively, it is input to the external storage device 3 by a protocol such as HTTP or SMTP through a network interface program. The network interface program may be a known program called a web browser. The document image data placed in the external storage device 3 is transferred to a main memory (not shown) in the character recognition device 1. A pre-processing unit 12 extracts a document area and binarizes it prior to character recognition, and then divides the extracted document area into line-by-line regions and character-by-character regions. . For the division, a method is usually used in which the projection for each line in the direction parallel to the row direction is obtained, and the area is divided by 0 or a small number of lines of pixels of a character color (usually black because of binarization). In addition, a projection perpendicular to the row direction is obtained for each line area, and after dividing the area with a small projection, it is reintegrated into an area that forms one character based on conditions such as the size of the character. Either the extraction of the document area or the binarization may be performed first. Depending on the type of manuscript, it may be possible to efficiently exclude photographs and color figures before binarization. The character recognition unit 13 recognizes one character for each one character area. For example, a character pattern dictionary 1 in which a feature amount is obtained in advance by a predetermined method for each recognition target character 1
In the same manner, the feature amount is obtained from the one-character region by using 4, and one having a small mutual distance is selected as a recognition candidate. By this processing, one or a plurality of recognition candidate characters and a distance value for each one character area are obtained. Next, the language processing unit 15 uses the word dictionary 1
6, post-processing by language processing is performed on the sequence of the recognition result. Although details will be described later, while creating a hypothesis as a word string from the beginning of the line to the end, by giving a cost to each hypothesis and selecting the one with a small (or large) cost, the correct word string is selected. Guess and change the candidate order of recognized characters according to the most probable word string.

【０００７】ここで、未知語検索・判定処理部１７で
は、言語処理部１５の結果、未知語と判定された文字列
を検索キーとして検索サーバ６への問い合わせを行う。
検索の対象とする文字列は仮説生成の途上で品詞が未定
となった文字列について逐次行うことも考えられるが、
検索コストが莫大となってしまう。そこで、一度採用す
る仮説が確定した段階で単語辞書１６の照合ができてお
らず、未知語となっている部分文字列を検索対象とす
る。未知語検索・判定処理部１７では、検索対象となっ
た文字列を予め登録されているＵＲＬに対するＨＴＴＰ
のＧＥＴリクエスト等を用いて問い合わせを行う。問い
合わせの結果、検索サーバ６からはその語をキーとして
ヒットしたネットワークリソースのリストが得られる。
このリストの件数及び内容から、後に示す方法によって
その検索語が広く用いられており、従って実在する語で
あるかどうかを判定する。この判定が成功すれば当該単
語は既知の語であるとして以後の処理において辞書登録
語と同様に扱われる。また、上記の処理で未知語を検索
語とした検索サーバ６への問い合わせにおいて得られ
た検索結果に示されたドキュメントを解析のために取得
する場合がある。この場合には取得したドキュメント
の当該未知語が含まれる部位の一定範囲の文字列を単語
辞書１６を利用して形態素解析する。このとき、形態素
解析結果に単語辞書１６で極端に低い頻度情報を与えら
れている単語があった場合、キャッシュもしくは単語辞
書１６に対し、その単語をキーとして頻度情報を高く設
定したエントリーを追加または上書きする。言語処理１
５を含む後処理を経た認識結果は、１文字ごとの認識候
補群（その確からしさを示す表か値）を含むデータ構
造、あるいは第１候補の列として得られる文書データと
して外部記憶装置３に出力する。この結果は出力部１８
によって図示しない表示装置に出力するか、ネットワー
クインターフェースプログラムを経て任意の場所に転
送される。Here, the unknown word search / determination processing unit 17 makes an inquiry to the search server 6 by using the character string determined to be an unknown word as a search key by the language processing unit 15.
As for the character string to be searched, it is possible to sequentially perform the character string whose part of speech is undecided during hypothesis generation.
Search costs will be enormous. Therefore, when the hypothesis to be adopted is once determined, the word dictionary 16 cannot be collated and the partial character string that is an unknown word is set as the search target. In the unknown word search / determination processing unit 17, the character string to be searched is HTTP for the URL registered in advance.
Make inquiries using the GET request, etc. As a result of the inquiry, the search server 6 obtains a list of network resources hit by using the word as a key.
From the number and contents of this list, the search term is widely used by the method described later, and it is therefore determined whether or not it is an actual word. If this determination is successful, the word is treated as a known word, and is treated in the same manner as a dictionary registration word in the subsequent processing. Further, the document indicated in the search result obtained in the inquiry to the search server 6 using the unknown word as the search word in the above process may be acquired for analysis. In this case, the word dictionary 16 is used to perform a morphological analysis on a character string within a certain range of a part of the acquired document that includes the unknown word. At this time, if the morphological analysis result includes a word to which extremely low frequency information is given in the word dictionary 16, an entry in which the frequency information is set to be high is added to the cache or the word dictionary 16 as a key or Overwrite. Language processing 1
The post-processing recognition result including 5 is stored in the external storage device 3 as a data structure including a recognition candidate group for each character (a table or a value indicating its likelihood) or document data obtained as a first candidate sequence. Output. This result is output by the output unit 18
Output to a display device (not shown) or transferred to an arbitrary place via a network interface program.

【０００８】次に言語処理部１５が行う処理の詳細を説
明する。言語処理部１５の処理では品詞及び頻度情報付
き単語辞書１６、品詞間接続コストテーブルを用いる。
文字認識部１３によって、１文字ごとに複数の文字認識
結果がその文字の認識距離値とともに出力される。言語
処理部１５では以下のような手順で処理単位の左から右
へと単語列の仮説を作成し、それに応じて認識結果の候
補順を最適なものに変更する。１．処理単位の左から各文字候補領域に対して、その
位置の候補群にあるいずれかの文字から始まり、右へ向
かってそれぞれの位置の候補群のいずれかの文字をつな
げて得られる単語候補の中で単語辞書１６にあるものを
記憶していく。２．各単語にはその単語を構成する文字の順位と単語
の出現頻度（辞書に記憶）から得られる単語コストを割
り当てる。３．ある範囲においてその範囲の候補群にある文字を
つないで得られる単語が一つも単語辞書１６になかった
場合には、構成する文字の順位が一定基準を満たしてい
る文字列を未知語として単語列の構成要素に加える。４．上記で未知語となった部分について後で述べる未
知語検索・判定処理部１７の処理によって有効と判断さ
れた文字列については単語コストを高く設定し、それ以
外には未知語のデフォルトコストを割り当てる。さら
に、有効と判定した未知語は単語キャッシュに追加し、
以後の認識で単語辞書１６にある単語と同様に扱う。５．各単語の終了位置で処理単位の先頭からその位置
までの単語を接続して得られる単語列仮説を生成する。６．単語列仮説の中から単語コストと品詞間接続コス
トが大きい（または小さい）仮説を正解とする。正解が
確定したらそれが１位候補の列を構成するように各位置
の候補群の順位を変更する。Next, details of the processing performed by the language processing unit 15 will be described. In the processing of the language processing unit 15, the word dictionary with part-of-speech and frequency information 16 and the part-of-speech connection cost table are used.
The character recognition unit 13 outputs a plurality of character recognition results for each character together with the recognition distance value of the character. The language processing unit 15 creates a hypothesis of a word string from the left to the right of the processing unit according to the following procedure, and accordingly changes the candidate order of the recognition result to the optimum one. 1. From the left of the processing unit, for each character candidate area, start from any character in the candidate group at that position and connect to any character from the candidate group at each position toward the right The contents in the word dictionary 16 are stored therein. 2. Each word is assigned a word cost obtained from the rank of the characters that make up the word and the frequency of occurrence of the word (stored in the dictionary). 3. If there is no word in the word dictionary 16 obtained by connecting the characters in the candidate group of the range in a certain range, the character string in which the ranks of the constituent characters satisfy a certain criterion is regarded as the unknown word string. Added to the components of. 4. The word cost is set high for a character string that is determined to be valid by the process of the unknown word search / determination processing unit 17, which will be described later with respect to the part that becomes an unknown word above, and the default cost of the unknown word is assigned to other parts. . Furthermore, unknown words that are judged to be valid are added to the word cache,
The words in the word dictionary 16 are treated in the same way as in the subsequent recognition. 5. A word string hypothesis obtained by connecting words from the beginning of the processing unit to that position at the end position of each word is generated. 6. From the word string hypotheses, the correct answer is the hypothesis with a high (or low) word cost and part-of-speech connection cost. When the correct answer is determined, the rank of the candidate group at each position is changed so that it forms a row of first-ranked candidates.

【０００９】また、未知語検索・判定処理部１７の処理
は以下のように行われる。１．言語処理部１５で発生した未知語に対して、その
文字列を検索キーワードとする問い合わせをインターネ
ット検索サーバ６に送信する。この送信はたとえばＨＴ
ＭＬのＧＥＴ／ＰＯＳＴメソッドなど、検索サービスに
よって所定の方法で行う。もし検索結果の表示件数の指
定が可能な場合には、１００件程度の多少大きな数値に
設定して問い合わせる。２．検索の結果として、検索サーバ６からヒットの件
数と結果のリストがＨＴＭＬの形で得られる。好ましく
はＸＭＬなど、より適した専用のフォーマットで結果を
受け取れる取り決めを行うことが望ましい。３．本発明の一つの実施の形態では、ヒットの件数を
見て検索語の有効性を判定する。ａ）ヒット数が０件もしくは５件未満の少数であった場
合には、当該単語はネットワーク上のリソースに使用さ
れていないか、使用されていてもタイプミス等による誤
った出現である可能性が強いため、当該文字列は広く用
いられている語ではないと判定する。ｂ）ヒット数が所定の件数（５件程度、利用する検索サ
ービスにより変更することも考えられる）以上あれば、
当該単語は広く用いられている単語であると判定する。Further, the processing of the unknown word search / determination processing section 17 is performed as follows. 1. For the unknown word generated in the language processing unit 15, an inquiry using the character string as a search keyword is transmitted to the Internet search server 6. This transmission is for example HT
It is performed by a predetermined method by a search service such as the ML GET / POST method. If the number of search results to be displayed can be specified, set a slightly larger value, such as about 100, and make an inquiry. 2. As a result of the search, the number of hits and a list of the results are obtained from the search server 6 in the form of HTML. It is desirable to make an agreement to receive the results in a more suitable proprietary format, preferably XML. 3. In one embodiment of the present invention, the effectiveness of a search term is determined by looking at the number of hits. a) When the number of hits is 0 or a small number of less than 5, the word may not be used as a resource on the network, or even if it is used, it may be an incorrect occurrence due to a typo. Therefore, it is determined that the character string is not a widely used word. b) If the number of hits is greater than or equal to the predetermined number (about 5; it may be changed depending on the search service used),
The word is determined to be a widely used word.

【００１０】４．本発明のもうひとつの実施の形態に
おいては、結果のリストにあるリンク先のドキュメント
の情報を確認した上で判定を行う。必要ならばリンク先
にアクセスを試み、結果のリストに現れたアイテムを以
下のような条件（一つまたは複数）でグルーピングす
る。・ドキュメントのサイズが同一である：同じ文書のコピ
ーである可能性が高い。相手先サーバによってはドキュ
メントを取得しなくてもサイズを得ることが可能。・ＵＲＬが同一である：同一のものへのリンクである。
何らかの原因で検索結果に含まれてしまったと思われ
る。（検索サーバ側でこれらを避けている場合もある）・ドキュメント中で当該語の周囲の文字列が同一であ
る：同一のドキュメントを起源とする可能性が高い。その上で、相異なるグループに分類されたドキュメント
の数が一定数（５件程度、検索サービスによって変更す
ることも可能）以上であれば、当該文字列は広く用いら
れている単語であると判定し、０件または一定数未満な
らばそうではないと判定する。4. In another embodiment of the present invention, the determination is made after confirming the information of the linked document in the result list. If necessary, try to access the link destination, and group the items appearing in the result list by the following conditions (one or more). Documents have the same size: likely to be copies of the same document. Depending on the destination server, you can get the size without getting the document. Same URL: Links to the same thing.
It seems that it was included in the search results for some reason. (There are also cases where these are avoided on the search server side) -The character strings around the word in the document are the same: there is a high possibility that they will originate from the same document. If the number of documents classified into different groups is more than a certain number (about 5 documents, it can be changed by the search service), it is determined that the character string is a widely used word. If 0 or less than a certain number, it is determined not to be.

【００１１】以上説明したように、請求項１及び２によ
れば、認識結果に対して単語辞書を用いて言語処理を行
い、認識結果の補正を行う文字認識装置において、言語
処理の結果、辞書にない未知語と判断された単語（文字
列）を検索キーとしてインターネット上の検索サービス
に問い合わせを行うので、その単語が広く用いられてい
る単語である場合、この問い合わせの結果として一つま
たは複数の検索結果を得るはずである。ここで一つも結
果が得られなかった場合、その単語（文字列）は検索キ
ーとしてのエントリーがなく、広く用いられている単語
ではない可能性が高い。請求項３においては、上記検索
サービスへの問い合わせの結果、ヒットした件数が一定
（たとえば５）以上であれば当該単語（文字列）が広く
用いられている単語であると判定するものである。本発
明では検索結果が得られるのと同時にヒットした件数も
得られるため、一度のアクセスで判定を行うことができ
る。請求項５においては、上記問い合わせによって得ら
れた検索結果のリストについて、リンク先のデータを実
際に別のデータであるもの、あるいは別のＵＲＬを指し
ているものに分類し、分類された実際の総数が一定数以
上であれば当該単語が広く用いられている単語であると
判定するものである。この分類の方法としては、リンク
の指しているＵＲＬが同じであるものごと、リンク先デ
ータのファイル名による分類、リンク先データの大きさ
が同じであるものごとの分類などがある。これは請求項
３に比べて余計な処理とネットワークトラフィックを発
生させるが、同一のリンク先や多数のミラーサイトとい
った要素を排除して、その単語が用いられている文書数
を正確に判定することができる。請求項４、６は請求項
３，５のそれぞれの場合に、当該単語が広く用いられて
いる単語と判定されたときには、その単語を次回から既
知の語として取り扱うように書き込み可能なユーザ辞書
あるいは一時的なキャッシュ記憶域に記録するものであ
る。こうすることで、次回からの通信と判定にかかるコ
ストを削減することができる。請求項７は請求項５の方
法において、検索結果の指し示すドキュメントを解析の
ために取得した場合に、検索した未知語が出現する部分
の周囲あるいは近傍の文字列の中で単語辞書にある語に
ついて、辞書中の単語の出現頻度に関するパラメータを
一時的に、あるいは永久的に高く設定する。これは現在
認識中の文書画像に含まれる文字列の示す内容に辞書を
適応させる効果を持つ。請求項８は本発明の文字認識装
置を計算機上で実行するためのプログラムであり、上述
した効果を奏するものである。As described above, according to the first and second aspects, in the character recognition device that performs the language processing on the recognition result using the word dictionary and corrects the recognition result, the result of the language processing, the dictionary If a word (character string) that is determined to be an unknown word that does not exist is used as a search key to make an inquiry to a search service on the Internet, if that word is widely used, one or more words may be returned as a result of this inquiry. You should get the search results for. If no result is obtained here, there is a high possibility that the word (character string) has no entry as a search key and is not a widely used word. According to the third aspect of the present invention, as a result of the inquiry to the search service, if the number of hits is constant (for example, 5) or more, it is determined that the word (character string) is a widely used word. In the present invention, since the number of hits is obtained at the same time that the search result is obtained, it is possible to make a determination with one access. In the claim 5, in the list of the search results obtained by the inquiry, the data of the link destination is actually classified as different data or the data pointing to another URL, and the classified actual data is classified. If the total number is a certain number or more, it is determined that the word is a widely used word. As the classification method, there are various methods such as one having the same URL pointed to by the link, one having the same file name of the link destination data, and one having the same size of the link destination data. Although this causes extra processing and network traffic compared to claim 3, elements such as the same link destination and many mirror sites are excluded to accurately determine the number of documents in which the word is used. You can According to claims 4 and 6, in each of the cases of claims 3 and 5, a writable user dictionary or a writable user dictionary so that the word is treated as a known word from the next time when it is determined that the word is widely used. It is recorded in a temporary cache storage area. By doing so, it is possible to reduce the cost required for communication and determination from the next time. According to claim 7, in the method of claim 5, when a document indicated by a search result is acquired for analysis, regarding a word in a word dictionary in a character string around or near a portion where a searched unknown word appears. The parameter regarding the frequency of appearance of words in the dictionary is set temporarily or permanently high. This has the effect of adapting the dictionary to the content indicated by the character string contained in the currently recognized document image. The eighth aspect of the present invention is a program for executing the character recognition device of the present invention on a computer, and has the above-mentioned effect.

【００１２】[0012]

【発明の効果】以上説明した様に、本発明においては、
次のような直接、あるいは間接的な効果を奏する。・文字認識装置またはソフトウェアに含まれる辞書の更
新よりも迅速に広く用いられている語を辞書または単語
キャッシュに置くことにより、最新の原稿でも認識率の
向上が期待できる。・本発明の方法によらない場合に比べ、装置やソフトウ
ェアに含まれる単語辞書の更新を行う時間間隔を長くす
ることができ、辞書作成や配布に伴うコストを削減でき
る。・広く利用されている単語であると一度判定された未知
語はユーザ辞書や単語キャッシュに置かれるため、次に
同じ単語が含まれる処理単位を処理する場合にネットワ
ークへのアクセスコストが必要ない。・ネットワークアクセス環境の善し悪しによって、検索
結果の件数のみで判定するか、検索結果をさらに解析し
て判定するのかを選択することができ、パフォーマンス
の低下を押さえることができる。・未知語をキーとした検索の結果得られるリソースを利
用して、単語辞書中の関連のあると思われる項目の頻度
情報を更新することにより、辞書の認識対象への好適応
を実現する。As described above, according to the present invention,
It has the following direct or indirect effects. -By placing widely used words in the dictionary or word cache more quickly than updating the dictionary included in the character recognition device or software, the recognition rate can be expected to be improved even with the latest manuscript. -The time interval for updating the word dictionary included in the device or software can be lengthened as compared to the case where the method of the present invention is not used, and the costs associated with dictionary creation and distribution can be reduced. An unknown word once determined to be a widely used word is placed in the user dictionary or word cache, so that the processing cost of the next processing unit including the same word does not require network access cost. -Depending on whether the network access environment is good or bad, it is possible to select whether to make a judgment based on only the number of search results or to analyze and further analyze the search results, and to suppress performance degradation. -By using the resources obtained as a result of a search using unknown words as keys, the frequency information of items that are considered to be related in the word dictionary is updated, thereby realizing good adaptation to the recognition target of the dictionary.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施形態を示すシステム構成図であ
る。FIG. 1 is a system configuration diagram showing an embodiment of the present invention.

【図２】本発明の文字認識装置の実施例を示す構成図で
ある。FIG. 2 is a configuration diagram showing an embodiment of a character recognition device of the present invention.

【符号の説明】[Explanation of symbols]

１文字認識装置２スキャナ３外部記憶装置４ネットワークインターフェース５ネットワーク６検索サーバ７パーソナルコンピュータ１１画像入力部１２前処理部１３文字認識部１４パターン辞書１５言語処理部１６単語辞書１７未知語検索・判定処理部１８出力部 1 character recognition device 2 scanner 3 external storage 4 network interface 5 network 6 Search server 7 personal computer 11 Image input section 12 Pretreatment section 13 Character recognition part 14 pattern dictionary 15 Language Processing Department 16 word dictionary 17 Unknown word search / judgment processing unit 18 Output section

Claims

【特許請求の範囲】[Claims]

【請求項１】認識結果に対して単語辞書を用いて言語
処理を行い、認識結果の補正を行う文字認識装置におい
て、言語処理の結果、前記単語辞書にない未知語と判断
された単語についてネットワークを介して外部データベ
ースにアクセスして、マッチする単語が所定数以上得ら
れた場合にはその語を前記単語辞書にある単語とみな
し、前記補正に用いることを特徴とする文字認識装置。1. A character recognition device that performs language processing on a recognition result using a word dictionary and corrects the recognition result. In the character recognition device, as a result of the language processing, a network for a word determined to be an unknown word that is not in the word dictionary is used. A character recognition device characterized in that, when a predetermined number or more of matching words are obtained by accessing an external database via, the word is regarded as a word in the word dictionary and used for the correction.

【請求項２】請求項１において、前記外部データベー
スとしてインターネット検索サイトを用い、未知語を検
索語句として検索要求を送信することを特徴とする文字
認識装置。2. The character recognition device according to claim 1, wherein an internet search site is used as the external database, and a search request is transmitted using an unknown word as a search phrase.

【請求項３】請求項２において、検索によってマッチ
する単語が所定数以上の時に前記検索語句を広く用いら
れている単語と見なして前記認識結果の補正に用いるこ
とを特徴とする文字認識装置。3. The character recognition device according to claim 2, wherein when the number of words matched by the search is a predetermined number or more, the search term is regarded as a widely used word and used for correcting the recognition result.

【請求項４】請求項２において、検索によってマッチ
する単語が所定数以上の時に前記検索語句を前記単語辞
書に登録することを特徴とする文字認識装置。4. The character recognition device according to claim 2, wherein the search phrase is registered in the word dictionary when the number of words matched by the search is a predetermined number or more.

【請求項５】請求項２において、検索結果のリンクか
ら得られる相異なるデータの数を計数し、これが所定数
以上であれば前記検索語句を広く用いられている単語と
みなして前記認識結果の補正に用いることを特徴とする
文字認識装置。5. The method according to claim 2, wherein the number of different data obtained from the search result link is counted, and if the number is more than a predetermined number, the search term is regarded as a widely used word and the recognition result A character recognition device characterized by being used for correction.

【請求項６】請求項２において、検索結果のリンクか
ら得られる相異なるデータの数を計数し、これが所定数
以上であれば前記検索語句を前記単語辞書に登録するこ
とを特徴とする文字認識装置。6. The character recognition according to claim 2, wherein the number of different data obtained from the search result links is counted, and if the number is a predetermined number or more, the search term is registered in the word dictionary. apparatus.

【請求項７】請求項５において、前記検索語が広く用
いられている単語であると判断された場合に、前記検索
語が含まれるリンク先ドキュメント内に存在する語で前
記単語辞書にあるものについて、その出現頻度情報を高
く設定することを特徴とする文字認識装置。7. The word according to claim 5, when the search word is determined to be a widely used word, the word existing in the linked document including the search word and being in the word dictionary. The character recognition device is characterized in that the appearance frequency information is set high.

【請求項８】認識結果に対して単語辞書を用いて言語
処理を行い、認識結果の補正を行う文字認識プログラム
において、言語処理の結果、前記単語辞書にない未知語
と判断された単語についてネットワークを介して外部デ
ータベースにアクセスして、マッチする単語が所定数得
られた場合にはその語を前記単語辞書にある単語とみな
す未知語検索・判定処理モジュールと、前記単語辞書に
ある単語とみなされた単語を用いて前記言語処理を行う
言語処理モジュールとを備えたことを特徴とする文字認
識プログラム。8. In a character recognition program for performing language processing on a recognition result using a word dictionary and correcting the recognition result, a network for a word judged as an unknown word that is not in the word dictionary as a result of the language processing. An external word is accessed through, and when a predetermined number of matching words are obtained, an unknown word search / determination processing module that regards the word as a word in the word dictionary, and a word in the word dictionary And a language processing module that performs the language processing using the generated words.