JP3798582B2 - Contact character separation device and method, and recording medium - Google Patents

Contact character separation device and method, and recording medium Download PDF

Info

Publication number
JP3798582B2
JP3798582B2 JP19534999A JP19534999A JP3798582B2 JP 3798582 B2 JP3798582 B2 JP 3798582B2 JP 19534999 A JP19534999 A JP 19534999A JP 19534999 A JP19534999 A JP 19534999A JP 3798582 B2 JP3798582 B2 JP 3798582B2
Authority
JP
Japan
Prior art keywords
character
cutting
point
contact
separation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP19534999A
Other languages
Japanese (ja)
Other versions
JP2001022885A (en
Inventor
浩司 黒川
克仁 藤本
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to JP19534999A priority Critical patent/JP3798582B2/en
Publication of JP2001022885A publication Critical patent/JP2001022885A/en
Application granted granted Critical
Publication of JP3798582B2 publication Critical patent/JP3798582B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Character Input (AREA)
  • Character Discrimination (AREA)

Description

【0001】
【発明の属する技術分野】
本発明は、文字認識におけるあらゆる接触英字に柔軟に対応し高精度な分離を行うことができる接触文字分離装置及び方法及び記録媒体に関する。
【0002】
日英混在文書では、誤って日本語を切断分離することが多くその結果、英字以外の文字の認識精度低下を伴うことが多いため、日本語の認識精度を維持しつつ、接触英字の分離を行うことが望まれている。
【0003】
【従来の技術】
従来の接触英字の分離方式では、中川らの「英文OCRにおける接触文字切り出しアルゴリズム」と題される電子情報通信学会技術報告書(PRU93−128(1994−01))による、英文中のある長さ以上の文字の上方からの輪郭を抽出するものがあった。
【0004】
図7は従来例の説明図であり、図7(a)は上部輪郭の抽出の説明、図7(b)は極大、極小の抽出の説明である。図7(a)において、集合英文字「separa」の上方の基準からの距離を求め上部の輪郭を抽出する。図7(b)において、前記抽出した上部の輪郭を追跡し、極大、極小の位置を求め、極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補とするものであった。
【0005】
ここでの極大、極小とは、基準からの距離が大きい極を極小、基準からの距離が小さい極を極大とする。
【0006】
図8は従来例のフローチャートである。以下、接触文字分離部が行う処理を図8の処理S21〜処理S26に従って説明する。
【0007】
S21:接触文字分離部は、入力された文字画像の文字の上方からの輪郭を抽出して、処理S22に移る。
【0008】
S22:接触文字分離部は、前記抽出した上部の輪郭を追跡し、極大、極小の位置を求め、処理S23に移る。
【0009】
S23:接触文字分離部は、極小点があるかどうか判断する。この判断で極小点がある場合は処理S24に移り、ない場合はこの処理を終了する。
【0010】
S24:接触文字分離部は、順番に極小点を選択し、処理S25に移る。
【0011】
S25:接触文字分離部は、前記選択した極小点の極小と直前の極大との差がしきい値より大きいかどうか判断する。この判断で極小と直前の極大との差がしきい値より大きい場合は処理S26に移り、小さい場合は処理S23に戻る。
【0012】
S26:接触文字分離部は、前記判断で極小と直前の極大との差がしきい値より大きい場合、その極小点を切断候補に決定し、処理S23に戻る。
【0013】
【発明が解決しようとする課題】
前記従来のものには、次のような課題があった。
【0014】
▲1▼:文字の上方からの輪郭のみの抽出しか行っていないことから、あらゆる英字接触に柔軟に対応することができなかった。
【0015】
▲2▼:また、全文字列処理のため、日英混在文書では、誤って日本語を切断対象とすることが多く、その結果、英字以外の文字の認識精度低下を伴うことが多かった。
【0016】
本発明は、このような従来の課題を解決し、従来技術では困難であった日本語中の接触英字の分離を、日本語に悪影響を与えることなく、あらゆる接触英字に柔軟に対応し高精度な分離を行うことができるようにすることを目的とする。
【0017】
【課題を解決するための手段】
図1は本発明の接触英字分離装置の説明図である。図1中、1は入力部、2は接触文字分離部、3は認識手段、21は切断候補点選定手段、22は切断候補削減手段である。
【0018】
本発明は前記従来の課題を解決するため次のように構成した。
【0019】
(1):文字集合から日本語接触文字の分離を行い分離文字1を得る日本語接触文字分離手段と、得た分離文字1に対して文字認識を行い認識時の距離値によるDPマッチングを行って評価値を得る第一DPマッチング手段と、得た評価値が閾値より小さい前記分離文字1を対象に接触英字分離を行い分離文字2を得る接触英字分離手段と、得た分離文字2に対してDPマッチングを行い分離文字3を得る第二DPマッチング手段と、前記接触英字分離手段は、接触英字分離を行う前記分離文字1の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とする切断候補点選定手段21と、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する切断候補削減手段22とを備える。
【0020】
(2):前記(1)の接触文字分離装置において、前記接触英字分離手段は、前記分離文字1の中で一つでも評価値が閾値より小さい場合、分離前の文字集合を接触英字分離の対象とする
【0021】
(3):前記(1)又は(2)の接触文字分離装置において、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除し、次に、残った切断候補点の中で縦方向の黒画素数が最も小さい切断候補点を次の切断点と決定し、文字の大きさから切断点とはなりえない前記決定した次の切断点付近の切断候補点を削除する切断候補削減手段とを備える。
【0022】
(4):日本語接触文字分離手段で文字集合から日本語接触文字の分離を行い分離文字1を得、第一DPマッチング手段で得た分離文字1に対して文字認識を行い認識時の距離値によるDPマッチングを行って評価値を得、接触英字分離手段で得た評価値が閾値より小さい前記分離文字1を対象に接触英字分離を行い分離文字2を得、第二DPマッチング手段で得た分離文字2に対してDPマッチングを行い分離文字3を得、前記接触英字分離手段で、接触英字分離を行う前記分離文字1の上方、下方、内部上方、内部下方からの輪郭を抽出して、該抽出した各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とし、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する接触文字分離方法とする。
【0023】
(5):文字集合から日本語接触文字の分離を行い分離文字1を得る日本語接触文字分離手順と、得た分離文字1に対して文字認識を行い認識時の距離値によるDPマッチングを行って評価値を得る第一DPマッチング手順と、得た評価値が閾値より小さい前記分離文字1を対象に接触英字分離を行い分離文字2を得る接触英字分離手順と、得た分離文字2に対してDPマッチングを行い分離文字3を得る第二DPマッチング手順と、接触英字分離を行う前記分離文字1の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とする切断候補点選定手順と、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する切断候補削減手順とを実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体とする。
【0024】
(作用)
前記構成に基づく作用を説明する。
【0025】
切断候補点選定手段21で接触英字分離を行う分離文字1の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とし、切断候補削減手段22で前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する。このため、あらゆる接触英字に柔軟に対応し高精度な分離を行うことができる。
【0026】
また、前記極大、極小の位置を求めるのに、前記輪郭を微分する。このため、極大、極小の位置を容易に求めることができる。
【0027】
さらに、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除し、次に、残った切断候補点の中で縦方向の黒画素数が最も小さい切断候補点を次の切断点と決定し、文字の大きさから切断点とはなりえない前記次の切断点付近の切断候補点を削除する。このため、不要な切断候補点を容易に削除することができる。
【0028】
また、接触英字分離を行う分離文字1の上方、下方、内部上方、内部下方からの輪郭を抽出して、該抽出した各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とし、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する接触文字分離方法とする。このため、あらゆる接触英字に柔軟に対応し高精度な分離を行う方法を提供することができる。
【0029】
さらに、接触英字分離を行う分離文字1の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とする切断候補点選定手順と、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する切断候補削減手順と、を実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体とする。このため、この記録媒体のプログラムをコンピュータにインストールすることで、あらゆる接触英字に柔軟に対応し高精度な分離を行うことができる接触文字分離装置を容易に提供することができる。
【0030】
【発明の実施の形態】
(1):接触英字分離装置の説明
図1は本発明の実施の形態における接触英字分離装置の説明図である。図1において、接触英字分離装置には、入力部1、接触文字分離部2、認識手段3が設けてあり、接触文字分離部2には、切断候補点選定手段21、切断候補削減手段22が設けてある。
【0031】
入力部1は、認識する文字画像を入力するものである。接触文字分離部2は、接触した文字の分離処理を行うものである。認識手段3は、接触文字分離部2で分離した文字の認識を行うものである。切断候補点選定手段21は、文字の上方、下方、内部上方、内部下方から輪郭を追跡して切断候補点を選定するものである。切断候補削減手段22は、切断候補点選定手段21で選定した切断候補点を削減するものである。
【0032】
なお、認識手段3を接触英字分離装置に含めて説明したが、別に設けることもできる。
【0033】
図2は輪郭の抽出方向の説明図である。図2において、接触文字分離部2では、「s」の文字の上方、下方、内部上方、内部下方から輪郭を抽出する。ここで、上方輪郭とは、矢印(a)のように上方の基準からの距離を求め上部の輪郭を抽出するものである。下方輪郭とは、矢印(d)のように下方の基準からの距離を求め下部の輪郭を抽出するものである。内部上方輪郭とは、矢印(b)のように上方輪郭をとる文字線の下部の輪郭を抽出するものである。内部下方輪郭とは、矢印(c)のように下方輪郭をとる文字線の上部の輪郭を抽出するものである。
【0034】
(接触英字分離動作の説明)
先ず、入力部1から入力された文字画像を、接触文字分離部2で日本語接触文字の切り出しを行い、認識手段3で文字認識を行う。この認識結果から得られた評価値を用いて認識精度の悪い文字を取り出し、接触文字分離部2で接触英字の分離法を用いて文字の分離を行う。
【0035】
接触文字分離部2は、切断候補点として、文字の上方、下方、内部上方、内部下方から輪郭を抽出する。そして、切断候補点選定手段21は、その輪郭を追跡し、それぞれの輪郭の極大、極小を求め、極小と直前の極大との差がしきい値より大きい場合に、その極小を切断候補とする。
【0036】
切断候補削減手段22は、この時得られた四つの方向からの切断候補点集合を、縦方向の黒画素数が小さい順に切断点とする。その際、決定した切断点付近(文字の大きさから考えて切断点とはなり得ない範囲)に切断候補点が存在した場合、その点は切断候補から削除する。
【0037】
次に、残った切断候補点のなかで黒画素数が小さいものを求め前記と同様の処理を行う。これを繰り返し切断点を決定する。決定された切断点で切断を行い、文字認識を行い、認識距離値によるDPマッチング(認識結果の一番良い切断点を決める処理)を行い、最適な切断点を求める。
【0038】
接触英字の分離を行った文字の認識結果が、記号等(文字でない)であった場合、漢字を誤切り出ししている可能性が高い。このため、認識評価値を下げてその認識結果の優先度を下げるようにする。
【0039】
このようにすると、従来技術では困難であった日本語中の接触英字の分離を、日本語に悪影響を与えることなく、あらゆる接触英字に柔軟に対応し高精度な分離を行うことができる。
【0040】
(2):認識結果を用いた接触英字分離の基本処理の説明
先ず、日本語接触文字の切り出しを行い、ついで文字認識を行い認識結果を得る。この認識結果から得られる評価値を用いて、文字認識の結果が確からしいかどうか判定を行う。例えば、文字認識時の距離値を評価値とし、距離がしきい値以上のものは英字が接触し、漢字などに誤読され距離が大きくなっていると推定し、接触英字の分離法を用いて、その文字の分離を行う対象文字として選定する。
【0041】
なお、評価値は距離値に限定されず、例えば、正読確率等を用いることも可能である。また、選定された文字が日本語接触文字の切り出しで分離された文字であった場合、その文字から分離されたその他の文字の評価値が良かった場合でも、分離前の文字を接触英字分離対象文字とする。
【0042】
図3は認識結果を用いた接触英字分離の基本処理フローチャートである。以下、図3の処理S1〜処理S3に従って説明する。
【0043】
S1:接触文字分離部2は、日本語接触文字の切り出しを行い、認識手段3で文字認識を行い、認識結果を得て、処理S2に移る。
【0044】
S2:接触文字分離部2は、認識結果から得られた評価値を用いて、文字認識の結果が確からしいかどうか判断する。この判断で文字認識の結果が確からしい場合はこの処理を終了し、もし確からしくない(評価値が悪い)場合は処理S3に移る。
【0045】
S3:接触文字分離部2は、英字接触文字分離法を用いて文字の分離を行い、この処理を終了する。
【0046】
(3):接触英字の分離処理の説明
先ず、文字認識時の評価値がしきい値以上のものは、英字が接触し漢字などに誤読されていると推定し、その文字の輪郭を抽出する。輪郭の抽出は、文字の上方、下方、内部上方、内部下方から行う。その輪郭を追跡し、それぞれの輪郭の極大、極小を求める。
【0047】
文字の上方から得られた極大点、極小点について極小点と直前の極大点とを比較し、その差がしきい値より大きい場合、その極小点を切断候補とする。それを全ての極小点に対して行い、文字の上方の輪郭から得られた切断候補点集合とする。なお、切断候補点集合を得る際、輪郭を微分したものから極大、極小点を求め前記処理を行うことも可能である。
【0048】
前記同様の接触英字分離処理を、文字の下方、内部上方、内部下方からも行い、それぞれの輪郭からの切断候補点集合を得る。
【0049】
次に、四つの輪郭から得られた切断候補点集合の中から不要な切断候補点を削除する。これは縦方向の黒画素数が最も小さい切断候補点を先ず切断点とする。そして、この決定した切断点付近に切断候補点が存在した場合、その点は不要な切断候補点と判断し、候補から削除する。
【0050】
次に、残った切断候補点の中で縦方向の黒画素数が最も小さい切断候補点を求め同様の処理を行う。これを繰り返し切断候補点の削減、切断点の決定を行う。
【0051】
決定された切断点に従い文字を分離し、文字認識、認識距離値によるDPマッチングを行い、最適な切断点を求め、接触英字分離後の認識結果を得る。そして、その認識結果が記号等であった場合、漢字を誤切り出ししている可能性が高いと考え、認識評価値を下げて、その認識結果の優先度を下げる(使わない)ようにする。
【0052】
図4は接触英字の分離処理フローチャートである。以下、図4の処理S11〜処理S15に従って説明する。
【0053】
S11:接触文字分離部2は、認識手段3での文字の認識評価値の悪かった文字を抽出し、処理S12に移る。
【0054】
S12:接触文字分離部2は、文字の上方、下方、内部上方、内部下方のそれぞれからの輪郭を抽出し、その極大、極小点から、切断候補点集合を求め、処理S13に移る。
【0055】
S13:接触文字分離部2は、切断候補点はあるかどうか判断する。この判断で切断候補点がある場合は処理S14に移り、ない場合は切断点の決定を行い処理を終了する。
【0056】
S14:接触文字分離部2は、縦方向の黒画素数が最も小さい切断候補点を切断点とし、処理S15に移る。
【0057】
S15:接触文字分離部2は、切断点付近にある切断候補点を削除し、処理S13に戻る。
【0058】
(4):接触英字の分離法の全体の流れの説明
図5は接触英字の分離法の全体の流れの説明図である。以下、図5の▲1▼〜▲6▼に従って接触英字の分離法の全体の流れを説明する。
【0059】
▲1▼ここで、文字集合(集合文字)A、B、C、D、E、Fを接触英字の分離を行う入力画像とする。
【0060】
▲2▼先ず、文字集合A、B、C、D、E、Fの日本語接触文字分離を行う。この分離で集合文字CはC1 、C2 、C3 、C4 に分離された。
【0061】
▲3▼文字認識を行い、認識時の距離値によるDPマッチングを行う。このDPマッチングでの評価値の良かったものは○、良くなかったもの(斜線のもの)×で示してある。
【0062】
▲4▼認識結果から得られた評価値が悪い文字を取り出し、接触文字分離法の対象とする(斜線の集合文字B、C、D)。ここで日本語接触文字の切り出しで分離された文字で、一つでも評価値の悪い文字があった場合、その文字から分離されたその他の文字の評価値が良かった場合でも、分離前の文字(前記▲1▼の文字集合)を接触英字分離対象文字とする。
【0063】
▲5▼斜線の集合文字B、C、Dの接触英字分離を行う。分離した文字の文字認識を行い、集合文字Bは文字B1 、B2 に分離し、認識結果はそれぞれ「英字」となった。また、集合文字Cは文字C1 、C2 、C3 、C4 、C5 に分離し、認識結果は文字C1 、C2 が「英字」、文字C3 、C4 、C5 が「かな」となった。さらに、集合文字Dは文字D1 、D2 、D3 に分離し認識結果はそれぞれ「記号」となった。なお、文字C3 、C4 、C5 、D1 、D2 、D3 のように認識結果が記号、かな等であった場合、漢字を誤切り出ししている可能性が高いと考え、認識評価値を下げてその認識結果の優先度を下げる。
【0064】
▲6▼文字認識時の距離値によるDPマッチングを行う。このDPマッチング後の結果、文字Aが「漢字」、文字B1 、B2 、C1 が「英字」、文字C2 とC3 を合わせた文字が「英字」、文字C4 とC5 を合わせた文字が「英字」、文字D1 とD2 とD3 を合わせた文字が「漢字」、文字EFが「漢字」として、分離して認識される。
【0065】
(5):切断候補点および候補削減結果の具体例による説明
図6は四つの輪郭から得られる切断候補点および候補削減結果の例の説明図である。図6において、集合文字「UNIX」の上方、下方、内部上方、内部下方から輪郭を追跡し、それぞれの輪郭の極大、極小を求める。そして、文字の上方から得られた極大点、極小点について極小点と直前の極大点とを比較し、その差がしきい値より大きい場合、その極小点を切断候補とする。それを全ての極小点に対して行い、文字の上方の輪郭から得られた切断候補点集合とする。同様の接触英字分離処理を、文字の下方、内部上方、内部下方からも行い、それぞれの輪郭からの切断候補点集合(矢印(a)、(b)、(c))を得る。
【0066】
次に、四つの輪郭から得られた切断候補点集合の中から不要な切断候補点(a)を削除する。この切断候補点(a)は、切断点(矢印(b)又は(c))付近にある。なお、この切断点付近とは、例えば分離対象文字の高さの何分の1とし、文字の大きさからは切断点とはなりえない長さとすることができる。
【0067】
次に、削減して残った切断候補点を切断点と決定する(矢印(b)、(c))。そして、決定された切断点に従い文字を分離し、文字認識、認識距離値によるDPマッチングを行い、最適な切断点を求める(矢印(c))。
【0068】
(6):プログラムのインストールの説明
入力部1、接触文字分離部2、認識手段3、切断候補点選定手段21、切断候補削減手段22等は、プログラムで構成でき、主制御部(CPU)が実行するものであり、主記憶に格納されているものである。このプログラムは、一般的な、コンピュータで処理されるものである。このコンピュータは、主制御部、主記憶、ファイル装置、表示装置、キーボード等の入力手段である入力装置などのハードウェアで構成されている。
【0069】
このコンピュータに、本発明のプログラムをインストールする。このインストールは、フロッピィ、光磁気ディスク等の可搬型の記録(記憶)媒体に、これらのプログラムを記憶させておき、コンピュータが備えている記録媒体に対して、アクセスするためのドライブ装置を介して、或いは、LAN等のネットワークを介して、コンピュータに設けられたファイル装置にインストールされる。そして、このファイル装置から処理に必要なプログラムステップを主記憶に読み出し、主制御部が実行するものである。
【0070】
【発明の効果】
以上説明したように、本発明によれば次のような効果がある。
【0071】
(1):切断候補点選定手段で接触英字分離を行う分離文字1の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とし、切断候補削減手段で前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除するため、あらゆる接触英字に柔軟に対応し高精度な分離を行うことができる。
【0072】
(2):極大、極小の位置を求めるのに、輪郭を微分するため、極大、極小の位置を容易に求めることができる。
【0073】
(3):切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除し、次に、残った切断候補点の中で縦方向の黒画素数が最も小さい切断候補点を次の切断点と決定し、文字の大きさから切断点とはなりえない次の切断点付近の切断候補点を削除するため、不要な切断候補点を容易に削除することができる。
【0074】
(4):接触英字分離を行う分離文字1の上方、下方、内部上方、内部下方からの輪郭を抽出して、該抽出した各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とし、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する接触文字分離方法とするため、あらゆる接触英字に柔軟に対応し高精度な分離を行う方法を提供することができる。
【0075】
(5):接触英字分離を行う分離文字1の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とする切断候補点選定手順と、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する切断候補削減手順とを実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体とするため、この記録媒体のプログラムをコンピュータにインストールすることで、あらゆる接触英字に柔軟に対応し高精度な分離を行うことができる接触文字分離装置を容易に提供することができる。
【図面の簡単な説明】
【図1】本発明の接触英字分離装置の説明図である。
【図2】実施の形態における輪郭の抽出方向の説明図である。
【図3】実施の形態における認識結果を用いた接触英字分離の基本処理フローチャートである。
【図4】実施の形態における接触英字の分離処理フローチャートである。
【図5】実施の形態における接触英字の分離法の全体の流れの説明図である。
【図6】実施の形態における四つの輪郭から得られる切断候補点および候補削減結果の例の説明図である。
【図7】従来例の説明図である。
【図8】従来例のフローチャートである。
【符号の説明】
1 入力部
2 接触文字分離部
3 認識手段
21 切断候補点選定手段
22 切断候補削減手段
[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a contact character separating apparatus and method, and a recording medium that can flexibly cope with any contact alphabetic characters in character recognition and perform high-accuracy separation.
[0002]
In mixed Japanese-English documents, Japanese characters are often cut and separated by mistake, and as a result, the accuracy of recognition of non-English characters is often reduced.Therefore, separation of contact English characters is maintained while maintaining Japanese recognition accuracy. It is hoped to do.
[0003]
[Prior art]
In the conventional method of separating contact English characters, a certain length in English according to the technical report of the Institute of Electronics, Information and Communication Engineers (PRU93-128 (1994-01)) titled “Contact Character Extraction Algorithm in English OCR” by Nakagawa et al. There is one that extracts the outline from above the above characters.
[0004]
FIG. 7 is an explanatory diagram of a conventional example, FIG. 7A is a description of extraction of the upper contour, and FIG. 7B is a description of extraction of maximum and minimum. In FIG. 7A, the distance from the upper reference of the collective English character “separa” is obtained, and the upper contour is extracted. In FIG. 7B, the extracted upper contour is traced, the positions of the maximum and minimum are obtained, and when the difference between the minimum and the previous maximum is larger than the threshold value, the minimum point is set as a cutting candidate. It was a thing.
[0005]
In this case, the local maximum and the local minimum are a local pole having a large distance from the reference and a local pole having a small distance from the reference.
[0006]
FIG. 8 is a flowchart of a conventional example. Hereinafter, processing performed by the contact character separation unit will be described according to processing S21 to processing S26 of FIG.
[0007]
S21: The contact character separation unit extracts a contour from above the character of the input character image, and proceeds to processing S22.
[0008]
S22: The contact character separation unit tracks the extracted outline of the upper part, obtains the maximum and minimum positions, and proceeds to step S23.
[0009]
S23: The contact character separation unit determines whether there is a minimum point. If there is a minimum point in this determination, the process proceeds to step S24, and if not, this process ends.
[0010]
S24: The contact character separation unit sequentially selects the minimum points, and proceeds to processing S25.
[0011]
S25: The contact character separation unit determines whether the difference between the minimum of the selected minimum point and the maximum immediately before is greater than a threshold value. If it is determined that the difference between the local minimum and the previous local maximum is larger than the threshold, the process proceeds to step S26, and if it is smaller, the process returns to step S23.
[0012]
S26: If the difference between the local minimum and the previous local maximum is greater than the threshold value in the above determination, the contact character separation unit determines the local minimum point as a cutting candidate, and returns to step S23.
[0013]
[Problems to be solved by the invention]
The conventional device has the following problems.
[0014]
{Circle around (1)} Since only the outline from the top of the character is extracted, it has not been possible to flexibly cope with any English character contact.
[0015]
{Circle around (2)} Also, because all character strings are processed, Japanese-English mixed documents are often mistakenly cut Japanese, and as a result, the recognition accuracy of non-English characters is often lowered.
[0016]
The present invention solves such conventional problems, and separation of contact alphabets in Japanese, which has been difficult with the prior art, can be flexibly applied to any contact alphabet without adversely affecting the Japanese language. The purpose is to enable easy separation.
[0017]
[Means for Solving the Problems]
FIG. 1 is an explanatory view of a contact alphabet separating apparatus of the present invention. In FIG. 1, 1 is an input unit, 2 is a contact character separation unit, 3 is a recognition unit, 21 is a cutting candidate point selection unit, and 22 is a cutting candidate reduction unit.
[0018]
The present invention is configured as follows to solve the conventional problems.
[0019]
(1): Japanese contact character separation means for separating Japanese contact characters from a character set to obtain separated character 1, and character recognition for the obtained separated character 1 and DP matching based on distance values at the time of recognition First DP matching means for obtaining an evaluation value, contact alphabet separation means for obtaining a separated character 2 by separating contact alphabetic characters for the separated character 1 whose obtained evaluation value is smaller than a threshold, and for the obtained separated character 2 The second DP matching means for performing the DP matching to obtain the separated character 3 and the contact alphabet separating means extract the contours from above, below, inside upward, and inside below of the separated character 1 performing the contact alphabet separation. In this case, the respective contours are tracked, the positions of the local maximum and the local minimum are obtained, and when the difference between the obtained local minimum and the previous local maximum is larger than the threshold value, the cutting candidate point selection means that uses the local minimum point as a cutting candidate point 21 and the cutting candidate The cutting candidate reduction means 22 for determining a cutting candidate point having the smallest number of black pixels in the vertical direction as a cutting point and deleting cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point based on the character size. With.
[0020]
(2): In the contact character separating apparatus according to (1), the contact alphabet separating means determines the character set before separation when the evaluation value of any one of the separated characters 1 is smaller than a threshold value. Target .
[0021]
(3): In the contact character separating apparatus according to (1) or (2), the cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as a cutting point, and the cutting point is determined based on the character size. The cutting candidate point in the vicinity of the determined cutting point that cannot be determined is deleted, and then the cutting candidate point having the smallest number of black pixels in the vertical direction among the remaining cutting candidate points is determined as the next cutting point. Cutting candidate reduction means for deleting cutting candidate points in the vicinity of the determined next cutting point that cannot be a cutting point based on the character size.
[0022]
(4): Japanese contact character separation means separates Japanese contact characters from the character set to obtain separated character 1, character recognition is performed on separated character 1 obtained by the first DP matching means, and distance at the time of recognition DP matching is performed to obtain an evaluation value, and the evaluation value obtained by the contact alphabet separation means is a contact alphabet separation for the separation character 1 whose evaluation value is smaller than the threshold value to obtain a separation character 2 and obtained by the second DP matching means. DP separation is performed on the separated character 2 to obtain a separated character 3, and the contact alphabet separating means extracts outlines from above, below, inside, and inside of the separated character 1 that performs contact alphabet separation. The extracted contours are tracked, the positions of the maximum and minimum are obtained, and when the difference between the obtained minimum and the previous maximum is larger than a threshold value, the minimum point is set as a cutting candidate point, and the cutting candidate The number of black pixels in the vertical direction The again cutting candidate point is determined as the breakpoint, the touched character separation process to remove the cutting candidate point in the vicinity of the cutting point as the decision not be a breakpoint from character size.
[0023]
(5): Separation of Japanese contact characters from the character set to obtain separated character 1 Japanese character separation procedure for obtaining separated character 1 and character recognition for the obtained separated character 1 and DP matching based on distance value at the time of recognition A first DP matching procedure for obtaining an evaluation value, a contact alphabet separation procedure for obtaining a separation character 2 by separating a contact alphabet for the separation character 1 whose evaluation value is smaller than a threshold value, and for the obtained separation character 2 The second DP matching procedure for performing DP matching to obtain the separated character 3 and the contours from above, below, inside up, and inside down of the separated character 1 that performs contact alphabet separation are extracted and each contour is traced. If the difference between the obtained local minimum and the previous local maximum is greater than a threshold value, a cutting candidate point selection procedure using the local minimum point as a cutting candidate point, The smallest number of black pixels in the vertical direction A program for executing a cutting candidate reduction procedure for deciding a cutting candidate point as a cutting point and deleting a cutting candidate point near the determined cutting point that cannot be a cutting point from the character size is recorded A computer-readable recording medium is used.
[0024]
(Function)
The operation based on the above configuration will be described.
[0025]
The contours from the upper, lower, inner upper and inner lower sides of the separation character 1 for separating the contact alphabets by the cutting candidate point selection means 21 are extracted, the respective contours are traced, the maximum and minimum positions are obtained, and the obtained If the difference between the local minimum and the previous local maximum is larger than the threshold, the local minimum point is set as a cutting candidate point, and the cutting candidate reducing unit 22 uses the cutting candidate point with the smallest number of black pixels in the vertical direction of the cutting candidate point. Is determined as a cutting point, and cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point based on the character size are deleted. For this reason, it is possible to flexibly cope with any contact alphabet and perform high-precision separation.
[0026]
Further, the contour is differentiated in order to obtain the maximum and minimum positions. For this reason, the maximum and minimum positions can be easily obtained.
[0027]
Further, the cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as a cutting point, and the cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point are deleted from the character size. Then, the cutting candidate point with the smallest number of black pixels in the vertical direction among the remaining cutting candidate points is determined as the next cutting point, and the next cutting that cannot be a cutting point from the character size Delete cutting candidate points near the point. For this reason, unnecessary cutting candidate points can be easily deleted.
[0028]
Further, the contours from the upper, lower, inner upper, and inner lower portions of the separation character 1 for performing contact alphabet separation are extracted, the extracted contours are traced, the positions of the maximum and minimum are obtained, and the obtained minimum and When the difference from the previous maximum is larger than the threshold, the minimum point is set as a cutting candidate point, the cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as the cutting point, and the character A contact character separation method for deleting candidate cutting points in the vicinity of the determined cutting point that cannot be a cutting point because of its size. For this reason, it is possible to provide a method for flexibly dealing with any contact alphabet and performing high-precision separation.
[0029]
Further, the contours from the upper, lower, inner upper, and inner lower sides of the separation character 1 that performs contact alphabet separation are extracted, the respective contours are traced, the positions of the maximum and minimum are obtained, and the obtained minimum and immediately before When the difference from the local maximum is larger than the threshold, the cutting candidate point selection procedure using the local minimum point as a cutting candidate point, and the cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point as a cutting point A cutting candidate reducing procedure for determining and deleting cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point based on the character size; and a computer-readable recording medium recording a program for executing To do. For this reason, by installing the program of the recording medium in a computer, it is possible to easily provide a contact character separation device that can flexibly handle any contact alphabetic characters and perform high-precision separation.
[0030]
DETAILED DESCRIPTION OF THE INVENTION
(1): Explanation of Contact Alphabet Separating Device FIG. 1 is an explanatory diagram of a contact alphabet separating device in an embodiment of the present invention. In FIG. 1, the contact alphabet separating apparatus includes an input unit 1, a contact character separating unit 2, and a recognition unit 3. The contact character separating unit 2 includes a cutting candidate point selecting unit 21 and a cutting candidate reducing unit 22. It is provided.
[0031]
The input unit 1 inputs a character image to be recognized. The contact character separation unit 2 performs a process for separating the touched characters. The recognition unit 3 recognizes the character separated by the contact character separation unit 2. The cutting candidate point selection means 21 is to select cutting candidate points by tracing the outline from above, below, inside above, and inside below characters. The cutting candidate reduction means 22 reduces cutting candidate points selected by the cutting candidate point selection means 21.
[0032]
In addition, although the recognition means 3 was included in the contact alphabet separator, it can also be provided separately.
[0033]
FIG. 2 is an explanatory diagram of the contour extraction direction. In FIG. 2, the contact character separation unit 2 extracts contours from above, below, inside above, and inside below the character “s”. Here, the upper contour is for obtaining the distance from the upper reference as shown by the arrow (a) and extracting the upper contour. The lower contour is to obtain the distance from the lower reference as shown by the arrow (d) and extract the lower contour. The internal upper contour is for extracting a lower contour of a character line having an upper contour as indicated by an arrow (b). The internal lower outline is for extracting an upper outline of a character line having a lower outline as indicated by an arrow (c).
[0034]
(Explanation of contact alphabet separation operation)
First, the contact character separation unit 2 cuts out a Japanese contact character from the character image input from the input unit 1, and the recognition unit 3 performs character recognition. Characters with poor recognition accuracy are extracted using the evaluation value obtained from the recognition result, and the contact character separation unit 2 separates the characters using the contact alphabet separation method.
[0035]
The contact character separation unit 2 extracts outlines from above, below, inside above, and inside below as the cutting candidate points. Then, the cutting candidate point selection means 21 tracks the contour, finds the maximum and minimum of each contour, and when the difference between the minimum and the previous maximum is larger than the threshold, the minimum is set as a cutting candidate. .
[0036]
The cutting candidate reduction unit 22 sets the cutting candidate point sets from the four directions obtained at this time as cutting points in ascending order of the number of black pixels in the vertical direction. At that time, if a cutting candidate point exists in the vicinity of the determined cutting point (a range that cannot be a cutting point considering the size of the character), the point is deleted from the cutting candidates.
[0037]
Next, the remaining cutting candidate points having a small number of black pixels are obtained and the same processing as described above is performed. This is repeated to determine the cutting point. Cutting is performed at the determined cutting point, character recognition is performed, DP matching based on the recognition distance value (processing for determining the cutting point with the best recognition result) is performed, and an optimal cutting point is obtained.
[0038]
If the recognition result of the character from which the contact alphabet has been separated is a symbol or the like (not a character), there is a high possibility that the Chinese character has been erroneously cut out. For this reason, the recognition evaluation value is lowered to lower the priority of the recognition result.
[0039]
In this way, separation of contact alphabets in Japanese, which has been difficult with the prior art, can be performed flexibly with any contact alphabet without adversely affecting the Japanese language and can be performed with high accuracy.
[0040]
(2): Description of basic processing for separating contact English characters using recognition results First, Japanese contact characters are cut out, then character recognition is performed to obtain recognition results. Using the evaluation value obtained from the recognition result, it is determined whether or not the character recognition result is likely. For example, the distance value at the time of character recognition is used as the evaluation value, and if the distance is greater than the threshold value, it is estimated that the alphabet is touching, misreading by kanji, etc., and the distance is large. The character is selected as a target character to be separated.
[0041]
The evaluation value is not limited to the distance value, and for example, a correct reading probability or the like can be used. Also, if the selected character is a character separated by cutting out Japanese contact characters, even if the evaluation value of other characters separated from that character is good, the character before separation is subject to contact alphabet separation Use letters.
[0042]
FIG. 3 is a basic processing flowchart of contact alphabet separation using the recognition result. Hereinafter, description will be given according to the processing S1 to processing S3 of FIG.
[0043]
S1: The contact character separation unit 2 cuts out Japanese contact characters, performs character recognition by the recognition means 3, obtains a recognition result, and proceeds to processing S2.
[0044]
S2: The contact character separation unit 2 determines whether or not the character recognition result is probable using the evaluation value obtained from the recognition result. If it is determined that the character recognition result is probable, the process is terminated. If the character recognition result is uncertain (the evaluation value is bad), the process proceeds to process S3.
[0045]
S3: The contact character separation unit 2 performs character separation using the English character contact character separation method, and ends this process.
[0046]
(3): Explanation of contact alphabet separation processing First, if the evaluation value at the time of character recognition is equal to or greater than a threshold value, it is estimated that the alphabet is in contact and misread as kanji, and the outline of the character is extracted. . The contour is extracted from above, below, inside above, and below inside the character. The contour is traced, and the maximum and minimum of each contour are obtained.
[0047]
The local maximum point and the local maximum point obtained from above the character are compared with the local maximum point immediately before, and if the difference is larger than the threshold value, the local minimum point is determined as a cutting candidate. This is performed for all the minimum points, and a cut candidate point set obtained from the outline above the character is obtained. When obtaining a set of candidate cutting points, it is also possible to obtain the maximum and minimum points from the differentiated contour and perform the above processing.
[0048]
The same contact alphabet separation process as described above is also performed from the lower side of the character, the upper part of the interior, and the lower part of the interior to obtain a set of cutting candidate points from the respective contours.
[0049]
Next, unnecessary cutting candidate points are deleted from the cutting candidate point set obtained from the four contours. The cutting candidate point having the smallest number of black pixels in the vertical direction is first set as a cutting point. If a cutting candidate point exists in the vicinity of the determined cutting point, the point is determined as an unnecessary cutting candidate point and deleted from the candidate.
[0050]
Next, the cutting candidate point having the smallest number of black pixels in the vertical direction among the remaining cutting candidate points is obtained and the same processing is performed. This is repeated to reduce cutting candidate points and determine cutting points.
[0051]
Characters are separated according to the determined cutting points, character recognition and DP matching based on recognition distance values are performed, an optimal cutting point is obtained, and a recognition result after separating contact alphabets is obtained. If the recognition result is a symbol or the like, it is considered that there is a high possibility that a Chinese character is erroneously cut out, and the recognition evaluation value is lowered so that the priority of the recognition result is lowered (not used).
[0052]
FIG. 4 is a flowchart of contact alphabet separation processing. Hereinafter, description will be given according to processing S11 to processing S15 of FIG.
[0053]
S11: The contact character separation unit 2 extracts a character having a bad character recognition evaluation value in the recognition unit 3, and proceeds to processing S12.
[0054]
S12: The contact character separation unit 2 extracts contours from the upper, lower, inner upper, and inner lower portions of the character, obtains a cutting candidate point set from the maximum and minimum points, and proceeds to processing S13.
[0055]
S13: The contact character separation unit 2 determines whether there is a cutting candidate point. If it is determined that there is a cutting candidate point, the process proceeds to step S14. If there is no cutting candidate point, the cutting point is determined and the process ends.
[0056]
S14: The contact character separation unit 2 sets a cutting candidate point having the smallest number of black pixels in the vertical direction as a cutting point, and proceeds to processing S15.
[0057]
S15: The contact character separation unit 2 deletes the cutting candidate point in the vicinity of the cutting point, and returns to the process S13.
[0058]
(4): Explanation of the overall flow of the contact alphabet separation method FIG. 5 is an explanatory diagram of the overall flow of the contact alphabet separation method. The overall flow of the method for separating contact alphabets will be described below with reference to (1) to (6) in FIG.
[0059]
{Circle around (1)} Here, character sets (collective characters) A, B, C, D, E, and F are input images for separating contact alphabets.
[0060]
(2) First, Japanese contact character separation of character sets A, B, C, D, E, and F is performed. With this separation, the collective character C was separated into C 1 , C 2 , C 3 , and C 4 .
[0061]
(3) Character recognition is performed, and DP matching is performed based on the distance value at the time of recognition. Good evaluation values in this DP matching are indicated by ◯, and poor (slashed) ×.
[0062]
{Circle around (4)} Characters with poor evaluation values obtained from the recognition results are taken out and set as targets of the contact character separation method (set characters B, C, and D with diagonal lines). Here, even if there is a character with a bad evaluation value among the characters separated by cutting out Japanese contact characters, even if the evaluation value of other characters separated from that character is good, the character before separation (Character set (1) above) is set as a contact alphabet separation target character.
[0063]
{Circle around (5)} The contact alphabets of the hatched set characters B, C, and D are separated. Character recognition of the separated characters was performed, and the collective character B was separated into characters B 1 and B 2 , and the recognition result was “English”. In addition, the collective character C is separated into characters C 1 , C 2 , C 3 , C 4 , and C 5 , and the recognition result is that the characters C 1 and C 2 are “English characters”, and the characters C 3 , C 4 , and C 5 are “ Kana " Furthermore, the collective character D is separated into characters D 1 , D 2 , and D 3 , and the recognition result is “symbol”. If the recognition result is a symbol, kana, etc., such as the characters C 3 , C 4 , C 5 , D 1 , D 2 , D 3 , Lower the evaluation value and lower the priority of the recognition result.
[0064]
(6) DP matching is performed by the distance value at the time of character recognition. As a result of this DP matching, the character A is “Kanji”, the characters B 1 , B 2 , and C 1 are “English characters”, the characters that are the combination of the characters C 2 and C 3 are “English characters”, and the characters C 4 and C 5 are The combined characters are recognized as “English characters”, the combined characters D 1 , D 2 and D 3 are “Kanji”, and the character EF is “Kanji”.
[0065]
(5): Explanation of specific examples of cutting candidate points and candidate reduction results FIG. 6 is an explanatory diagram of an example of cutting candidate points and candidate reduction results obtained from four contours. In FIG. 6, the contour is traced from above, below, inside, and inside below the collective character “UNIX”, and the maximum and minimum of each contour are obtained. Then, with respect to the maximum point and the minimum point obtained from above the character, the minimum point and the immediately preceding maximum point are compared, and if the difference is larger than the threshold value, the minimum point is determined as a cutting candidate. This is performed for all the minimum points, and a cut candidate point set obtained from the outline above the character is obtained. Similar contact alphabet separation processing is also performed from the lower side, upper part of the character, and lower part of the character to obtain cutting candidate point sets (arrows (a), (b), (c)) from the respective contours.
[0066]
Next, an unnecessary cutting candidate point (a) is deleted from the cutting candidate point set obtained from the four contours. This cutting candidate point (a) is in the vicinity of the cutting point (arrow (b) or (c)). Note that the vicinity of the cutting point is, for example, a fraction of the height of the character to be separated, and may be a length that cannot be a cutting point from the size of the character.
[0067]
Next, the cutting candidate points remaining after reduction are determined as cutting points (arrows (b) and (c)). Then, characters are separated according to the determined cutting point, and character recognition and DP matching based on the recognition distance value are performed to obtain an optimal cutting point (arrow (c)).
[0068]
(6): Program installation description The input unit 1, contact character separation unit 2, recognition unit 3, cutting candidate point selection unit 21, cutting candidate reduction unit 22 and the like can be configured by a program, and the main control unit (CPU) To be executed and stored in the main memory. This program is generally processed by a computer. This computer is composed of hardware such as an input device which is an input means such as a main control unit, a main memory, a file device, a display device, and a keyboard.
[0069]
The program of the present invention is installed on this computer. In this installation, these programs are stored in a portable recording (storage) medium such as a floppy disk or a magneto-optical disk, and a drive device for accessing the recording medium provided in the computer is used. Alternatively, it is installed in a file device provided in the computer via a network such as a LAN. Then, the program steps necessary for processing are read from the file device into the main memory and executed by the main control unit.
[0070]
【The invention's effect】
As described above, the present invention has the following effects.
[0071]
(1): Extract the contours from above, below, inside top, and inside bottom of the separation character 1 that performs contact alphabet separation by the cutting candidate point selection means, trace each contour, and find the maximum and minimum positions When the difference between the obtained local minimum and the previous local maximum is larger than a threshold value, the local minimum point is set as a cutting candidate point, and the cutting candidate reduction unit cuts the cutting candidate point with the smallest number of black pixels in the vertical direction. Candidate points are determined as cutting points, and cutting candidate points near the determined cutting points that cannot be cutting points are deleted from the character size. be able to.
[0072]
(2): Since the contour is differentiated in order to obtain the maximum and minimum positions, the maximum and minimum positions can be easily obtained.
[0073]
(3): A cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as a cutting point, and cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point based on the character size are determined. Next, the cutting candidate point with the smallest number of black pixels in the vertical direction among the remaining cutting candidate points is determined as the next cutting point, and the next cutting that cannot be a cutting point based on the character size Since the cutting candidate points near the points are deleted, unnecessary cutting candidate points can be easily deleted.
[0074]
(4): Extracting the contours from above, below, inside top, and inside bottom of the separated character 1 that performs contact alphabet separation, tracing each of the extracted contours, finding the maximum and minimum positions, and finding When the difference between the local minimum and the previous local maximum is larger than the threshold, the local minimum point is set as a cutting candidate point, and the cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as the cutting point. Provided a method for performing high-accuracy separation that can flexibly handle any contact alphabetic character, so that the contact character separation method can be used to delete the cutting candidate points near the determined cutting point, which cannot be a cutting point based on the character size. can do.
[0075]
(5): Extract the contours from above, below, inside top, and inside bottom of the separation character 1 that performs contact alphabet separation, trace each contour, find the maximum and minimum positions, When the difference from the previous maximum is larger than the threshold, the cutting candidate point selection procedure using the minimum point as a cutting candidate point, and the cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point A computer-readable recording medium recording a program for executing a cutting candidate reduction procedure for deleting a cutting candidate point in the vicinity of the determined cutting point, which is determined as a point and cannot be a cutting point based on the character size Therefore, by installing the program of the recording medium in a computer, it is possible to easily provide a contact character separation device that can flexibly handle any contact alphabetic characters and perform high-precision separation.
[Brief description of the drawings]
FIG. 1 is an explanatory view of a contact alphabet separating apparatus of the present invention.
FIG. 2 is an explanatory diagram of a contour extraction direction in the embodiment.
FIG. 3 is a basic processing flowchart of contact alphabet separation using a recognition result in the embodiment;
FIG. 4 is a flowchart of contact alphabet separation processing in the embodiment;
FIG. 5 is an explanatory diagram of an overall flow of a method for separating contact alphabetic characters according to an embodiment.
FIG. 6 is an explanatory diagram of an example of cutting candidate points and candidate reduction results obtained from four contours in the embodiment.
FIG. 7 is an explanatory diagram of a conventional example.
FIG. 8 is a flowchart of a conventional example.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Input part 2 Contact character separation part 3 Recognition means 21 Cutting candidate point selection means 22 Cutting candidate reduction means

Claims (5)

文字集合から日本語接触文字の分離を行い分離文字1を得る日本語接触文字分離手段と、
得た分離文字1に対して文字認識を行い認識時の距離値によるDPマッチングを行って評価値を得る第一DPマッチング手段と、
得た評価値が閾値より小さい前記分離文字1を対象に接触英字分離を行い分離文字2を得る接触英字分離手段と、
得た分離文字2に対してDPマッチングを行い分離文字3を得る第二DPマッチング手段と、
前記接触英字分離手段は、
接触英字分離を行う前記分離文字1の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とする切断候補点選定手段と、 前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する切断候補削減手段と、
を備えることを特徴とした接触文字分離装置。
Japanese contact character separation means for separating Japanese contact characters from a character set to obtain separated character 1;
First DP matching means for performing character recognition on the obtained separated character 1 and performing DP matching based on a distance value at the time of recognition to obtain an evaluation value;
Contact alphabet separating means for separating the contact alphabets for the separated character 1 whose evaluation value is smaller than a threshold value to obtain the separated character 2;
Second DP matching means for performing DP matching on the obtained separated character 2 to obtain the separated character 3,
The contact alphabet separation means is
Extract the contours from above, below, inside top, and inside bottom of the separation character 1 that performs contact alphabet separation, track each contour, find the maximum and minimum positions, find the minimum and the maximum just before A cutting candidate point selection means that uses the local minimum point as a cutting candidate point, and determines a cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point as a cutting point. Cutting candidate reduction means for deleting cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point from the character size;
A contact character separation device comprising:
前記接触英字分離手段は、前記分離文字1の中で一つでも評価値が閾値より小さい場合、分離前の文字集合を接触英字分離の対象とすることを特徴とした請求項1記載の接触文字分離装置。2. The contact character according to claim 1, wherein when at least one of the separated characters 1 has an evaluation value smaller than a threshold, the contact alphabet separating means sets a character set before separation as a contact alphabet separation target. Separation device. 前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除し、次に、残った切断候補点の中で縦方向の黒画素数が最も小さい切断候補点を次の切断点と決定し、文字の大きさから切断点とはなりえない前記決定した次の切断点付近の切断候補点を削除する切断候補削減手段とを備えることを特徴とした請求項1又は2記載の接触文字分離装置。A cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as a cutting point, and a cutting candidate point near the determined cutting point that cannot be a cutting point from the character size is deleted, Next, the cutting candidate point having the smallest number of black pixels in the vertical direction among the remaining cutting candidate points is determined as the next cutting point, and the determined next cutting that cannot be the cutting point from the character size The contact character separation device according to claim 1, further comprising a cutting candidate reduction unit that deletes cutting candidate points near the point. 日本語接触文字分離手段で文字集合から日本語接触文字の分離を行い分離文字1を得、
第一DPマッチング手段で得た分離文字1に対して文字認識を行い認識時の距離値によるDPマッチングを行って評価値を得、
接触英字分離手段で得た評価値が閾値より小さい前記分離文字1を対象に接触英字分離を行い分離文字2を得、
第二DPマッチング手段で得た分離文字2に対してDPマッチングを行い分離文字3を得、
前記接触英字分離手段で、
接触英字分離を行う前記分離文字1の上方、下方、内部上方、内部下方からの輪郭を抽出して、該抽出した各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とし、
前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除することを特徴とした接触文字分離方法。
Separating Japanese contact characters from the character set with Japanese contact character separation means to obtain separated character 1
Character recognition is performed on the separated character 1 obtained by the first DP matching means, and DP matching based on the distance value at the time of recognition is performed to obtain an evaluation value,
The contact character separation is performed on the separation character 1 whose evaluation value obtained by the contact alphabet separation means is smaller than the threshold value, and the separation character 2 is obtained,
DP matching is performed on the separated character 2 obtained by the second DP matching means to obtain the separated character 3,
In the contact alphabet separating means,
Extract outlines from above, below, inside top, and inside bottom of the separation character 1 that performs contact alphabet separation, trace each of the extracted outlines, find the maximum and minimum positions, find the minimum and immediately before If the difference from the local maximum is larger than the threshold, the local minimum point is set as the cutting candidate point.
A cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as a cutting point, and cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point are deleted from the character size. A method for separating contact characters .
文字集合から日本語接触文字の分離を行い分離文字1を得る日本語接触文字分離手順と、
得た分離文字1に対して文字認識を行い認識時の距離値によるDPマッチングを行って評価値を得る第一DPマッチング手順と、
得た評価値が閾値より小さい前記分離文字1を対象に接触英字分離を行い分離文字2を得る接触英字分離手順と、
得た分離文字2に対してDPマッチングを行い分離文字3を得る第二DPマッチング手順と、
接触英字分離を行う前記分離文字1の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とする切断候補点選定手順と、
前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する切断候補削減手順と、
を実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体。
Japanese contact character separation procedure for separating Japanese contact characters from a character set to obtain separated character 1;
A first DP matching procedure for performing character recognition on the obtained separated character 1 and performing DP matching based on a distance value at the time of recognition to obtain an evaluation value;
A contact alphabet separation procedure for obtaining a separation character 2 by performing contact alphabet separation on the separation character 1 whose evaluation value is smaller than a threshold;
A second DP matching procedure for performing DP matching on the obtained separated character 2 to obtain the separated character 3;
Extract the contours from above, below, inside top, and inside bottom of the separation character 1 that performs contact alphabet separation, track each contour, find the maximum and minimum positions, find the minimum and the maximum just before When the difference between and is larger than the threshold value, the cutting candidate point selection procedure using the minimum point as the cutting candidate point,
A cutting candidate point that has the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as a cutting point, and a cutting candidate point in the vicinity of the determined cutting point that cannot be a cutting point is deleted from the character size Candidate reduction procedures;
The computer-readable recording medium which recorded the program for performing this.
JP19534999A 1999-07-09 1999-07-09 Contact character separation device and method, and recording medium Expired - Fee Related JP3798582B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP19534999A JP3798582B2 (en) 1999-07-09 1999-07-09 Contact character separation device and method, and recording medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP19534999A JP3798582B2 (en) 1999-07-09 1999-07-09 Contact character separation device and method, and recording medium

Publications (2)

Publication Number Publication Date
JP2001022885A JP2001022885A (en) 2001-01-26
JP3798582B2 true JP3798582B2 (en) 2006-07-19

Family

ID=16339700

Family Applications (1)

Application Number Title Priority Date Filing Date
JP19534999A Expired - Fee Related JP3798582B2 (en) 1999-07-09 1999-07-09 Contact character separation device and method, and recording medium

Country Status (1)

Country Link
JP (1) JP3798582B2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5439069B2 (en) * 2009-07-08 2014-03-12 三菱重工業株式会社 Character recognition device and character recognition method

Also Published As

Publication number Publication date
JP2001022885A (en) 2001-01-26

Similar Documents

Publication Publication Date Title
JP4332356B2 (en) Information retrieval apparatus and method, and control program
JP3919617B2 (en) Character recognition device, character recognition method, program, and storage medium
JP2713622B2 (en) Tabular document reader
JP5357612B2 (en) Underline removal device
CN104966051A (en) Method of recognizing layout of document image
JP4011646B2 (en) Line detection method and character recognition device
US6867875B1 (en) Method and apparatus for simplifying fax transmissions using user-circled region detection
JP5488077B2 (en) Character string detection device, character evaluation device, image processing device, character string detection method, character evaluation method, control program, and recording medium
JP2002015280A (en) Device and method for image recognition, and computer- readable recording medium with recorded image recognizing program
US11551461B2 (en) Text classification
Nguyen et al. A segmentation method of single-and multiple-touching characters in offline handwritten japanese text recognition
US9811726B2 (en) Chinese, Japanese, or Korean language detection
JP3798582B2 (en) Contact character separation device and method, and recording medium
Nguyen et al. Enhanced character segmentation for format-free Japanese text recognition
JPH0721817B2 (en) Document image processing method
WO1999041681A1 (en) Document image structure analyzing method
KR0186172B1 (en) Character recognition apparatus
Liu et al. An improved algorithm for Identifying Mathematical formulas in the images of PDF documents
JPH10171920A (en) Method and device for character recognition, and its recording medium
JP2004246546A (en) Image processing method, program used for execution of method, and image processing apparatus
JP3526821B2 (en) Document search device
JP2000259847A (en) Information retrieval method and device and recording medium
JP4204185B2 (en) Character recognition device, character recognition method, and recording medium
JP2002056357A (en) Character recognizing device, its method, and recording medium
JP3343305B2 (en) Character extraction device and character extraction method

Legal Events

Date Code Title Description
A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20051027

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20051108

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20060106

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20060418

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20060420

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20090428

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20100428

Year of fee payment: 4

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20110428

Year of fee payment: 5

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20110428

Year of fee payment: 5

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20120428

Year of fee payment: 6

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20130428

Year of fee payment: 7

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20140428

Year of fee payment: 8

LAPS Cancellation because of no payment of annual fees