JP3798582B2

JP3798582B2 - Contact character separation device and method, and recording medium

Info

Publication number: JP3798582B2
Application number: JP19534999A
Authority: JP
Inventors: 浩司黒川; 克仁藤本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1999-07-09
Filing date: 1999-07-09
Publication date: 2006-07-19
Anticipated expiration: 2019-07-09
Also published as: JP2001022885A

Description

【０００１】
【発明の属する技術分野】
本発明は、文字認識におけるあらゆる接触英字に柔軟に対応し高精度な分離を行うことができる接触文字分離装置及び方法及び記録媒体に関する。
【０００２】
日英混在文書では、誤って日本語を切断分離することが多くその結果、英字以外の文字の認識精度低下を伴うことが多いため、日本語の認識精度を維持しつつ、接触英字の分離を行うことが望まれている。
【０００３】
【従来の技術】
従来の接触英字の分離方式では、中川らの「英文ＯＣＲにおける接触文字切り出しアルゴリズム」と題される電子情報通信学会技術報告書（ＰＲＵ９３−１２８（１９９４−０１））による、英文中のある長さ以上の文字の上方からの輪郭を抽出するものがあった。
【０００４】
図７は従来例の説明図であり、図７（ａ）は上部輪郭の抽出の説明、図７（ｂ）は極大、極小の抽出の説明である。図７（ａ）において、集合英文字「ｓｅｐａｒａ」の上方の基準からの距離を求め上部の輪郭を抽出する。図７（ｂ）において、前記抽出した上部の輪郭を追跡し、極大、極小の位置を求め、極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補とするものであった。
【０００５】
ここでの極大、極小とは、基準からの距離が大きい極を極小、基準からの距離が小さい極を極大とする。
【０００６】
図８は従来例のフローチャートである。以下、接触文字分離部が行う処理を図８の処理Ｓ２１〜処理Ｓ２６に従って説明する。
【０００７】
Ｓ２１：接触文字分離部は、入力された文字画像の文字の上方からの輪郭を抽出して、処理Ｓ２２に移る。
【０００８】
Ｓ２２：接触文字分離部は、前記抽出した上部の輪郭を追跡し、極大、極小の位置を求め、処理Ｓ２３に移る。
【０００９】
Ｓ２３：接触文字分離部は、極小点があるかどうか判断する。この判断で極小点がある場合は処理Ｓ２４に移り、ない場合はこの処理を終了する。
【００１０】
Ｓ２４：接触文字分離部は、順番に極小点を選択し、処理Ｓ２５に移る。
【００１１】
Ｓ２５：接触文字分離部は、前記選択した極小点の極小と直前の極大との差がしきい値より大きいかどうか判断する。この判断で極小と直前の極大との差がしきい値より大きい場合は処理Ｓ２６に移り、小さい場合は処理Ｓ２３に戻る。
【００１２】
Ｓ２６：接触文字分離部は、前記判断で極小と直前の極大との差がしきい値より大きい場合、その極小点を切断候補に決定し、処理Ｓ２３に戻る。
【００１３】
【発明が解決しようとする課題】
前記従来のものには、次のような課題があった。
【００１４】
▲１▼：文字の上方からの輪郭のみの抽出しか行っていないことから、あらゆる英字接触に柔軟に対応することができなかった。
【００１５】
▲２▼：また、全文字列処理のため、日英混在文書では、誤って日本語を切断対象とすることが多く、その結果、英字以外の文字の認識精度低下を伴うことが多かった。
【００１６】
本発明は、このような従来の課題を解決し、従来技術では困難であった日本語中の接触英字の分離を、日本語に悪影響を与えることなく、あらゆる接触英字に柔軟に対応し高精度な分離を行うことができるようにすることを目的とする。
【００１７】
【課題を解決するための手段】
図１は本発明の接触英字分離装置の説明図である。図１中、１は入力部、２は接触文字分離部、３は認識手段、２１は切断候補点選定手段、２２は切断候補削減手段である。
【００１８】
本発明は前記従来の課題を解決するため次のように構成した。
【００１９】
（１）：文字集合から日本語接触文字の分離を行い分離文字１を得る日本語接触文字分離手段と、得た分離文字１に対して文字認識を行い認識時の距離値によるＤＰマッチングを行って評価値を得る第一ＤＰマッチング手段と、得た評価値が閾値より小さい前記分離文字１を対象に接触英字分離を行い分離文字２を得る接触英字分離手段と、得た分離文字２に対してＤＰマッチングを行い分離文字３を得る第二ＤＰマッチング手段と、前記接触英字分離手段は、接触英字分離を行う前記分離文字１の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とする切断候補点選定手段２１と、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する切断候補削減手段２２とを備える。
【００２０】
（２）：前記（１）の接触文字分離装置において、前記接触英字分離手段は、前記分離文字１の中で一つでも評価値が閾値より小さい場合、分離前の文字集合を接触英字分離の対象とする。
【００２１】
（３）：前記（１）又は（２）の接触文字分離装置において、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除し、次に、残った切断候補点の中で縦方向の黒画素数が最も小さい切断候補点を次の切断点と決定し、文字の大きさから切断点とはなりえない前記決定した次の切断点付近の切断候補点を削除する切断候補削減手段とを備える。
【００２２】
（４）：日本語接触文字分離手段で文字集合から日本語接触文字の分離を行い分離文字１を得、第一ＤＰマッチング手段で得た分離文字１に対して文字認識を行い認識時の距離値によるＤＰマッチングを行って評価値を得、接触英字分離手段で得た評価値が閾値より小さい前記分離文字１を対象に接触英字分離を行い分離文字２を得、第二ＤＰマッチング手段で得た分離文字２に対してＤＰマッチングを行い分離文字３を得、前記接触英字分離手段で、接触英字分離を行う前記分離文字１の上方、下方、内部上方、内部下方からの輪郭を抽出して、該抽出した各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とし、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する接触文字分離方法とする。
【００２３】
（５）：文字集合から日本語接触文字の分離を行い分離文字１を得る日本語接触文字分離手順と、得た分離文字１に対して文字認識を行い認識時の距離値によるＤＰマッチングを行って評価値を得る第一ＤＰマッチング手順と、得た評価値が閾値より小さい前記分離文字１を対象に接触英字分離を行い分離文字２を得る接触英字分離手順と、得た分離文字２に対してＤＰマッチングを行い分離文字３を得る第二ＤＰマッチング手順と、接触英字分離を行う前記分離文字１の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とする切断候補点選定手順と、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する切断候補削減手順とを実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体とする。
【００２４】
（作用）
前記構成に基づく作用を説明する。
【００２５】
切断候補点選定手段２１で接触英字分離を行う分離文字１の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とし、切断候補削減手段２２で前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する。このため、あらゆる接触英字に柔軟に対応し高精度な分離を行うことができる。
【００２６】
また、前記極大、極小の位置を求めるのに、前記輪郭を微分する。このため、極大、極小の位置を容易に求めることができる。
【００２７】
さらに、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除し、次に、残った切断候補点の中で縦方向の黒画素数が最も小さい切断候補点を次の切断点と決定し、文字の大きさから切断点とはなりえない前記次の切断点付近の切断候補点を削除する。このため、不要な切断候補点を容易に削除することができる。
【００２８】
また、接触英字分離を行う分離文字１の上方、下方、内部上方、内部下方からの輪郭を抽出して、該抽出した各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とし、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する接触文字分離方法とする。このため、あらゆる接触英字に柔軟に対応し高精度な分離を行う方法を提供することができる。
【００２９】
さらに、接触英字分離を行う分離文字１の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とする切断候補点選定手順と、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する切断候補削減手順と、を実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体とする。このため、この記録媒体のプログラムをコンピュータにインストールすることで、あらゆる接触英字に柔軟に対応し高精度な分離を行うことができる接触文字分離装置を容易に提供することができる。
【００３０】
【発明の実施の形態】
（１）：接触英字分離装置の説明
図１は本発明の実施の形態における接触英字分離装置の説明図である。図１において、接触英字分離装置には、入力部１、接触文字分離部２、認識手段３が設けてあり、接触文字分離部２には、切断候補点選定手段２１、切断候補削減手段２２が設けてある。
【００３１】
入力部１は、認識する文字画像を入力するものである。接触文字分離部２は、接触した文字の分離処理を行うものである。認識手段３は、接触文字分離部２で分離した文字の認識を行うものである。切断候補点選定手段２１は、文字の上方、下方、内部上方、内部下方から輪郭を追跡して切断候補点を選定するものである。切断候補削減手段２２は、切断候補点選定手段２１で選定した切断候補点を削減するものである。
【００３２】
なお、認識手段３を接触英字分離装置に含めて説明したが、別に設けることもできる。
【００３３】
図２は輪郭の抽出方向の説明図である。図２において、接触文字分離部２では、「ｓ」の文字の上方、下方、内部上方、内部下方から輪郭を抽出する。ここで、上方輪郭とは、矢印（ａ）のように上方の基準からの距離を求め上部の輪郭を抽出するものである。下方輪郭とは、矢印（ｄ）のように下方の基準からの距離を求め下部の輪郭を抽出するものである。内部上方輪郭とは、矢印（ｂ）のように上方輪郭をとる文字線の下部の輪郭を抽出するものである。内部下方輪郭とは、矢印（ｃ）のように下方輪郭をとる文字線の上部の輪郭を抽出するものである。
【００３４】
（接触英字分離動作の説明）
先ず、入力部１から入力された文字画像を、接触文字分離部２で日本語接触文字の切り出しを行い、認識手段３で文字認識を行う。この認識結果から得られた評価値を用いて認識精度の悪い文字を取り出し、接触文字分離部２で接触英字の分離法を用いて文字の分離を行う。
【００３５】
接触文字分離部２は、切断候補点として、文字の上方、下方、内部上方、内部下方から輪郭を抽出する。そして、切断候補点選定手段２１は、その輪郭を追跡し、それぞれの輪郭の極大、極小を求め、極小と直前の極大との差がしきい値より大きい場合に、その極小を切断候補とする。
【００３６】
切断候補削減手段２２は、この時得られた四つの方向からの切断候補点集合を、縦方向の黒画素数が小さい順に切断点とする。その際、決定した切断点付近（文字の大きさから考えて切断点とはなり得ない範囲）に切断候補点が存在した場合、その点は切断候補から削除する。
【００３７】
次に、残った切断候補点のなかで黒画素数が小さいものを求め前記と同様の処理を行う。これを繰り返し切断点を決定する。決定された切断点で切断を行い、文字認識を行い、認識距離値によるＤＰマッチング（認識結果の一番良い切断点を決める処理）を行い、最適な切断点を求める。
【００３８】
接触英字の分離を行った文字の認識結果が、記号等（文字でない）であった場合、漢字を誤切り出ししている可能性が高い。このため、認識評価値を下げてその認識結果の優先度を下げるようにする。
【００３９】
このようにすると、従来技術では困難であった日本語中の接触英字の分離を、日本語に悪影響を与えることなく、あらゆる接触英字に柔軟に対応し高精度な分離を行うことができる。
【００４０】
（２）：認識結果を用いた接触英字分離の基本処理の説明
先ず、日本語接触文字の切り出しを行い、ついで文字認識を行い認識結果を得る。この認識結果から得られる評価値を用いて、文字認識の結果が確からしいかどうか判定を行う。例えば、文字認識時の距離値を評価値とし、距離がしきい値以上のものは英字が接触し、漢字などに誤読され距離が大きくなっていると推定し、接触英字の分離法を用いて、その文字の分離を行う対象文字として選定する。
【００４１】
なお、評価値は距離値に限定されず、例えば、正読確率等を用いることも可能である。また、選定された文字が日本語接触文字の切り出しで分離された文字であった場合、その文字から分離されたその他の文字の評価値が良かった場合でも、分離前の文字を接触英字分離対象文字とする。
【００４２】
図３は認識結果を用いた接触英字分離の基本処理フローチャートである。以下、図３の処理Ｓ１〜処理Ｓ３に従って説明する。
【００４３】
Ｓ１：接触文字分離部２は、日本語接触文字の切り出しを行い、認識手段３で文字認識を行い、認識結果を得て、処理Ｓ２に移る。
【００４４】
Ｓ２：接触文字分離部２は、認識結果から得られた評価値を用いて、文字認識の結果が確からしいかどうか判断する。この判断で文字認識の結果が確からしい場合はこの処理を終了し、もし確からしくない（評価値が悪い）場合は処理Ｓ３に移る。
【００４５】
Ｓ３：接触文字分離部２は、英字接触文字分離法を用いて文字の分離を行い、この処理を終了する。
【００４６】
（３）：接触英字の分離処理の説明
先ず、文字認識時の評価値がしきい値以上のものは、英字が接触し漢字などに誤読されていると推定し、その文字の輪郭を抽出する。輪郭の抽出は、文字の上方、下方、内部上方、内部下方から行う。その輪郭を追跡し、それぞれの輪郭の極大、極小を求める。
【００４７】
文字の上方から得られた極大点、極小点について極小点と直前の極大点とを比較し、その差がしきい値より大きい場合、その極小点を切断候補とする。それを全ての極小点に対して行い、文字の上方の輪郭から得られた切断候補点集合とする。なお、切断候補点集合を得る際、輪郭を微分したものから極大、極小点を求め前記処理を行うことも可能である。
【００４８】
前記同様の接触英字分離処理を、文字の下方、内部上方、内部下方からも行い、それぞれの輪郭からの切断候補点集合を得る。
【００４９】
次に、四つの輪郭から得られた切断候補点集合の中から不要な切断候補点を削除する。これは縦方向の黒画素数が最も小さい切断候補点を先ず切断点とする。そして、この決定した切断点付近に切断候補点が存在した場合、その点は不要な切断候補点と判断し、候補から削除する。
【００５０】
次に、残った切断候補点の中で縦方向の黒画素数が最も小さい切断候補点を求め同様の処理を行う。これを繰り返し切断候補点の削減、切断点の決定を行う。
【００５１】
決定された切断点に従い文字を分離し、文字認識、認識距離値によるＤＰマッチングを行い、最適な切断点を求め、接触英字分離後の認識結果を得る。そして、その認識結果が記号等であった場合、漢字を誤切り出ししている可能性が高いと考え、認識評価値を下げて、その認識結果の優先度を下げる（使わない）ようにする。
【００５２】
図４は接触英字の分離処理フローチャートである。以下、図４の処理Ｓ１１〜処理Ｓ１５に従って説明する。
【００５３】
Ｓ１１：接触文字分離部２は、認識手段３での文字の認識評価値の悪かった文字を抽出し、処理Ｓ１２に移る。
【００５４】
Ｓ１２：接触文字分離部２は、文字の上方、下方、内部上方、内部下方のそれぞれからの輪郭を抽出し、その極大、極小点から、切断候補点集合を求め、処理Ｓ１３に移る。
【００５５】
Ｓ１３：接触文字分離部２は、切断候補点はあるかどうか判断する。この判断で切断候補点がある場合は処理Ｓ１４に移り、ない場合は切断点の決定を行い処理を終了する。
【００５６】
Ｓ１４：接触文字分離部２は、縦方向の黒画素数が最も小さい切断候補点を切断点とし、処理Ｓ１５に移る。
【００５７】
Ｓ１５：接触文字分離部２は、切断点付近にある切断候補点を削除し、処理Ｓ１３に戻る。
【００５８】
（４）：接触英字の分離法の全体の流れの説明
図５は接触英字の分離法の全体の流れの説明図である。以下、図５の▲１▼〜▲６▼に従って接触英字の分離法の全体の流れを説明する。
【００５９】
▲１▼ここで、文字集合（集合文字）Ａ、Ｂ、Ｃ、Ｄ、Ｅ、Ｆを接触英字の分離を行う入力画像とする。
【００６０】
▲２▼先ず、文字集合Ａ、Ｂ、Ｃ、Ｄ、Ｅ、Ｆの日本語接触文字分離を行う。この分離で集合文字ＣはＣ₁、Ｃ₂、Ｃ₃、Ｃ₄に分離された。
【００６１】
▲３▼文字認識を行い、認識時の距離値によるＤＰマッチングを行う。このＤＰマッチングでの評価値の良かったものは○、良くなかったもの（斜線のもの）×で示してある。
【００６２】
▲４▼認識結果から得られた評価値が悪い文字を取り出し、接触文字分離法の対象とする（斜線の集合文字Ｂ、Ｃ、Ｄ）。ここで日本語接触文字の切り出しで分離された文字で、一つでも評価値の悪い文字があった場合、その文字から分離されたその他の文字の評価値が良かった場合でも、分離前の文字（前記▲１▼の文字集合）を接触英字分離対象文字とする。
【００６３】
▲５▼斜線の集合文字Ｂ、Ｃ、Ｄの接触英字分離を行う。分離した文字の文字認識を行い、集合文字Ｂは文字Ｂ₁、Ｂ₂に分離し、認識結果はそれぞれ「英字」となった。また、集合文字Ｃは文字Ｃ₁、Ｃ₂、Ｃ₃、Ｃ₄、Ｃ₅に分離し、認識結果は文字Ｃ₁、Ｃ₂が「英字」、文字Ｃ₃、Ｃ₄、Ｃ₅が「かな」となった。さらに、集合文字Ｄは文字Ｄ₁、Ｄ₂、Ｄ₃に分離し認識結果はそれぞれ「記号」となった。なお、文字Ｃ₃、Ｃ₄、Ｃ₅、Ｄ₁、Ｄ₂、Ｄ₃のように認識結果が記号、かな等であった場合、漢字を誤切り出ししている可能性が高いと考え、認識評価値を下げてその認識結果の優先度を下げる。
【００６４】
▲６▼文字認識時の距離値によるＤＰマッチングを行う。このＤＰマッチング後の結果、文字Ａが「漢字」、文字Ｂ₁、Ｂ₂、Ｃ₁が「英字」、文字Ｃ₂とＣ₃を合わせた文字が「英字」、文字Ｃ₄とＣ₅を合わせた文字が「英字」、文字Ｄ₁とＤ₂とＤ₃を合わせた文字が「漢字」、文字ＥＦが「漢字」として、分離して認識される。
【００６５】
（５）：切断候補点および候補削減結果の具体例による説明
図６は四つの輪郭から得られる切断候補点および候補削減結果の例の説明図である。図６において、集合文字「ＵＮＩＸ」の上方、下方、内部上方、内部下方から輪郭を追跡し、それぞれの輪郭の極大、極小を求める。そして、文字の上方から得られた極大点、極小点について極小点と直前の極大点とを比較し、その差がしきい値より大きい場合、その極小点を切断候補とする。それを全ての極小点に対して行い、文字の上方の輪郭から得られた切断候補点集合とする。同様の接触英字分離処理を、文字の下方、内部上方、内部下方からも行い、それぞれの輪郭からの切断候補点集合（矢印（ａ）、（ｂ）、（ｃ））を得る。
【００６６】
次に、四つの輪郭から得られた切断候補点集合の中から不要な切断候補点（ａ）を削除する。この切断候補点（ａ）は、切断点（矢印（ｂ）又は（ｃ））付近にある。なお、この切断点付近とは、例えば分離対象文字の高さの何分の１とし、文字の大きさからは切断点とはなりえない長さとすることができる。
【００６７】
次に、削減して残った切断候補点を切断点と決定する（矢印（ｂ）、（ｃ））。そして、決定された切断点に従い文字を分離し、文字認識、認識距離値によるＤＰマッチングを行い、最適な切断点を求める（矢印（ｃ））。
【００６８】
（６）：プログラムのインストールの説明
入力部１、接触文字分離部２、認識手段３、切断候補点選定手段２１、切断候補削減手段２２等は、プログラムで構成でき、主制御部（ＣＰＵ）が実行するものであり、主記憶に格納されているものである。このプログラムは、一般的な、コンピュータで処理されるものである。このコンピュータは、主制御部、主記憶、ファイル装置、表示装置、キーボード等の入力手段である入力装置などのハードウェアで構成されている。
【００６９】
このコンピュータに、本発明のプログラムをインストールする。このインストールは、フロッピィ、光磁気ディスク等の可搬型の記録（記憶）媒体に、これらのプログラムを記憶させておき、コンピュータが備えている記録媒体に対して、アクセスするためのドライブ装置を介して、或いは、ＬＡＮ等のネットワークを介して、コンピュータに設けられたファイル装置にインストールされる。そして、このファイル装置から処理に必要なプログラムステップを主記憶に読み出し、主制御部が実行するものである。
【００７０】
【発明の効果】
以上説明したように、本発明によれば次のような効果がある。
【００７１】
（１）：切断候補点選定手段で接触英字分離を行う分離文字１の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とし、切断候補削減手段で前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除するため、あらゆる接触英字に柔軟に対応し高精度な分離を行うことができる。
【００７２】
（２）：極大、極小の位置を求めるのに、輪郭を微分するため、極大、極小の位置を容易に求めることができる。
【００７３】
（３）：切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除し、次に、残った切断候補点の中で縦方向の黒画素数が最も小さい切断候補点を次の切断点と決定し、文字の大きさから切断点とはなりえない次の切断点付近の切断候補点を削除するため、不要な切断候補点を容易に削除することができる。
【００７４】
（４）：接触英字分離を行う分離文字１の上方、下方、内部上方、内部下方からの輪郭を抽出して、該抽出した各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とし、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する接触文字分離方法とするため、あらゆる接触英字に柔軟に対応し高精度な分離を行う方法を提供することができる。
【００７５】
（５）：接触英字分離を行う分離文字１の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とする切断候補点選定手順と、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する切断候補削減手順とを実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体とするため、この記録媒体のプログラムをコンピュータにインストールすることで、あらゆる接触英字に柔軟に対応し高精度な分離を行うことができる接触文字分離装置を容易に提供することができる。
【図面の簡単な説明】
【図１】本発明の接触英字分離装置の説明図である。
【図２】実施の形態における輪郭の抽出方向の説明図である。
【図３】実施の形態における認識結果を用いた接触英字分離の基本処理フローチャートである。
【図４】実施の形態における接触英字の分離処理フローチャートである。
【図５】実施の形態における接触英字の分離法の全体の流れの説明図である。
【図６】実施の形態における四つの輪郭から得られる切断候補点および候補削減結果の例の説明図である。
【図７】従来例の説明図である。
【図８】従来例のフローチャートである。
【符号の説明】
１入力部
２接触文字分離部
３認識手段
２１切断候補点選定手段
２２切断候補削減手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a contact character separating apparatus and method, and a recording medium that can flexibly cope with any contact alphabetic characters in character recognition and perform high-accuracy separation.
[0002]
In mixed Japanese-English documents, Japanese characters are often cut and separated by mistake, and as a result, the accuracy of recognition of non-English characters is often reduced.Therefore, separation of contact English characters is maintained while maintaining Japanese recognition accuracy. It is hoped to do.
[0003]
[Prior art]
In the conventional method of separating contact English characters, a certain length in English according to the technical report of the Institute of Electronics, Information and Communication Engineers (PRU93-128 (1994-01)) titled “Contact Character Extraction Algorithm in English OCR” by Nakagawa et al. There is one that extracts the outline from above the above characters.
[0004]
FIG. 7 is an explanatory diagram of a conventional example, FIG. 7A is a description of extraction of the upper contour, and FIG. 7B is a description of extraction of maximum and minimum. In FIG. 7A, the distance from the upper reference of the collective English character “separa” is obtained, and the upper contour is extracted. In FIG. 7B, the extracted upper contour is traced, the positions of the maximum and minimum are obtained, and when the difference between the minimum and the previous maximum is larger than the threshold value, the minimum point is set as a cutting candidate. It was a thing.
[0005]
In this case, the local maximum and the local minimum are a local pole having a large distance from the reference and a local pole having a small distance from the reference.
[0006]
FIG. 8 is a flowchart of a conventional example. Hereinafter, processing performed by the contact character separation unit will be described according to processing S21 to processing S26 of FIG.
[0007]
S21: The contact character separation unit extracts a contour from above the character of the input character image, and proceeds to processing S22.
[0008]
S22: The contact character separation unit tracks the extracted outline of the upper part, obtains the maximum and minimum positions, and proceeds to step S23.
[0009]
S23: The contact character separation unit determines whether there is a minimum point. If there is a minimum point in this determination, the process proceeds to step S24, and if not, this process ends.
[0010]
S24: The contact character separation unit sequentially selects the minimum points, and proceeds to processing S25.
[0011]
S25: The contact character separation unit determines whether the difference between the minimum of the selected minimum point and the maximum immediately before is greater than a threshold value. If it is determined that the difference between the local minimum and the previous local maximum is larger than the threshold, the process proceeds to step S26, and if it is smaller, the process returns to step S23.
[0012]
S26: If the difference between the local minimum and the previous local maximum is greater than the threshold value in the above determination, the contact character separation unit determines the local minimum point as a cutting candidate, and returns to step S23.
[0013]
[Problems to be solved by the invention]
The conventional device has the following problems.
[0014]
{Circle around (1)} Since only the outline from the top of the character is extracted, it has not been possible to flexibly cope with any English character contact.
[0015]
{Circle around (2)} Also, because all character strings are processed, Japanese-English mixed documents are often mistakenly cut Japanese, and as a result, the recognition accuracy of non-English characters is often lowered.
[0016]
The present invention solves such conventional problems, and separation of contact alphabets in Japanese, which has been difficult with the prior art, can be flexibly applied to any contact alphabet without adversely affecting the Japanese language. The purpose is to enable easy separation.
[0017]
[Means for Solving the Problems]
FIG. 1 is an explanatory view of a contact alphabet separating apparatus of the present invention. In FIG. 1, 1 is an input unit, 2 is a contact character separation unit, 3 is a recognition unit, 21 is a cutting candidate point selection unit, and 22 is a cutting candidate reduction unit.
[0018]
The present invention is configured as follows to solve the conventional problems.
[0019]
(1): Japanese contact character separation means for separating Japanese contact characters from a character set to obtain separated character 1, and character recognition for the obtained separated character 1 and DP matching based on distance values at the time of recognition First DP matching means for obtaining an evaluation value, contact alphabet separation means for obtaining a separated character 2 by separating contact alphabetic characters for the separated character 1 whose obtained evaluation value is smaller than a threshold, and for the obtained separated character 2 The second DP matching means for performing the DP matching to obtain the separated character 3 and the contact alphabet separating means extract the contours from above, below, inside upward, and inside below of the separated character 1 performing the contact alphabet separation. In this case, the respective contours are tracked, the positions of the local maximum and the local minimum are obtained, and when the difference between the obtained local minimum and the previous local maximum is larger than the threshold value, the cutting candidate point selection means that uses the local minimum point as a cutting candidate point 21 and the cutting candidate The cutting candidate reduction means 22 for determining a cutting candidate point having the smallest number of black pixels in the vertical direction as a cutting point and deleting cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point based on the character size. With.
[0020]
(2): In the contact character separating apparatus according to (1), the contact alphabet separating means determines the character set before separation when the evaluation value of any one of the separated characters 1 is smaller than a threshold value. Target .
[0021]
(3): In the contact character separating apparatus according to (1) or (2), the cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as a cutting point, and the cutting point is determined based on the character size. The cutting candidate point in the vicinity of the determined cutting point that cannot be determined is deleted, and then the cutting candidate point having the smallest number of black pixels in the vertical direction among the remaining cutting candidate points is determined as the next cutting point. Cutting candidate reduction means for deleting cutting candidate points in the vicinity of the determined next cutting point that cannot be a cutting point based on the character size.
[0022]
(4): Japanese contact character separation means separates Japanese contact characters from the character set to obtain separated character 1, character recognition is performed on separated character 1 obtained by the first DP matching means, and distance at the time of recognition DP matching is performed to obtain an evaluation value, and the evaluation value obtained by the contact alphabet separation means is a contact alphabet separation for the separation character 1 whose evaluation value is smaller than the threshold value to obtain a separation character 2 and obtained by the second DP matching means. DP separation is performed on the separated character 2 to obtain a separated character 3, and the contact alphabet separating means extracts outlines from above, below, inside, and inside of the separated character 1 that performs contact alphabet separation. The extracted contours are tracked, the positions of the maximum and minimum are obtained, and when the difference between the obtained minimum and the previous maximum is larger than a threshold value, the minimum point is set as a cutting candidate point, and the cutting candidate The number of black pixels in the vertical direction The again cutting candidate point is determined as the breakpoint, the touched character separation process to remove the cutting candidate point in the vicinity of the cutting point as the decision not be a breakpoint from character size.
[0023]
(5): Separation of Japanese contact characters from the character set to obtain separated character 1 Japanese character separation procedure for obtaining separated character 1 and character recognition for the obtained separated character 1 and DP matching based on distance value at the time of recognition A first DP matching procedure for obtaining an evaluation value, a contact alphabet separation procedure for obtaining a separation character 2 by separating a contact alphabet for the separation character 1 whose evaluation value is smaller than a threshold value, and for the obtained separation character 2 The second DP matching procedure for performing DP matching to obtain the separated character 3 and the contours from above, below, inside up, and inside down of the separated character 1 that performs contact alphabet separation are extracted and each contour is traced. If the difference between the obtained local minimum and the previous local maximum is greater than a threshold value, a cutting candidate point selection procedure using the local minimum point as a cutting candidate point, The smallest number of black pixels in the vertical direction A program for executing a cutting candidate reduction procedure for deciding a cutting candidate point as a cutting point and deleting a cutting candidate point near the determined cutting point that cannot be a cutting point from the character size is recorded A computer-readable recording medium is used.
[0024]
(Function)
The operation based on the above configuration will be described.
[0025]
The contours from the upper, lower, inner upper and inner lower sides of the separation character 1 for separating the contact alphabets by the cutting candidate point selection means 21 are extracted, the respective contours are traced, the maximum and minimum positions are obtained, and the obtained If the difference between the local minimum and the previous local maximum is larger than the threshold, the local minimum point is set as a cutting candidate point, and the cutting candidate reducing unit 22 uses the cutting candidate point with the smallest number of black pixels in the vertical direction of the cutting candidate point. Is determined as a cutting point, and cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point based on the character size are deleted. For this reason, it is possible to flexibly cope with any contact alphabet and perform high-precision separation.
[0026]
Further, the contour is differentiated in order to obtain the maximum and minimum positions. For this reason, the maximum and minimum positions can be easily obtained.
[0027]
Further, the cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as a cutting point, and the cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point are deleted from the character size. Then, the cutting candidate point with the smallest number of black pixels in the vertical direction among the remaining cutting candidate points is determined as the next cutting point, and the next cutting that cannot be a cutting point from the character size Delete cutting candidate points near the point. For this reason, unnecessary cutting candidate points can be easily deleted.
[0028]
Further, the contours from the upper, lower, inner upper, and inner lower portions of the separation character 1 for performing contact alphabet separation are extracted, the extracted contours are traced, the positions of the maximum and minimum are obtained, and the obtained minimum and When the difference from the previous maximum is larger than the threshold, the minimum point is set as a cutting candidate point, the cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as the cutting point, and the character A contact character separation method for deleting candidate cutting points in the vicinity of the determined cutting point that cannot be a cutting point because of its size. For this reason, it is possible to provide a method for flexibly dealing with any contact alphabet and performing high-precision separation.
[0029]
Further, the contours from the upper, lower, inner upper, and inner lower sides of the separation character 1 that performs contact alphabet separation are extracted, the respective contours are traced, the positions of the maximum and minimum are obtained, and the obtained minimum and immediately before When the difference from the local maximum is larger than the threshold, the cutting candidate point selection procedure using the local minimum point as a cutting candidate point, and the cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point as a cutting point A cutting candidate reducing procedure for determining and deleting cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point based on the character size; and a computer-readable recording medium recording a program for executing To do. For this reason, by installing the program of the recording medium in a computer, it is possible to easily provide a contact character separation device that can flexibly handle any contact alphabetic characters and perform high-precision separation.
[0030]
DETAILED DESCRIPTION OF THE INVENTION
(1): Explanation of Contact Alphabet Separating Device FIG. 1 is an explanatory diagram of a contact alphabet separating device in an embodiment of the present invention. In FIG. 1, the contact alphabet separating apparatus includes an input unit 1, a contact character separating unit 2, and a recognition unit 3. The contact character separating unit 2 includes a cutting candidate point selecting unit 21 and a cutting candidate reducing unit 22. It is provided.
[0031]
The input unit 1 inputs a character image to be recognized. The contact character separation unit 2 performs a process for separating the touched characters. The recognition unit 3 recognizes the character separated by the contact character separation unit 2. The cutting candidate point selection means 21 is to select cutting candidate points by tracing the outline from above, below, inside above, and inside below characters. The cutting candidate reduction means 22 reduces cutting candidate points selected by the cutting candidate point selection means 21.
[0032]
In addition, although the recognition means 3 was included in the contact alphabet separator, it can also be provided separately.
[0033]
FIG. 2 is an explanatory diagram of the contour extraction direction. In FIG. 2, the contact character separation unit 2 extracts contours from above, below, inside above, and inside below the character “s”. Here, the upper contour is for obtaining the distance from the upper reference as shown by the arrow (a) and extracting the upper contour. The lower contour is to obtain the distance from the lower reference as shown by the arrow (d) and extract the lower contour. The internal upper contour is for extracting a lower contour of a character line having an upper contour as indicated by an arrow (b). The internal lower outline is for extracting an upper outline of a character line having a lower outline as indicated by an arrow (c).
[0034]
(Explanation of contact alphabet separation operation)
First, the contact character separation unit 2 cuts out a Japanese contact character from the character image input from the input unit 1, and the recognition unit 3 performs character recognition. Characters with poor recognition accuracy are extracted using the evaluation value obtained from the recognition result, and the contact character separation unit 2 separates the characters using the contact alphabet separation method.
[0035]
The contact character separation unit 2 extracts outlines from above, below, inside above, and inside below as the cutting candidate points. Then, the cutting candidate point selection means 21 tracks the contour, finds the maximum and minimum of each contour, and when the difference between the minimum and the previous maximum is larger than the threshold, the minimum is set as a cutting candidate. .
[0036]
The cutting candidate reduction unit 22 sets the cutting candidate point sets from the four directions obtained at this time as cutting points in ascending order of the number of black pixels in the vertical direction. At that time, if a cutting candidate point exists in the vicinity of the determined cutting point (a range that cannot be a cutting point considering the size of the character), the point is deleted from the cutting candidates.
[0037]
Next, the remaining cutting candidate points having a small number of black pixels are obtained and the same processing as described above is performed. This is repeated to determine the cutting point. Cutting is performed at the determined cutting point, character recognition is performed, DP matching based on the recognition distance value (processing for determining the cutting point with the best recognition result) is performed, and an optimal cutting point is obtained.
[0038]
If the recognition result of the character from which the contact alphabet has been separated is a symbol or the like (not a character), there is a high possibility that the Chinese character has been erroneously cut out. For this reason, the recognition evaluation value is lowered to lower the priority of the recognition result.
[0039]
In this way, separation of contact alphabets in Japanese, which has been difficult with the prior art, can be performed flexibly with any contact alphabet without adversely affecting the Japanese language and can be performed with high accuracy.
[0040]
(2): Description of basic processing for separating contact English characters using recognition results First, Japanese contact characters are cut out, then character recognition is performed to obtain recognition results. Using the evaluation value obtained from the recognition result, it is determined whether or not the character recognition result is likely. For example, the distance value at the time of character recognition is used as the evaluation value, and if the distance is greater than the threshold value, it is estimated that the alphabet is touching, misreading by kanji, etc., and the distance is large. The character is selected as a target character to be separated.
[0041]
The evaluation value is not limited to the distance value, and for example, a correct reading probability or the like can be used. Also, if the selected character is a character separated by cutting out Japanese contact characters, even if the evaluation value of other characters separated from that character is good, the character before separation is subject to contact alphabet separation Use letters.
[0042]
FIG. 3 is a basic processing flowchart of contact alphabet separation using the recognition result. Hereinafter, description will be given according to the processing S1 to processing S3 of FIG.
[0043]
S1: The contact character separation unit 2 cuts out Japanese contact characters, performs character recognition by the recognition means 3, obtains a recognition result, and proceeds to processing S2.
[0044]
S2: The contact character separation unit 2 determines whether or not the character recognition result is probable using the evaluation value obtained from the recognition result. If it is determined that the character recognition result is probable, the process is terminated. If the character recognition result is uncertain (the evaluation value is bad), the process proceeds to process S3.
[0045]
S3: The contact character separation unit 2 performs character separation using the English character contact character separation method, and ends this process.
[0046]
(3): Explanation of contact alphabet separation processing First, if the evaluation value at the time of character recognition is equal to or greater than a threshold value, it is estimated that the alphabet is in contact and misread as kanji, and the outline of the character is extracted. . The contour is extracted from above, below, inside above, and below inside the character. The contour is traced, and the maximum and minimum of each contour are obtained.
[0047]
The local maximum point and the local maximum point obtained from above the character are compared with the local maximum point immediately before, and if the difference is larger than the threshold value, the local minimum point is determined as a cutting candidate. This is performed for all the minimum points, and a cut candidate point set obtained from the outline above the character is obtained. When obtaining a set of candidate cutting points, it is also possible to obtain the maximum and minimum points from the differentiated contour and perform the above processing.
[0048]
The same contact alphabet separation process as described above is also performed from the lower side of the character, the upper part of the interior, and the lower part of the interior to obtain a set of cutting candidate points from the respective contours.
[0049]
Next, unnecessary cutting candidate points are deleted from the cutting candidate point set obtained from the four contours. The cutting candidate point having the smallest number of black pixels in the vertical direction is first set as a cutting point. If a cutting candidate point exists in the vicinity of the determined cutting point, the point is determined as an unnecessary cutting candidate point and deleted from the candidate.
[0050]
Next, the cutting candidate point having the smallest number of black pixels in the vertical direction among the remaining cutting candidate points is obtained and the same processing is performed. This is repeated to reduce cutting candidate points and determine cutting points.
[0051]
Characters are separated according to the determined cutting points, character recognition and DP matching based on recognition distance values are performed, an optimal cutting point is obtained, and a recognition result after separating contact alphabets is obtained. If the recognition result is a symbol or the like, it is considered that there is a high possibility that a Chinese character is erroneously cut out, and the recognition evaluation value is lowered so that the priority of the recognition result is lowered (not used).
[0052]
FIG. 4 is a flowchart of contact alphabet separation processing. Hereinafter, description will be given according to processing S11 to processing S15 of FIG.
[0053]
S11: The contact character separation unit 2 extracts a character having a bad character recognition evaluation value in the recognition unit 3, and proceeds to processing S12.
[0054]
S12: The contact character separation unit 2 extracts contours from the upper, lower, inner upper, and inner lower portions of the character, obtains a cutting candidate point set from the maximum and minimum points, and proceeds to processing S13.
[0055]
S13: The contact character separation unit 2 determines whether there is a cutting candidate point. If it is determined that there is a cutting candidate point, the process proceeds to step S14. If there is no cutting candidate point, the cutting point is determined and the process ends.
[0056]
S14: The contact character separation unit 2 sets a cutting candidate point having the smallest number of black pixels in the vertical direction as a cutting point, and proceeds to processing S15.
[0057]
S15: The contact character separation unit 2 deletes the cutting candidate point in the vicinity of the cutting point, and returns to the process S13.
[0058]
(4): Explanation of the overall flow of the contact alphabet separation method FIG. 5 is an explanatory diagram of the overall flow of the contact alphabet separation method. The overall flow of the method for separating contact alphabets will be described below with reference to (1) to (6) in FIG.
[0059]
{Circle around (1)} Here, character sets (collective characters) A, B, C, D, E, and F are input images for separating contact alphabets.
[0060]
(2) First, Japanese contact character separation of character sets A, B, C, D, E, and F is performed. With this separation, the collective character C was separated into C ₁ , C ₂ , C ₃ , and C ₄ .
[0061]
(3) Character recognition is performed, and DP matching is performed based on the distance value at the time of recognition. Good evaluation values in this DP matching are indicated by ◯, and poor (slashed) ×.
[0062]
{Circle around (4)} Characters with poor evaluation values obtained from the recognition results are taken out and set as targets of the contact character separation method (set characters B, C, and D with diagonal lines). Here, even if there is a character with a bad evaluation value among the characters separated by cutting out Japanese contact characters, even if the evaluation value of other characters separated from that character is good, the character before separation (Character set (1) above) is set as a contact alphabet separation target character.
[0063]
{Circle around (5)} The contact alphabets of the hatched set characters B, C, and D are separated. Character recognition of the separated characters was performed, and the collective character B was separated into characters B ₁ and B ₂ , and the recognition result was “English”. In addition, the collective character C is separated into characters C ₁ , C ₂ , C ₃ , C ₄ , and C ₅ , and the recognition result is that the characters C ₁ and C ₂ are “English characters”, and the characters C ₃ , C ₄ , and C ₅ are “ Kana " Furthermore, the collective character D is separated into characters D ₁ , D ₂ , and D ₃ , and the recognition result is “symbol”. If the recognition result is a symbol, kana, etc., such as the characters C ₃ , C ₄ , C ₅ , D ₁ , D ₂ , D ₃ , Lower the evaluation value and lower the priority of the recognition result.
[0064]
(6) DP matching is performed by the distance value at the time of character recognition. As a result of this DP matching, the character A is “Kanji”, the characters B ₁ , B ₂ , and C ₁ are “English characters”, the characters that are the combination of the characters C ₂ and C ₃ are “English characters”, and the characters C ₄ and C ₅ are The combined characters are recognized as “English characters”, the combined characters D ₁ , D ₂ and D ₃ are “Kanji”, and the character EF is “Kanji”.
[0065]
(5): Explanation of specific examples of cutting candidate points and candidate reduction results FIG. 6 is an explanatory diagram of an example of cutting candidate points and candidate reduction results obtained from four contours. In FIG. 6, the contour is traced from above, below, inside, and inside below the collective character “UNIX”, and the maximum and minimum of each contour are obtained. Then, with respect to the maximum point and the minimum point obtained from above the character, the minimum point and the immediately preceding maximum point are compared, and if the difference is larger than the threshold value, the minimum point is determined as a cutting candidate. This is performed for all the minimum points, and a cut candidate point set obtained from the outline above the character is obtained. Similar contact alphabet separation processing is also performed from the lower side, upper part of the character, and lower part of the character to obtain cutting candidate point sets (arrows (a), (b), (c)) from the respective contours.
[0066]
Next, an unnecessary cutting candidate point (a) is deleted from the cutting candidate point set obtained from the four contours. This cutting candidate point (a) is in the vicinity of the cutting point (arrow (b) or (c)). Note that the vicinity of the cutting point is, for example, a fraction of the height of the character to be separated, and may be a length that cannot be a cutting point from the size of the character.
[0067]
Next, the cutting candidate points remaining after reduction are determined as cutting points (arrows (b) and (c)). Then, characters are separated according to the determined cutting point, and character recognition and DP matching based on the recognition distance value are performed to obtain an optimal cutting point (arrow (c)).
[0068]
(6): Program installation description The input unit 1, contact character separation unit 2, recognition unit 3, cutting candidate point selection unit 21, cutting candidate reduction unit 22 and the like can be configured by a program, and the main control unit (CPU) To be executed and stored in the main memory. This program is generally processed by a computer. This computer is composed of hardware such as an input device which is an input means such as a main control unit, a main memory, a file device, a display device, and a keyboard.
[0069]
The program of the present invention is installed on this computer. In this installation, these programs are stored in a portable recording (storage) medium such as a floppy disk or a magneto-optical disk, and a drive device for accessing the recording medium provided in the computer is used. Alternatively, it is installed in a file device provided in the computer via a network such as a LAN. Then, the program steps necessary for processing are read from the file device into the main memory and executed by the main control unit.
[0070]
【The invention's effect】
As described above, the present invention has the following effects.
[0071]
(1): Extract the contours from above, below, inside top, and inside bottom of the separation character 1 that performs contact alphabet separation by the cutting candidate point selection means, trace each contour, and find the maximum and minimum positions When the difference between the obtained local minimum and the previous local maximum is larger than a threshold value, the local minimum point is set as a cutting candidate point, and the cutting candidate reduction unit cuts the cutting candidate point with the smallest number of black pixels in the vertical direction. Candidate points are determined as cutting points, and cutting candidate points near the determined cutting points that cannot be cutting points are deleted from the character size. be able to.
[0072]
(2): Since the contour is differentiated in order to obtain the maximum and minimum positions, the maximum and minimum positions can be easily obtained.
[0073]
(3): A cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as a cutting point, and cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point based on the character size are determined. Next, the cutting candidate point with the smallest number of black pixels in the vertical direction among the remaining cutting candidate points is determined as the next cutting point, and the next cutting that cannot be a cutting point based on the character size Since the cutting candidate points near the points are deleted, unnecessary cutting candidate points can be easily deleted.
[0074]
(4): Extracting the contours from above, below, inside top, and inside bottom of the separated character 1 that performs contact alphabet separation, tracing each of the extracted contours, finding the maximum and minimum positions, and finding When the difference between the local minimum and the previous local maximum is larger than the threshold, the local minimum point is set as a cutting candidate point, and the cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as the cutting point. Provided a method for performing high-accuracy separation that can flexibly handle any contact alphabetic character, so that the contact character separation method can be used to delete the cutting candidate points near the determined cutting point, which cannot be a cutting point based on the character size. can do.
[0075]
(5): Extract the contours from above, below, inside top, and inside bottom of the separation character 1 that performs contact alphabet separation, trace each contour, find the maximum and minimum positions, When the difference from the previous maximum is larger than the threshold, the cutting candidate point selection procedure using the minimum point as a cutting candidate point, and the cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point A computer-readable recording medium recording a program for executing a cutting candidate reduction procedure for deleting a cutting candidate point in the vicinity of the determined cutting point, which is determined as a point and cannot be a cutting point based on the character size Therefore, by installing the program of the recording medium in a computer, it is possible to easily provide a contact character separation device that can flexibly handle any contact alphabetic characters and perform high-precision separation.
[Brief description of the drawings]
FIG. 1 is an explanatory view of a contact alphabet separating apparatus of the present invention.
FIG. 2 is an explanatory diagram of a contour extraction direction in the embodiment.
FIG. 3 is a basic processing flowchart of contact alphabet separation using a recognition result in the embodiment;
FIG. 4 is a flowchart of contact alphabet separation processing in the embodiment;
FIG. 5 is an explanatory diagram of an overall flow of a method for separating contact alphabetic characters according to an embodiment.
FIG. 6 is an explanatory diagram of an example of cutting candidate points and candidate reduction results obtained from four contours in the embodiment.
FIG. 7 is an explanatory diagram of a conventional example.
FIG. 8 is a flowchart of a conventional example.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Input part 2 Contact character separation part 3 Recognition means 21 Cutting candidate point selection means 22 Cutting candidate reduction means

Claims

文字集合から日本語接触文字の分離を行い分離文字１を得る日本語接触文字分離手段と、
得た分離文字１に対して文字認識を行い認識時の距離値によるＤＰマッチングを行って評価値を得る第一ＤＰマッチング手段と、
得た評価値が閾値より小さい前記分離文字１を対象に接触英字分離を行い分離文字２を得る接触英字分離手段と、
得た分離文字２に対してＤＰマッチングを行い分離文字３を得る第二ＤＰマッチング手段と、
前記接触英字分離手段は、
接触英字分離を行う前記分離文字１の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とする切断候補点選定手段と、前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する切断候補削減手段と、
を備えることを特徴とした接触文字分離装置。 Japanese contact character separation means for separating Japanese contact characters from a character set to obtain separated character 1;
First DP matching means for performing character recognition on the obtained separated character 1 and performing DP matching based on a distance value at the time of recognition to obtain an evaluation value;
Contact alphabet separating means for separating the contact alphabets for the separated character 1 whose evaluation value is smaller than a threshold value to obtain the separated character 2;
Second DP matching means for performing DP matching on the obtained separated character 2 to obtain the separated character 3,
The contact alphabet separation means is
Extract the contours from above, below, inside top, and inside bottom of the separation character 1 that performs contact alphabet separation, track each contour, find the maximum and minimum positions, find the minimum and the maximum just before A cutting candidate point selection means that uses the local minimum point as a cutting candidate point, and determines a cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point as a cutting point. Cutting candidate reduction means for deleting cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point from the character size;
A contact character separation device comprising:

前記接触英字分離手段は、前記分離文字１の中で一つでも評価値が閾値より小さい場合、分離前の文字集合を接触英字分離の対象とすることを特徴とした請求項１記載の接触文字分離装置。2. The contact character according to claim 1, wherein when at least one of the separated characters 1 has an evaluation value smaller than a threshold, the contact alphabet separating means sets a character set before separation as a contact alphabet separation target. Separation device.

前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除し、次に、残った切断候補点の中で縦方向の黒画素数が最も小さい切断候補点を次の切断点と決定し、文字の大きさから切断点とはなりえない前記決定した次の切断点付近の切断候補点を削除する切断候補削減手段とを備えることを特徴とした請求項１又は２記載の接触文字分離装置。A cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as a cutting point, and a cutting candidate point near the determined cutting point that cannot be a cutting point from the character size is deleted, Next, the cutting candidate point having the smallest number of black pixels in the vertical direction among the remaining cutting candidate points is determined as the next cutting point, and the determined next cutting that cannot be the cutting point from the character size The contact character separation device according to claim 1, further comprising a cutting candidate reduction unit that deletes cutting candidate points near the point.

日本語接触文字分離手段で文字集合から日本語接触文字の分離を行い分離文字１を得、
第一ＤＰマッチング手段で得た分離文字１に対して文字認識を行い認識時の距離値によるＤＰマッチングを行って評価値を得、
接触英字分離手段で得た評価値が閾値より小さい前記分離文字１を対象に接触英字分離を行い分離文字２を得、
第二ＤＰマッチング手段で得た分離文字２に対してＤＰマッチングを行い分離文字３を得、
前記接触英字分離手段で、
接触英字分離を行う前記分離文字１の上方、下方、内部上方、内部下方からの輪郭を抽出して、該抽出した各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とし、
前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除することを特徴とした接触文字分離方法。 Separating Japanese contact characters from the character set with Japanese contact character separation means to obtain separated character 1
Character recognition is performed on the separated character 1 obtained by the first DP matching means, and DP matching based on the distance value at the time of recognition is performed to obtain an evaluation value,
The contact character separation is performed on the separation character 1 whose evaluation value obtained by the contact alphabet separation means is smaller than the threshold value, and the separation character 2 is obtained,
DP matching is performed on the separated character 2 obtained by the second DP matching means to obtain the separated character 3,
In the contact alphabet separating means,
Extract outlines from above, below, inside top, and inside bottom of the separation character 1 that performs contact alphabet separation, trace each of the extracted outlines, find the maximum and minimum positions, find the minimum and immediately before If the difference from the local maximum is larger than the threshold, the local minimum point is set as the cutting candidate point.
A cutting candidate point having the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as a cutting point, and cutting candidate points in the vicinity of the determined cutting point that cannot be a cutting point are deleted from the character size. A method for separating contact characters .

文字集合から日本語接触文字の分離を行い分離文字１を得る日本語接触文字分離手順と、
得た分離文字１に対して文字認識を行い認識時の距離値によるＤＰマッチングを行って評価値を得る第一ＤＰマッチング手順と、
得た評価値が閾値より小さい前記分離文字１を対象に接触英字分離を行い分離文字２を得る接触英字分離手順と、
得た分離文字２に対してＤＰマッチングを行い分離文字３を得る第二ＤＰマッチング手順と、
接触英字分離を行う前記分離文字１の上方、下方、内部上方、内部下方からの輪郭を抽出して、該各輪郭を追跡し、極大、極小の位置を求め、該求めた極小と直前の極大との差がしきい値より大きい場合に、その極小点を切断候補点とする切断候補点選定手順と、
前記切断候補点の縦方向の黒画素数が最も小さい切断候補点を切断点と決定し、文字の大きさから切断点とはなりえない前記決定した切断点付近の切断候補点を削除する切断候補削減手順と、
を実行させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体。 Japanese contact character separation procedure for separating Japanese contact characters from a character set to obtain separated character 1;
A first DP matching procedure for performing character recognition on the obtained separated character 1 and performing DP matching based on a distance value at the time of recognition to obtain an evaluation value;
A contact alphabet separation procedure for obtaining a separation character 2 by performing contact alphabet separation on the separation character 1 whose evaluation value is smaller than a threshold;
A second DP matching procedure for performing DP matching on the obtained separated character 2 to obtain the separated character 3;
Extract the contours from above, below, inside top, and inside bottom of the separation character 1 that performs contact alphabet separation, track each contour, find the maximum and minimum positions, find the minimum and the maximum just before When the difference between and is larger than the threshold value, the cutting candidate point selection procedure using the minimum point as the cutting candidate point,
A cutting candidate point that has the smallest number of black pixels in the vertical direction of the cutting candidate point is determined as a cutting point, and a cutting candidate point in the vicinity of the determined cutting point that cannot be a cutting point is deleted from the character size Candidate reduction procedures;
The computer-readable recording medium which recorded the program for performing this.