JPH0344788A

JPH0344788A - Area extracting method for document picture

Info

Publication number: JPH0344788A
Application number: JP1179070A
Authority: JP
Inventors: Shoji Shimomura; 下村　正二; Masatoshi Okada; 岡田　正年
Original assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Current assignee: Fuji Electric Co Ltd; Fuji Facom Corp
Priority date: 1989-07-13
Filing date: 1989-07-13
Publication date: 1991-02-26

Abstract

PURPOSE:To correctly extract document constitution even in the case of an inclined document by dividing the document into several blocks and calculating peripheral distribution. CONSTITUTION:The general printed document is divided into the plural unit blocks and the peripheral distribution in both vertical and horizontal directions is calculated for each unit block. Then, the periodicity of the peripheral distribution is investigated and the block, which seems to be a document area, is discriminated. Afterwards, the size of characters in the document and an interval between lines are estimated from the feature of the peripheral distribution. The unit blocks are further divided into detailed parts based on the estimated interval between the lines and it is decided from the feature of the peripheral distribution whether each block is the document area or not. Then, the areas, for which the interval between the blocks to be decided as the document is small, are unified and the document area is made. Thus, even in the case of the inclined document, the document area can be extracted while being distinguished from the other area.

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は、スキャナの如き光学的手段を用いて入力さ
れた文書画像をもとに文字を切り出し、認識を行なう文
字認識方法、特に文書画像の領域抽出方法に関する。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a character recognition method in which characters are cut out and recognized based on a document image input using an optical means such as a scanner, and in particular, a character recognition method for recognizing characters based on an input document image using an optical means such as a scanner. This paper relates to a region extraction method.

〔従来の技術〕[Conventional technology]

文書中から文章１図形１表などといった領域を抽出する
方法としては、次のような方法が知られている。The following methods are known as methods for extracting areas such as one sentence, one figure, and one table from a document.

■）ランレングスのランを観測単位としてボヵシ処理を
行ない、各構成要素（文章２区切り罫線。■) Perform blurring processing using runs of run length as observation units, and perform blurring processing on each component (ruled line separating two sentences).

図および写真）を抽出１分類する方法（例えば、Ｋ、Ｙ
、Ｗｏｎｇ、Ｒ，Ｇ、Ｃａ５ｅｙＳａｎｄＦ、Ｍ、Ｗａ
ｈｌ：”Ｄｏｃｕｍｅｎｔ　　　ａｎａＩｙｓｉｓ　　
ｓｙｓｔｅｍ”Ｐｒｏｃ、６ｔｈ。Figures and photographs) are extracted and classified (for example, K, Y
, Wong, R.G., Ca5eySandF., M.Wa.
hl:”Document anaIysis
system”Proc, 6th.

ＩＣＰＲ，Ｐ、４９６〜４９９を参照）。See ICPR, P, 496-499).

■）黒画素の融合、縮退操作により１つの文字列を１つ
の連結成分として抽出する方法（例えば、中村、開本、
庭田、南：“欧文テキスト画像における文字領域の抽出
アルゴリズム”信学論、Ｖ。■) Method of extracting one character string as one connected component by fusion and degeneracy of black pixels (for example, Nakamura, Kaimoto,
Niwata, Minami: “Extraction algorithm for character regions in European text images” IEICE, V.

１、Ｊ６６−ＤＸＮｆＬ４、Ｐ、４３７〜４４４参照）
ＩＩＩ）　２次元フーリエ変換を用い、文字列の周期性
を検出して文書中の文章領域を抽出する方法（例えば、
長谷用、星野：“２次元フーリエ変換を用いた文書画像
領域抽出法”信学論、Ｖｏｌ、Ｊ６７−Ｄ、弘９、Ｐ、
１０４４〜１０５４参照）。1, J66-DXNfL4, P, 437-444)
III) A method of detecting the periodicity of a character string and extracting a text area in a document using two-dimensional Fourier transform (for example,
Yo Hase, Hoshino: “Document image area extraction method using two-dimensional Fourier transform” IEICE, Vol. J67-D, Kou 9, P.
1044-1054).

■）文書画像の水平および垂直方向の周・通分布を用い
、その山と谷から文字列の位置を抽出する方法（例えば
、秋山、増田：“周辺分布、！密度。■) A method of extracting the position of a character string from the peaks and valleys of a document image using its horizontal and vertical circumferential distribution (for example, Akiyama, Masuda: “Marginal distribution, !density.

外接矩形特徴を併用した文書画像の領域分割”信学論、
Ｖｏ　１．Ｊ　６９−Ｄ、ｍ８、Ｐ、１１８７〜１１９
６参照）。Region segmentation of document images using circumscribed rectangle features,” IEICE Theory,
Vo 1. J 69-D, m8, P, 1187-119
(see 6).

〔発明が解決しようとする課題〕[Problem to be solved by the invention]

上記！、ｎの方法では、文書画像を格納するメモリの他
に画像演算用のメモリが必要になり、メモリ容量が膨大
になる。また、■の方法によれば、演算量が多くなり処
理に時間が掛かる。■の方法は処理も単純であり、処理
に必要なワークメモリも小容量ですむかわりに、傾きに
弱いという欠点がある。the above! , n requires a memory for image calculations in addition to a memory for storing document images, resulting in an enormous memory capacity. Furthermore, according to method (2), the amount of calculation increases and the processing takes time. Although method (2) is simple in processing and requires a small amount of work memory for processing, it has the disadvantage of being vulnerable to slope.

〔課題を解決するための手段〕[Means to solve the problem]

文書領域と図５表および写真を含む他の領域とをもつ一
般印刷文書を複数の単位ブロックに分割し、この分割さ
れた単位ブロック毎に垂直、水平両方向の周辺分布を求
め、その周辺分布の周期性を調べて文章領域らしいブロ
ックを判別し、それらの周辺分布の特徴から文書中の文
字サイズと行間隔を推定し、この推定された行間隔をも
とに単位ブロックをさらに詳細に分割し、各ブロックが
文章領域であるか否かを周辺分布の特徴から判定し、文
章と判定されたブロック間の間隔の小さなものを統合し
て文章領域を作ることにより、文章領域とその他の領域
とを区別して抽出する。A general printed document that has a document area and other areas including tables and photographs in Figure 5 is divided into multiple unit blocks, and the peripheral distribution in both the vertical and horizontal directions is determined for each divided unit block. It examines the periodicity to identify blocks that appear to be text areas, estimates the font size and line spacing in the document from the characteristics of their peripheral distribution, and further divides the unit block into more detailed units based on the estimated line spacing. , determines whether each block is a text area based on the characteristics of the surrounding distribution, and creates a text area by integrating blocks with small intervals that are determined to be text, thereby distinguishing between text areas and other areas. Differentiate and extract.

〔作用〕[Effect]

文書画像を複数の単位ブロックに分割し、分割した領域
を対象に文章領域を抽出することにより例えば第２Ａ図
に示すような傾いた文書の周辺分布が第２Ｄ図の如く１
個所（Ｘ）だけ切れ目があって、残りは連続的になって
しまうのを、第２Ｅ図のように小さな領域を対象とする
ことにより、文字列毎に切ることができるようにする。By dividing a document image into a plurality of unit blocks and extracting text areas from the divided areas, for example, the peripheral distribution of a tilted document as shown in Figure 2A can be changed to 1 as shown in Figure 2D.
Although there is a cut at point (X) and the rest is continuous, it is possible to cut each character string by targeting a small area as shown in FIG. 2E.

〔実施例〕〔Example〕

第１図はこの発明の詳細な説明するための概要フローチ
ャートである。なお、同図の処理Ａ（粗な画像の作成）
およびＢ（周辺の空白領域の除去〉は、抽出処理をより
高速にするための前処理を示している。以下、この発明
の特徴となる処理Ｃ以降につき説明する。FIG. 1 is a schematic flowchart for explaining the invention in detail. In addition, processing A (creation of a rough image) in the same figure
and B (removal of surrounding blank areas) indicate pre-processing to speed up the extraction process.Processes C and subsequent processes which are the characteristics of the present invention will be described below.

処理Ｃ：単位ブロックへの分割入力された文書画像を、処理対象となるｎＸｍ画素のブ
ロック（単位ブロック）に分割する。第２Ａ図に示す文
書画像の例に対応する分割の例を第２Ｂ図に示す。第２
Ｂ図の符号１，２．３が領域を示し、１，２が文章、３
が写真の例である。Process C: Division into unit blocks The input document image is divided into blocks (unit blocks) of nXm pixels to be processed. FIG. 2B shows an example of division corresponding to the example document image shown in FIG. 2A. Second
Numbers 1, 2, and 3 in figure B indicate areas, 1 and 2 indicate text, and 3
is an example of a photo.

処理Ｄ＝文字サイズと行間隔の検出各ブロック毎に、垂直および水平方向に第３図の如く周
辺分布をとり、同図に示されるＢ、やＴ。Processing D=Detection of character size and line spacing For each block, the peripheral distribution is taken in the vertical and horizontal directions as shown in FIG.

を用いて、ｎ−２ａ−１（１）として周期性αを計算する。ここで、ｎはブロック中の
分割される領域の数であり、ｎ＞２である。そして、この式（１）によって求められる単位
ブロックごとの平均値α、を用い、α５との差の大きい
αを除外した残りのブロックに対してαの平均値α、お
よびＢ、Ｔの平均値Ｂａ、Ｔａを求める。これらの値に
より、文書中の文字サイズＨおよび行間隔Ｗが次式より
推定できる。Calculate the periodicity α as n-2a-1 (1) using . Here, n is the number of regions in the block to be divided, and n>2. Then, using the average value α for each unit block obtained by this formula (1), calculate the average value α of α and the average values of B and T for the remaining blocks excluding α that has a large difference from α5. Find Ba and Ta. From these values, the character size H and line spacing W in the document can be estimated using the following equations.

Ｈ＝Ｂａ　　、　　Ｗ＝Ｔａ−Ｂａ　　　−（２）処理
Ｅ：単位ブロックの詳細分割単位ブロックの中を空白または罫線によって分割する。H=Ba, W=Ta-Ba - (2) Process E: Detailed division of unit block The inside of the unit block is divided by blanks or ruled lines.

分割の条件は次のとおりである。The conditions for division are as follows.

条件１；行間隔Ｗより広い空白領域で分割する。Condition 1: Divide into a blank area wider than the line spacing W.

条件２：空白が行間隔Ｗより小さい場合は、幅が文字サ
イズＨより小さく、しかも細長いブロックは罫線として
除去する。かかる条件で第２Ａ図の文書を分割した例を
第２Ｃ図に示す。同図のＩＡ、２Ａ、２Ｂ、　ｂも３Ａ
等が詳細分割された領域を示す。Condition 2: If the blank space is smaller than the line spacing W, the width is smaller than the character size H, and blocks that are long and thin are removed as ruled lines. An example of dividing the document shown in FIG. 2A under such conditions is shown in FIG. 2C. IA, 2A, 2B, and b in the same figure are also 3A
etc. indicate areas that have been divided in detail.

処理Ｆ：テキストブロソクの判別詳細分割した各ブロックに対し、周期性αおよびＢ、Ｔ
を計測して先に求めたα、、Ｂａ、Ｔａと比較し、以下
の条件によりテキストブロックであるか否かを判定する
。Processing F: Text block discrimination details For each divided block, periodicity α, B, T
is measured and compared with α, , Ba, and Ta previously determined, and it is determined whether or not it is a text block based on the following conditions.

条件；αｌｌ＃αかつＢａ″−ＢかつＴａ’ｑＴ文字列
が２列以下であり、周期性αが計算できないブロックに
ついては、ブロック内の黒画素密度りを、テキストブロ
ックと判定されたブロックの平均値Ｄａと比較して、Ｄａ　＃Ｄであれば、テキストブロックと判定する。Conditions: For blocks where αll#α and Ba″-B and Ta'qT character strings are two or less columns and periodicity α cannot be calculated, the black pixel density in the block is calculated by calculating the black pixel density in the block that is determined to be a text block. Compare it with the average value Da, and if Da #D, it is determined that it is a text block.

処理Ｇ：テキストブロックの統合テキストブロックと判定された各ブロックに対し、隣り
合うブロック同士の間隔を行間隔Ｗと比較して間隔の狭
いブロックを統合し、統合されたそれぞれの領域を文章
領域とする。その結果、第２Ａ図の例では第２Ｃ図のＩ
Ａ、２Ｂが文章領域として抽出されることになる。Process G: Integration of text blocks For each block determined to be a text block, compare the spacing between adjacent blocks with the line spacing W, integrate blocks with narrow spacing, and define each integrated area as a text area. do. As a result, in the example of FIG. 2A, the I of FIG.
A and 2B will be extracted as text areas.

〔発明の効果〕〔Effect of the invention〕

この発明によれば、文書をいくつかのブロックに分割し
て周辺分布をとるようにしたので、傾いた文書でも正し
く文章領域を抽出することが可能となる。According to this invention, since the document is divided into several blocks and the marginal distribution is taken, it is possible to correctly extract the text area even in a tilted document.

【図面の簡単な説明】[Brief explanation of drawings]

第１図はこの発明の詳細な説明するための概要フローチ
ャート、第２Ａ図は入力文書の一例を説明するための説
明図、第２Ｂ図はブロック分割例を説明するための説明
図、第２Ｃ図は領域の詳細分割例を説明するための説明
図、第２Ｄ図は第２Ａ図の全体から周辺分布を抽出する
場合の例を説明するための説明図、第２Ｅ図は第２Ａ図
の画像をブロック分割して周辺分布を抽出する場合の例
を説明するための説明図、第３図はこの発明による文字
サイズ、行間隔の検出方法を説明するための説明図であ
る。符号説明１．２．３・・・領域、ＩＡ、２Ａ、２Ｂ、３Ａ・・・
詳細領域。第図第２Ｂ図１２ｃ図Ａ第２０図ぼ鎮峯デｂFIG. 1 is an overview flowchart for explaining the invention in detail, FIG. 2A is an explanatory diagram for explaining an example of an input document, FIG. 2B is an explanatory diagram for explaining an example of block division, and FIG. 2C is an explanatory diagram for explaining an example of an input document. is an explanatory diagram for explaining an example of detailed division of a region, FIG. 2D is an explanatory diagram for explaining an example of extracting the marginal distribution from the entirety of FIG. 2A, and FIG. FIG. 3 is an explanatory diagram for explaining an example of extracting peripheral distribution by dividing into blocks, and FIG. 3 is an explanatory diagram for explaining a method for detecting character size and line spacing according to the present invention. Code explanation 1.2.3...Area, IA, 2A, 2B, 3A...
Detail area. Figure 2B Figure 12c Figure A Figure 20 Bochinmineb

Claims

【特許請求の範囲】[Claims]

１）文書領域と図、表および写真を含む他の領域とをも
つ一般印刷文書を複数の単位ブロックに分割し、この分
割された単位ブロック毎に垂直、水平両方向の周辺分布
を求め、その周辺分布の周期性を調べて文章領域らしい
ブロックを判別し、それらの周辺分布の特徴から文書中
の文字サイズと行間隔を推定し、この推定された行間隔
をもとに単位ブロックをさらに詳細に分割し、各ブロッ
クが文章領域であるか否かを周辺分布の特徴から判定し
、文章と判定されたブロック間の間隔の小さなものを統
合して文章領域を作ることにより、文章領域とその他の
領域とを区別して抽出することを特徴とする文書画像の
領域抽出方法。1) Divide a general printed document that has a document area and other areas including figures, tables, and photographs into multiple unit blocks, calculate the peripheral distribution in both the vertical and horizontal directions for each divided unit block, and Examine the periodicity of the distribution to identify blocks that appear to be text areas, estimate the font size and line spacing in the document from the characteristics of their surrounding distribution, and further refine unit blocks based on the estimated line spacing. By dividing each block into text areas, determining whether or not each block is a text area based on the characteristics of the surrounding distribution, and creating a text area by integrating blocks with small intervals that are determined to be sentences, the text area and other A document image region extraction method characterized by extracting regions separately.