JPH0344788A - Area extracting method for document picture - Google Patents

Area extracting method for document picture

Info

Publication number
JPH0344788A
JPH0344788A JP1179070A JP17907089A JPH0344788A JP H0344788 A JPH0344788 A JP H0344788A JP 1179070 A JP1179070 A JP 1179070A JP 17907089 A JP17907089 A JP 17907089A JP H0344788 A JPH0344788 A JP H0344788A
Authority
JP
Japan
Prior art keywords
document
blocks
block
area
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP1179070A
Other languages
Japanese (ja)
Inventor
Shoji Shimomura
下村 正二
Masatoshi Okada
岡田 正年
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuji Electric Co Ltd
Fuji Facom Corp
Original Assignee
Fuji Electric Co Ltd
Fuji Facom Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuji Electric Co Ltd, Fuji Facom Corp filed Critical Fuji Electric Co Ltd
Priority to JP1179070A priority Critical patent/JPH0344788A/en
Publication of JPH0344788A publication Critical patent/JPH0344788A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Input (AREA)

Abstract

PURPOSE:To correctly extract document constitution even in the case of an inclined document by dividing the document into several blocks and calculating peripheral distribution. CONSTITUTION:The general printed document is divided into the plural unit blocks and the peripheral distribution in both vertical and horizontal directions is calculated for each unit block. Then, the periodicity of the peripheral distribution is investigated and the block, which seems to be a document area, is discriminated. Afterwards, the size of characters in the document and an interval between lines are estimated from the feature of the peripheral distribution. The unit blocks are further divided into detailed parts based on the estimated interval between the lines and it is decided from the feature of the peripheral distribution whether each block is the document area or not. Then, the areas, for which the interval between the blocks to be decided as the document is small, are unified and the document area is made. Thus, even in the case of the inclined document, the document area can be extracted while being distinguished from the other area.

Description

【発明の詳細な説明】 〔産業上の利用分野〕 この発明は、スキャナの如き光学的手段を用いて入力さ
れた文書画像をもとに文字を切り出し、認識を行なう文
字認識方法、特に文書画像の領域抽出方法に関する。
[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a character recognition method in which characters are cut out and recognized based on a document image input using an optical means such as a scanner, and in particular, a character recognition method for recognizing characters based on an input document image using an optical means such as a scanner. This paper relates to a region extraction method.

〔従来の技術〕[Conventional technology]

文書中から文章1図形1表などといった領域を抽出する
方法としては、次のような方法が知られている。
The following methods are known as methods for extracting areas such as one sentence, one figure, and one table from a document.

■)ランレングスのランを観測単位としてボヵシ処理を
行ない、各構成要素(文章2区切り罫線。
■) Perform blurring processing using runs of run length as observation units, and perform blurring processing on each component (ruled line separating two sentences).

図および写真)を抽出1分類する方法(例えば、K、Y
、Wong、R,G、Ca5eySandF、M、Wa
hl:”Document   anaIysis  
system”Proc、6th。
Figures and photographs) are extracted and classified (for example, K, Y
, Wong, R.G., Ca5eySandF., M.Wa.
hl:”Document anaIysis
system”Proc, 6th.

ICPR,P、496〜499を参照)。See ICPR, P, 496-499).

■)黒画素の融合、縮退操作により1つの文字列を1つ
の連結成分として抽出する方法(例えば、中村、開本、
庭田、南:“欧文テキスト画像における文字領域の抽出
アルゴリズム”信学論、V。
■) Method of extracting one character string as one connected component by fusion and degeneracy of black pixels (for example, Nakamura, Kaimoto,
Niwata, Minami: “Extraction algorithm for character regions in European text images” IEICE, V.

1、J66−DXNfL4、P、437〜444参照)
III) 2次元フーリエ変換を用い、文字列の周期性
を検出して文書中の文章領域を抽出する方法(例えば、
長谷用、星野:“2次元フーリエ変換を用いた文書画像
領域抽出法”信学論、Vol、J67−D、弘9、P、
1044〜1054参照)。
1, J66-DXNfL4, P, 437-444)
III) A method of detecting the periodicity of a character string and extracting a text area in a document using two-dimensional Fourier transform (for example,
Yo Hase, Hoshino: “Document image area extraction method using two-dimensional Fourier transform” IEICE, Vol. J67-D, Kou 9, P.
1044-1054).

■)文書画像の水平および垂直方向の周・通分布を用い
、その山と谷から文字列の位置を抽出する方法(例えば
、秋山、増田:“周辺分布、!密度。
■) A method of extracting the position of a character string from the peaks and valleys of a document image using its horizontal and vertical circumferential distribution (for example, Akiyama, Masuda: “Marginal distribution, !density.

外接矩形特徴を併用した文書画像の領域分割”信学論、
Vo 1.J 69−D、m8、P、1187〜119
6参照)。
Region segmentation of document images using circumscribed rectangle features,” IEICE Theory,
Vo 1. J 69-D, m8, P, 1187-119
(see 6).

〔発明が解決しようとする課題〕[Problem to be solved by the invention]

上記!、nの方法では、文書画像を格納するメモリの他
に画像演算用のメモリが必要になり、メモリ容量が膨大
になる。また、■の方法によれば、演算量が多くなり処
理に時間が掛かる。■の方法は処理も単純であり、処理
に必要なワークメモリも小容量ですむかわりに、傾きに
弱いという欠点がある。
the above! , n requires a memory for image calculations in addition to a memory for storing document images, resulting in an enormous memory capacity. Furthermore, according to method (2), the amount of calculation increases and the processing takes time. Although method (2) is simple in processing and requires a small amount of work memory for processing, it has the disadvantage of being vulnerable to slope.

〔課題を解決するための手段〕[Means to solve the problem]

文書領域と図5表および写真を含む他の領域とをもつ一
般印刷文書を複数の単位ブロックに分割し、この分割さ
れた単位ブロック毎に垂直、水平両方向の周辺分布を求
め、その周辺分布の周期性を調べて文章領域らしいブロ
ックを判別し、それらの周辺分布の特徴から文書中の文
字サイズと行間隔を推定し、この推定された行間隔をも
とに単位ブロックをさらに詳細に分割し、各ブロックが
文章領域であるか否かを周辺分布の特徴から判定し、文
章と判定されたブロック間の間隔の小さなものを統合し
て文章領域を作ることにより、文章領域とその他の領域
とを区別して抽出する。
A general printed document that has a document area and other areas including tables and photographs in Figure 5 is divided into multiple unit blocks, and the peripheral distribution in both the vertical and horizontal directions is determined for each divided unit block. It examines the periodicity to identify blocks that appear to be text areas, estimates the font size and line spacing in the document from the characteristics of their peripheral distribution, and further divides the unit block into more detailed units based on the estimated line spacing. , determines whether each block is a text area based on the characteristics of the surrounding distribution, and creates a text area by integrating blocks with small intervals that are determined to be text, thereby distinguishing between text areas and other areas. Differentiate and extract.

〔作用〕[Effect]

文書画像を複数の単位ブロックに分割し、分割した領域
を対象に文章領域を抽出することにより例えば第2A図
に示すような傾いた文書の周辺分布が第2D図の如く1
個所(X)だけ切れ目があって、残りは連続的になって
しまうのを、第2E図のように小さな領域を対象とする
ことにより、文字列毎に切ることができるようにする。
By dividing a document image into a plurality of unit blocks and extracting text areas from the divided areas, for example, the peripheral distribution of a tilted document as shown in Figure 2A can be changed to 1 as shown in Figure 2D.
Although there is a cut at point (X) and the rest is continuous, it is possible to cut each character string by targeting a small area as shown in FIG. 2E.

〔実施例〕〔Example〕

第1図はこの発明の詳細な説明するための概要フローチ
ャートである。なお、同図の処理A(粗な画像の作成)
およびB(周辺の空白領域の除去〉は、抽出処理をより
高速にするための前処理を示している。以下、この発明
の特徴となる処理C以降につき説明する。
FIG. 1 is a schematic flowchart for explaining the invention in detail. In addition, processing A (creation of a rough image) in the same figure
and B (removal of surrounding blank areas) indicate pre-processing to speed up the extraction process.Processes C and subsequent processes which are the characteristics of the present invention will be described below.

処理C:単位ブロックへの分割 入力された文書画像を、処理対象となるnXm画素のブ
ロック(単位ブロック)に分割する。第2A図に示す文
書画像の例に対応する分割の例を第2B図に示す。第2
B図の符号1,2.3が領域を示し、1,2が文章、3
が写真の例である。
Process C: Division into unit blocks The input document image is divided into blocks (unit blocks) of nXm pixels to be processed. FIG. 2B shows an example of division corresponding to the example document image shown in FIG. 2A. Second
Numbers 1, 2, and 3 in figure B indicate areas, 1 and 2 indicate text, and 3
is an example of a photo.

処理D=文字サイズと行間隔の検出 各ブロック毎に、垂直および水平方向に第3図の如く周
辺分布をとり、同図に示されるB、やT。
Processing D=Detection of character size and line spacing For each block, the peripheral distribution is taken in the vertical and horizontal directions as shown in FIG.

を用いて、 n−2a−1 (1) として周期性αを計算する。ここで、nはブロック中の
分割される領域の数であり、 n>2 である。そして、この式(1)によって求められる単位
ブロックごとの平均値α、を用い、α5との差の大きい
αを除外した残りのブロックに対してαの平均値α、お
よびB、Tの平均値Ba、Taを求める。これらの値に
より、文書中の文字サイズHおよび行間隔Wが次式より
推定できる。
Calculate the periodicity α as n-2a-1 (1) using . Here, n is the number of regions in the block to be divided, and n>2. Then, using the average value α for each unit block obtained by this formula (1), calculate the average value α of α and the average values of B and T for the remaining blocks excluding α that has a large difference from α5. Find Ba and Ta. From these values, the character size H and line spacing W in the document can be estimated using the following equations.

H=Ba  、  W=Ta−Ba   −(2)処理
E:単位ブロックの詳細分割 単位ブロックの中を空白または罫線によって分割する。
H=Ba, W=Ta-Ba - (2) Process E: Detailed division of unit block The inside of the unit block is divided by blanks or ruled lines.

分割の条件は次のとおりである。The conditions for division are as follows.

条件1;行間隔Wより広い空白領域で分割する。Condition 1: Divide into a blank area wider than the line spacing W.

条件2:空白が行間隔Wより小さい場合は、幅が文字サ
イズHより小さく、しかも細長いブロックは罫線として
除去する。かかる条件で第2A図の文書を分割した例を
第2C図に示す。同図のIA、2A、2B、 bも3A
等が詳細分割された領域を示す。
Condition 2: If the blank space is smaller than the line spacing W, the width is smaller than the character size H, and blocks that are long and thin are removed as ruled lines. An example of dividing the document shown in FIG. 2A under such conditions is shown in FIG. 2C. IA, 2A, 2B, and b in the same figure are also 3A
etc. indicate areas that have been divided in detail.

処理F:テキストブロソクの判別 詳細分割した各ブロックに対し、周期性αおよびB、T
を計測して先に求めたα、、Ba、Taと比較し、以下
の条件によりテキストブロックであるか否かを判定する
Processing F: Text block discrimination details For each divided block, periodicity α, B, T
is measured and compared with α, , Ba, and Ta previously determined, and it is determined whether or not it is a text block based on the following conditions.

条件;αll#αかつBa″−BかつTa’qT文字列
が2列以下であり、周期性αが計算できないブロックに
ついては、ブロック内の黒画素密度りを、テキストブロ
ックと判定されたブロックの平均値Daと比較して、 Da #D であれば、テキストブロックと判定する。
Conditions: For blocks where αll#α and Ba″-B and Ta'qT character strings are two or less columns and periodicity α cannot be calculated, the black pixel density in the block is calculated by calculating the black pixel density in the block that is determined to be a text block. Compare it with the average value Da, and if Da #D, it is determined that it is a text block.

処理G:テキストブロックの統合 テキストブロックと判定された各ブロックに対し、隣り
合うブロック同士の間隔を行間隔Wと比較して間隔の狭
いブロックを統合し、統合されたそれぞれの領域を文章
領域とする。その結果、第2A図の例では第2C図のI
A、2Bが文章領域として抽出されることになる。
Process G: Integration of text blocks For each block determined to be a text block, compare the spacing between adjacent blocks with the line spacing W, integrate blocks with narrow spacing, and define each integrated area as a text area. do. As a result, in the example of FIG. 2A, the I of FIG.
A and 2B will be extracted as text areas.

〔発明の効果〕〔Effect of the invention〕

この発明によれば、文書をいくつかのブロックに分割し
て周辺分布をとるようにしたので、傾いた文書でも正し
く文章領域を抽出することが可能となる。
According to this invention, since the document is divided into several blocks and the marginal distribution is taken, it is possible to correctly extract the text area even in a tilted document.

【図面の簡単な説明】[Brief explanation of drawings]

第1図はこの発明の詳細な説明するための概要フローチ
ャート、第2A図は入力文書の一例を説明するための説
明図、第2B図はブロック分割例を説明するための説明
図、第2C図は領域の詳細分割例を説明するための説明
図、第2D図は第2A図の全体から周辺分布を抽出する
場合の例を説明するための説明図、第2E図は第2A図
の画像をブロック分割して周辺分布を抽出する場合の例
を説明するための説明図、第3図はこの発明による文字
サイズ、行間隔の検出方法を説明するための説明図であ
る。 符号説明 1.2.3・・・領域、IA、2A、2B、3A・・・
詳細領域。 第 図 第2B図 12c 図 A 第20図 ぼ鎮峯デb
FIG. 1 is an overview flowchart for explaining the invention in detail, FIG. 2A is an explanatory diagram for explaining an example of an input document, FIG. 2B is an explanatory diagram for explaining an example of block division, and FIG. 2C is an explanatory diagram for explaining an example of an input document. is an explanatory diagram for explaining an example of detailed division of a region, FIG. 2D is an explanatory diagram for explaining an example of extracting the marginal distribution from the entirety of FIG. 2A, and FIG. FIG. 3 is an explanatory diagram for explaining an example of extracting peripheral distribution by dividing into blocks, and FIG. 3 is an explanatory diagram for explaining a method for detecting character size and line spacing according to the present invention. Code explanation 1.2.3...Area, IA, 2A, 2B, 3A...
Detail area. Figure 2B Figure 12c Figure A Figure 20 Bochinmineb

Claims (1)

【特許請求の範囲】[Claims] 1)文書領域と図、表および写真を含む他の領域とをも
つ一般印刷文書を複数の単位ブロックに分割し、この分
割された単位ブロック毎に垂直、水平両方向の周辺分布
を求め、その周辺分布の周期性を調べて文章領域らしい
ブロックを判別し、それらの周辺分布の特徴から文書中
の文字サイズと行間隔を推定し、この推定された行間隔
をもとに単位ブロックをさらに詳細に分割し、各ブロッ
クが文章領域であるか否かを周辺分布の特徴から判定し
、文章と判定されたブロック間の間隔の小さなものを統
合して文章領域を作ることにより、文章領域とその他の
領域とを区別して抽出することを特徴とする文書画像の
領域抽出方法。
1) Divide a general printed document that has a document area and other areas including figures, tables, and photographs into multiple unit blocks, calculate the peripheral distribution in both the vertical and horizontal directions for each divided unit block, and Examine the periodicity of the distribution to identify blocks that appear to be text areas, estimate the font size and line spacing in the document from the characteristics of their surrounding distribution, and further refine unit blocks based on the estimated line spacing. By dividing each block into text areas, determining whether or not each block is a text area based on the characteristics of the surrounding distribution, and creating a text area by integrating blocks with small intervals that are determined to be sentences, the text area and other A document image region extraction method characterized by extracting regions separately.
JP1179070A 1989-07-13 1989-07-13 Area extracting method for document picture Pending JPH0344788A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP1179070A JPH0344788A (en) 1989-07-13 1989-07-13 Area extracting method for document picture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP1179070A JPH0344788A (en) 1989-07-13 1989-07-13 Area extracting method for document picture

Publications (1)

Publication Number Publication Date
JPH0344788A true JPH0344788A (en) 1991-02-26

Family

ID=16059574

Family Applications (1)

Application Number Title Priority Date Filing Date
JP1179070A Pending JPH0344788A (en) 1989-07-13 1989-07-13 Area extracting method for document picture

Country Status (1)

Country Link
JP (1) JPH0344788A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5091964A (en) * 1990-04-06 1992-02-25 Fuji Electric Co., Ltd. Apparatus for extracting a text region in a document image

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS63292381A (en) * 1987-05-26 1988-11-29 Fujitsu Ltd Detector for character line
JPS6453281A (en) * 1987-05-01 1989-03-01 Ricoh Kk Area extraction method
JPH01130293A (en) * 1987-11-16 1989-05-23 Nec Corp Document image analyzing system
JPH01169686A (en) * 1987-12-25 1989-07-04 Fujitsu Ltd Character line detecting system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6453281A (en) * 1987-05-01 1989-03-01 Ricoh Kk Area extraction method
JPS63292381A (en) * 1987-05-26 1988-11-29 Fujitsu Ltd Detector for character line
JPH01130293A (en) * 1987-11-16 1989-05-23 Nec Corp Document image analyzing system
JPH01169686A (en) * 1987-12-25 1989-07-04 Fujitsu Ltd Character line detecting system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5091964A (en) * 1990-04-06 1992-02-25 Fuji Electric Co., Ltd. Apparatus for extracting a text region in a document image

Similar Documents

Publication Publication Date Title
EP0543593B1 (en) Method for determining boundaries of words in text
EP1146478B1 (en) A method for extracting titles from digital images
dos Santos et al. Text line segmentation based on morphology and histogram projection
JP5844783B2 (en) Method for processing grayscale document image including text region, method for binarizing at least text region of grayscale document image, method and program for extracting table for forming grid in grayscale document image
Antonacopoulos Page segmentation using the description of the background
JP5934174B2 (en) Method and program for authenticating a printed document
CN104462380A (en) Trademark retrieval method
JPH03260787A (en) Discrimination method of line figure in picture
JPH05225378A (en) Area dividing system for document image
CN113642380A (en) Identification technology for wireless form
Payne et al. Document segmentation using texture analysis
CN116824608A (en) Answer sheet layout analysis method based on target detection technology
Chakraborty et al. Marginal Noise Reduction in Historical Handwritten Documents--A Survey
JPH0344788A (en) Area extracting method for document picture
JPH0410087A (en) Base line extracting method
JPH03126181A (en) Area dividing method for document image
JPS61117670A (en) Character cutting-out processing system
JPS60189084A (en) Character area extracting circuit
JPS58201182A (en) Character and graph demarcating method
JP2675303B2 (en) Character recognition method
Gupta et al. A Comprehensive Analysis of Various Text Detection and Extraction Techniques for Complex Degraded Images.
JPH05342412A (en) Extracting system for gradient vector and feature extracting system for character recognition
CN113888758B (en) Curved character recognition method and system based on complex scene
Gayashan et al. Old Sinhala newspaper article segmentation for content recognition using image processing
Zhang et al. Using Orientation Voting to Extract Text Lines with Various Mixed Directions from a Document Image