JP2682873B2

JP2682873B2 - Recognition device for tabular documents

Info

Publication number: JP2682873B2
Application number: JP1214930A
Authority: JP
Inventors: 勝美細川
Original assignee: Fuji Electric Co Ltd
Current assignee: Fuji Electric Co Ltd
Priority date: 1989-08-23
Filing date: 1989-08-23
Publication date: 1997-11-26
Anticipated expiration: 2012-11-26
Also published as: JPH0378891A

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は、表またはこれと同等の構造を含む文書
（以下、表形式文書ともいう）を認識する認識装置に関
する。The present invention relates to a recognition apparatus for recognizing a document including a table or a structure equivalent to the table (hereinafter, also referred to as a tabular document).

〔従来の技術〕[Conventional technology]

従来、罫線および行間空白によって構成されている表
形式文書を認識する方法は種々あるが、連続的に処理を
する場合の対象は固定長の表形式文書が主である。Conventionally, there are various methods of recognizing a tabular document composed of ruled lines and space between lines, but a fixed length tabular document is mainly targeted for continuous processing.

例えば、固定長表形式文書を対象として、固定フォー
マットを作成しておく方法がある。固定フォーマットと
は、文字が記入されている位置の文字枠情報を予め指定
しておき、文書上の固定位置にマーク等を付けることに
よって位置ずれを補正し、連続的に入力される表形式文
書の画像データと、その固定フォーマットの文字枠情報
とを重ね合わせて枠内部の文字を認識する方法である。
この方法によれば、構造の異なる表形式文書ごとにフォ
ーマットを作成することによって認識が可能となる。For example, there is a method of creating a fixed format for a fixed length table format document. Fixed format is a tabular document that is continuously input by pre-designating the character frame information of the position where the characters are entered, correcting the misalignment by attaching a mark etc. to the fixed position on the document. Is a method of recognizing the character inside the frame by superimposing the image data of and the character frame information of the fixed format.
According to this method, recognition is possible by creating a format for each tabular document having a different structure.

一方、文字の認識精度を向上させる方法としては、予
め文字を認識する部分に文字枠を指定しておき、その文
字枠に記入される文字種を“数字”などに限定しておく
方法が考えられる。また、罫線および行間空白を毎回認
識しながら、文字の記入してある文字枠を検出する方法
もあるが、この方法では如何なる表が入力されるかは分
からず文字種もその都度指定する必要があるため、煩雑
で時間も掛かる。On the other hand, as a method for improving the character recognition accuracy, a method may be considered in which a character frame is designated in advance in the part for recognizing the character, and the character type to be entered in the character frame is limited to "number" or the like. . There is also a method of detecting a character frame in which characters are entered while recognizing ruled lines and space between lines each time, but this method does not know what kind of table will be input and it is necessary to specify the character type each time. Therefore, it is complicated and time-consuming.

〔発明が解決しようとする課題〕[Problems to be solved by the invention]

表形式文書（名簿，住所録，リスト，データシート）
には種々の形式のものがあるが、構造的には同一で、行
数などのデータ量だけが異なる文書（可変長同一構造表
形式文書）がある。このような表形式文書を連続的に読
取処理する場合、固定フォーマットを用いる方法では、
当然のことながらデータ量が変わるたびにフォーマット
を作成し直す必要が生じるという問題があり、入力作業
の自動化の障害ともなる。したがって、この発明の第１
の課題は可変長同一構造表形式文書でも簡単に対処し得
るようにすることにある。Tabular documents (list, address book, list, data sheet)
There are various types of documents, but there are documents that are structurally the same and differ only in the amount of data such as the number of lines (variable length identical structure table format document). When reading such tabular documents continuously, the fixed format method
As a matter of course, there is a problem that the format needs to be recreated every time the amount of data changes, which is an obstacle to automation of input work. Therefore, the first aspect of the present invention
The problem is to be able to easily deal with variable length identical structure tabular documents.

また、可変長同一構造表形式文書を連続的に読取処理
する場合、文字枠の数が可変になるため、文字種が不定
の文字枠が生じてしまい、精度の良い文字認識ができな
くなるという問題もある。したがって、この発明の第２
の課題は精度の良い文字認識を可能にすることにある。Further, in the case of continuously reading the variable-length identical structure table format document, the number of character frames becomes variable, so that a character frame with an undefined character type is generated, and accurate character recognition cannot be performed. is there. Therefore, the second aspect of the present invention
Is to enable accurate character recognition.

〔課題を解決するための手段〕[Means for solving the problem]

表またはこれと同等の構造を含む文書（表形式文書）
の画像データを入力する画像入力部と、この画像データ
から罫線を認識する罫線認識部と、前記画像データの罫
線にて囲まれる文字を認識する文字認識部とを設け、標
準の表形式文書から抽出される表の構造に関する各種デ
ータに表形式のタイプ，可変方向のデータを加えた表形
式記述ファイルを予め作成しておき、以後同じタイプの
表形式文書を認識するときは、前記ファイルの少なくと
も可変方向のデータを更新し、この更新されたファイル
にもとづいて表内の文字を認識する。また、各表要素
（セル）ごとに文字種を指定してそのデータを前記表形
式記述ファイルに格納しておき、認識対象となる文書の
表要素の数が変化したときはこの表形式記述ファイルの
可変方向に文字種のデータを延長して文字を認識する。Documents containing tables or equivalent structures (tabular documents)
An image input unit for inputting image data of, a ruled line recognition unit for recognizing ruled lines from the image data, and a character recognition unit for recognizing characters surrounded by the ruled lines of the image data are provided, and a standard tabular document is used. When a tabular description file is created in advance in which various types of data related to the structure of the table to be extracted are added with tabular type and variable direction data, and when a tabular document of the same type is subsequently recognized, at least the file The variable direction data is updated, and the characters in the table are recognized based on this updated file. Also, the character type is specified for each table element (cell), the data is stored in the table format description file, and when the number of table elements of the document to be recognized changes, the table format description file Recognize characters by extending the character type data in a variable direction.

〔作用〕[Action]

罫線および行間空白の抽出処理は毎回行なうが、表の
構造から連続処理を行なう表形式を予め類型化して分類
しておく。すなわち、或る表は行方向にデータ数が変動
する（行方向が可変長である）とか、すべての表要素
（セル）が罫線で囲まれている、などの情報とともに最
初の表形式文書一枚をもとに、表形式記述ファイルを作
成する。その結果、罫線の不足や余分の追加，削除を行
ない、その情報も表形式記述ファイルに保存しておくこ
とにより、可変長同一構造表形式文書の連続読取処理を
簡単に実現できるようにする。The ruled lines and the space between lines are extracted every time, but the table format for continuous processing is categorized in advance based on the structure of the table. That is, the number of data in a certain table varies in the row direction (the row direction is variable length), or all table elements (cells) are surrounded by ruled lines. Create a tabular description file based on the sheets. As a result, it is possible to easily realize the continuous reading process of the variable length same structure tabular document by performing the lacking of ruled lines, the extra addition and deletion, and the saving of the information in the tabular description file.

また、認識精度を向上させる手段として、最初の表形
式文書一枚を認識する際、各表要素（セル）に文字種、
すなわち数字，英字，ひらがな，カタカナ，漢字等の情
報を付加した属性データも、表形式記述ファイルに作成
する。これにより、データが可変である表形式文書を連
続処理する場合の認識精度を向上させる。Also, as a means of improving the recognition accuracy, when recognizing the first tabular document, the character type for each table element (cell),
That is, the attribute data to which information such as numbers, letters, hiragana, katakana, and kanji is added is also created in the tabular description file. This improves the recognition accuracy in the case of continuously processing a tabular document whose data is variable.

〔実施例〕〔Example〕

第１図はこの発明の実施例を示すブロック図である。
同図において、１はホストCPU、２はCRT、３はキーボー
ド、４は画像入力部、５は画像メモリ、６は罫線認識
部、７は文字認識部、８は補助記憶部、TXは表形式文書
である。なお、表形式文書の類型化として、ここでは第
２図のように３つのタイプに分類する。同図（イ）に示
すタイプ１は罫線によってすべての表要素（セル）が囲
まれている表形式文書、タイプ２は同図（ロ）の如く表
全体と列が罫線によって区切られ、行の区切りが罫線と
行間空白とで構成されている表形式文書、タイプ３は同
図（ハ）の如く表全体を囲む罫線が一部または全く無
く、列を区切る罫線も一部または全て無いような表形式
文書である。また、各タイプの表形式文書は行方向にデ
ータ数が可変のもの、または列方向にデータ数が可変の
ものに分かれる。第2A図にタイプ１の例を、第2B図にタ
イプ２の例を、そして第2C図にタイプ３の例をそれぞれ
示す。FIG. 1 is a block diagram showing an embodiment of the present invention.
In the figure, 1 is a host CPU, 2 is a CRT, 3 is a keyboard, 4 is an image input unit, 5 is an image memory, 6 is a ruled line recognition unit, 7 is a character recognition unit, 8 is an auxiliary storage unit, and TX is a table format. It is a document. The tabular documents are classified into three types as shown in FIG. Type 1 shown in (a) of the figure is a tabular document in which all table elements (cells) are surrounded by ruled lines, and type 2 is divided into a whole table and columns by ruled lines as shown in (b) of FIG. A tabular document in which the delimiter is composed of ruled lines and space between lines, type 3 has no or all ruled lines that surround the entire table and some or all ruled lines that separate columns as shown in FIG. It is a tabular document. Further, each type of tabular document is divided into a document having a variable number of data in the row direction and a document having a variable number of data in the column direction. An example of type 1 is shown in FIG. 2A, an example of type 2 is shown in FIG. 2B, and an example of type 3 is shown in FIG. 2C.

第１図ではこのような表形式文書を連続処理するた
め、まず表形式文書TXが画像入力部４にセットされる。
画像入力部４は複数の文書をストックし、ホストCPU1か
らの命令によって一枚ずつ画像を読む込むことも可能で
ある。画像入力部４へ入力された表形式文書TXはここで
画像データに変換され、画像メモリ５に格納される。次
いで、表形式記述ファイル作成のため、ホストCPU1から
類型化した表形式のタイプ，可変方向が選択（指定）さ
れる。ここでは、表形式タイプを例えば第３図（イ）に
示されるようなタイプ１とし、行方向を可変とする。罫
線認識部６は、表形式タイプが１であることから、罫線
のみの抽出を行なう。なお、その手法には種々のものが
あるので、かかる公知のものを使用することができる。
また、罫線と同様な意味をもつ空白行を抽出するタイプ
であれば、そのための処理が行なわれることは云うまで
もない。このように、タイプの選択により余分な処理を
しなくても済み、処理時間を短縮することができる。第
３図（イ）に示される如きタイプ１の表から抽出される
罫線データの例を第４図に示す。なお、同図のＫは罫
線、X₀₁〜X₀₄,Y₀₁〜Y₁₁は罫線の始端，終端位置を示し
ている。In FIG. 1, since such a tabular document is continuously processed, the tabular document TX is first set in the image input unit 4.
The image input unit 4 can also stock a plurality of documents and read the images one by one according to an instruction from the host CPU 1. The tabular document TX input to the image input unit 4 is converted into image data here and stored in the image memory 5. Next, in order to create a tabular description file, the type of tabular type and variable direction selected from the host CPU 1 are selected (designated). Here, the tabular type is set to type 1 as shown in FIG. 3A, for example, and the row direction is variable. Since the table format type is 1, the ruled line recognition unit 6 extracts only the ruled lines. Since there are various methods, such known methods can be used.
Needless to say, if the type is a type that extracts a blank line having the same meaning as a ruled line, a process for that purpose is performed. Thus, by selecting the type, it is not necessary to perform extra processing, and the processing time can be shortened. An example of ruled line data extracted from the type 1 table as shown in FIG. 3 (a) is shown in FIG. In the figure, K indicates a ruled line, and X _{01 to} X ₀₄ and Y _{01 to} Y ₁₁ indicate start and end positions of the ruled line.

その後は、抽出した罫線データをCRT2に表示し、修正
の必要があるときはCRT2を見ながらキーボード３を用い
て追加，削除，移動等の操作を行なう。次に、抽出した
罫線データから第５図に符号Ｓで示す如き文字枠を抽出
し、この文字枠で囲まれる表要素（セル）の各々に文字
種を第４図のＡ〜Ｆの如く割り当てて表形式記述ファイ
ルを作成し、補助記憶部８に格納する。表形式記述ファ
イルの構造を第６図に示す。また、第３図（イ）の如き
タイプ１の表については、その表形式記述ファイルは第
７図のようになる。ここに、D1は表形式タイプ、D2は可
変方向、D3は罫線座標データ、D4は文字種データをそれ
ぞれ示している。After that, the extracted ruled line data is displayed on the CRT 2, and when it is necessary to correct it, operations such as addition, deletion and movement are performed using the keyboard 3 while looking at the CRT 2. Next, a character frame as shown by symbol S in FIG. 5 is extracted from the extracted ruled line data, and character types are assigned to each of the table elements (cells) surrounded by this character frame as shown in A to F in FIG. A tabular description file is created and stored in the auxiliary storage unit 8. The structure of the tabular description file is shown in FIG. For the type 1 table as shown in FIG. 3A, the table format description file is as shown in FIG. Here, D1 indicates a tabular type, D2 indicates a variable direction, D3 indicates ruled line coordinate data, and D4 indicates character type data.

しかる後、得られた文字枠データおよび文字種をもと
に文字認識部７により文字の認識を行ない、その結果を
補助記憶部８に保存する。Thereafter, the character recognition unit 7 recognizes the character based on the obtained character frame data and character type, and the result is stored in the auxiliary storage unit 8.

ここで、次の表形式文書が第８図の如くであるとする
と、上記表形式記述ファイルの表形式データD1,可変方
向データD2にもとづき罫線が抽出されるが、その罫線デ
ータは第９図のような座標データになる。表形式記述フ
ァイルの罫線データD3と比較すると、可変でない縦方向
の罫線データが表形式記述ファイルでは４本、抽出され
た罫線は５本なので両者の縦罫線のピッチを検査するこ
とにより、（X₀₃′,Y₀₁′）〜（X₀₃′,Y₁₁′）の座標を
持つ罫線K₁が不要であることがわかる。そこで、この罫
線データを削除し、残りの罫線データから文字枠のデー
タを算出し、これを表形式記述ファイルの文字種データ
D4とともに文字認識部７に送り、文字認識を行なう。If the next tabular document is as shown in FIG. 8, ruled lines are extracted based on the tabular data D1 and variable direction data D2 of the tabular description file, and the ruled line data is shown in FIG. It becomes coordinate data like. Compared with the ruled line data D3 of the tabular description file, there are four vertical ruled line data that are not variable in the tabular description file and five extracted ruled lines, so by checking the pitch of both vertical ruled lines, (X _{_{03 ', Y 01') ~}} (X 03 ', Y 11' borders K ₁ with coordinates of) is found to be unnecessary. Therefore, this ruled line data is deleted, character frame data is calculated from the remaining ruled line data, and this is calculated as the character type data of the tabular description file.
It is sent to the character recognition unit 7 together with D4 to perform character recognition.

また、次の表形式文書が第３図（ロ）の如くであると
すると、罫線が可変方向（行方向）に増加する表形式文
書として判断され、抽出した文字枠データ，表形式記述
ファイルの表形式データD1および文字種データD4を文字
認識部７へ送る。文字認識部７では第３図（ロ）の表要
素E2まで認識した時点で、文字種データD4が無くなる。
そこで、ここでは例えば第11行１列目の表要素E3につい
ては、その文字種データD4はそのすぐ上にある第10行１
列目の表要素E0のそれと同じと見る、つまり可変方向に
文字種の属性データを延長することにより、文字の認識
率を向上させるようにしている。したがって、表要素E
4,5についても正確に認識することが可能となり、以下
同様である。If the following tabular document is as shown in FIG. 3B, it is determined that the ruled line increases in the variable direction (row direction), and the extracted character frame data and the tabular description file are extracted. The tabular data D1 and the character type data D4 are sent to the character recognition unit 7. When the character recognition unit 7 recognizes the table element E2 in FIG. 3B, the character type data D4 is lost.
Therefore, here, for example, for the table element E3 at the 11th row and the 1st column, the character type data D4 is immediately above the 10th row 1
The character recognition rate is improved by looking the same as that of the table element E0 in the column, that is, by extending the attribute data of the character type in the variable direction. Therefore, table element E
It is possible to accurately recognize 4,5 and so on.

〔発明の効果〕〔The invention's effect〕

この発明によれば、ｉ）表形式記述ファイルを作成するようにしたので、人
手を介在させることなく名簿，住所録，リストなどの様
々な可変長同一構造の表形式文書を連続的に読取ること
ができる。According to the present invention, i) Since the tabular description file is created, various tabular documents having the same variable length structure such as a name list, an address book, and a list can be continuously read without human intervention. You can

ii）表形式記述ファイルを作成するようにしたので、罫
線抽出に当たり必要な処理のみを行なえば良く、処理時
間が短縮される。ii) Since the tabular description file is created, only the processing necessary for extracting the ruled lines is required, and the processing time is shortened.

iii）表形式記述ファイルを用いて罫線の抽出ミスを自
動的に修正することができ、精度の高い認識が可能とな
る。iii) A ruled line extraction error can be automatically corrected by using a tabular description file, and highly accurate recognition is possible.

iv）表要素の文字種を可変方向に伝播させるようにした
ので、様々な可変長同一構造の表形式文書を最適な文字
種で高精度に認識することができる。iv) Since the character type of the table element is propagated in the variable direction, various tabular documents having the same variable length and the same structure can be recognized with the optimum character type with high accuracy.

【図面の簡単な説明】[Brief description of the drawings]

第１図はこの発明の実施例を示すブロック図、第２図は
表形式の基本タイプを説明するための説明図、第2A図,2
B図,2C図はいずれも表の具体例を説明するための説明
図、第３図は認識対象となる表の例を説明するための説
明図、第４図は罫線データの例を説明するための説明
図、第５図は罫線と文字枠との関係を説明するための説
明図、第６図は表形式記述ファイルの一般的な構成を説
明するための説明図、第７図は表形式記述ファイルの具
体的な例を説明するための説明図、第８図は不要罫線を
含む場合の例を説明するための説明図、第９図は第８図
に対応する罫線データを説明するための説明図である。符号説明１……ホストCPU、２……CRT、３……キーボード、４…
…画像入力部、５……画像メモリ、６……罫線認識部、
７……文字認識部、８……補助記憶部、TX……表形式文
書、Ｋ……罫線、E0〜E5……表要素、Ｓ……文字枠（セ
ル）、D1……表形式タイプ、D2……可変方向、D3……罫
線座標データ、D4……文字種データ、K₁……不要罫線。FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is an explanatory diagram for explaining a basic type in tabular form, and FIGS.
FIGS. 2B and 2C are explanatory diagrams for explaining specific examples of tables, FIG. 3 is an explanatory diagram for explaining examples of tables to be recognized, and FIG. 4 is an example of ruled line data. 5 is an explanatory diagram for explaining the relationship between ruled lines and character frames, FIG. 6 is an explanatory diagram for explaining the general structure of a tabular description file, and FIG. 7 is a table. FIG. 8 is an explanatory diagram for explaining a concrete example of the format description file, FIG. 8 is an explanatory diagram for explaining an example in the case where unnecessary ruled lines are included, and FIG. 9 is for explaining ruled line data corresponding to FIG. FIG. Explanation of symbols 1 ... Host CPU, 2 ... CRT, 3 ... Keyboard, 4 ...
... image input section, 5 ... image memory, 6 ... ruled line recognition section,
7 ... Character recognition part, 8 ... Auxiliary storage part, TX ... tabular document, K ... ruled lines, E0 to E5 ... table elements, S ... character frame (cell), D1 ... tabular type, D2 ...... variable direction, D3 ...... borders coordinate data, D4 ...... character type data, K ₁ ...... unwanted border.

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】表またはこれと同等の構造を含む文書（表
形式文書）の画像データを入力する画像入力部と、この
画像データから罫線を認識する罫線認識部と、前記画像
データの罫線にて囲まれる文字を認識する文字認識部と
を備え、標準の表形式文書から抽出される表の構造に関する各種
データに表形式のタイプ，可変方向のデータを加えた表
形式記述ファイルを予め作成しておき、以後同じタイプ
の表形式文書を認識するときは、前記ファイルの少なく
とも可変方向のデータを更新し、この更新されたファイ
ルにもとづいて表内の文字を認識することを特徴とする
表形式文書の認識装置。1. An image input unit for inputting image data of a document (tabular document) including a table or a structure equivalent thereto, a ruled line recognition unit for recognizing ruled lines from the image data, and a ruled line of the image data. It is equipped with a character recognition unit that recognizes the characters enclosed in a box, and creates in advance a table format description file that adds the table format type and variable direction data to various data related to the structure of the table extracted from the standard table format document. When recognizing a tabular document of the same type thereafter, at least the data in the variable direction of the file is updated, and the characters in the table are recognized based on the updated file. Document recognition device.

【請求項２】各表要素（セル）ごとに文字種を指定して
そのデータを前記表形式記述ファイルに格納しておき、
認識対象となる文書の表要素の数が変化したときはこの
表形式記述ファイルの可変方向に文字種のデータを延長
して文字を認識することを特徴とする請求項１）に記載
の表形式文書の認識装置。2. A character type is specified for each table element (cell), and the data is stored in the table format description file,
The tabular document according to claim 1), wherein when the number of tabular elements of the document to be recognized changes, the character type data is extended in the variable direction of the tabular description file to recognize the character. Recognition device.