JP2915175B2

JP2915175B2 - Word space detection method

Info

Publication number: JP2915175B2
Application number: JP3165100A
Authority: JP
Inventors: 保夫本郷; 正年岡田; 一郎小倉
Original assignee: Efu Efu Shii Kk; Fuji Electric Co Ltd
Current assignee: Efu Efu Shii Kk; Fuji Electric Co Ltd
Priority date: 1990-10-01
Filing date: 1991-06-10
Publication date: 1999-07-05
Anticipated expiration: 2014-07-05
Also published as: JPH056459A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は、文字読み取り方法、
特にプロポーショナル印字された英文の文書画像から単
語間のスペース（単語間スペース）を検出する方法に関
する。なお、プロポーショナル印字とは、英文の複数単
語が１行内に納まるように行毎に調整して印字する手法
をいう。BACKGROUND OF THE INVENTION The present invention relates to a character reading method,
In particular, the present invention relates to a method for detecting a space between words (space between words) from a proportionally printed English document image. Note that proportional printing refers to a method of performing printing by adjusting each line so that a plurality of English words are contained within one line.

【０００２】[0002]

【従来の技術】従来、プロポーショナル印字された英文
の単語間スペースを検出する方法としては、行毎に文字
間のスペース（文字間スペース）の頻度を求め、頻度分
布の文字間スペースを表わすピークと単語間スペースを
表わすピークとの間の谷に相当する頻度０の個所を検出
のためのしきい値とするものが知られている（例えば、
特開昭６３−１５８６７８号公報参照）。すなわち、ス
ペースの１ドット刻みの頻度分布が、例えば図１３のよ
うに、文字間スペースの群Ｍ１と単語間スペースの群Ｍ
２の２つの群の間に、頻度０の個所（谷Ｖ）が存在する
ことを想定してしきい値を決定するものである。また、
図１３のように頻度分布を１ドット刻みでとるのではな
く、予め文字サイズにより定められた幅（クラス幅とも
いう）を持ったヒストグラムとすることによりスペース
のばらつきによる不適切な谷の発生を防止する方法も本
出願人により考案されている。すなわち、図１４に示す
ように標準文字サイズの５％（数ドット）をクラス幅と
する頻度ヒストグラムを作成して文字間スペースの群Ｍ
１と単語間スペースの群Ｍ２の２つの群の間に、頻度０
の個所（谷Ｖ）が存在することを想定してしきい値を決
定するものである。2. Description of the Related Art Conventionally, as a method of detecting a space between words in a proportionally printed English sentence, a frequency of a space between characters (space between characters) is obtained for each line, and a peak representing a space between characters in a frequency distribution is obtained. It is known that a location having a frequency of 0 corresponding to a valley between a peak representing an interword space and a peak is used as a threshold value for detection (for example,
See JP-A-63-158678. That is, as shown in FIG. 13, for example, the frequency distribution of the space in units of one dot is represented by a group M1 of inter-character spaces and a group M of inter-word spaces.
The threshold value is determined on the assumption that there is a location (valley V) having a frequency of 0 between the two groups 2. Also,
Instead of taking the frequency distribution in units of one dot as shown in FIG. 13, a histogram having a width (also referred to as a class width) determined in advance by the character size can be used to prevent the occurrence of inappropriate valleys due to space variations. Prevention methods have also been devised by the applicant. That is, as shown in FIG. 14, a frequency histogram having a class width of 5% (several dots) of the standard character size is created, and a group M of inter-character spaces is created.
Between the two groups of 1 and the group M2 of inter-word spaces, the frequency 0
The threshold value is determined on the assumption that the point (valley V) exists.

【０００３】[0003]

【発明が解決しようとする課題】しかし、このような従
来の方法は、図１３、図１４に示されているようにスペ
ース幅の頻度が文字間スペースの群と単語間スペースの
群の２つの群を持つ双方性の分布をなしていることを前
提として、群と群との間の谷に相当するしきい値として
単語間スペースを検出するものであった。しかし、実際
の文書では、図１５のように群が３つ以上存在する頻度
分布、すなわち文字間スペースの群Ｍ１と単語間スペー
スの群Ｍ２の２つの群以外に、文字間スペースか単語間
スペースかを明確に確定できないスペースの群Ｍ３（つ
まり未確定スペース群）が発生する場合がある。However, in such a conventional method, as shown in FIGS. 13 and 14, the frequency of the space width is divided into two groups, ie, a group of inter-character spaces and a group of inter-word spaces. Assuming that the distribution has a bisexuality with groups, the inter-word space is detected as a threshold value corresponding to a valley between groups. However, in an actual document, a frequency distribution in which three or more groups exist as shown in FIG. 15, that is, in addition to the two groups of the inter-character space group M1 and the inter-word space group M2, an inter-character space or an inter-word space is used. There is a case where a group M3 of spaces (that is, an undetermined space group) in which it cannot be clearly determined may occur.

【０００４】このような群が発生する原因として、次の
２つがあげられる。（１）単語内でカーニング（くい込み）が発生してしま
うような文字の組合せ（例えば、ｆとｔ）の間にある単
語間スペースは通常の単語間スペースよりも小さくな
る。（２）ゴシック、ボールド等のサンセリフのフォントの
ｕとｍ、ｍとｐ等の間の文字間スペースは通常の文字間
スペースよりも小さくなる。このような文字組ルール上の傾向が実際の文書にあるた
め、小さめの単語間スペースや大きめの文字間スペース
が独立した群を作ってしまう。[0004] There are the following two causes for the occurrence of such a group. (1) The inter-word space between character combinations (for example, f and t) that may cause kerning in a word is smaller than a normal inter-word space. (2) The space between characters between u and m, m, p, etc. of a sans serif font such as gothic or bold is smaller than a normal space between characters. Since such a tendency in the character set rule exists in an actual document, a small inter-word space and a large inter-character space form independent groups.

【０００５】このように頻度分布の群が３つ以上存在し
ている場合には、しきい値が正しく決定できず、図１６
の「＊」印に示すような単語の誤統合や図１７の「＊」
印に示すような誤分割を起こしてしまうことがある。ま
た、英文書は字体の種類が多く、しかも印字の形態につ
いてもタイプライタや写植というように各種のものがあ
るため、単一の方法だけでは単語間スペースの検出を誤
ることがあった。本発明は上記の問題点を解決するため
になされたもので、その目的とするところは、字体や印
字の形態にかかわらず単語間のスペースを常に的確に検
出することができる単語間スペース検出方法を提供する
ことにある。[0005] When three or more groups of frequency distributions exist as described above, the threshold value cannot be determined correctly, and FIG.
Mis-integration of words as shown by the "*" mark and "*" in FIG.
The erroneous division as shown by the mark may occur. Further, English documents have many types of fonts, and since there are various types of printing such as typewriters and typesettings, the detection of the space between words may be erroneously detected by a single method. SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and an object of the present invention is to provide a method for detecting a space between words that can always accurately detect a space between words regardless of a font or a form of printing. Is to provide.

【０００６】[0006]

【課題を解決するための手段】上記目的を達成するため
に、第１の発明は、入力された英文書画像から個々の文
字を切り出して文字間のスペース幅を算出し、得られた
各スペース幅の頻度分布をクラス幅ごとのヒストグラム
として表し、ヒストグラム中に形成された文字間を表す
スペース幅の山部とそれよりも上位に位置して単語間を
表すスペース幅の山部との中間に形成された谷部に該当
するスペース幅を、文字間のスペース幅と単語間のスペ
ース幅を区分するしきい値とし、算出された文字間スペ
ース幅がしきい値よりも大きい場合はその文字間スペー
ス幅を単語間スペースとして検出する単語間スペース検
出方法において、切り出された個々の文字を認識した後
に、文字間スペースの前後に位置する文字についての文
字組ルールにより決定される補正係数を用いて、算出し
た文字間スペース幅を補正し、その後にヒストグラムを
作成するようにしたことを特徴とする。In order to achieve the above object, a first aspect of the present invention is to cut out individual characters from an input English document image, calculate a space width between the characters, and obtain each of the obtained spaces. The frequency distribution of the width is represented as a histogram for each class width, and is located between the peak of the space width between characters formed in the histogram and the peak of the space width between the words located at a higher level than that. The space width corresponding to the formed valley is used as a threshold for separating the space width between characters and the space width between words. If the calculated space width between characters is larger than the threshold, the space between the characters is used. In the inter-word space detection method of detecting a space width as an inter-word space, after recognizing each cut-out character, a character set rule is applied to characters positioned before and after the inter-character space. Using the correction factor to be constant, to correct the calculated inter-character space width, then it is characterized in that so as to create a histogram.

【０００７】第２の発明は、第１の発明において、文字
間スペース幅の頻度分布を表すヒストグラムの谷部のい
ずれにも頻度分布値が０のクラスがない場合は、クラス
幅を順次狭くしていき谷部に頻度分布値０のクラスが出
現したところでそのクラス内のスペース幅をしきい値と
するようにしたことを特徴とする。According to a second aspect, in the first aspect, if none of the valleys of the histogram representing the frequency distribution of the inter-character space width has a class with a frequency distribution value of 0, the class width is sequentially reduced. When a class having a frequency distribution value of 0 appears in a valley, a space width in the class is set as a threshold value.

【０００８】第３の発明は、第１の発明または第２の発
明において、谷部に位置するクラス内のスペース幅から
しきい値を求める際に、そのクラスの中央値となるスペ
ース幅をしきい値とするようにしたことを特徴とする。According to a third aspect, in the first or second aspect, when a threshold value is obtained from the space width in a class located at a valley, the space width that is the median of the class is determined. The threshold value is set.

【０００９】第４の発明は、入力された英文書画像から
個々の文字を切り出して文字間のスペース幅を算出し、
得られた各スペース幅の頻度分布をクラス幅ごとのヒス
トグラムとして表し、ヒストグラム中に形成された文字
間を表すスペース幅の山部とそれよりも上位に位置して
単語間を表すスペース幅の山部との中間に形成された谷
部に該当するスペース幅を、文字間のスペース幅と単語
間のスペース幅を区分するしきい値とし、算出された文
字間スペース幅がしきい値よりも大きい場合はその文字
間スペース幅を単語間スペースとして検出する単語間ス
ペース検出方法において、予め、各種の字体および印字
形態からなる学習用の英文書画像を入力し、その画像か
ら個々の文字を切り出して文字間のスペース幅を算出
し、それらの頻度分布をヒストグラムとして表し、ヒス
トグラム中に形成された単語間を表すスペース幅の山部
および文字間を表すスペース幅の山部それぞれのピーク
となるスペース幅を検出し、ピークのスペース幅とそれ
ぞれピーク以外のスペース幅との比を算出して補正係数
とし、得られた各補正係数から各文字の前後組合せごと
に文字間のスペース幅をピークのスペース幅に修正する
ための補正係数テーブルを学習用英文書画像ごとに作成
しておき、次いで、検出対象として切り出された個々の
文字を認識し、文字間スペースの前後に位置する文字の
組合せごとに予め作成した複数の補正係数テーブルを参
照し算出した文字間のスペース幅を補正し、参照した補
正係数テーブルごとにスペース幅の頻度分布を示すヒス
トグラムを作成して比較し、最適な補正結果を選択しそ
の補正結果を用いて算出した文字間スペース幅を補正す
るようにしたことを特徴とする。According to a fourth aspect of the present invention, individual characters are cut out from an input English document image to calculate a space width between the characters,
The obtained frequency distribution of each space width is represented as a histogram for each class width, and a space width peak formed between the characters formed in the histogram and a space width peak positioned above and representing the space between the words are displayed. The space width corresponding to the valley formed in the middle of the part is set as a threshold for separating the space width between characters and the space width between words, and the calculated space width between characters is larger than the threshold value. In this case, in the inter-word space detection method of detecting the inter-character space width as an inter-word space, a learning English document image having various fonts and printing forms is input in advance, and individual characters are cut out from the image. Calculate the space width between characters, represent the frequency distribution of them as a histogram, and represent the peaks of the space width representing the space between words formed in the histogram and the space between characters. The peak space width at each peak of the pace width is detected, the ratio of the space width of the peak to the space width other than the peak is calculated as a correction coefficient, and a combination before and after each character is obtained from the obtained correction coefficient. In each case, a correction coefficient table for correcting the space width between characters to the peak space width for each English document image for learning is created, and then each character cut out as a detection target is recognized, and the space between characters is recognized. Corrects the space width between characters calculated by referring to multiple correction coefficient tables created in advance for each combination of characters located before and after the space, and creates a histogram showing the frequency distribution of space width for each referenced correction coefficient table Then, the optimum correction result is selected, and the inter-character space width calculated using the correction result is corrected.

【００１０】[0010]

【作用】第１の発明においては、切り出された個々の文
字を認識した後に、文字間スペースの前後に位置する文
字についての文字組ルールにより決定される補正係数を
用いて、算出した文字間スペース幅が補正され、その後
にヒストグラムが作成されてから単語間スペースが検出
される。According to the first aspect of the present invention, after recognizing each cut-out character, the inter-character space calculated using the correction coefficient determined by the character set rule for the characters located before and after the inter-character space is calculated. The width is corrected, and then a histogram is created, and then the space between words is detected.

【００１１】第２の発明においては、文字間スペース幅
の頻度分布を表すヒストグラムの谷部のいずれにも頻度
分布値が０のクラスがない場合に、クラス幅を順次狭く
して頻度分布を求め、谷部に頻度分布値０のクラスが出
現したところでそのクラス内のスペース幅がしきい値と
されてヒストグラムが作成されて単語間スペースが検出
される。In the second invention, when there is no class having a frequency distribution value of 0 in any of the valleys of the histogram representing the frequency distribution of the character space width, the frequency distribution is obtained by sequentially narrowing the class width. When a class having a frequency distribution value of 0 appears in a valley, a space width within the class is set as a threshold, a histogram is created, and a space between words is detected.

【００１２】第３の発明においては、谷部に位置するク
ラス内のスペース幅からしきい値を求める際に、そのク
ラスの中央値となるスペース幅がしきい値とされてヒス
トグラムが作成されて単語間スペースが検出される。In the third invention, when a threshold value is obtained from a space width in a class located at a valley, a space width serving as a median value of the class is set as a threshold value and a histogram is created. Inter-word spaces are detected.

【００１３】第４の発明においては、予め各種の字体お
よび印字形態からなる学習用の英文書画像が入力される
と、その画像から個々の文字が切り出されて文字間のス
ペース幅が算出される。それらの頻度分布はヒストグラ
ムとして表され、ヒストグラム中に形成された単語間を
表すスペース幅の山部および文字間を表すスペース幅の
山部それぞれのピークとなるスペース幅が検出される。
さらに、ピークのスペース幅とピーク以外の各スペース
幅との比が算出されて補正係数が求められ、得られた補
正係数から各文字の前後組合せごとに文字間のスペース
幅をピークのスペース幅に修正するための補正係数テー
ブルが学習用英文書画像ごとに作成される。次いで、検
出対象として切り出された個々の文字が認識され、文字
間スペースの前後に位置する文字の組合せごとに予め作
成しておいた複数の補正係数テーブルを参照して算出し
た文字間のスペース幅が補正され、参照した補正係数テ
ーブルごとにスペース幅の頻度分布を示すヒストグラム
が作成・比較されて最適な補正結果が選択され、その補
正結果を用いて算出した文字間スペース幅が補正されて
から単語間スペースが検出される。In the fourth invention, when a learning English document image having various fonts and printing forms is input in advance, individual characters are cut out from the image and a space width between the characters is calculated. . The frequency distribution is represented as a histogram, and the peak width of each of the peaks of the space width representing the space between words and the peaks of the space width representing the space between characters formed in the histogram is detected.
Further, the ratio between the space width of the peak and the space width other than the peak is calculated, and a correction coefficient is obtained.From the obtained correction coefficient, the space width between characters for each combination before and after each character is converted to the space width of the peak. A correction coefficient table for correction is created for each English document image for learning. Next, each character cut out as a detection target is recognized, and the space width between characters calculated by referring to a plurality of correction coefficient tables created in advance for each combination of characters positioned before and after the character space. Is corrected, a histogram showing the frequency distribution of the space width is created and compared for each of the referred correction coefficient tables, the optimum correction result is selected, and the space between characters calculated using the correction result is corrected. Inter-word spaces are detected.

【００１４】[0014]

【実施例】以下、図に沿って本発明の実施例を説明す
る。図１は第１ないし第３の発明の実施例による単語間
スペース検出処理を含む文字認識処理を示すフローチャ
ートである。図２は処理対象となる文字列の一例を説明
するための説明図である。図３は図２の文字列から得ら
れるスペース幅の頻度を示すヒストグラムであり、予め
定められている標準文字サイズの５％（数ドット）をク
ラス幅としてヒストグラムを作成した。図４は図３に示
す頻度ヒストグラムに対して補正係数による補正を行な
った場合の補正されたスペース幅の頻度を示すヒストグ
ラムである。図５は図２に示す文字列を処理した場合の
最終的な認識結果を示す図である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a flowchart showing a character recognition process including an inter-word space detection process according to the first to third embodiments of the present invention. FIG. 2 is an explanatory diagram for describing an example of a character string to be processed. FIG. 3 is a histogram showing the frequency of the space width obtained from the character string in FIG. 2, and a histogram was created using 5% (several dots) of a predetermined standard character size as a class width. FIG. 4 is a histogram showing the frequency of the corrected space width when the frequency histogram shown in FIG. 3 is corrected by the correction coefficient. FIG. 5 is a diagram showing a final recognition result when the character string shown in FIG. 2 is processed.

【００１５】次に、図１ないし図５により単語間スペー
ス検出方法の実施例を説明する。図１において、プロポ
ーショナル印字された英文の文書画像が入力されると
（Ｓ１１）、各文字が外接矩形によって切り出される
（Ｓ１２）。この文字切出し方法は周知の方法を用いる
ことができるので、詳細な説明は省略する。この外接矩
形の座標をもとに、文字間のスペース幅が計算され、文
字間スペース情報の配列ＳＰ［］に格納される（Ｓ１
３）。ここで、ＳＰ［ｉ］は文字番号ｉと文字番号（ｉ
＋１）の間の文字間スペース（ドット数）を表わし、次
式により定義される。Next, an embodiment of a method for detecting a space between words will be described with reference to FIGS. In FIG. 1, when a proportionally printed English document image is input (S11), each character is cut out by a circumscribed rectangle (S12). Since a well-known method can be used for this character extraction method, detailed description is omitted. The space width between the characters is calculated based on the coordinates of the circumscribed rectangle, and stored in the array SP [] of the space information between characters (S1).
3). Here, SP [i] is a character number i and a character number (i
+1) represents the inter-character space (the number of dots) and is defined by the following equation.

【００１６】ＳＰ［ｉ］＝ＳＴ［ｉ＋１］−ＥＤ［ｉ］但し、ＳＴ［ｉ］，ＥＤ［ｉ］はそれぞれ文字番号ｉの
文字開始座標、終了座標を示している。従って、上式の
着目文字の文字間スペースは、次の文字の開始座標から
着目文字の終了座標値を引いたもの（ドット数）で表わ
されることを示している。図２に示す文字列からＳ１３
の処理により算出された文字間スペースを、標準文字サ
イズの５％をクラス幅としてヒストグラムをとった場合
の頻度ヒストグラムは図３のようになり、３つの群が存
在してしまう。スペース幅が２０〜２５％の個所に群が
生じているのは、文字‘ｆ’と‘ｆ’、‘ｆ’と‘ｔ’
の間のスペースによるものである。SP [i] = ST [i + 1] -ED [i] where ST [i] and ED [i] indicate the character start coordinates and end coordinates of character number i, respectively. Accordingly, it is shown that the inter-character space of the target character in the above expression is represented by a value obtained by subtracting the end coordinate value of the target character from the start coordinate of the next character (the number of dots). From the character string shown in FIG.
The frequency histogram when the inter-character space calculated by the above process is plotted with a class width of 5% of the standard character size is as shown in FIG. 3, and there are three groups. Groups where the space width is 20 to 25% are caused by the characters 'f' and 'f' and 'f' and 't'.
It is due to the space between.

【００１７】次に、切り出された各文字に対して認識が
行なわれる（Ｓ１４）。この処理は、通常の認識処理で
ある。ここで得られた認識結果に基づき、Ｓ１３の処理
において算出された文字間スペースの幅を、その前後の
文字についての文字組ルールにより決定される補正値に
より補正する（Ｓ１５）。このスペース幅の補正は次式
により行なわれる。ＳＰ’［ｉ］＝Ｈ（ｃｈ［ｉ］，ｃｈ［ｉ＋１］）×ＳＰ［ｉ］但し、ｃｈ［ｉ］は文字番号ｉの文字の認識結果、Ｈ
（ａ，ｂ）は前の文字ａ，後ろの文字ｂのスペースに対
する補正係数、ＳＰ’「ｉ」は補正されたスペース幅で
ある。Next, recognition is performed on each of the extracted characters (S14). This process is a normal recognition process. Based on the recognition result obtained here, the width of the inter-character space calculated in the process of S13 is corrected by a correction value determined by a character set rule for characters before and after the character (S15). The correction of the space width is performed by the following equation. SP ′ [i] = H (ch [i], ch [i + 1]) × SP [i] where ch [i] is the recognition result of the character of character number i, and H
(A, b) is a correction coefficient for the space between the preceding character a and the following character b, and SP ′ “i” is the corrected space width.

【００１８】補正係数としては、例えば実験により表１
のような値が求められる。なお、表１に記載されていな
い文字の補正係数は１．０である。As the correction coefficient, for example, an experiment is shown in Table 1.
Is obtained. The correction coefficients of characters not described in Table 1 are 1.0.

【００１９】[0019]

【表１】 [Table 1]

【００２０】補正係数により補正されたスペースに基づ
いて図４に示す頻度ヒストグラムが作成される（Ｓ１
６）。図４からも明らかなように、補正により図３の２
０〜２５％の群Ｍ３が３０〜３５％の群Ｍ２に移動し
て、群Ｍ３がなくなる。このように、頻度ヒストグラム
を２つの群に変更した後、２つの群の間の谷の中で頻度
が０となるクラスを検出し、このクラスの中でスペース
のしきい値ＴＨを決定する（Ｓ１７）。頻度が０となる
クラスが存在しない場合は、クラス幅を小さくして再度
詳細な頻度ヒストグラムを作成し、頻度が０となるクラ
スを求める。それでも頻度０のクラスが見い出されない
場合は、さらにクラス幅を小さくしていき最終的には画
素単位まで小さくする。このようにして頻度０のクラス
を見つけてから、そのクラスの中でスペースのしきい値
ＴＨを決定する。A frequency histogram shown in FIG. 4 is created based on the space corrected by the correction coefficient (S1).
6). As is apparent from FIG.
The group M3 of 0 to 25% moves to the group M2 of 30 to 35%, and the group M3 disappears. Thus, after changing the frequency histogram to two groups, a class having a frequency of 0 is detected in a valley between the two groups, and a space threshold TH is determined in this class ( S17). If there is no class with a frequency of 0, the class width is reduced and a detailed frequency histogram is created again to find a class with a frequency of 0. If a class having a frequency of 0 is still not found, the class width is further reduced, and finally the size is reduced to the pixel unit. After finding the class having the frequency of 0 in this manner, the threshold value TH of the space is determined in the class.

【００２１】このしきい値ＴＨの決定は、次のようにし
て行なわれる。図４の例では、頻度０のクラスは文字サ
イズの１５〜２０％（７〜８ドット），２０〜２５％
（９〜１０ドット），２５〜３０％（１０〜１２ドッ
ト）の３つのクラスとなるので、これらのクラスの中の
中央値１０ドットをしきい値ＴＨとして決定する。この
ようにしてしきい値ＴＨが決定されると、しきい値ＴＨ
と補正されたスペース長ＳＰ’［ｉ］とを比較して、Ｔ
Ｈ＜ＳＰ’［ｉ］であれば単語間スペース、そうでない
場合は文字間スペースと判定し、単語間スペースと判定
された個所にスペース文字を挿入する（Ｓ１８）。The determination of the threshold value TH is performed as follows. In the example of FIG. 4, the class having a frequency of 0 is 15 to 20% (7 to 8 dots) and 20 to 25% of the character size.
(9 to 10 dots) and 25 to 30% (10 to 12 dots). Therefore, the median 10 dots in these classes are determined as the threshold value TH. When the threshold value TH is determined in this way, the threshold value TH
Is compared with the corrected space length SP '[i].
If H <SP '[i], it is determined that there is an inter-word space; otherwise, it is determined that there is an inter-character space, and a space character is inserted at a location determined to be an inter-word space (S18).

【００２２】これらの処理を図２の文字列に対しておこ
ない、単語間スペースを検出し、単語間スペースにスペ
ース文字を挿入した場合の最終的な認識結果が図５とな
る。図５では、各単語間に空白が正しく挿入されたこと
が示されている。このようにして、各単語間空白が正し
く挿入された文字列に対し、誤読文字の修正、大文字・
小文字の変更等の後処理が行なわれる（Ｓ１９）。These processes are performed on the character string shown in FIG. 2, a space between words is detected, and a final recognition result when a space character is inserted in the space between words is shown in FIG. FIG. 5 shows that a space has been correctly inserted between words. In this way, correct the misread characters,
Post-processing, such as changing a lowercase letter, is performed (S19).

【００２３】次に第４の発明の実施例について説明す
る。図６は、サンプル文書から学習により補正係数テー
ブルを作成するまでの処理を示すフローチャートであ
る。図において、初めに学習用として英文書画像が入力
されると（Ｓ６１）、図１の処理と同様に、文字の切出
し、文字間スペース幅の算出および文字の認識がなされ
る（Ｓ６２〜Ｓ６４）。次に、算出された文字間スペー
スの頻度分布からヒストグラムを作成し、ヒストグラム
中にあらわれた文字間スペースの山と単語間スペースの
山それぞれのピーク値を検出する（Ｓ６５）。図７は作
成されたヒストグラムの一例を示す。さらに図８は、図
７にあらわれた文字間スペースの山からピーク値Ｓｃ
を、単語間スペースの山からピーク値Ｓｗそれぞれ求め
たことを示す。Next, an embodiment of the fourth invention will be described. FIG. 6 is a flowchart showing processing up to creation of a correction coefficient table from a sample document by learning. In the figure, when an English document image is first input for learning (S61), characters are cut out, the space between characters is calculated, and the characters are recognized (S62 to S64), as in the process of FIG. . Next, a histogram is created from the calculated frequency distribution of the inter-character space, and the peak values of the peaks of the inter-character space and the peaks of the inter-word space appearing in the histogram are detected (S65). FIG. 7 shows an example of the created histogram. Further, FIG. 8 shows the peak value Sc from the peak of the space between the characters shown in FIG.
Is obtained from the peak of the inter-word space, respectively.

【００２４】次いで、検出された各ピーク値Ｓｃ，Ｓｗ
を基準値として、文字間スペースの山および単語間スペ
ースの山にそれぞれ属するスペース幅ＳＰ［ｉ］から、
文字の組合せごとに補正係数Ｋｉを求める（Ｓ６６）。
すなわち、文字間スペースの山については、次式、Ｋｉ＝Ｓｃ／ＳＰ［ｉ］により求め、単語間スペースの山については、次式、Ｋｉ＝Ｓｗ／ＳＰ［ｉ］により求める。Next, the detected peak values Sc, Sw
Is used as a reference value, from the space width SP [i] belonging to the mountain of the space between characters and the mountain of the space between words,
A correction coefficient Ki is obtained for each character combination (S66).
That is, the peak of the inter-character space is determined by the following equation: Ki = Sc / SP [i], and the peak of the inter-word space is determined by the following equation: Ki = Sw / SP [i].

【００２５】ここで得られた補正係数Ｋｉを、スペース
前後の文字の組合せごとに集計して平均値を求め補正係
数テーブルを作成する（Ｓ６７）。このようにして、プ
ロポーショナル印字された英文書、またはそうでない英
文書も含めて、各種字体のサンプル文書を入力してこれ
らの処理を繰り返し実行させることにより、複数の補正
係数テーブルが作成される。なお、テーブルの初期値は
１．０である。表２はタイプライター文書をサンプルと
した場合に作成された補正係数テーブルの一部を示す。The correction coefficients Ki obtained here are totaled for each combination of characters before and after the space to obtain an average value, and a correction coefficient table is created (S67). In this manner, a plurality of correction coefficient tables are created by inputting sample documents of various fonts, including English documents that are proportionally printed or not, and repeatedly executing these processes. Note that the initial value of the table is 1.0. Table 2 shows a part of a correction coefficient table created when a typewriter document is used as a sample.

【００２６】[0026]

【表２】 [Table 2]

【００２７】表３は字体がモダンである英文雑誌をサン
プルとした場合に作成された補正係数テーブルの一部を
示す。Table 3 shows a part of a correction coefficient table created when a sample is an English-language magazine whose font style is modern.

【００２８】[0028]

【表３】 [Table 3]

【００２９】図９は、入力された英文文書画像について
認識した後に補正係数テーブルを用いてスペース幅を補
正し、単語間スペースを検出する処理についてのフロー
チャートである。図において、初めに文書画像が入力さ
れると（Ｓ９１）、図１の処理と同様に、文字の切出
し、文字間スペースの算出および文字の認識がなされる
（Ｓ９２〜Ｓ９４）。FIG. 9 is a flowchart showing a process for detecting the space between words by recognizing the input English document image and then correcting the space width using the correction coefficient table. In the figure, when a document image is input first (S91), characters are cut out, the space between characters is calculated, and the characters are recognized (S92 to S94), as in the process of FIG.

【００３０】次に、文字の認識結果を用いて、予め作成
した各補正係数テーブルごとに、スペース前後に位置す
る文字の組合せに応じて空白すなわちスペース幅を補正
する（Ｓ９５）。次いで、補正されたスペース幅につい
ての頻度分布を表すヒストグラムを、補正係数テーブル
ごとに作成し、さらにヒストグラム中にあらわれた文字
間スペースの山と単語間スペースの山それぞれの拡がり
が最も少ないヒストグラムを最適な補正として決定する
（Ｓ９６）。図１０〜図１２は、同一の入力英文書に対
してそれぞれ異なる３種類の補正係数テーブルを用い
て、スペース幅を補正し、その頻度分布をヒストグラム
として表したものである。Next, using the character recognition result, a blank, that is, a space width is corrected for each correction coefficient table created in advance in accordance with a combination of characters located before and after the space (S95). Next, a histogram representing the frequency distribution of the corrected space width is created for each correction coefficient table, and the histogram having the smallest spread between the character space peaks and the word space peaks in the histogram is optimized. (S96). FIGS. 10 to 12 show three types of correction coefficient tables different from each other for the same input English document, in which the space width is corrected and its frequency distribution is represented as a histogram.

【００３１】これらの図のなかでは、図１２の文字間ス
ペースの山の幅ｈｃおよび単語間スペースの山の幅ｈｗ
が最も小さいものとなり、図１２のヒストグラムが最も
適した補正として選択される。こうして得られた最適な
ヒストグラムを用い、文字間スペースと単語間スペース
を区分するしきい値を決定し、しきい値により大きいス
ペース幅についてのみを単語間スペースと判定する（Ｓ
９７）。次いで、単語間スペースとして判定されたスペ
ースの位置には空白記号を挿入する（Ｓ９８）。なお、
Ｓ９７以降の処理は、図１のＳ１７以降と同じである。In these figures, the peak width hc of the space between characters and the peak width hw of the space between words in FIG.
Is the smallest, and the histogram of FIG. 12 is selected as the most suitable correction. Using the optimum histogram obtained in this way, a threshold value for separating the inter-character space and the inter-word space is determined, and only the space width larger than the threshold value is determined as the inter-word space (S
97). Next, a space symbol is inserted at the position of the space determined as the inter-word space (S98). In addition,
The processing after S97 is the same as the processing after S17 in FIG.

【００３２】第４の発明の実施例では、英文書を入力す
ると、その文書固有の補正係数テーブルが学習により自
動作成されるため、新規な英文書を入力して認識しよう
とする場合にも、常に精度の良い単語間空白の検出がで
きるようになる。なお、この実施例では、最適なヒスト
グラムが決定されると、それに用いられた補正係数テー
ブルに付随する文書の字体、文字組情報が得られるの
で、以後の処理にも好都合である。このようにして、本
発明の各実施例では、従来、一律に判定することが適切
でなかった文字間と単語間のスペースについてを、その
入力された英文書ごとに最も適当なしきい値が自動的に
設定されるようになり、認識した文字行ごとに的確な単
語間空白が挿入されるようになる。In the fourth embodiment of the present invention, when an English document is input, a correction coefficient table unique to the document is automatically created by learning. It is possible to always detect a space between words with high accuracy. In this embodiment, when the optimum histogram is determined, the font and character set information of the document accompanying the correction coefficient table used for the histogram are obtained, which is convenient for the subsequent processing. In this manner, in each embodiment of the present invention, the most appropriate threshold value for each input English document is automatically set for the space between characters and between words, for which it has not conventionally been appropriate to judge uniformly. And the correct inter-word space is inserted for each recognized character line.

【００３３】[0033]

【発明の効果】以上述べたように第１の発明によれば、
切り出された個々の文字を認識した後に、文字間スペー
スの前後に位置する文字についての文字組ルールにより
決定される補正係数を用いて、算出した文字間スペース
幅が補正され、その後にヒストグラムが作成されてから
単語間スペースが検出されるので、文字間スペースの中
から誤りなく単語間スペースを検出することができるよ
うになる。第２の発明によれば、文字間スペース幅の分
布を表すヒストグラムの谷部のいずれにも分布値が０の
クラスがない場合に、クラス幅を順次狭くして、谷部に
分布値０のクラスが出現したところでそのクラス内のス
ペース幅がしきい値とされるので、文字切り出しの精度
等が悪い場合でもその状態で最も的確なしきい値が得ら
れて検出精度の低下を防ぐことができる。As described above, according to the first aspect,
After recognizing each cut-out character, the calculated inter-character space width is corrected using the correction coefficient determined by the character set rule for the characters positioned before and after the inter-character space, and then a histogram is created. Since the inter-word space is detected after this, the inter-word space can be detected from the inter-character space without error. According to the second aspect, when there is no class having a distribution value of 0 in any of the valleys of the histogram representing the distribution of the space between characters, the class width is sequentially narrowed, and a distribution value of 0 is provided in the valley. When the class appears, the space width in the class is set as the threshold, so that even if the accuracy of character segmentation is poor, the most accurate threshold can be obtained in that state, and the detection accuracy can be prevented from lowering. .

【００３４】第３の発明によれば、谷部に位置するクラ
ス内のスペース幅からしきい値を求める際に、そのクラ
スの中央値となるスペース幅がしきい値とされるので、
クラス幅が広い場合でも最も妥当な値がしきい値とな
り、その分、検出精度を向上させることができる。第４
の発明によれば、予め各種の字体および印字形態からな
る学習用の英文書画像を入力して、各種の補正係数テー
ブルを作成しておき、算出した文字間のスペース幅をそ
れぞれの補正係数テーブルにより補正して比較し、その
結果から最適な補正結果を選択するようにしたので、検
出対象の英文書が各種の字体や印字形態であってもそれ
らに応じて正確に単語間スペースを検出することができ
る。According to the third aspect, when the threshold value is obtained from the space width in the class located in the valley, the space width that is the median of the class is used as the threshold value.
Even when the class width is wide, the most appropriate value becomes the threshold value, and the detection accuracy can be improved accordingly. 4th
According to the invention, a learning English document image having various fonts and printing forms is input in advance, and various correction coefficient tables are created, and the calculated space width between characters is set in each correction coefficient table. And compare the results, and select the most appropriate correction result from the results. Therefore, even if the English document to be detected is in various fonts or printing forms, the space between words can be detected accurately according to them. be able to.

【図面の簡単な説明】[Brief description of the drawings]

【図１】第１ないし第３の発明の実施例の処理動作を示
すフローチャートである。FIG. 1 is a flowchart showing a processing operation of an embodiment of the first to third inventions.

【図２】処理対象の英文文字列の一例を示す図である。FIG. 2 is a diagram illustrating an example of an English character string to be processed;

【図３】スペース幅の頻度分布を示すヒストグラムであ
る。FIG. 3 is a histogram showing a frequency distribution of a space width.

【図４】スペース幅を補正した後のヒストグラムであ
る。FIG. 4 is a histogram after a space width is corrected.

【図５】再現された英文文字列の一例を示す図である。FIG. 5 is a diagram illustrating an example of a reproduced English character string.

【図６】第４の発明の実施例の処理動作の一部を示すフ
ローチャートである。FIG. 6 is a flowchart showing a part of the processing operation of the embodiment of the fourth invention.

【図７】同じく実施例におけるスペース幅の頻度分布を
示すヒストグラムである。FIG. 7 is a histogram showing a frequency distribution of a space width in the embodiment.

【図８】スペース幅の頻度分布を示すヒストグラムであ
る。FIG. 8 is a histogram showing a frequency distribution of a space width.

【図９】処理動作の一部を示すフローチャートである。FIG. 9 is a flowchart showing a part of a processing operation.

【図１０】スペース幅の頻度分布を示すヒストグラムで
ある。FIG. 10 is a histogram showing a frequency distribution of a space width.

【図１１】スペース幅の頻度分布を示すヒストグラムで
ある。FIG. 11 is a histogram showing a frequency distribution of a space width.

【図１２】スペース幅の頻度分布を示すヒストグラムで
ある。FIG. 12 is a histogram showing a frequency distribution of a space width.

【図１３】従来方法により作成されたヒストグラムであ
る。FIG. 13 is a histogram created by a conventional method.

【図１４】従来方法により作成されたヒストグラムであ
る。FIG. 14 is a histogram created by a conventional method.

【図１５】従来方法により作成されたヒストグラムであ
る。FIG. 15 is a histogram created by a conventional method.

【図１６】従来方法において発生する誤統合の一例を示
す図である。FIG. 16 is a diagram showing an example of erroneous integration occurring in a conventional method.

【図１７】従来方法において発生する誤分割の一例を示
す図である。FIG. 17 is a diagram illustrating an example of an erroneous division that occurs in a conventional method.

【符号の説明】[Explanation of symbols]

Ｍ１文字間スペース群Ｍ２単語間スペース群Ｍ３未確定スペース群Ｖ頻度０の部分（谷）Ｓｃ文字間スペースの山のピーク値Ｓｗ単語間スペースの山のピーク値ｈｃ文字間スペースの山の幅ｈｗ単語間スペースの山の幅 M1 Inter-character space group M2 Inter-word space group M3 Undetermined space group V Part of frequency 0 (valley) Sc Peak value of mountain of inter-character space Sw Peak value of mountain of inter-word space hc Width of mountain of inter-character space hw Mountain width of space between words

───────────────────────────────────────────────────── フロントページの続き (72)発明者小倉一郎東京都日野市富士町１番地富士ファコム制御株式会社内 (56)参考文献特開平４−139594（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁶，ＤＢ名) G06K 9/20 G06K 9/34 ──────────────────────────────────────────────────続き Continuation of the front page (72) Inventor Ichiro Ogura 1 Fuji-cho, Hino-shi, Tokyo Inside Fuji Faccom Control Co., Ltd. (56) References JP-A-4-139594 (JP, A) Field (Int.Cl. ⁶ , DB name) G06K 9/20 G06K 9/34

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】入力された英文書画像から個々の文字を
切り出して文字間のスペース幅を算出し、得られた各ス
ペース幅の頻度分布をクラス幅ごとのヒストグラムとし
て表し、ヒストグラム中に形成された文字間を表すスペ
ース幅の山部とそれよりも上位に位置して単語間を表す
スペース幅の山部との中間に形成された谷部に該当する
スペース幅を、文字間のスペース幅と単語間のスペース
幅を区分するしきい値とし、算出された文字間スペース
幅がしきい値よりも大きい場合はその文字間スペース幅
を単語間スペースとして検出する単語間スペース検出方
法において、切り出された個々の文字を認識した後に、
文字間スペースの前後に位置する文字についての文字組
ルールにより決定される補正係数を用いて、算出した文
字間スペース幅を補正し、その後にヒストグラムを作成
するようにしたことを特徴とする単語間スペース検出方
法。1. An individual character is cut out from an input English document image, a space width between characters is calculated, and a frequency distribution of the obtained space width is represented as a histogram for each class width, and is formed in the histogram. The space width corresponding to the valley formed between the peak of the space width representing the space between the characters and the mountain of the space width positioned above and representing the space between the words is defined as the space width between the characters. When the calculated inter-character space width is larger than the threshold value, the inter-word space width is determined as the inter-word space. After recognizing the individual characters
Using a correction coefficient determined by a character set rule for characters located before and after the inter-character space, the calculated inter-character space width is corrected, and then a histogram is created. Space detection method.

【請求項２】請求項１記載の単語間スペース検出方法
において、文字間スペース幅の頻度分布を表すヒストグ
ラムの谷部のいずれにも頻度分布値が０のクラスがない
場合は、クラス幅を順次狭くしていき谷部に頻度分布値
０のクラスが出現したところでそのクラス内のスペース
幅をしきい値とするようにしたことを特徴とする単語間
スペース検出方法。2. A method according to claim 1, wherein if there is no class having a frequency distribution value of 0 in any of the valleys of the histogram representing the frequency distribution of the space between characters, the class width is sequentially increased. An inter-word space detection method, characterized in that when a class having a frequency distribution value of 0 appears in a valley portion, the space width within the class is used as a threshold value.

【請求項３】請求項１または請求項２記載の単語間ス
ペース検出方法において、谷部に位置するクラス内のス
ペース幅からしきい値を求める際に、そのクラスの中央
値となるスペース幅をしきい値とするようにしたことを
特徴とする単語間スペース検出方法。3. The inter-word space detection method according to claim 1, wherein when a threshold value is obtained from a space width in a class located at a valley, a space width that is a median value of the class is determined. A method for detecting a space between words, wherein the threshold value is set.

【請求項４】入力された英文書画像から個々の文字を
切り出して文字間のスペース幅を算出し、得られた各ス
ペース幅の頻度分布をクラス幅ごとのヒストグラムとし
て表し、ヒストグラム中に形成された文字間を表すスペ
ース幅の山部とそれよりも上位に位置して単語間を表す
スペース幅の山部との中間に形成された谷部に該当する
スペース幅を、文字間のスペース幅と単語間のスペース
幅を区分するしきい値とし、算出された文字間スペース
幅がしきい値よりも大きい場合はその文字間スペース幅
を単語間スペースとして検出する単語間スペース検出方
法において、予め、各種の字体および印字形態からなる
学習用の英文書画像を入力し、その画像から個々の文字
を切り出して文字間のスペース幅を算出し、それらの頻
度分布をヒストグラムとして表し、ヒストグラム中に形
成された単語間を表すスペース幅の山部および文字間を
表すスペース幅の山部それぞれのピークとなるスペース
幅を検出し、ピークのスペース幅とそれぞれピーク以外
のスペース幅との比を算出して補正係数とし、得られた
各補正係数から各文字の前後組合せごとに文字間のスペ
ース幅をピークのスペース幅に修正するための補正係数
テーブルを学習用英文書画像ごとに作成しておき、次い
で、検出対象として切り出された個々の文字を認識し、
文字間スペースの前後に位置する文字の組合せごとに予
め作成した複数の補正係数テーブルを参照し算出した文
字間のスペース幅を補正し、参照した補正係数テーブル
ごとにスペース幅の頻度分布を示すヒストグラムを作成
して比較し、最適な補正結果を選択しその補正結果を用
いて算出した文字間スペース幅を補正するようにしたこ
とを特徴とする単語間スペース検出方法。4. An individual character is cut out from an input English document image, a space width between characters is calculated, and a frequency distribution of each obtained space width is represented as a histogram for each class width, and is formed in the histogram. The space width corresponding to the valley formed between the peak of the space width representing the space between the characters and the mountain of the space width positioned above and representing the space between the words is defined as the space width between the characters. In the inter-word space detection method of detecting the space width between words as a threshold value, and if the calculated inter-character space width is larger than the threshold value, the inter-character space width is detected as an inter-word space. A learning English document image consisting of various fonts and printing forms is input, individual characters are cut out from the image, the space width between characters is calculated, and their frequency distribution is histogramd. The peak space width of each of the peaks of the space width between words formed in the histogram and the peak of the space width between characters formed in the histogram is detected, and the space width of the peak and the space other than the peak are detected. A correction coefficient table for correcting the space width between characters to a peak space width for each preceding and succeeding combination of each character from the obtained correction coefficients by calculating a ratio with the width and a correction coefficient, and an English document image for learning. Each character, and then recognize individual characters cut out for detection,
A histogram showing the frequency distribution of the space width for each of the correction coefficient tables referenced, correcting the space width between the characters calculated by referring to a plurality of correction coefficient tables created in advance for each combination of characters located before and after the inter-character space. A method for detecting an inter-word space, characterized in that an inter-character space width calculated using the correction result is selected, and the optimum inter-character space width is calculated using the correction result.