TW308677B - Method of automatic separating text box and segmenting word in Chinese/English recognition system - Google Patents

Method of automatic separating text box and segmenting word in Chinese/English recognition system Download PDF

Info

Publication number
TW308677B
TW308677B TW82108380A TW82108380A TW308677B TW 308677 B TW308677 B TW 308677B TW 82108380 A TW82108380 A TW 82108380A TW 82108380 A TW82108380 A TW 82108380A TW 308677 B TW308677 B TW 308677B
Authority
TW
Taiwan
Prior art keywords
word
string
blank
block
item
Prior art date
Application number
TW82108380A
Other languages
Chinese (zh)
Inventor
Yeong-Shuenn Lin
Chyi-Pyng Yeh
Menq-Jia Tsay
Liang-Shyan Liou
Original Assignee
Ind Tech Res Inst
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ind Tech Res Inst filed Critical Ind Tech Res Inst
Priority to TW82108380A priority Critical patent/TW308677B/en
Application granted granted Critical
Publication of TW308677B publication Critical patent/TW308677B/en

Links

Landscapes

  • Character Input (AREA)
  • Processing Or Creating Images (AREA)

Abstract

A method of automatic separating text box and segmenting word in Chinese/English recognition system comprises of: input binary image data; from the above binary image data calculating all components; by component adjacent relationship and word height, size combining same column component to word; with component in each word as center, checking rotation angle; when oration angle is larger than one preset value, executing to rotate eachcomponent coordinate to correct rectangle, and back to step (3); when rotation angle is less or equal to the above predetermined value, executing step (6); by adaptive profile, with word as projection, by left/right and top/down large blank area segmenting to windows; from each word shape in window, dividing into multiple shapes, enumeratinghorizontal/vertical layout part, in order to judge text/graphic andlandscape/portrait; output text region.

Description

308677 A7 B7 五、發明説明(/ ) 5-1發明背景 解決影像中文字抽取的方法,大約可分為以下幾種: (1) 投影法(Profile) (i )傳統作法,以X軸,Y軸為主,對有値之影像作投 影(Pr 〇fi 1 e)利用其空白或較小値之位置作為行與 行,字與字之判斷條件,其缺點為怕傾斜之影像及 有圖文央雜之影像。 (ii)循環式(Recursive)X-Y:同樣為對X’ Y轴作投 影,但是對於有問題之部份,即找不到空白値之位 置,再縮小範圍只針對該有問題部份作投影,以找 出行座標;此先前技術之缺點為仍無法處理傾斜度 較大之影像,但可解決部份圖文夾雜情形。 (2) 膨脹收縮法:其做法為將影像作膨脹(dilation)及收縮 (erosion)動作’以求得相連之影像’作行之區隔及圖 形判,但缺點為速度慢且需較大的記憶空間 (Memory)。 (3) 分割法:可分為以下兩種 (i )小領域切割:將影像分為固定之區域,依X,Y軸 分割,再求各小塊間之投影,判斷其是否有文字 (Text)存在,相臨之區域則以重疊(Overlay)之判 斷條件,檢查是否為相連之行。其缺點為切割數目 不一,造成很多傾斜之文件,及複雜圖文難以處理 及多欄位(Multi-column)文件之不易處理。 (ii)二維小領域切割法,係將影像切成二維小空間,以 上下列(row)相隔一空白小片段作為上下單元(unit) 之區隔,如此可造成很多相連之單元分佈於影像四 週,再利用在Y軸投影關係,解決_行之問題,對 於過大之單元(Unit)則被視為圖形或雜訊,其缺點 為圖文夾雜之影像仍不易處理,且對於橫/直排版 並存之文件無法處理。 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公釐) —.11 |螝| n —11 . I 111 (請先閱讀背面之注意事項再填寫本頁) 經濟部中央標準局員工消費合作社印製 A7 308677 B7 五、發明説明(> ) (4)背景分析法(Background Analysis) 利用背景空白之特徵,抽取出大、小空白之串連區域, 並依此判斷出文字區、圖形區,但尚未有很顯著之成 果。 5-2發明目的與概述 由以上先前技藝發現,對於圖文夾雜之印刷中文且有傾 斜或因由光學掃描時所造成的傾斜,先前技藝均無法有效 的處理,而使得在現實的應用上受到限制。 本發明之方法為對經光學掃描後之印刷中英文件做以下 步處理,步驟1 ;利用一快速査表參照原專利中華民國發明 第五四八五五號,美國專利5357582號,先計算出影像中所 有的相連元素(Connectted Component),步驟2 :利用統 計方法,計算出文件之概估、文字大小;步驟3 :利用步驟2 所計算出之資料,連接相鄰之元素(Component)為字串 (Word),而此字_(Word)即做為組合成行之基本單元;步 驟4 :以字串(Word)之元素(Component)的中心點,計算此 字串的傾斜角度,統計所有字_之傾斜角則可以得到整張 文件之傾斜角。並旋轉正各元素的座標。步驟5:由字串 (Word)間距統計此文件之行間距,步驟6 :利用行間距分 割,以適應性(adaptive)作X,Y投影(profile)以求取區 塊。步驟7 :分析各區塊内字串(Word)之外形傾向、判斷其 為文字區塊、標題區、圖形區,並判斷其橫或直排版。 以上方法不僅可以解決文件傾斜的問題,更可處理複雜 圖文問題,如圖2為處理結果。 依據本發明的方法可解決圖文夾雜之印刷中文文件,並 可達成以下目的(1)橫直排版並存,且自動分辨(2)圖文並 存,且自動分離(3)傾斜文件(4)文字的抽取。 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公簸) I ^ I. » 裝 i 41 (請先閣讀背面之注意事項再填寫本頁) 經濟部中央標準局員工消費合作社印製 A7 308677 B7 五、發明説明(3 ) 5-3圖式簡單説明 圖一所示為一列傾斜之印刷字,旋轉元素(components) 座標 圖二所示為一具橫直排版並存文件,區塊分割之結果 圖三所示為本發明之系統流程圖 圖四所示為圖三步驟302之詳細流程圖 圖五所示為圖三步驟303之詳細流程圖 圖六所示為圖三步驟306之詳細流程圖 圖七所示為圖三步驟308之詳細流程圖 圖八所示為即使文件傾斜用旋轉後之座標以投影 (profile)法可克服傾斜問題 圖九所示為圖三步驟304之字串傾斜角計算方法 5-5發明詳細説明 在解説本發明前,先定義-三個在本發明描述中所使 用到的名詞 (1) 元素(Component):包含相連影像之最小矩形’例 如"言”這個字可為4個元素(Component)—一一 □。 (2) 字串(Word):將距離小於本發明中所計算出之行 間距之元素(component)合成一字串(Word)換言之 類似英文由字母(letters)組成字(Word)之觀念。在 中文文件中,相對於標點符號間之子句。 (3) 區塊(Window):含同屬性之區域’例如文字區、 圖形區等。 如第2圖示之"羊n字(Word)即由101,102,103等 3個元素(Component)所組成,而第2圖中之201, 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公釐) --------^---裝------訂 (請先閱讀背面之注意事項再填寫本頁)308677 A7 B7 V. Description of the invention (/) 5-1 Background of the invention The methods for solving the text extraction in the image can be roughly divided into the following types: (1) Projection method (Profile) (i) Traditional method, with X axis, Y The main axis is the projection of the image with a value (Pr 〇fi 1 e). The blank or smaller value is used as the line and line, the condition for judging the word and the word. The shortcomings are the image and the image that are afraid of tilting. The image of Yangza. (ii) Recursive XY: It is also to project the X 'Y axis, but for the problematic part, that is, the position of the blank value cannot be found, and then narrow the scope to only project the problematic part, To find the row coordinates; the shortcoming of this prior art is that it still can't handle images with large inclination, but it can solve the situation of partial image inclusion. (2) Expansion and contraction method: The method is to dilate and contract the image (to obtain connected images) for segmentation and graphic judgment, but the disadvantage is that the speed is slow and requires a large Memory space (Memory). (3) Segmentation method: It can be divided into the following two (i) small area cutting: divide the image into a fixed area, split according to the X and Y axis, and then find the projection between each small block to determine whether it has text (Text ) Exist, the adjacent areas are checked for overlap by the judgment condition of Overlay. The disadvantage is that the number of cuts is different, resulting in a lot of slanted documents, complicated graphics and text are difficult to handle and multi-column (Multi-column) documents are not easy to handle. (ii) Two-dimensional small-area cutting method is to cut the image into two-dimensional small space. The above (row) is separated by a small blank segment as the upper and lower unit (unit), so it can cause many connected units to be distributed in the image All around, then use the projection relationship on the Y axis to solve the problem of _ line. For the unit that is too large, it is regarded as graphics or noise. The disadvantage is that the image with mixed graphics is still not easy to handle, and for horizontal / vertical layout Coexisting documents cannot be processed. This paper scale is applicable to the Chinese National Standard (CNS) A4 specification (210X297mm) —.11 | 螝 | n —11. I 111 (Please read the notes on the back before filling this page) Employee Consumer Cooperative of the Central Bureau of Standards of the Ministry of Economic Affairs Printed A7 308677 B7 5. Description of the invention (>) (4) Background Analysis (Background Analysis) Using the characteristics of the background blank, extract the large and small blank series of areas, and then determine the text area and graphics area , But there have not been any significant results. 5-2 Purpose and Summary of the Invention From the above prior art, it is found that for printed Chinese with pictures and text, and there is tilt or tilt caused by optical scanning, the previous art cannot be effectively processed, which limits its practical application. . The method of the present invention is to perform the following steps on the printed Chinese and English documents after optical scanning, step 1; use a quick look-up table to refer to the original patent Republic of China Invention No. 548585, US Patent No. 5357582, first calculate the image All connected elements (Connectted Component) in step 2: Use statistical methods to calculate the approximate estimate and text size of the document; Step 3: Use the data calculated in step 2 to connect the adjacent elements (Component) as a string (Word), and this word _ (Word) is used as the basic unit combined into a line; Step 4: Calculate the inclination angle of this word string with the center point of the component of the word string (Word), and count all words _ The tilt angle of the whole document can be obtained. And rotate the coordinates of each element. Step 5: Calculate the line spacing of this file from the word spacing. Step 6: Use line spacing to divide and adaptively make X, Y projections to obtain the block. Step 7: Analyze the outward tendency of the word string in each block, determine it as a text block, title area, and graphics area, and determine its horizontal or vertical layout. The above method can not only solve the problem of skewed files, but also deal with complex graphic problems, as shown in Figure 2 for the processing results. The method according to the present invention can solve printed Chinese documents with mixed pictures and texts, and can achieve the following objectives: (1) coexistence of horizontal typesetting, and automatic discrimination (2) coexistence of pictures and text, and automatic separation (3) tilted documents (4) text Extract. The size of this paper applies to the Chinese National Standard (CNS) A4 (210X297 bumps) I ^ I. »Install i 41 (please read the notes on the back before filling this page) A7 printed by the Employee Consumer Cooperative of the Central Bureau of Standards of the Ministry of Economic Affairs 308677 B7 5. Description of the invention (3) 5-3 Brief description of the drawings Figure 1 shows a list of slanted printed characters, rotating elements (components) Coordinates Figure 2 shows a horizontally typeset coexisting document, the result of block division Figure 3 is a flowchart of the system of the present invention. Figure 4 is a detailed flowchart of step 302 in Figure 3. Figure 5 is a detailed flowchart of step 303 in Figure 3. Figure 6 is a detailed flowchart of step 306 in Figure 3 Figure 7 shows a detailed flowchart of step 308 in Figure 3. Figure 8 shows that even if the file is tilted, the rotated coordinates can be used to overcome the tilt problem using the profile method. Figure 9 shows the string tilt angle of step 304 in Figure 3 Calculation method 5-5 Detailed description of the invention Before explaining the present invention, first define-three nouns (1) used in the description of the present invention (1) Element (Component): contains the smallest rectangle of the connected image 'for example "quote" Word 4 elements (Component) —one by one □. (2) Word (Word): a component with a distance less than the line spacing calculated in the present invention is synthesized into a word string (Word), in other words, similar to English by letters ( letters) The concept of forming words (Word). In Chinese documents, relative to the clauses between punctuation marks. (3) Block (Window): Areas with the same attributes' such as text area, graphics area, etc. As shown in Figure 2 The "quote" word (Word) is composed of three elements (101, 102, 103, etc.), and 201 in the second picture, the paper scale is applicable to the Chinese National Standard (CNS) A4 specification (210X297 Ali) -------- ^ --- installed ----- ordered (please read the notes on the back before filling this page)

經濟部中央標準局員工消費合作社印I 308677 A7 _____ B7 五、發明説明(午) 202 ’ 203,204 ’ 205均在本發明中視為區塊 (Window) 0 本發明之方法為利用被掃描文件上之元素 (Component)往上組合並運用了統計及數値分析之方 法’其流程如第3圖所示,301系統開始,輸入此系統 =是二元影像資料(binary image) ’步驟3〇2為由二元 衫像資料串成元素表(Component list),其做法為利用 查表法結合兩個串列(link list)資料結構,計算出所有 元素之座標(指包含元素之最小矩形座標;亦即找出一 列衫像資料所有runs之左右邊界’再檢査與前一列之元 素是否有相連,以擴大元素式產生新元素。詳見圖四。 步驟303 :為利用元素之相鄰關係,及其字高大小,結合同 行之元素成字串(Word),詳見圖五。 步螺3〇4:為檢查旋轉角度。此乃利用各字内之元素中心 點’以數値方法Y = ax+ b公式代入,以 MSE(Minimum Square Error 最小平方誤差)求 出可能之旋轉角度。若角度(0)太大(>1〇。)或 影響到行與行重疊(overlap)情況,則跳回步驟 303 ° h & 1 4亍赃 (行與行重私 ^tan-1 -max(Image width-Image Length^ 步螺3〇5 :為將每個元素旋轉成正確矩形,並重新串字(執 行303)再根據這些字串(Word),以統計方式算出 其行間距。 步驟306 :為利用適應性投影,以字串(word)做投影,利 用左、右與上、下之大空白區域,分隔成區塊 (Windows) 0詳見圖六。 I I I I i 裝— I I I 訂 I I Am (請先閣讀背面之注意事項再填寫本頁) 經濟部中央榡準局員工消費合作衽印製I 308677 A7 _____ B7, the employee consumer cooperative of the Central Bureau of Standards of the Ministry of Economic Affairs V. Description of the invention (noon) 202 '203, 204' 205 are regarded as windows in the present invention 0 The method of the present invention is to use the scanned document The component (Component) is combined upwards and uses the method of statistics and numerical value analysis. The process is shown in Figure 3. The 301 system starts. Enter this system = is a binary image (step 2). To form a component list from binary shirt-like data, the method is to use a table lookup method to combine two link list data structures to calculate the coordinates of all elements (referring to the smallest rectangular coordinates containing elements; That is to find the left and right borders of all runs in a row of shirt image data, and then check whether the elements in the previous row are connected to expand the element type to generate new elements. See Figure 4 for details. Step 303: To use the adjacent relationship of elements, and The height of the word is combined with the elements of the peer to form a word string (Word). For details, see Figure 5. Step screw 3〇4: To check the rotation angle. This is to use the center point of the element in each word to calculate the value Y = ax + b Substitute the formula, and use MSE (Minimum Square Error) to find the possible rotation angle. If the angle (0) is too large (> 10.) Or affects the line-to-line overlap (overlap), skip back to the step 303 ° h & 1 4 亍 藍 (line and line heavy private ^ tan-1 -max (Image width-Image Length ^ step screw 3〇5: to rotate each element into the correct rectangle, and re-string (execute 303) According to these words, calculate the line spacing in a statistical manner. Step 306: To use adaptive projection, use the word as the projection, and use the large blank areas of left, right, and top and bottom. Divided into blocks (Windows) 0 See Figure 6 for details. IIII i Pack — III Order II Am (please read the precautions on the back and then fill out this page) Printed by the Consumers ’Cooperation Cooperative Bureau of the Central Bureau of Economics of the Ministry of Economic Affairs

經濟部中央標準局員工消費合作社印製 A7 B7 五、發明説明(() 步驟307 :利用區塊(Window)内各字(word)之形狀,分為 六種,統計其橫,直之部份,以判斷是否為文字 區塊,及判斷其為橫或直排版文件。 步驟308 :將文字區塊作切行切字。 步驟309 :為結束動作。 為了更仔細描述第3圖中之各步驟,茲説明如下;步驟 302其詳細的執行步驟如流程圖第4圖中所示,為計算影像 中所有的元素(Components),首先定義兩名詞: 1. (連績位元)RUN :為單一列(ROW)影像中,連續有意義 之位元(bit)所連成的單元,謂之RUN此資料結構會記錄 左、右邊界位置。A7 B7 printed by the Staff Consumer Cooperative of the Central Bureau of Standards of the Ministry of Economic Affairs 5. Description of the invention (() Step 307: Use the shape of each word in the window (Window), divided into six types, counting the horizontal and straight parts, To determine whether it is a text block, and determine whether it is a horizontal or straight typesetting document. Step 308: Cut the text block into lines and words. Step 309: End the action. In order to describe the steps in Figure 3 more carefully, The following is a description; the detailed execution steps of step 302 are shown in the fourth diagram of the flowchart. To calculate all the components in the image, first define two terms: 1. (Continuous performance bit) RUN: a single column (ROW) In the image, the unit of consecutive meaningful bits (bit) is called RUN. This data structure records the left and right boundary positions.

2. 連接矩形:連接上、下相連之RUN所形成之包圍最小矩 形稱之。此資料結構為記錄上、下、左、右之位置,及最 後一個ROW的RUN(左、右)位置,以便檢查與下一 ROW 是否有相連。 步驟401 :為設i為第i列(Row)影像連接矩形為清除狀態 (NULL)。 步驟402 :利用判斷邊界方法,以8-位元(bit)為一單元,迅 速 找出在第i行之RUN之位置及數目。 步驟403 :本步驟中,會將RUN之資料結構以一串列(link 1 i st)串連(由小至大),並將尚未結束之連結矩形 以另一串列(link list)串連,並以其最後一個 ROW之位置作排序。因此在比較連接矩形與 RUN是否相連時,可利用兩個列(list)作類似插 入排序之動作迅速執行。若有相連,則將此RUN 結合至原先的連接矩形;否則產生一新的連接矩 形,並插在連接矩形的列(list)中,以確保位置為 排序(sort)好之結果,加快下次執行。 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公釐) ^^^1 1^1 In ^^^1 ^^^1 1^11 m IP I m n ^ J 夺 、-切 (請先閱讀背面之注意事項再填寫本頁) 308677 A7 B7 經濟部中央標準局員工消費合作社印製 五、發明説明(t ) 步驟404 :為繼續下一個列(ROW)。 步驟405 :為完成所有的影像元素(Component)計算’將此 連接矩形全部輸出;由於利用兩個列(list)及快速 矩形邊界判斷法,故此計算只需讀取(trace)影像 一次即可。 第5圖為步驟303之更進一步的流程圖。 步驟501 :由統計方法,以元素(Compononts)大小為橫 軸,個數為縱軸,計算個數最大的文字大小,設 為代表字高。 步驟502 :將太小或太大或狹長型元素(Compononts)去 除。 步驟503 :初設1/3字高為字間矩(HC),此為預估値。 步驟504 :將距離小於字間距(HC)之元素(Components)串 成字串(words)。(以串列(link list)代表)。 步驟505 :檢查是否合理,即串成字串(word)後之代表高度 是否與原計算之代表高度接近,若有T1値(如超 過80%則不合理),則表示字間矩(HC)有問題。 若有過多單一元素(Component)未能串成字串 (word)則加大字間矩(HC);若有太多字串(word) 之代表高度過大則減字間矩(HC)。 所謂合理字串為一字串(word)只存在一行文字, 而非兩行以上 如: 合理字串: 不合理字串 ----------裝— (請先閱讀背面之注意事項再填寫本頁) 工研院電通所 新竹縣竹東鎮 工研院電通所 新竹縣竹東鎮 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公釐) A7 308677 B7 五、發明説明(/p 步驟506 :根據步驟505之判斷,加或減字間矩(HC),並重 新串字串(word)。 以上為第5圖之流程解説。 步驟304檢查旋轉角度,設HC為代為字高參數將文件中字 串(words)分為五類; ⑻ _w_ h h〜HC,(w/h)>3 (b) ^ h w〜HC,( h Λν) > 3 (c) w2. Connection rectangle: The minimum rectangle formed by connecting the upper and lower connected RUN is called. This data structure records the positions of up, down, left, and right, and the RUN (left and right) position of the last ROW, so as to check whether it is connected to the next ROW. Step 401: Set i to the i-th row (Row) image connection rectangle to the clear state (NULL). Step 402: Using the judgment boundary method, taking 8-bit as a unit, quickly find the position and number of RUN in the i-th row. Step 403: In this step, the data structure of RUN is connected in series (link 1 i st) (small to large), and the connection rectangle that has not ended is connected in another list (link list) And sort by the position of its last ROW. Therefore, when comparing whether the connecting rectangle is connected with RUN, you can use two lists to perform similar insertion sorting operations quickly. If there is a connection, combine this RUN to the original connection rectangle; otherwise, a new connection rectangle is generated and inserted in the list of the connection rectangle to ensure that the position is a sorted result, speeding up the next time carried out. The size of this paper is applicable to the Chinese National Standard (CNS) A4 specification (210X297mm) ^^^ 1 1 ^ 1 In ^^^ 1 ^^^ 1 1 ^ 11 m IP I mn ^ J Capture,-cut (please read first (Notes on the back and then fill out this page) 308677 A7 B7 Printed by the Consumer Consultation Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs V. Invention Instructions (t) Step 404: Continue to the next row (ROW). Step 405: In order to complete the calculation of all the image elements (Component), all the connected rectangles are output; since two lists (list) and a fast rectangle boundary judgment method are used, the calculation only needs to trace the image once. Figure 5 is a further flowchart of step 303. Step 501: Calculate the size of the largest number of characters with the size of elements (Compononts) as the horizontal axis and the number as the vertical axis by statistical methods, and set it as the representative character height. Step 502: Remove elements that are too small or too large or elongated (Compononts). Step 503: Initially set the 1/3 word height as the inter-word moment (HC), which is the estimated value. Step 504: String the elements whose distance is smaller than the word spacing (HC) into words. (Represented by a link list). Step 505: Check whether it is reasonable, that is, whether the representative height after being stringed is close to the originally calculated representative height, if there is a T1 value (if it exceeds 80%, it is unreasonable), it means the inter-word moment (HC) has a problem. If there are too many single elements (Component) that cannot be stringed into words, then increase the inter-word moment (HC); if there are too many words (word), the height is too large, then reduce the inter-word moment (HC). The so-called reasonable string is a word string (word) only one line of text, rather than two or more lines such as: reasonable string: unreasonable string ---------- installed — (please read the note on the back first Please fill in this page again.) The Institute of Dentsu, Industrial Research Institute, Zhudong Town, Hsinchu County, Zhutong Town, Institute of Industrial Research, Hsinchu County, China (/ P Step 506: According to the judgment of step 505, add or subtract the inter-word moment (HC), and re-string the word (word). The above is the explanation of the flow in Figure 5. Step 304 Check the rotation angle, let HC be replaced by The word height parameter divides the words in the file into five categories; ⑻ _w_ hh ~ HC, (w / h) > 3 (b) ^ hw ~ HC, (h Λν) > 3 (c) w

h w,h〜HC (d) wh w, h ~ HC (d) w

—h w,h « HC (e) w—H w, h «HC (e) w

h w,h » HCh w, h »HC

(present high) > 1.2 * HC -1 目前高度 取符合(a)or(b)words,每一個words以其内含之元素 (component)的質心以 Y = ax + b 方程式,及 MSE(Mininum(present high) > 1.2 * HC -1 is currently highly consistent with (a) or (b) words, each word with the centroid of the element contained in it is represented by the equation Y = ax + b, and MSE ( Mininum

Square Error)見 Fig.9求得每個 words 内 comp one nts排歹丨J 的傾斜度,並統計這些角度以找出這張影像的傾斜度()。 Θ 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公釐〉 -----------裝------訂------^4: (請先聞讀背面之注意事項再填寫本頁) 經濟部中央標準局員工消費合作社印製Square Error) See Fig. 9 for the inclination of the comp one nts in each word, J, and count these angles to find the inclination of this image (). Θ The size of this paper is applicable to the Chinese National Standard (CNS) A4 specification (210X297mm) ----------- installed ------ ordered ------ ^ 4: (please listen first Read the notes on the back and fill out this page) Printed by the Employees Consumer Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs

步驟305^^.如此旋轉正後之座標矩形可還原其大小,使 元素(component)不致因傾斜而矩形座標被放 大。 步驟306之詳細流程圖如第_所示,其為利用適應性投影 (adaptive profile)串區塊(wind〇ws)由於到步驟 306之元素已旋轉過,所以這個步騾可使用字 (word)及投影(profUe)方法切割區域nd〇 爲投影一般傾斜問題最難處理)„ 步驟601 :將所有字串~01^)以1_座標排序(8〇1^)成一串列 (link list)。 步騾602 :以1/2行間矩為段落(ci〇ck),並以此段落掃描 (clock scan)整張影像’以加快速度作投影 (profile)。 步驟603 :若有字串第一次遇到此段落(el〇ck),則投影至乂軸 上。 步驟604 :檢查是否有空白行,且段落(cl〇ck)通過兩次, 並大於字串(word)在X軸之投影。若有,表示有 行或區塊(window)可能產生;若無,則跳到步驟 607繼續。 λ 步驟605 :判斷是否為可merge條件,條件有二: ---------裝-- (請先閱讀背面之注意事項再填寫本頁) 訂' 經濟部中央標準局員工消費合作社印製 Μ \l 於; 於 大 小 白 白 空an白空an白 之bl空右bl空 jsck#-左ckf £10區rdlo區 ^b/{ ob/lv block blank =行間距 (區段-¾白j +(oi?set)¥ 計値 步驟606 :判斷是否已為獨立之區塊,則條件有三: 本紙張尺度適用中國國家標準(CNS ) Μ規格(210Χ297公釐) 308677 A7 B7 五、發明説明(Step 305 ^^. In this way, the coordinate rectangle after the rotation can be restored to its size, so that the component cannot be enlarged due to the tilt. The detailed flow chart of step 306 is shown in section _, which uses adaptive projection (adaptive profile) string blocks (windows). Since the elements to step 306 have been rotated, this step mule can use the word (word) And the projection (profUe) method cutting the area nd〇 is the most difficult to deal with the general tilt problem of projection) Step 601: Sort all strings ~ 01 ^) into 1_coordinates (8〇1 ^) into a list (link list). Step mule 602: Take the 1/2 line moment as a paragraph (ci〇ck), and scan the entire image with this paragraph to speed up the profile. Step 603: If there is a string for the first time When this paragraph (el〇ck) is encountered, it is projected onto the x axis. Step 604: Check if there is a blank line, and the paragraph (cloc) passes twice, which is greater than the projection of the word on the X axis. If there is, it means that there may be a row or window (window) may occur; if not, skip to step 607 to continue. Λ Step 605: determine whether it is a mergeable condition, there are two conditions: --------- install -(Please read the precautions on the back before filling out this page) Order 'Printed by the Consumer Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs Μ \ l 于; Small white white empty an white empty an white bl empty right bl empty jsck # -left ckf £ 10 area rdlo area ^ b / {ob / lv block blank = line spacing (section-¾white j + (oi? Set) ¥ Calculation step 606: To determine whether it is an independent block, there are three conditions: This paper size is applicable to the Chinese National Standard (CNS) Μ specification (210Χ297 mm) 308677 A7 B7 V. Description of invention (

V 2衹區段-空白) rd區有1。 .£w) ?!且ordind U空 字之31Φ胤 #-^Γ-έ 字心 合下-$的為 ^〜一新字串(word)V 2 only sections-blank) rd area has 1. . £ w)?! And ordind U empty The word 31Φ 胤 #-^ Γ-έ The word heart is closed-$ 的 为 ^ ~ A new word (word)

塊區白 區過空k 之超方an 成要下bl。 組白且kr 串空,oc白 字之白bl空 iL-ir-空2*^:-獨、一過㈣ 單左段处(E \1/ 2 /IV m I -I - - In , ml I (請先閱讀背面之注意事項再填寫本頁) (3)段落執行完,收集鄰近 之#或區塊,Y軸有重疊 ,且左、右空白小於(區 段-空白)之合併+可自成 區塊。 步驟607 : clock + 1,(段落+ 1) 步驟608 .若掃完整個影像,執行EXITg尚未掃完則回到步驟 603 ° 以上為步驟306之詳細流程。 步驟307利用words之特性,判斷wind〇ws是否為文字 區或橫直排版。在步驟3〇4中已將words分成5類 ==== >文字區橫排 >文字區直排 本紙張尺度適用中國國家標準(CNS ) A4規格(21 OX297公釐)In the block area, the white area is too empty, and the super square an Cheng Yao wants to play bl. The group is white and kr is empty, the white bl of oc white is empty iL-ir-empty 2 * ^:-independent, one pass (iv) single left paragraph (E \ 1/2 / IV m I -I--In, ml I (Please read the precautions on the back before filling this page) (3) After the execution of the paragraph, collect the adjacent # or block, the Y axis overlaps, and the left and right blanks are less than (section-blank) merge + can be Into blocks. Step 607: clock + 1, (paragraph + 1) Step 608. If the entire image is scanned and EXITg has not been scanned, return to step 603 ° above is the detailed process of step 306. Step 307 uses the characteristics of words To determine whether wind〇ws is a text area or horizontal typesetting. In step 304, the words have been divided into 5 categories ====> Text area horizontal line> The text area line type This paper size is applicable to the Chinese National Standard (CNS ) A4 specification (21 OX297 mm)

、1T 經濟部中央標準局—工消費合作社印製 (1) (a) + (c) + (d)佔 T2 以上=== 版 (2) (a) + (c) + (d)佔 T2 以上=== 版 T2 :同性質字串之所佔合理比例値。 T1 .字串之T§J度與代表字1¾度之百分比合理値。 (3) others ==== >非文字區 在(1) (2)條件中,((a) + (b) + (c) + (d) ) / total > T3 A7 B7 五、發明説明0 c ) Τ3 :文字區之字串所佔合理之百分比。 步驟307詳細流程如第7圖所示 步驟701 :由於旋轉過之影像,沒有傾斜問題,又已知道區 塊為橫或直排版,因此,可以元素(Components) 作投影(profile)來切行,如圖8。 步驟702 :再取陰影(shadows)作文字邊界判斷依由上述步 驟完成自動分離文框與文字切割。 综合以上敘述本發明有以下的優點: 1. 橫直排版並存。 2. 由於以元素為最小單位,與其它方法相較誤差減少。 3. 由統計得到數據,應用範圍較有彈性。 4. 圖文相近,或大標題與本文相近亦可分隔。 5 .可以接受之傾斜角度大,區塊分割較不受影響。 I--------裝-- (請先閱讀背面之注意事項再填寫本頁), 1T Printed by the Central Bureau of Standards of the Ministry of Economic Affairs-Industrial and Consumer Cooperatives (1) (a) + (c) + (d) accounted for more than T2 === version (2) (a) + (c) + (d) accounted for T2 The above === version T2: a reasonable proportion of strings of the same nature. T1. The percentage between the T§J degree of the character string and the representative word of 1¾ degree is reasonable. (3) others ====> In the condition of (1) (2), ((a) + (b) + (c) + (d)) / total > T3 A7 B7 V. Invention Note 0 c) Τ3: a reasonable percentage of words in the text area. The detailed process of step 307 is shown in Figure 7 Step 701: Since the rotated image has no tilt problem, and the block is known to be horizontal or vertical layout, therefore, the elements can be used as a profile to cut the line, As shown in Figure 8. Step 702: Take shadows again to determine the boundary of the text. Follow the above steps to complete the automatic separation of the frame and the text. Summarizing the above description, the present invention has the following advantages: 1. Horizontal and horizontal typesetting coexist. 2. Since the element is the smallest unit, the error is reduced compared with other methods. 3. The data is obtained from statistics, and the application range is more flexible. 4. The pictures and texts are similar, or the headlines are similar to this article and can be separated. 5. The acceptable tilt angle is large, and block division is less affected. I -------- install-- (Please read the precautions on the back before filling this page)

XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX 訂 經濟部中央標準局員工消費合作社印製 6.切行,切字對傾斜容忍度亦較大。 本紙張尺度適用中國國家標準(CNS ) A4規格(2丨0 X 297公釐)XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX Order Printed by the Employee Consumer Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs 6. Cut the line, and the word is also tolerant to tilt. This paper scale is applicable to the Chinese National Standard (CNS) A4 specification (2 丨 0 X 297 mm)

Claims (1)

308677 A8 B8 C8 D8 Μ* 經濟部中央標準局員工消費合作社印製 申請專利範圍 1. 一種中英文辨識系統之自動分離文框與切割之方法包含以下 步驟: 步驟1 :輸入二元影像資料; 步驟2 :由上述之二元影像資料計算出所有的元素; 步驟3 :以元素之相鄰關係及字高、大、小結合同行之元素 成”字串”; 步驟4 :以各字内之元素中心點,檢查旋轉角度; 步驟5 :當旋轉角度大於一預定値時執行將每個元素座標旋 轉成正確矩形,並回到步驟3 ; 萬旋轉角度小於或爭於上述之預定値時執行步驟6 ; 步驟6 :利用適應性投影,以字串做投影,利用左右與上下 之大空白區域分隔成區塊; 步騾7 :將區塊内各字之形狀,分為多數種形狀,統計其中 之橫直部份,以判斷出文字、圖形及橫排版、直排 版; 步驟8 :輸出文字區。 2.如申請專利範圍第1項之方法,其中上述之步騾2包含下列 步驟: 步驟9 :將第i列影像連接矩形設為清除狀態; 步騾10 :以多數個位元為一單元,找出在第丨行之連續位元 (RUN)之位置及數目; 步騾11 :對所有之連續位元(RUN),檢查是否有連接矩形與 之相連’若有則接合成新的連接矩形,若無,則建 立一新的連接矩形; 步驟12 :設定i+1為i直到最後-個列則執行步驟13,若非最 後列則到步驟10執行; 本紙張尺度適用中關家標準^T^717QX297公釐) (請先閱讀背面之注意事項再填寫本頁) 訂 308677 as B8 ;-——os T、申請專利範圍 " ~~ ' 步驟13 :輸出所有的連接矩形。 3.=請專利細第1項之方法,其中上述之步驟3包含下列 变驟. 步驟14 :以元素大小為難,個數為縱軸,計算個數最大的 文字的大小,設為代表字高; 步驟15 :去除太小、太大、及狹長型之元素; 步驟16 :設定1/N字高為字間距; “, 步驟17 :將距離小於代表字高之元素串成字串; 步驟18 :辛成字争後之代表高度,若大於—固定値化則表 字串不合理’若有過多單—元絲能串成字串,則 加大字_ ;若有太?字串之代表高度過大則減字 間距; 步驟19 .根據步驟18,加或減調整字間距,並重新_字串。 4·如申請專利範圍第i項之方法,其中上述之步驟6包含 步驟: 步驟20 :以影像之Y座標,排序所有之字串; 步驟21 ··以1/L行間矩為段落(Clock),並以此作段落掃插 (Clock Scan)整張影像,以加快速度作投影; 步驟22 :將落在段落(Clock)之字_,投影在)(袖上· 步騾23 :檢查是否有空白行,且段落(cl〇ck)通過兩次,並 大於字串在X軸之投影;若有表示有行或區塊 (Window)可能產生,執行步騾24 ;若無則執行步驟 26 ; * 步驟24 :判斷是否為可合併(merge)之條件; 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公釐') • i^i— i^n (請先閱讀背面之注意事項再填寫本頁) -訂 經濟部中央標準局員工消費合作社印製308677 A8 B8 C8 D8 Μ * Printed by the Consumer Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs for patent application 1. A method for automatically separating frames and cutting in Chinese and English identification systems includes the following steps: Step 1: Input binary image data; Step 2: Calculate all the elements from the above binary image data; Step 3: Use the adjacent relationship of the elements and the elements with the same height, large and small to form a "string"; Step 4: Use the elements within each word Center point, check the rotation angle; Step 5: When the rotation angle is greater than a predetermined value, perform the rotation of each element coordinate into a correct rectangle, and return to step 3; When the rotation angle is less than or competes with the above predetermined value, perform step 6 ; Step 6: Use adaptive projection to project with a string of characters, and use the large blank areas on the left, right, and top to separate into blocks; Step 7: Divide the shapes of the words in the block into a variety of shapes, and count them Horizontal section, to judge the text, graphics and horizontal layout, straight layout; Step 8: Output text area. 2. The method as claimed in item 1 of the patent scope, where the above-mentioned step mule 2 includes the following steps: Step 9: set the image connection rectangle of the i-th column to the clear state; step mule 10: take multiple bits as a unit, Find the position and number of consecutive bits (RUN) in the first row; Step 11: For all consecutive bits (RUN), check if there is a connection rectangle connected to it. If there is, connect them into a new connection rectangle If not, create a new connection rectangle; Step 12: Set i + 1 to i until the last row, then go to step 13, if it is not the last row, go to step 10; This paper size applies to Zhongguanjia standard ^ T ^ 717QX297mm) (Please read the precautions on the back before filling in this page) Order 308677 as B8; ----- os T, patent application scope " ~~ 'Step 13: Output all connection rectangles. 3. = The method in item 1 of the patent details, in which the above step 3 includes the following steps. Step 14: It is difficult to use the element size and the number is the vertical axis, calculate the size of the largest number of characters, and set it as the representative character height ; Step 15: Remove elements that are too small, too large, and long and narrow; Step 16: Set the 1 / N word height as the word spacing; ", Step 17: String the elements whose distance is less than the representative word height into a string; Step 18 : The representative height after the Xincheng word competition, if it is greater than-fixed value, it means that the string is unreasonable. If there are too many single wires-the string can be stringed, then increase the word _; if there is too? The string represents the height If it is too large, the word spacing will be reduced; Step 19. According to step 18, add or subtract to adjust the word spacing, and re_ string. 4. For example, the method of patent application scope item i, in which the above step 6 includes steps: Step 20: The Y coordinate of the image, sort all the strings; Step 21 · Use the 1 / L line-to-line moment as the paragraph (Clock), and use this to scan the entire image of the paragraph (Clock Scan) to speed up the projection; Step 22 : The word _ that will fall in the paragraph (Clock), projected on) (sleeve · step mule 23: Check if there is a blank line, and the paragraph (cl〇ck) passes twice, and is greater than the projection of the string on the X axis; if there is a line or block (Window) may be generated, execute step mule 24; if not, execute Step 26; * Step 24: Determine whether it is mergeable (merge) conditions; This paper standard is applicable to China National Standard (CNS) A4 specifications (210X297mm ') • i ^ i— i ^ n (please read the back (Notes and then fill out this page)-Printed by the Ministry of Economic Affairs Central Standards Bureau Staff Consumer Cooperative ^斷是否射獨立魏塊(wlndQW),若是 影; 步驟26 ·段落(clockHl ; 步驟27 :若掃描完影像則結束;若尚未掃描完執行步驟22。 5 =請專利範圍第1項之方法,其中上述之步驟7包含下列 步驟28 :由㈣(WindQW)内之元铸投影切行; 步驟29 :以各行之元素取陰影作切字,作邊界判斷。 6.如申請柄翻第!項之方法,財上述之步則之預定値 為10度。 7. =申請專鄕圍第押之方法,其巾上述之步驟财之^胖 高,其N値預定為3。 8. 如申請專利範圍第3項之方法,其中上述之固定數値打為 80%。 (請先閲讀背面之注意事項再填寫本頁) *1T 經濟部中央標準局員工消費合作社印製 9.如申請專利範圍第4項之方法,其中上述之步驟21中之l値為 2〇 10·如申請專利範圍第4項之方法,其中上述之步驟24包含2項 條件: 條件1 :字下方之空白大於區段(block) —空白(bi〇ck); 條件2 :字左、右空白小於區段(block)—空白(block); 本紙張尺度適用中國國家樣準(CNS ) Α4規格(210X297公釐)^ Determine whether to shoot the independent Wei block (wlndQW), if it is a shadow; Step 26 · Paragraph (clockHl; Step 27: If the image is scanned, it will end; if it has not been scanned, go to step 22. 5 = The method of item 1 of the patent scope, The above step 7 includes the following step 28: cut the line from the cast projection in the (WindQW); Step 29: use the shadow of each line element to cut the word and make a boundary judgment. 6. If you apply for the handle to turn the item! Method, the predetermined value of the above steps is 10 degrees. 7. = For the method of applying special encirclement, the above steps of the method will be fat and high, and the N value is scheduled to be 3. 8. If applying for a patent The method of item 3, where the above fixed number value is 80%. (Please read the notes on the back before filling in this page) * 1T Printed by the Employee Consumer Cooperative of the Central Bureau of Standards of the Ministry of Economic Affairs 9. If applying for patent scope item 4 Method, wherein the l value in the above step 21 is 2010. As in the method of claim 4, the above step 24 contains 2 conditions: Condition 1: the space under the word is greater than the block —Blank (bi〇ck); Condition 2: The left and right spaces of the word are less than the area (Block) - Blank (block); This paper applies China National scale-like quasi (CNS) Α4 size (210X297 mm) 六、申請專利範 距十估計値 308677 其中區段(block)—空白(blank)等於行間 (offset)。 1L如申鱗利範15第卿之方法,其中上述之步卿所述之 判斷是否可獨立為區塊的條件包含: 條件3 ·合併字串(merge w〇rds)下方空白超過區段卬1〇也) 空白(blank)且下方有新的字,則定為區塊 (window); 條件4 :單獨之字串組成之區塊,左、右之空白要超過區段 —空白,且下方空白超過2*區段一空白; 條件5 :段落(clock)執行完,收集鄰近之字或區塊,Y軸有 重疊,且左、右空白小於區段一空白是合併字可自 成區塊。 12.如申請專利範圍第1項之方法,其中之步驟7所述之多數種 形狀可以是5種形狀。 I n n I I .^1 I n n. - I I ...... 丁 —-Μ. .-.O 气 I --= (請先閱讀背面之注意事項再填寫本頁) 經濟部中央標準局Μ工消費合作社印t 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公釐)Sixth, the patent application range is estimated to be 308,677 where block-blank is equal to offset. 1L is the method of Shen Qingli Fan 15, No. 1 Qing, in which the conditions described in the above step Qing for judging whether it can be independently a block include: Condition 3 • The space below the merge string (merge w〇rds) exceeds the section 卬 1〇 (Blank) and there is a new word below, it is defined as a window (condition): Condition 4: a block consisting of a separate string, the left and right blanks must exceed the section-blank, and the lower blank exceeds 2 * Section 1 blank; Condition 5: After the execution of the paragraph (clock), collect adjacent words or blocks, the Y axis overlaps, and the left and right blanks are smaller than the section 1 blanks. The merged words can form blocks by themselves. 12. The method as claimed in item 1 of the patent application, wherein most of the shapes mentioned in step 7 can be 5 shapes. I nn II. ^ 1 I n n.-II ...... 丁 —-Μ. .-. O 气 I-= (Please read the notes on the back before filling this page) Central Bureau of Standards, Ministry of Economic Affairs Printed by Μ 工 consumer cooperatives. This paper scale is applicable to the Chinese National Standard (CNS) A4 specification (210X297mm)
TW82108380A 1993-10-06 1993-10-06 Method of automatic separating text box and segmenting word in Chinese/English recognition system TW308677B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW82108380A TW308677B (en) 1993-10-06 1993-10-06 Method of automatic separating text box and segmenting word in Chinese/English recognition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW82108380A TW308677B (en) 1993-10-06 1993-10-06 Method of automatic separating text box and segmenting word in Chinese/English recognition system

Publications (1)

Publication Number Publication Date
TW308677B true TW308677B (en) 1997-06-21

Family

ID=51566222

Family Applications (1)

Application Number Title Priority Date Filing Date
TW82108380A TW308677B (en) 1993-10-06 1993-10-06 Method of automatic separating text box and segmenting word in Chinese/English recognition system

Country Status (1)

Country Link
TW (1) TW308677B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7269298B2 (en) 2000-09-14 2007-09-11 Sharp Kabushiki Kaisha Image processing device, image processing method, and record medium on which the same is recorded
TWI396099B (en) * 2010-02-11 2013-05-11 Jieh Hsiang The typographical method of video text file

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7269298B2 (en) 2000-09-14 2007-09-11 Sharp Kabushiki Kaisha Image processing device, image processing method, and record medium on which the same is recorded
TWI396099B (en) * 2010-02-11 2013-05-11 Jieh Hsiang The typographical method of video text file

Similar Documents

Publication Publication Date Title
US20210256253A1 (en) Method and apparatus of image-to-document conversion based on ocr, device, and readable storage medium
AU2019202677B2 (en) System and method for automated conversion of interactive sites and applications to support mobile and other display environments
WO2020140698A1 (en) Table data acquisition method and apparatus, and server
US5465304A (en) Segmentation of text, picture and lines of a document image
US6173073B1 (en) System for analyzing table images
US8855413B2 (en) Image reflow at word boundaries
JP3950498B2 (en) Image processing method and apparatus
KR101985612B1 (en) Method for manufacturing digital articles of paper-articles
US8995768B2 (en) Methods and devices for processing scanned book's data
JPH0798765A (en) Direction-detecting method and image analyzer
JPH0652354A (en) Skew correcting method, skew angle detecting method, document segmentation system and skew angle detector
WO2007022460A2 (en) Post-ocr image segmentation into spatially separated text zones
US20110222776A1 (en) Form template definition method and form template definition apparatus
JPH01253077A (en) Detection of string
US7929772B2 (en) Method for generating typographical line
JP5412903B2 (en) Document image processing apparatus, document image processing method, and document image processing program
US5923782A (en) System for detecting and identifying substantially linear horizontal and vertical lines of engineering drawings
JP5950700B2 (en) Image processing apparatus, image processing method, and program
US8611666B2 (en) Document image processing apparatus, document image processing method, and computer-readable recording medium having recorded document image processing program
WO2015021737A1 (en) Method for converting paper file into electronic file
TW308677B (en) Method of automatic separating text box and segmenting word in Chinese/English recognition system
US20020085755A1 (en) Method for region analysis of document image
JP7215176B2 (en) Display comparison program, apparatus and method
JPH04352295A (en) System and device for identifing character string direction
US20240193217A1 (en) Information processing apparatus, method of controlling information processing apparatus, and storage medium

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees