TW308677B - Method of automatic separating text box and segmenting word in Chinese/English recognition system - Google Patents
Method of automatic separating text box and segmenting word in Chinese/English recognition system Download PDFInfo
- Publication number
- TW308677B TW308677B TW82108380A TW82108380A TW308677B TW 308677 B TW308677 B TW 308677B TW 82108380 A TW82108380 A TW 82108380A TW 82108380 A TW82108380 A TW 82108380A TW 308677 B TW308677 B TW 308677B
- Authority
- TW
- Taiwan
- Prior art keywords
- word
- string
- blank
- block
- item
- Prior art date
Links
Landscapes
- Character Input (AREA)
- Processing Or Creating Images (AREA)
Abstract
Description
308677 A7 B7 五、發明説明(/ ) 5-1發明背景 解決影像中文字抽取的方法,大約可分為以下幾種: (1) 投影法(Profile) (i )傳統作法,以X軸,Y軸為主,對有値之影像作投 影(Pr 〇fi 1 e)利用其空白或較小値之位置作為行與 行,字與字之判斷條件,其缺點為怕傾斜之影像及 有圖文央雜之影像。 (ii)循環式(Recursive)X-Y:同樣為對X’ Y轴作投 影,但是對於有問題之部份,即找不到空白値之位 置,再縮小範圍只針對該有問題部份作投影,以找 出行座標;此先前技術之缺點為仍無法處理傾斜度 較大之影像,但可解決部份圖文夾雜情形。 (2) 膨脹收縮法:其做法為將影像作膨脹(dilation)及收縮 (erosion)動作’以求得相連之影像’作行之區隔及圖 形判,但缺點為速度慢且需較大的記憶空間 (Memory)。 (3) 分割法:可分為以下兩種 (i )小領域切割:將影像分為固定之區域,依X,Y軸 分割,再求各小塊間之投影,判斷其是否有文字 (Text)存在,相臨之區域則以重疊(Overlay)之判 斷條件,檢查是否為相連之行。其缺點為切割數目 不一,造成很多傾斜之文件,及複雜圖文難以處理 及多欄位(Multi-column)文件之不易處理。 (ii)二維小領域切割法,係將影像切成二維小空間,以 上下列(row)相隔一空白小片段作為上下單元(unit) 之區隔,如此可造成很多相連之單元分佈於影像四 週,再利用在Y軸投影關係,解決_行之問題,對 於過大之單元(Unit)則被視為圖形或雜訊,其缺點 為圖文夾雜之影像仍不易處理,且對於橫/直排版 並存之文件無法處理。 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公釐) —.11 |螝| n —11 . I 111 (請先閱讀背面之注意事項再填寫本頁) 經濟部中央標準局員工消費合作社印製 A7 308677 B7 五、發明説明(> ) (4)背景分析法(Background Analysis) 利用背景空白之特徵,抽取出大、小空白之串連區域, 並依此判斷出文字區、圖形區,但尚未有很顯著之成 果。 5-2發明目的與概述 由以上先前技藝發現,對於圖文夾雜之印刷中文且有傾 斜或因由光學掃描時所造成的傾斜,先前技藝均無法有效 的處理,而使得在現實的應用上受到限制。 本發明之方法為對經光學掃描後之印刷中英文件做以下 步處理,步驟1 ;利用一快速査表參照原專利中華民國發明 第五四八五五號,美國專利5357582號,先計算出影像中所 有的相連元素(Connectted Component),步驟2 :利用統 計方法,計算出文件之概估、文字大小;步驟3 :利用步驟2 所計算出之資料,連接相鄰之元素(Component)為字串 (Word),而此字_(Word)即做為組合成行之基本單元;步 驟4 :以字串(Word)之元素(Component)的中心點,計算此 字串的傾斜角度,統計所有字_之傾斜角則可以得到整張 文件之傾斜角。並旋轉正各元素的座標。步驟5:由字串 (Word)間距統計此文件之行間距,步驟6 :利用行間距分 割,以適應性(adaptive)作X,Y投影(profile)以求取區 塊。步驟7 :分析各區塊内字串(Word)之外形傾向、判斷其 為文字區塊、標題區、圖形區,並判斷其橫或直排版。 以上方法不僅可以解決文件傾斜的問題,更可處理複雜 圖文問題,如圖2為處理結果。 依據本發明的方法可解決圖文夾雜之印刷中文文件,並 可達成以下目的(1)橫直排版並存,且自動分辨(2)圖文並 存,且自動分離(3)傾斜文件(4)文字的抽取。 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公簸) I ^ I. » 裝 i 41 (請先閣讀背面之注意事項再填寫本頁) 經濟部中央標準局員工消費合作社印製 A7 308677 B7 五、發明説明(3 ) 5-3圖式簡單説明 圖一所示為一列傾斜之印刷字,旋轉元素(components) 座標 圖二所示為一具橫直排版並存文件,區塊分割之結果 圖三所示為本發明之系統流程圖 圖四所示為圖三步驟302之詳細流程圖 圖五所示為圖三步驟303之詳細流程圖 圖六所示為圖三步驟306之詳細流程圖 圖七所示為圖三步驟308之詳細流程圖 圖八所示為即使文件傾斜用旋轉後之座標以投影 (profile)法可克服傾斜問題 圖九所示為圖三步驟304之字串傾斜角計算方法 5-5發明詳細説明 在解説本發明前,先定義-三個在本發明描述中所使 用到的名詞 (1) 元素(Component):包含相連影像之最小矩形’例 如"言”這個字可為4個元素(Component)—一一 □。 (2) 字串(Word):將距離小於本發明中所計算出之行 間距之元素(component)合成一字串(Word)換言之 類似英文由字母(letters)組成字(Word)之觀念。在 中文文件中,相對於標點符號間之子句。 (3) 區塊(Window):含同屬性之區域’例如文字區、 圖形區等。 如第2圖示之"羊n字(Word)即由101,102,103等 3個元素(Component)所組成,而第2圖中之201, 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公釐) --------^---裝------訂 (請先閱讀背面之注意事項再填寫本頁)308677 A7 B7 V. Description of the invention (/) 5-1 Background of the invention The methods for solving the text extraction in the image can be roughly divided into the following types: (1) Projection method (Profile) (i) Traditional method, with X axis, Y The main axis is the projection of the image with a value (Pr 〇fi 1 e). The blank or smaller value is used as the line and line, the condition for judging the word and the word. The shortcomings are the image and the image that are afraid of tilting. The image of Yangza. (ii) Recursive XY: It is also to project the X 'Y axis, but for the problematic part, that is, the position of the blank value cannot be found, and then narrow the scope to only project the problematic part, To find the row coordinates; the shortcoming of this prior art is that it still can't handle images with large inclination, but it can solve the situation of partial image inclusion. (2) Expansion and contraction method: The method is to dilate and contract the image (to obtain connected images) for segmentation and graphic judgment, but the disadvantage is that the speed is slow and requires a large Memory space (Memory). (3) Segmentation method: It can be divided into the following two (i) small area cutting: divide the image into a fixed area, split according to the X and Y axis, and then find the projection between each small block to determine whether it has text (Text ) Exist, the adjacent areas are checked for overlap by the judgment condition of Overlay. The disadvantage is that the number of cuts is different, resulting in a lot of slanted documents, complicated graphics and text are difficult to handle and multi-column (Multi-column) documents are not easy to handle. (ii) Two-dimensional small-area cutting method is to cut the image into two-dimensional small space. The above (row) is separated by a small blank segment as the upper and lower unit (unit), so it can cause many connected units to be distributed in the image All around, then use the projection relationship on the Y axis to solve the problem of _ line. For the unit that is too large, it is regarded as graphics or noise. The disadvantage is that the image with mixed graphics is still not easy to handle, and for horizontal / vertical layout Coexisting documents cannot be processed. This paper scale is applicable to the Chinese National Standard (CNS) A4 specification (210X297mm) —.11 | 螝 | n —11. I 111 (Please read the notes on the back before filling this page) Employee Consumer Cooperative of the Central Bureau of Standards of the Ministry of Economic Affairs Printed A7 308677 B7 5. Description of the invention (>) (4) Background Analysis (Background Analysis) Using the characteristics of the background blank, extract the large and small blank series of areas, and then determine the text area and graphics area , But there have not been any significant results. 5-2 Purpose and Summary of the Invention From the above prior art, it is found that for printed Chinese with pictures and text, and there is tilt or tilt caused by optical scanning, the previous art cannot be effectively processed, which limits its practical application. . The method of the present invention is to perform the following steps on the printed Chinese and English documents after optical scanning, step 1; use a quick look-up table to refer to the original patent Republic of China Invention No. 548585, US Patent No. 5357582, first calculate the image All connected elements (Connectted Component) in step 2: Use statistical methods to calculate the approximate estimate and text size of the document; Step 3: Use the data calculated in step 2 to connect the adjacent elements (Component) as a string (Word), and this word _ (Word) is used as the basic unit combined into a line; Step 4: Calculate the inclination angle of this word string with the center point of the component of the word string (Word), and count all words _ The tilt angle of the whole document can be obtained. And rotate the coordinates of each element. Step 5: Calculate the line spacing of this file from the word spacing. Step 6: Use line spacing to divide and adaptively make X, Y projections to obtain the block. Step 7: Analyze the outward tendency of the word string in each block, determine it as a text block, title area, and graphics area, and determine its horizontal or vertical layout. The above method can not only solve the problem of skewed files, but also deal with complex graphic problems, as shown in Figure 2 for the processing results. The method according to the present invention can solve printed Chinese documents with mixed pictures and texts, and can achieve the following objectives: (1) coexistence of horizontal typesetting, and automatic discrimination (2) coexistence of pictures and text, and automatic separation (3) tilted documents (4) text Extract. The size of this paper applies to the Chinese National Standard (CNS) A4 (210X297 bumps) I ^ I. »Install i 41 (please read the notes on the back before filling this page) A7 printed by the Employee Consumer Cooperative of the Central Bureau of Standards of the Ministry of Economic Affairs 308677 B7 5. Description of the invention (3) 5-3 Brief description of the drawings Figure 1 shows a list of slanted printed characters, rotating elements (components) Coordinates Figure 2 shows a horizontally typeset coexisting document, the result of block division Figure 3 is a flowchart of the system of the present invention. Figure 4 is a detailed flowchart of step 302 in Figure 3. Figure 5 is a detailed flowchart of step 303 in Figure 3. Figure 6 is a detailed flowchart of step 306 in Figure 3 Figure 7 shows a detailed flowchart of step 308 in Figure 3. Figure 8 shows that even if the file is tilted, the rotated coordinates can be used to overcome the tilt problem using the profile method. Figure 9 shows the string tilt angle of step 304 in Figure 3 Calculation method 5-5 Detailed description of the invention Before explaining the present invention, first define-three nouns (1) used in the description of the present invention (1) Element (Component): contains the smallest rectangle of the connected image 'for example "quote" Word 4 elements (Component) —one by one □. (2) Word (Word): a component with a distance less than the line spacing calculated in the present invention is synthesized into a word string (Word), in other words, similar to English by letters ( letters) The concept of forming words (Word). In Chinese documents, relative to the clauses between punctuation marks. (3) Block (Window): Areas with the same attributes' such as text area, graphics area, etc. As shown in Figure 2 The "quote" word (Word) is composed of three elements (101, 102, 103, etc.), and 201 in the second picture, the paper scale is applicable to the Chinese National Standard (CNS) A4 specification (210X297 Ali) -------- ^ --- installed ----- ordered (please read the notes on the back before filling this page)
經濟部中央標準局員工消費合作社印I 308677 A7 _____ B7 五、發明説明(午) 202 ’ 203,204 ’ 205均在本發明中視為區塊 (Window) 0 本發明之方法為利用被掃描文件上之元素 (Component)往上組合並運用了統計及數値分析之方 法’其流程如第3圖所示,301系統開始,輸入此系統 =是二元影像資料(binary image) ’步驟3〇2為由二元 衫像資料串成元素表(Component list),其做法為利用 查表法結合兩個串列(link list)資料結構,計算出所有 元素之座標(指包含元素之最小矩形座標;亦即找出一 列衫像資料所有runs之左右邊界’再檢査與前一列之元 素是否有相連,以擴大元素式產生新元素。詳見圖四。 步驟303 :為利用元素之相鄰關係,及其字高大小,結合同 行之元素成字串(Word),詳見圖五。 步螺3〇4:為檢查旋轉角度。此乃利用各字内之元素中心 點’以數値方法Y = ax+ b公式代入,以 MSE(Minimum Square Error 最小平方誤差)求 出可能之旋轉角度。若角度(0)太大(>1〇。)或 影響到行與行重疊(overlap)情況,則跳回步驟 303 ° h & 1 4亍赃 (行與行重私 ^tan-1 -max(Image width-Image Length^ 步螺3〇5 :為將每個元素旋轉成正確矩形,並重新串字(執 行303)再根據這些字串(Word),以統計方式算出 其行間距。 步驟306 :為利用適應性投影,以字串(word)做投影,利 用左、右與上、下之大空白區域,分隔成區塊 (Windows) 0詳見圖六。 I I I I i 裝— I I I 訂 I I Am (請先閣讀背面之注意事項再填寫本頁) 經濟部中央榡準局員工消費合作衽印製I 308677 A7 _____ B7, the employee consumer cooperative of the Central Bureau of Standards of the Ministry of Economic Affairs V. Description of the invention (noon) 202 '203, 204' 205 are regarded as windows in the present invention 0 The method of the present invention is to use the scanned document The component (Component) is combined upwards and uses the method of statistics and numerical value analysis. The process is shown in Figure 3. The 301 system starts. Enter this system = is a binary image (step 2). To form a component list from binary shirt-like data, the method is to use a table lookup method to combine two link list data structures to calculate the coordinates of all elements (referring to the smallest rectangular coordinates containing elements; That is to find the left and right borders of all runs in a row of shirt image data, and then check whether the elements in the previous row are connected to expand the element type to generate new elements. See Figure 4 for details. Step 303: To use the adjacent relationship of elements, and The height of the word is combined with the elements of the peer to form a word string (Word). For details, see Figure 5. Step screw 3〇4: To check the rotation angle. This is to use the center point of the element in each word to calculate the value Y = ax + b Substitute the formula, and use MSE (Minimum Square Error) to find the possible rotation angle. If the angle (0) is too large (> 10.) Or affects the line-to-line overlap (overlap), skip back to the step 303 ° h & 1 4 亍 藍 (line and line heavy private ^ tan-1 -max (Image width-Image Length ^ step screw 3〇5: to rotate each element into the correct rectangle, and re-string (execute 303) According to these words, calculate the line spacing in a statistical manner. Step 306: To use adaptive projection, use the word as the projection, and use the large blank areas of left, right, and top and bottom. Divided into blocks (Windows) 0 See Figure 6 for details. IIII i Pack — III Order II Am (please read the precautions on the back and then fill out this page) Printed by the Consumers ’Cooperation Cooperative Bureau of the Central Bureau of Economics of the Ministry of Economic Affairs
經濟部中央標準局員工消費合作社印製 A7 B7 五、發明説明(() 步驟307 :利用區塊(Window)内各字(word)之形狀,分為 六種,統計其橫,直之部份,以判斷是否為文字 區塊,及判斷其為橫或直排版文件。 步驟308 :將文字區塊作切行切字。 步驟309 :為結束動作。 為了更仔細描述第3圖中之各步驟,茲説明如下;步驟 302其詳細的執行步驟如流程圖第4圖中所示,為計算影像 中所有的元素(Components),首先定義兩名詞: 1. (連績位元)RUN :為單一列(ROW)影像中,連續有意義 之位元(bit)所連成的單元,謂之RUN此資料結構會記錄 左、右邊界位置。A7 B7 printed by the Staff Consumer Cooperative of the Central Bureau of Standards of the Ministry of Economic Affairs 5. Description of the invention (() Step 307: Use the shape of each word in the window (Window), divided into six types, counting the horizontal and straight parts, To determine whether it is a text block, and determine whether it is a horizontal or straight typesetting document. Step 308: Cut the text block into lines and words. Step 309: End the action. In order to describe the steps in Figure 3 more carefully, The following is a description; the detailed execution steps of step 302 are shown in the fourth diagram of the flowchart. To calculate all the components in the image, first define two terms: 1. (Continuous performance bit) RUN: a single column (ROW) In the image, the unit of consecutive meaningful bits (bit) is called RUN. This data structure records the left and right boundary positions.
2. 連接矩形:連接上、下相連之RUN所形成之包圍最小矩 形稱之。此資料結構為記錄上、下、左、右之位置,及最 後一個ROW的RUN(左、右)位置,以便檢查與下一 ROW 是否有相連。 步驟401 :為設i為第i列(Row)影像連接矩形為清除狀態 (NULL)。 步驟402 :利用判斷邊界方法,以8-位元(bit)為一單元,迅 速 找出在第i行之RUN之位置及數目。 步驟403 :本步驟中,會將RUN之資料結構以一串列(link 1 i st)串連(由小至大),並將尚未結束之連結矩形 以另一串列(link list)串連,並以其最後一個 ROW之位置作排序。因此在比較連接矩形與 RUN是否相連時,可利用兩個列(list)作類似插 入排序之動作迅速執行。若有相連,則將此RUN 結合至原先的連接矩形;否則產生一新的連接矩 形,並插在連接矩形的列(list)中,以確保位置為 排序(sort)好之結果,加快下次執行。 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公釐) ^^^1 1^1 In ^^^1 ^^^1 1^11 m IP I m n ^ J 夺 、-切 (請先閱讀背面之注意事項再填寫本頁) 308677 A7 B7 經濟部中央標準局員工消費合作社印製 五、發明説明(t ) 步驟404 :為繼續下一個列(ROW)。 步驟405 :為完成所有的影像元素(Component)計算’將此 連接矩形全部輸出;由於利用兩個列(list)及快速 矩形邊界判斷法,故此計算只需讀取(trace)影像 一次即可。 第5圖為步驟303之更進一步的流程圖。 步驟501 :由統計方法,以元素(Compononts)大小為橫 軸,個數為縱軸,計算個數最大的文字大小,設 為代表字高。 步驟502 :將太小或太大或狹長型元素(Compononts)去 除。 步驟503 :初設1/3字高為字間矩(HC),此為預估値。 步驟504 :將距離小於字間距(HC)之元素(Components)串 成字串(words)。(以串列(link list)代表)。 步驟505 :檢查是否合理,即串成字串(word)後之代表高度 是否與原計算之代表高度接近,若有T1値(如超 過80%則不合理),則表示字間矩(HC)有問題。 若有過多單一元素(Component)未能串成字串 (word)則加大字間矩(HC);若有太多字串(word) 之代表高度過大則減字間矩(HC)。 所謂合理字串為一字串(word)只存在一行文字, 而非兩行以上 如: 合理字串: 不合理字串 ----------裝— (請先閱讀背面之注意事項再填寫本頁) 工研院電通所 新竹縣竹東鎮 工研院電通所 新竹縣竹東鎮 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公釐) A7 308677 B7 五、發明説明(/p 步驟506 :根據步驟505之判斷,加或減字間矩(HC),並重 新串字串(word)。 以上為第5圖之流程解説。 步驟304檢查旋轉角度,設HC為代為字高參數將文件中字 串(words)分為五類; ⑻ _w_ h h〜HC,(w/h)>3 (b) ^ h w〜HC,( h Λν) > 3 (c) w2. Connection rectangle: The minimum rectangle formed by connecting the upper and lower connected RUN is called. This data structure records the positions of up, down, left, and right, and the RUN (left and right) position of the last ROW, so as to check whether it is connected to the next ROW. Step 401: Set i to the i-th row (Row) image connection rectangle to the clear state (NULL). Step 402: Using the judgment boundary method, taking 8-bit as a unit, quickly find the position and number of RUN in the i-th row. Step 403: In this step, the data structure of RUN is connected in series (link 1 i st) (small to large), and the connection rectangle that has not ended is connected in another list (link list) And sort by the position of its last ROW. Therefore, when comparing whether the connecting rectangle is connected with RUN, you can use two lists to perform similar insertion sorting operations quickly. If there is a connection, combine this RUN to the original connection rectangle; otherwise, a new connection rectangle is generated and inserted in the list of the connection rectangle to ensure that the position is a sorted result, speeding up the next time carried out. The size of this paper is applicable to the Chinese National Standard (CNS) A4 specification (210X297mm) ^^^ 1 1 ^ 1 In ^^^ 1 ^^^ 1 1 ^ 11 m IP I mn ^ J Capture,-cut (please read first (Notes on the back and then fill out this page) 308677 A7 B7 Printed by the Consumer Consultation Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs V. Invention Instructions (t) Step 404: Continue to the next row (ROW). Step 405: In order to complete the calculation of all the image elements (Component), all the connected rectangles are output; since two lists (list) and a fast rectangle boundary judgment method are used, the calculation only needs to trace the image once. Figure 5 is a further flowchart of step 303. Step 501: Calculate the size of the largest number of characters with the size of elements (Compononts) as the horizontal axis and the number as the vertical axis by statistical methods, and set it as the representative character height. Step 502: Remove elements that are too small or too large or elongated (Compononts). Step 503: Initially set the 1/3 word height as the inter-word moment (HC), which is the estimated value. Step 504: String the elements whose distance is smaller than the word spacing (HC) into words. (Represented by a link list). Step 505: Check whether it is reasonable, that is, whether the representative height after being stringed is close to the originally calculated representative height, if there is a T1 value (if it exceeds 80%, it is unreasonable), it means the inter-word moment (HC) has a problem. If there are too many single elements (Component) that cannot be stringed into words, then increase the inter-word moment (HC); if there are too many words (word), the height is too large, then reduce the inter-word moment (HC). The so-called reasonable string is a word string (word) only one line of text, rather than two or more lines such as: reasonable string: unreasonable string ---------- installed — (please read the note on the back first Please fill in this page again.) The Institute of Dentsu, Industrial Research Institute, Zhudong Town, Hsinchu County, Zhutong Town, Institute of Industrial Research, Hsinchu County, China (/ P Step 506: According to the judgment of step 505, add or subtract the inter-word moment (HC), and re-string the word (word). The above is the explanation of the flow in Figure 5. Step 304 Check the rotation angle, let HC be replaced by The word height parameter divides the words in the file into five categories; ⑻ _w_ hh ~ HC, (w / h) > 3 (b) ^ hw ~ HC, (h Λν) > 3 (c) w
h w,h〜HC (d) wh w, h ~ HC (d) w
—h w,h « HC (e) w—H w, h «HC (e) w
h w,h » HCh w, h »HC
(present high) > 1.2 * HC -1 目前高度 取符合(a)or(b)words,每一個words以其内含之元素 (component)的質心以 Y = ax + b 方程式,及 MSE(Mininum(present high) > 1.2 * HC -1 is currently highly consistent with (a) or (b) words, each word with the centroid of the element contained in it is represented by the equation Y = ax + b, and MSE ( Mininum
Square Error)見 Fig.9求得每個 words 内 comp one nts排歹丨J 的傾斜度,並統計這些角度以找出這張影像的傾斜度()。 Θ 本紙張尺度適用中國國家標準(CNS ) A4規格(210X297公釐〉 -----------裝------訂------^4: (請先聞讀背面之注意事項再填寫本頁) 經濟部中央標準局員工消費合作社印製Square Error) See Fig. 9 for the inclination of the comp one nts in each word, J, and count these angles to find the inclination of this image (). Θ The size of this paper is applicable to the Chinese National Standard (CNS) A4 specification (210X297mm) ----------- installed ------ ordered ------ ^ 4: (please listen first Read the notes on the back and fill out this page) Printed by the Employees Consumer Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs
步驟305^^.如此旋轉正後之座標矩形可還原其大小,使 元素(component)不致因傾斜而矩形座標被放 大。 步驟306之詳細流程圖如第_所示,其為利用適應性投影 (adaptive profile)串區塊(wind〇ws)由於到步驟 306之元素已旋轉過,所以這個步騾可使用字 (word)及投影(profUe)方法切割區域nd〇 爲投影一般傾斜問題最難處理)„ 步驟601 :將所有字串~01^)以1_座標排序(8〇1^)成一串列 (link list)。 步騾602 :以1/2行間矩為段落(ci〇ck),並以此段落掃描 (clock scan)整張影像’以加快速度作投影 (profile)。 步驟603 :若有字串第一次遇到此段落(el〇ck),則投影至乂軸 上。 步驟604 :檢查是否有空白行,且段落(cl〇ck)通過兩次, 並大於字串(word)在X軸之投影。若有,表示有 行或區塊(window)可能產生;若無,則跳到步驟 607繼續。 λ 步驟605 :判斷是否為可merge條件,條件有二: ---------裝-- (請先閱讀背面之注意事項再填寫本頁) 訂' 經濟部中央標準局員工消費合作社印製 Μ \l 於; 於 大 小 白 白 空an白空an白 之bl空右bl空 jsck#-左ckf £10區rdlo區 ^b/{ ob/lv block blank =行間距 (區段-¾白j +(oi?set)¥ 計値 步驟606 :判斷是否已為獨立之區塊,則條件有三: 本紙張尺度適用中國國家標準(CNS ) Μ規格(210Χ297公釐) 308677 A7 B7 五、發明説明(Step 305 ^^. In this way, the coordinate rectangle after the rotation can be restored to its size, so that the component cannot be enlarged due to the tilt. The detailed flow chart of step 306 is shown in section _, which uses adaptive projection (adaptive profile) string blocks (windows). Since the elements to step 306 have been rotated, this step mule can use the word (word) And the projection (profUe) method cutting the area nd〇 is the most difficult to deal with the general tilt problem of projection) Step 601: Sort all strings ~ 01 ^) into 1_coordinates (8〇1 ^) into a list (link list). Step mule 602: Take the 1/2 line moment as a paragraph (ci〇ck), and scan the entire image with this paragraph to speed up the profile. Step 603: If there is a string for the first time When this paragraph (el〇ck) is encountered, it is projected onto the x axis. Step 604: Check if there is a blank line, and the paragraph (cloc) passes twice, which is greater than the projection of the word on the X axis. If there is, it means that there may be a row or window (window) may occur; if not, skip to step 607 to continue. Λ Step 605: determine whether it is a mergeable condition, there are two conditions: --------- install -(Please read the precautions on the back before filling out this page) Order 'Printed by the Consumer Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs Μ \ l 于; Small white white empty an white empty an white bl empty right bl empty jsck # -left ckf £ 10 area rdlo area ^ b / {ob / lv block blank = line spacing (section-¾white j + (oi? Set) ¥ Calculation step 606: To determine whether it is an independent block, there are three conditions: This paper size is applicable to the Chinese National Standard (CNS) Μ specification (210Χ297 mm) 308677 A7 B7 V. Description of invention (
V 2衹區段-空白) rd區有1。 .£w) ?!且ordind U空 字之31Φ胤 #-^Γ-έ 字心 合下-$的為 ^〜一新字串(word)V 2 only sections-blank) rd area has 1. . £ w)?! And ordind U empty The word 31Φ 胤 #-^ Γ-έ The word heart is closed-$ 的 为 ^ ~ A new word (word)
塊區白 區過空k 之超方an 成要下bl。 組白且kr 串空,oc白 字之白bl空 iL-ir-空2*^:-獨、一過㈣ 單左段处(E \1/ 2 /IV m I -I - - In , ml I (請先閱讀背面之注意事項再填寫本頁) (3)段落執行完,收集鄰近 之#或區塊,Y軸有重疊 ,且左、右空白小於(區 段-空白)之合併+可自成 區塊。 步驟607 : clock + 1,(段落+ 1) 步驟608 .若掃完整個影像,執行EXITg尚未掃完則回到步驟 603 ° 以上為步驟306之詳細流程。 步驟307利用words之特性,判斷wind〇ws是否為文字 區或橫直排版。在步驟3〇4中已將words分成5類 ==== >文字區橫排 >文字區直排 本紙張尺度適用中國國家標準(CNS ) A4規格(21 OX297公釐)In the block area, the white area is too empty, and the super square an Cheng Yao wants to play bl. The group is white and kr is empty, the white bl of oc white is empty iL-ir-empty 2 * ^:-independent, one pass (iv) single left paragraph (E \ 1/2 / IV m I -I--In, ml I (Please read the precautions on the back before filling this page) (3) After the execution of the paragraph, collect the adjacent # or block, the Y axis overlaps, and the left and right blanks are less than (section-blank) merge + can be Into blocks. Step 607: clock + 1, (paragraph + 1) Step 608. If the entire image is scanned and EXITg has not been scanned, return to step 603 ° above is the detailed process of step 306. Step 307 uses the characteristics of words To determine whether wind〇ws is a text area or horizontal typesetting. In step 304, the words have been divided into 5 categories ====> Text area horizontal line> The text area line type This paper size is applicable to the Chinese National Standard (CNS ) A4 specification (21 OX297 mm)
、1T 經濟部中央標準局—工消費合作社印製 (1) (a) + (c) + (d)佔 T2 以上=== 版 (2) (a) + (c) + (d)佔 T2 以上=== 版 T2 :同性質字串之所佔合理比例値。 T1 .字串之T§J度與代表字1¾度之百分比合理値。 (3) others ==== >非文字區 在(1) (2)條件中,((a) + (b) + (c) + (d) ) / total > T3 A7 B7 五、發明説明0 c ) Τ3 :文字區之字串所佔合理之百分比。 步驟307詳細流程如第7圖所示 步驟701 :由於旋轉過之影像,沒有傾斜問題,又已知道區 塊為橫或直排版,因此,可以元素(Components) 作投影(profile)來切行,如圖8。 步驟702 :再取陰影(shadows)作文字邊界判斷依由上述步 驟完成自動分離文框與文字切割。 综合以上敘述本發明有以下的優點: 1. 橫直排版並存。 2. 由於以元素為最小單位,與其它方法相較誤差減少。 3. 由統計得到數據,應用範圍較有彈性。 4. 圖文相近,或大標題與本文相近亦可分隔。 5 .可以接受之傾斜角度大,區塊分割較不受影響。 I--------裝-- (請先閱讀背面之注意事項再填寫本頁), 1T Printed by the Central Bureau of Standards of the Ministry of Economic Affairs-Industrial and Consumer Cooperatives (1) (a) + (c) + (d) accounted for more than T2 === version (2) (a) + (c) + (d) accounted for T2 The above === version T2: a reasonable proportion of strings of the same nature. T1. The percentage between the T§J degree of the character string and the representative word of 1¾ degree is reasonable. (3) others ====> In the condition of (1) (2), ((a) + (b) + (c) + (d)) / total > T3 A7 B7 V. Invention Note 0 c) Τ3: a reasonable percentage of words in the text area. The detailed process of step 307 is shown in Figure 7 Step 701: Since the rotated image has no tilt problem, and the block is known to be horizontal or vertical layout, therefore, the elements can be used as a profile to cut the line, As shown in Figure 8. Step 702: Take shadows again to determine the boundary of the text. Follow the above steps to complete the automatic separation of the frame and the text. Summarizing the above description, the present invention has the following advantages: 1. Horizontal and horizontal typesetting coexist. 2. Since the element is the smallest unit, the error is reduced compared with other methods. 3. The data is obtained from statistics, and the application range is more flexible. 4. The pictures and texts are similar, or the headlines are similar to this article and can be separated. 5. The acceptable tilt angle is large, and block division is less affected. I -------- install-- (Please read the precautions on the back before filling this page)
XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX 訂 經濟部中央標準局員工消費合作社印製 6.切行,切字對傾斜容忍度亦較大。 本紙張尺度適用中國國家標準(CNS ) A4規格(2丨0 X 297公釐)XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX XXXXXXXXX Order Printed by the Employee Consumer Cooperative of the Central Standards Bureau of the Ministry of Economic Affairs 6. Cut the line, and the word is also tolerant to tilt. This paper scale is applicable to the Chinese National Standard (CNS) A4 specification (2 丨 0 X 297 mm)
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW82108380A TW308677B (en) | 1993-10-06 | 1993-10-06 | Method of automatic separating text box and segmenting word in Chinese/English recognition system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW82108380A TW308677B (en) | 1993-10-06 | 1993-10-06 | Method of automatic separating text box and segmenting word in Chinese/English recognition system |
Publications (1)
Publication Number | Publication Date |
---|---|
TW308677B true TW308677B (en) | 1997-06-21 |
Family
ID=51566222
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW82108380A TW308677B (en) | 1993-10-06 | 1993-10-06 | Method of automatic separating text box and segmenting word in Chinese/English recognition system |
Country Status (1)
Country | Link |
---|---|
TW (1) | TW308677B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7269298B2 (en) | 2000-09-14 | 2007-09-11 | Sharp Kabushiki Kaisha | Image processing device, image processing method, and record medium on which the same is recorded |
TWI396099B (en) * | 2010-02-11 | 2013-05-11 | Jieh Hsiang | The typographical method of video text file |
-
1993
- 1993-10-06 TW TW82108380A patent/TW308677B/en not_active IP Right Cessation
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7269298B2 (en) | 2000-09-14 | 2007-09-11 | Sharp Kabushiki Kaisha | Image processing device, image processing method, and record medium on which the same is recorded |
TWI396099B (en) * | 2010-02-11 | 2013-05-11 | Jieh Hsiang | The typographical method of video text file |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210256253A1 (en) | Method and apparatus of image-to-document conversion based on ocr, device, and readable storage medium | |
AU2019202677B2 (en) | System and method for automated conversion of interactive sites and applications to support mobile and other display environments | |
WO2020140698A1 (en) | Table data acquisition method and apparatus, and server | |
US5465304A (en) | Segmentation of text, picture and lines of a document image | |
US6173073B1 (en) | System for analyzing table images | |
US8855413B2 (en) | Image reflow at word boundaries | |
JP3950498B2 (en) | Image processing method and apparatus | |
KR101985612B1 (en) | Method for manufacturing digital articles of paper-articles | |
US8995768B2 (en) | Methods and devices for processing scanned book's data | |
JPH0798765A (en) | Direction-detecting method and image analyzer | |
JPH0652354A (en) | Skew correcting method, skew angle detecting method, document segmentation system and skew angle detector | |
WO2007022460A2 (en) | Post-ocr image segmentation into spatially separated text zones | |
US20110222776A1 (en) | Form template definition method and form template definition apparatus | |
JPH01253077A (en) | Detection of string | |
US7929772B2 (en) | Method for generating typographical line | |
JP5412903B2 (en) | Document image processing apparatus, document image processing method, and document image processing program | |
US5923782A (en) | System for detecting and identifying substantially linear horizontal and vertical lines of engineering drawings | |
JP5950700B2 (en) | Image processing apparatus, image processing method, and program | |
US8611666B2 (en) | Document image processing apparatus, document image processing method, and computer-readable recording medium having recorded document image processing program | |
WO2015021737A1 (en) | Method for converting paper file into electronic file | |
TW308677B (en) | Method of automatic separating text box and segmenting word in Chinese/English recognition system | |
US20020085755A1 (en) | Method for region analysis of document image | |
JP7215176B2 (en) | Display comparison program, apparatus and method | |
JPH04352295A (en) | System and device for identifing character string direction | |
US20240193217A1 (en) | Information processing apparatus, method of controlling information processing apparatus, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MM4A | Annulment or lapse of patent due to non-payment of fees |