JP2010176364A

JP2010176364A - Document processor

Info

Publication number: JP2010176364A
Application number: JP2009017998A
Authority: JP
Inventors: Takeshi Itami; 剛伊丹
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2009-01-29
Filing date: 2009-01-29
Publication date: 2010-08-12

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document processor for performing the text extraction of a page in multi-column configurations in consideration of the order of reading in a block and the order of reading between blocks since only text sorting on the basis of the coordinates of a text drawing command in the page is not sufficient for text extraction. <P>SOLUTION: The document processor is configured to obtain "offset coordinates" and "terminal coordinates" of the base line of each text column, and to block text columns applied to parallel and neighboring conditions on the basis of the created coordinate information of the base line, and to sort texts in the block on the basis of Y coordinates, and to perform the determination of linkage and alignment between the blocks by various conditional formulas in order to determine the order of reading between the blocks, and to perform text extraction in an aligned order. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は電子文書処理方法及び電子文書処理装置、特に、電子文書レイアウト情報を保持した電子文書からテキストを抽出するための電子文書処理方法及び電子文書処理装置に関するものである。 The present invention relates to an electronic document processing method and an electronic document processing device, and more particularly to an electronic document processing method and an electronic document processing device for extracting text from an electronic document holding electronic document layout information.

電子文書におけるテキストの検索は、電子文書内に含まれるテキストを抽出し、その中に検索用に指定したテキストが含まれるかどうかを判定することによって行っている。このように、検索手法としては、抽出された文字の中にテキストが一部でも含まれるかどうかによって判定する方法が一般的である。 Searching for text in an electronic document is performed by extracting text included in the electronic document and determining whether the text specified for search is included therein. As described above, a general search method is a method of determining whether or not a part of text is included in extracted characters.

より限定的な検索手法としては、検索用テキストが完全に含まれるかどうかを判定する完全一致検索などがある。また、検索対象のファイルフォルダ等に配置される複数の電子文書に対して検索用のテキストが含まれているかを確認し、仮に検索用のテキストと一致するテキストを持つ電子文書が見つかれば、そのテキストがどの個所に書かれたものかを検索結果として抽出する全文検索などもある。 As a more limited search method, there is an exact match search for determining whether or not the search text is completely included. Also, check whether the search text is included in the multiple electronic documents placed in the search target file folder, etc., and if an electronic document with text that matches the search text is found, There is also a full-text search that extracts where text is written as a search result.

さらには、検索したい内容を文章として入力し、その文章に近い内容の情報を探すことができる概念検索といった高度な検索手段も存在している。たとえば、電子文書から抽出されるテキストが、"He is a good boy. But, she is a bad girl."であった場合、概念検索では"good boy"は勿論のこと、"nice boy"でも検索がヒットする。また、"bad boy"では検索にヒットしない。これは、抽出されたテキストが意味する概念に則しているかについて、検索処理を提供できているからである。 Furthermore, there are also advanced search means such as a concept search in which contents to be searched can be input as sentences and information having contents close to the sentences can be searched. For example, if the text extracted from an electronic document is "He is a good boy. But, she is a bad girl.", Not only "good boy" but also "nice boy" are searched in the concept search. Hit. Also, "bad boy" does not hit the search. This is because a search process can be provided as to whether the extracted text conforms to the concept meant.

したがって概念検索を行う場合、日本語であれば日本語としての文章性であったり、英語であれば英語としての文章性であったり、その抽出したテキストが文章として意味が保たれている必要がある。 Therefore, when performing a conceptual search, the Japanese text must be written in Japanese, the English text must be written in English, and the extracted text must retain its meaning as a sentence. is there.

一方、文字のレイアウトを保持する電子文書において、テキスト描画を表すコマンド（以降、テキスト描画コマンドと呼ぶ）の順番と、テキスト描画コマンドがページ上で指定される描画開始位置が、それぞれ独立している場合がある。例えば、ページ上の中心あたりから、１つ目のテキスト描画コマンドがはじまり、次にページ上の下方あたりから２つ目のテキスト描画コマンドがはじまり、最後にページ上の上方あたりから最後のテキスト描画コマンドがはじまるといったケースである。 On the other hand, in an electronic document having a character layout, the order of commands representing text drawing (hereinafter referred to as a text drawing command) and the drawing start position at which the text drawing command is designated on the page are independent of each other. There is a case. For example, the first text drawing command starts from around the center of the page, then the second text drawing command starts from around the bottom of the page, and finally the last text drawing command from around the top of the page This is the case that begins.

実際の電子文書の例で、ＰＤＦ (Portable Document Format) やＰＤＬ (Page Description Language)などで、そのような表現がなされる場合がある。例えば、ＰＤＦを作成するソフトウェアの１つに、プリンタドライバ型のソフトウェアがある。これは、元の原稿を作成したワープロアプリケーションやドローアプリケーションにおいて、印刷指示を行う際、通常のプリンタドライバの代わりに、このＰＤＦ作成用ドライバを選択すると、印刷コマンドからＰＤＦファイルを作成するソフトウェアである。 In an example of an actual electronic document, such a representation may be made in PDF (Portable Document Format) or PDL (Page Description Language). For example, one of software for creating PDF is printer driver type software. This is software that creates a PDF file from a print command when a PDF creation driver is selected instead of a normal printer driver when issuing a print instruction in a word processing application or a draw application that created the original document. .

このとき、元の原稿を作成したアプリケーションが、どのようなテキスト描画の順番で、印刷コマンドをＰＤＦ作成ドライバに渡すかは、そのアプリケーションに依存する。すなわち、そのアプリケーションがレイアウトフリーな電子原稿作成アプリケーションであって、ページ上の文章の構成を考慮しないようにテキスト描画をしている場合、ＰＤＦ作成ドライバで作成されるＰＤＦファイルには、テキスト描画コマンドが並べられる順番として文章性を大きく欠いた結果で入力される場合がある。これは、テキスト描画コマンドが表すページ上の座標位置は正しいが、そのテキスト描画コマンドのＰＤＦファイル内での順番はばらばらであるということである。 At this time, the text drawing order in which the application that created the original document passes the print command to the PDF creation driver depends on the application. That is, when the application is a layout-free electronic document creation application and text drawing is performed without considering the composition of the text on the page, the text drawing command is included in the PDF file created by the PDF creation driver. May be input as a result of greatly lacking the sentence property. This means that the coordinate position on the page represented by the text drawing command is correct, but the order of the text drawing command in the PDF file is scattered.

レイアウトフリーな電子原稿作成アプリケーションの場合、操作者がテキストオブジェクトを生成した順に、そのオブジェクトに対してシーケンシャルに番号を振って管理している。しかし、操作者はレイアウトフリーに操作できる特性を活かして原稿を作成するので、テキストオブジェクトを生成した順番に則して文章性を保つようにテキストオブジェクトを配置していくことは必ずしも行われない。このような電子原稿から、ＰＤＦファイルを作成した場合、例えば図１に示すようなＰＤＦファイルが生成される。 In the case of a layout-free electronic document creation application, the objects are managed by sequentially assigning numbers to the objects in the order in which the text objects are generated. However, since the operator creates a manuscript by making use of the characteristics that can be operated in a layout-free manner, the text objects are not necessarily arranged so as to maintain the text quality in accordance with the order in which the text objects are generated. When a PDF file is created from such an electronic document, for example, a PDF file as shown in FIG. 1 is generated.

図１では、レイアウトフリーな電子原稿作成アプリケーションから作成されたＰＤＦファイルのプレビュー（１０１）と、そのＰＤＦファイル内のテキスト描画コマンドの配列（１０２）の例を示している。１０２のようなテキスト描画コマンドの配列になった原因は、レイアウトフリーな電子原稿作成アプリケーションにて、テキストオブジェクトを生成したことに起因する。このときのテキストオブジェクトを生成した順番は、「Michael」「Confidential」「sushi」「Michael」「Possibly」「appreciates」の順である。ただし、その後に並べ変えて１０１のプレビューで示すような配置にし、操作者が意図する通りの文章に構成している。この電子原稿から、ＰＤＦファイルに変換した場合、１０２に示す順番でテキスト描画コマンドが入力されてしまう。 FIG. 1 shows an example of a preview (101) of a PDF file created from a layout-free electronic document creation application and an array (102) of text drawing commands in the PDF file. The reason for the arrangement of the text drawing commands such as 102 is that a text object is generated by a layout-free electronic document creation application. The order in which the text objects are generated is “Michael”, “Confidential”, “sushi”, “Michael”, “Possibly”, and “appreciates”. However, it is rearranged and arranged as shown in the preview 101 so that the text is as intended by the operator. When the electronic document is converted into a PDF file, text drawing commands are input in the order shown in 102.

このようなＰＤＦファイルからテキスト描画コマンドが配置されている順にテキストを抽出しても文章性は保たれていないので、このような結果を受け取った検索エンジンは単語検索程度しかままならず、概念検索等の高度な検索において精度を落としてしまうという問題がある。 Even if text is extracted from such a PDF file in the order in which the text drawing commands are arranged, the text quality is not maintained. Therefore, the search engine that receives such a result remains only a word search, such as a concept search. There is a problem that accuracy is lowered in advanced search.

こういった基本的な問題に対応するため、ページ内のテキストを取得する際にテキスト描画コマンドを座標に応じてソートする先行技術の例として、特許文献１の「ページ記述言語ファイル内に記載されている単語を識別する装置及び方法」が開示されている。電子文書内のテキスト描画コマンドが記載されている順にテキストを抽出するのではなく、一旦すべてのテキスト描画コマンドと、それに紐付いた座標等のリソース情報を取り出す。さらに、テキスト描画コマンドのオフセット座標（テキスト描画の開始位置）をソートして、そのソート結果の順にテキストを抽出し、テキストの配置位置に則したテキスト抽出結果を得る技術がある。 In order to deal with these basic problems, as an example of the prior art that sorts text drawing commands according to coordinates when acquiring text in a page, it is described in “Page Description Language File” of Patent Document 1. Device and method for recognizing active words "are disclosed. Rather than extracting text in the order in which the text drawing commands in the electronic document are written, all text drawing commands and resource information such as coordinates associated therewith are extracted. Further, there is a technique for sorting offset coordinates (text drawing start position) of text drawing commands, extracting text in the order of the sorting results, and obtaining a text extraction result in accordance with the text arrangement position.

ただ、ページの構成には図２のように段組が成されている場合もあり、ただ単にページ内のテキスト描画コマンドの座標に応じてソートする方法を取ってしまうと、大幅に文章性を損なってテキストを抽出してしまう問題がある。
特開平０８−１９４６９７号公報 However, there are cases where the page structure is arranged as shown in Fig. 2, and if you just take a method of sorting according to the coordinates of the text drawing commands in the page, the textuality will be greatly improved. There is a problem that text is extracted because of loss.
Japanese Patent Application Laid-Open No. 08-194697

先述の通り、段組構成になっているページの場合は、単にページ内のテキスト描画コマンドの座標でソートするだけではテキスト抽出として不十分である。よって、ブロック内での読み順と、それぞれのブロック間での読み順を勘案したテキスト抽出を行う必要がある。 As described above, in the case of a page having a column structure, it is not sufficient as text extraction to simply sort by the coordinates of the text drawing commands in the page. Therefore, it is necessary to perform text extraction in consideration of the reading order within the block and the reading order between the respective blocks.

図３は段組構成になっているＰＤＦファイルである。この場合、３０１のテキスト後の読み順として正しいのは３０２のテキストである。しかし、ページ内のテキスト描画オブジェクトに対して座標順にソートした場合に、段組構成を勘案しないと、３０１のテキストの後に３０３のテキストを抽出する順番になってしまう。 FIG. 3 shows a PDF file having a column structure. In this case, 302 text is the correct reading order after 301 text. However, when the text drawing objects in the page are sorted in the coordinate order, if the column structure is not taken into consideration, the order of extracting 303 text after 301 text will be obtained.

よって、このような段組が構成されたケースにおいては、テキストのブロックの検知と、それぞれのブロック間での遷移が読み順として正しく検知できている必要があって、そのための情報処理装置及びその制御方法、プログラムを提供することを目的とする。 Therefore, in the case where such a column is configured, it is necessary to detect the block of text and the transition between each block correctly as the reading order. It is an object to provide a control method and a program.

テキスト抽出する電子文書が先述したような段組を構成している場合、テキストの連鎖を検知するために、本発明による情報処理装置は以下の構成を順次実行する機能を備える。 In the case where the electronic document to be extracted has a column as described above, the information processing apparatus according to the present invention has a function of sequentially executing the following configurations in order to detect a chain of texts.

１つ１つのテキスト列のベースラインの「オフセット座標」と「終端座標」を取得する。 The “offset coordinates” and “end coordinates” of the baseline of each text string are acquired.

１）で作成したベースラインの座標情報を元に、平行かつ近傍の条件に当てはまるテキスト列をブロック化する。さらに、ブロック内のテキストを、ページにおける縦方向の座標が大きい順にソートする（ページの上方から並べる）。 Based on the coordinate information of the baseline created in 1), the text string that satisfies the parallel and neighboring conditions is blocked. Further, the text in the block is sorted in descending order of the vertical coordinate on the page (arranged from the top of the page).

ブロック間での読み順を決定するために、各種条件式によりブロック間での連鎖性の判定と整列を行う。これは、ページ内のコンテンツに、確実に遷移すべき方向（ブロック）が示されている場合、その情報を最優先して遷移すべき方向を判定する。その他の場合は、ブロック間の距離と連鎖性に関係のあるページコンテンツを勘案して連鎖性インデックスを導出し、遷移すべき方向を判定する。 In order to determine the reading order between blocks, the determination and alignment of the linkage between blocks is performed by various conditional expressions. When the direction (block) to be surely transitioned is indicated in the content in the page, this information is given the highest priority to determine the direction to transition. In other cases, a linkage index is derived in consideration of page contents related to the distance between blocks and the linkage, and the direction to be transitioned is determined.

整列された順番でテキスト抽出を行う。 Extract text in sorted order.

本発明によると、レイアウトフリーな電子原稿作成機能を有するアプリケーションが起源であって、かつ、ページのテキストの構成に段組が施されているような電子文書であっても、段組のブロック間の連鎖を検知してテキスト列をつなげる仕組みがあるので、ページ全体の文章性を損なうことなくテキストを取得することができる。 According to the present invention, even an electronic document that originates from an application having a layout-free electronic manuscript creation function and has a column structure in the text of a page can be used between columns of columns. Since there is a mechanism that detects the chain of text and connects text strings, it is possible to acquire text without impairing the text quality of the entire page.

よって本発明によると、検索エンジンが本発明を用いることで、概念検索などページ全体の文章性が必要とされるような高度な検索において、その精度が向上する。 Therefore, according to the present invention, the use of the present invention by the search engine improves the accuracy in advanced searches such as concept search that require the whole page to be written.

以下、図面を参照しながら本発明に係る実施の形態を詳細に説明する。 Hereinafter, embodiments according to the present invention will be described in detail with reference to the drawings.

（本実施形態の文書処理装置の構成例）
図４は、本実施形態の文書処理装置の構成例を示すブロック図である。４０１は演算制御用のＣＰＵである。４０２はデータ及び指示入力のためのキーボードである。４０３は電子文書画像を表示するディスプレイである。４０４は電子文書を格納するハードディスクである。４０５は装置を制御するプログラムや必要な情報をあらかじめ記憶するＲＯＭである。４０６は様々なワークエリアとして利用されるＲＡＭである。４０７は、電子文書の構造を解析する解析手段に相当し、電子文書内のテキスト描画コマンドが指すテキスト列の配置特性を解析するレイアウト解析処理部である。４０８は本処理装置にあらかじめ定義された条件で二者のテキスト列が平行かつ近傍にある場合、その二者のテキスト列は連鎖していると判断してテキスト列をつなげ、つながったテキスト列を一つのブロックとしてブロック化するテキスト列結合処理部である。４０９はブロック間での読み順を判断するブロック間の整列処理部である。 (Configuration example of document processing apparatus of this embodiment)
FIG. 4 is a block diagram illustrating a configuration example of the document processing apparatus according to the present embodiment. Reference numeral 401 denotes an arithmetic control CPU. Reference numeral 402 denotes a keyboard for inputting data and instructions. Reference numeral 403 denotes a display for displaying an electronic document image. Reference numeral 404 denotes a hard disk for storing electronic documents. A ROM 405 stores a program for controlling the apparatus and necessary information in advance. A RAM 406 is used as various work areas. A layout analysis processing unit 407 corresponds to an analysis unit that analyzes the structure of the electronic document and analyzes the arrangement characteristics of the text string indicated by the text drawing command in the electronic document. If the two text strings are parallel and close to each other under the conditions pre-defined in this processing apparatus, the two text strings are determined to be linked, and the text strings are connected. This is a text string merging processing unit that blocks each block. Reference numeral 409 denotes an inter-block alignment processing unit that determines the reading order between blocks.

尚、レイアウト解析処理部４０７や、テキスト列結合部４０８や、ブロック間の整列処理部４０９は、ＲＯＭ４０５に格納されたコンピュータプログラムをＣＰＵ４０１が実行することで実現される。すなわち、コンピュータプログラムは、コンピュータを、レイアウト解析処理部４０７や、テキスト列結合部４０８や、整列処理部４０９として機能させることができる。なお、コンピュータプログラムが記憶されるコンピュータ読み取り可能な記憶媒体は、ＲＯＭ４０５に限るものではなく、例えばハードディスク４０４であってもよい。
（本実施形態の文書処理装置の動作例）
図５は、本発明に係る文書処理を行うステップである。レイアウト解析処理部４０７の処理に相当するレイアウト解析処理Ｓ５０１と、テキスト列結合処理部４０８に相当するテキスト列結合処理Ｓ５０２と、ブロック間の整列処理部４０９に相当するブロック間の整列処理Ｓ５０３と、Ｓ５０１−Ｓ５０３の処理結果をもって、整列された順番にテキスト列のテキストを抽出するＳ５０４で構成されている。 The layout analysis processing unit 407, the text string combination unit 408, and the inter-block alignment processing unit 409 are realized by the CPU 401 executing a computer program stored in the ROM 405. That is, the computer program can cause the computer to function as the layout analysis processing unit 407, the text string combination unit 408, and the alignment processing unit 409. Note that the computer-readable storage medium in which the computer program is stored is not limited to the ROM 405, and may be the hard disk 404, for example.
(Operation example of document processing apparatus of this embodiment)
FIG. 5 shows steps for performing document processing according to the present invention. Layout analysis processing S501 corresponding to the processing of the layout analysis processing unit 407, text string combination processing S502 corresponding to the text string combination processing unit 408, alignment processing S503 between blocks corresponding to the alignment processing unit 409 between blocks, The process consists of S504 for extracting the text of the text string in the sorted order based on the processing results of S501 to S503.

図６は、レイアウト解析処理を表すフローチャートと、その補足図である。ステップＳ６０１において、電子文書が保持するページ内のテキスト描画コマンドのすべてを取得する。なお、ＰＤＦファイルを例とした場合、テキスト描画コマンドはTjで表され、１０２で示すような記載が電子文書中でなされている。ステップＳ６０２では、６Ａで示すように、取得したテキスト描画コマンドに対応するフォント情報から、テキスト描画コマンドが表すテキスト列のベースラインを取得する。ベースラインは図７の７０１で示すように、書体の構成要素で、フォント情報の構成要素のひとつになっている。これは、テキスト列の並びの視線を誘導する役割として必要なラインで、テキスト列を並べる際にはベースラインの一直線上にそれぞれの文字が並ぶことになる。ステップＳ６０３では、６Ｂで示すように、ベースラインの先端座標と終端座標を取得する。ベースラインの先端座標は、先頭文字のベースラインの開始位置となり、ベースラインの終端座標は、最終文字のベースラインの終了位置となる。ステップＳ６０３の後にはレイアウト解析処理を終了する。 FIG. 6 is a flowchart showing layout analysis processing and a supplementary diagram thereof. In step S601, all text drawing commands in the page held by the electronic document are acquired. When a PDF file is taken as an example, a text drawing command is represented by Tj, and a description such as 102 is described in the electronic document. In step S602, as indicated by 6A, the baseline of the text string represented by the text drawing command is acquired from the font information corresponding to the acquired text drawing command. As indicated by reference numeral 701 in FIG. 7, the baseline is a typeface component and one of the font information components. This is a line necessary for guiding the line of sight of the arrangement of the text strings. When arranging the text strings, the respective characters are arranged on a straight line of the base line. In step S603, as shown by 6B, the front end coordinates and the end coordinates of the baseline are acquired. The leading edge coordinate of the baseline is the starting position of the baseline of the first character, and the ending coordinate of the baseline is the ending position of the baseline of the last character. After step S603, the layout analysis process ends.

図８は、テキスト列結合処理を表すフローチャートと、その補足図である。ステップＳ８０１において、テキスト描画コマンドの数と同じ数だけバッファを用意する。それぞれのバッファには、テキスト列を構成するテキストと、そのテキスト列が持つベースラインの先端と終端の座標情報を収める。ステップＳ８０２ではテキスト列のブロック化を行うため、バッファの全数（ｎ1とする）から、２個のテキスト列を選ぶ組み合わせ（コンビネーションの数式を用いると、８Ａに示すようにｎ1Ｃ2と表せる）に対し、お互いのテキスト列が平行かつ近傍にあるか確認する。これは、８Ｂで示すように、両者のテキスト列のベースラインが平行であって、なおかつ両者のベースラインにおいて縦方向の間隔がフォントサイズの特定倍数以内であって、さらに横方向において両者のベースラインが共有する領域が存在している場合にブロック化する対象かどうかの確認である。なお、縦方向のベースラインの間隔の許容度については、図１３に示すように、ユーザーインターフェースによる選択ができるようになっていてもよい。ステップＳ８０３では、ステップＳ８０２の処理結果を踏まえて、ブロック化できるかを判断し、Ｙｅｓの場合にはステップＳ８０４に進み、Ｎｏの場合にはテキスト列結合処理を終了する。ステップＳ８０４では、ブロック化の条件に当てはまった複数のテキスト列をブロック化する。８Ｃに示すように、ブロック化されたテキスト列の集合は、それぞれのブロック別に縦方向に上方から順になるようにテキスト列をソートしておく。また、ブロックは矩形で定義するが、そのとき、ブロックを構成するテキスト列群が全て収まる大きさで矩形を定義する。なお、それぞれのブロックは、ページ上における矩形の大きさと、配置場所が把握できる情報を保持する。 FIG. 8 is a flowchart showing the text string combination processing and its supplementary diagram. In step S801, as many buffers as the number of text drawing commands are prepared. Each buffer stores the text constituting the text string and the coordinate information of the beginning and end of the baseline of the text string. In step S802, since the text string is blocked, a combination of selecting two text strings from the total number of buffers (assumed to be n1) (using the combination formula, it can be expressed as n1C2 as shown in 8A). Check if the text strings are parallel and close to each other. This is because, as shown by 8B, the base lines of the two text strings are parallel, and the vertical distance between the two base lines is within a specific multiple of the font size. It is a confirmation of whether or not to block if there is an area shared by the line. In addition, as shown in FIG. 13, the tolerance of the vertical base line interval may be selected by a user interface. In step S803, based on the processing result in step S802, it is determined whether it can be blocked. If Yes, the process proceeds to step S804, and if No, the text string combination process is terminated. In step S804, a plurality of text strings that meet the blocking condition are blocked. As shown in FIG. 8C, in the set of text strings that are made into blocks, the text strings are sorted so that the blocks are arranged in the vertical direction from the top. The block is defined by a rectangle, and at that time, the rectangle is defined by a size that can accommodate all the text string groups constituting the block. Each block holds information that allows grasping the size of the rectangle on the page and the location of the rectangle.

図９は、ステップＳ５０２で作成したブロックに対して、それらブロック間での整列処理を表すフローチャートと、その補足図である。ステップＳ９０１では、ページ内のブロックのうち、最もページの開始位置に近い場所である左上のブロックを選択し、ブロック間の整列を開始する。ステップＳ９０２では、ステップＳ９０１にて開始したブロックから、次の遷移すべきブロックを判定するための連鎖性インデックス算出の第一段階を行う。そのため、ステップＳ９０２ではブロック間での距離を求める。このとき、９Ａに示すように、左上から開始したブロックの次の遷移すべきブロックを判定するための連鎖性インデックスを算出するため、分岐点と残りのブロックの開始点の二者を用いて、９Ｂに示すように特定の算出方法で導出する。９Ｂによると、まず、分岐点から、第１象限と第３象限のそれぞれの最寄りの開始点を特定する。ここで、第２象限にブロックがあることは例外的であり、また、第４象限への連鎖性も例外的であることから、第１象限と第３象限への連鎖性のみ確認を行う。開始ブロックの開始点の座標を（Ax2, Ay2）、分岐点を（0, 0）、第１象限方向の最寄りブロックの開始点の座標を（Ax1, Ay1）、第３象限の開始点の座標を（Ax3, Ay3）とする。まず、第１象限の開始点に関する連鎖性は、「Ax1」と「Ay1-Ay2」の和をブロック間の連鎖性インデックスとして算出する。同じく、第３象限の開始点に関する連鎖性は、「Ay3」と「Ax2-Ax3」の和をブロック間の連鎖性インデックスとして算出する。よって、この時点では、ブロック間の距離が近いほど、連鎖性インデックスが小さくなる。よって、連鎖性インデックスの値が小さいほど、連鎖性が高いと判断できる。 FIG. 9 is a flowchart showing the alignment processing between the blocks created in step S502 and a supplementary diagram thereof. In step S901, an upper left block that is closest to the start position of the page is selected from the blocks in the page, and alignment between the blocks is started. In step S902, the first step of calculating the linkage index for determining the next block to be transitioned from the block started in step S901 is performed. Therefore, in step S902, the distance between the blocks is obtained. At this time, as shown in 9A, in order to calculate a linkage index for determining a block to be transitioned next to the block started from the upper left, using the two of the branch point and the start point of the remaining blocks, Derived by a specific calculation method as shown in 9B. According to 9B, first, the nearest starting point of each of the first quadrant and the third quadrant is specified from the branch point. Here, it is exceptional that there is a block in the second quadrant, and since the linkage to the fourth quadrant is also exceptional, only the linkage to the first quadrant and the third quadrant is confirmed. The coordinates of the start point of the start block are (Ax2, Ay2), the branch point is (0, 0), the coordinates of the start point of the nearest block in the first quadrant (Ax1, Ay1), the coordinates of the start point of the third quadrant Is (Ax3, Ay3). First, as for the linkage with respect to the starting point of the first quadrant, the sum of “Ax1” and “Ay1-Ay2” is calculated as a linkage index between blocks. Similarly, the linkage of the starting point of the third quadrant is calculated as the sum of “Ay3” and “Ax2-Ax3” as a linkage index between blocks. Therefore, at this time, the closer the distance between blocks, the smaller the linkage index. Therefore, it can be determined that the smaller the value of the linkage index, the higher the linkage.

ステップＳ９０３では、さらにページ内のコンテンツ情報を勘案し、ステップＳ９０２で算出した連鎖性インデックスに変更を加えていく処理であり、それを図１０に示す。まず、ステップＳ１００１では、ブロック間の読み順として、最優先と思われる誘導情報がページ内のコンテンツにある場合、その情報を最優先として従う処理を行う。ステップＳ１００２においては、ブロックの段組間の読み順を誘導する記号が、ブロック内の文末にあるか確認する。１１Ａに示すように、段組が構成されている文書中には、読み順を誘導する矢印が記載されているケースがある。例えば、 In step S903, the content information in the page is further taken into consideration, and the linkage index calculated in step S902 is changed, which is shown in FIG. First, in step S1001, when the guidance information considered to be the highest priority is in the content in the page as the reading order between the blocks, processing is performed in which the information is given the highest priority. In step S1002, it is confirmed whether a symbol for guiding the reading order between the columns of the block is at the end of the sentence in the block. As shown in 11A, there is a case where an arrow for guiding the reading order is described in a document in which columns are configured. For example,

だとUnicodeで0x2199が文末に配置されることになり、その場合は第３象限方向のブロックを優先してブロックの遷移方向に選ぶ。このように、誘導矢印がある場合は、連鎖性の優先度は矢印が指す方向の象限にあるブロックを最優先として処理するため、ステップＳ１１０３に進み（ステップＳ１００２にてＹｅｓ）、条件にあった象限の方向のブロックを最優先として判定し、その後、最優先判定処理を終了する。図１１のステップＳ１１０２にてＮｏの判定の場合には、さらにステップＳ１１０２で最優先に判定できる条件があるか確認する。ステップＳ１１０２では、ブロックの段組間の読み順を誘導するグラフィックがページ内にレイアウトされているか確認する。例えば、１１Ｂに示すように、段組が構成されている文書は、段組の仕切り用にラインが施されているケースは多い。例えば、新聞のようなレイアウトを持った文書の場合など、段組の仕切り用のラインは一般的である。１１Ｂのケースの場合、第１象限の方向には、仕切り用のラインをまたぐことになるので、第３象限の方向を最優先として判定する。このように、段組の仕切りになるようなグラフィックがある場合、仕切りをまたがない方向の象限にあるブロックを最優先として処理するため、ステップＳ１１０３に進み（ステップＳ１１０２にてＹｅｓ）、条件にあった象限の方向のブロックを最優先として判定し、その後、最優先判定処理を終了する。ステップＳ１１０２にてＮｏになった場合、そのまま最優先判定処理を終了する。ステップＳ１００１における最優先判定処理の後に、ステップＳ１００２にて、連鎖性を判定するための最優先条件に当てはまったかを確認する。ステップＳ１００２にてＹｅｓの場合には、ページコンテンツ勘案処理を終了する。ステップＳ１００２がＮｏの場合には、ステップＳ１００３に進み、連鎖性インデックス編集処理を行う。連鎖性インデックス編集処理では、まず図１２のステップＳ１２０１にて、ブロック間にイメージが挿入されているか確認する。イメージが挿入されているかどうかは、ブロック間の分岐点と開始点を結ぶ線上にイメージの領域が被るかを確認すればよい。なお、１２Ａに示すように、イメージが挿入されている場合には（ステップＳ１２０２にてＹｅｓ）、ステップＳ１２０２に進んで、イメージに重なる領域分だけ、連鎖性インデックスを差し引き、ステップＳ１２０３に進む。ステップＳ１２０１がＮｏの場合はそのままステップＳ１２０３に進む。ステップＳ１２０３では、ブロック間でフォントサイズに違いがあるか確認する。ブロック間でフォントサイズが変わるような場合には連鎖性は少ないと捉えられるので、その事象を連鎖性インデックスに反映させる。フォントサイズの比較は、各ブロック先頭のテキストのフォントサイズの比較、もしくは全体の平均の比較で行う。ブロック間でフォントサイズが変わる場合には（ステップＳ１２０３にてＹｅｓ）、サイズの違いに応じて連鎖性インデックスを増やし、例えば、開始元のブロックのフォントサイズが6ptであり、遷移先候補のブロックのフォントサイズが9ptなら、それらのブロック間の連鎖性インデックスは1.5倍にするような処理を行う。その後、ステップＳ１２０５に進む。ステップＳ１２０３にてＮｏの場合は、そのままステップＳ１２０５に進む。ステップＳ１２０５では、ブロック間でフォントの書体に違いがあるか確認する。ブロック間で書体が変わるような場合には連鎖性は少ないと考えられる。よって、ブロック間でフォントの書体が変わる場合には（ステップＳ１２０５にてＹｅｓ）、連鎖性インデックスを2倍にする等、大きく増やす。その後、連鎖性インデックス編集処理を終了する。また、ステップＳ１２０５がＮｏの場合は、そのまま連鎖性インデックス編集処理を終了する。連鎖性インデックス編集処理が終了した後には、ページコンテンツ勘案処理も終了する。 Then, in Unicode, 0x2199 is placed at the end of the sentence. In this case, the block in the third quadrant direction is given priority and is selected as the block transition direction. As described above, when there is a guide arrow, the priority of the linkage is processed with the block in the quadrant in the direction indicated by the arrow as the highest priority, so the process proceeds to step S1103 (Yes in step S1002), and the condition is met. The block in the quadrant direction is determined as the highest priority, and then the highest priority determination process is terminated. In the case of No determination in step S1102 in FIG. 11, it is further confirmed in step S1102 whether there is a condition that can be determined with the highest priority. In step S1102, it is confirmed whether a graphic for guiding the reading order between the columns of blocks is laid out in the page. For example, as shown in 11B, there are many cases where a document in which a column is configured is provided with a line for partitioning the column. For example, in the case of a document having a layout such as a newspaper, a multi-column partition line is common. In the case of 11B, the direction of the first quadrant crosses the partitioning line, so the direction of the third quadrant is determined as the highest priority. In this way, when there is a graphic that forms a columnar partition, the process proceeds to step S1103 (Yes in step S1102) to process the block in the quadrant in the direction that does not cross the partition as the highest priority. The block in the quadrant direction is determined as the highest priority, and then the highest priority determination process is terminated. If No in step S1102, the highest priority determination process is terminated as it is. After the highest priority determination process in step S1001, it is checked in step S1002 whether the highest priority condition for determining the linkage is satisfied. If Yes in step S1002, the page content consideration process ends. When step S1002 is No, it progresses to step S1003 and a linkage index edit process is performed. In the linkage index editing process, first, in step S1201 in FIG. 12, it is confirmed whether an image is inserted between blocks. Whether or not the image is inserted may be confirmed by checking whether the image area is covered on the line connecting the branch point between the blocks and the start point. As shown in 12A, if an image has been inserted (Yes in step S1202), the process proceeds to step S1202, the chained index is subtracted by the area overlapping the image, and the process proceeds to step S1203. If step S1201 is No, the process proceeds directly to step S1203. In step S1203, it is confirmed whether there is a difference in font size between blocks. When the font size changes between blocks, it can be considered that the linkage is low, so that event is reflected in the linkage index. The font size is compared by comparing the font size of the text at the head of each block or by comparing the average of the whole. If the font size changes between blocks (Yes in step S1203), the linkage index is increased according to the size difference. For example, the font size of the starting block is 6pt, and the transition destination candidate block If the font size is 9pt, the chain index between these blocks is increased 1.5 times. Thereafter, the process proceeds to step S1205. If No in step S1203, the process proceeds directly to step S1205. In step S1205, it is confirmed whether there is a difference in font typeface between blocks. When the typeface changes between blocks, it is considered that there is little linkage. Therefore, when the font typeface changes between blocks (Yes in step S1205), the linkage index is increased by a factor of 2, for example. Thereafter, the linkage index editing process is terminated. If step S1205 is No, the linkage index editing process is terminated as it is. After the linkage index editing process ends, the page content consideration process also ends.

ブロック間の整列処理に戻り、ステップＳ９０４では、ステップＳ９０３にて判定した最優先判定に当てはまった場合においては、ブロックの遷移する方向は最優先判定の結果に従う処理を行う。また、最優先判定に当てはまらなかった場合には、ステップＳ９０２及びステップＳ９０３で導出した連鎖性インデックスを用いることで、ブロックの遷移する方向を決定する。なお、ステップＳ９０５では、一度遷移する方向が第１象限に決まった場合、その後は第１象限のみを遷移対象とする処理を行う。一方、一度遷移する方向が第３象限に決まった場合、その後は第３象限のみを遷移対象とする処理を行う。その後、ステップＳ９０６において、ステップＳ９０５の処理を遂行したのちに、遷移する対象の方向（対象の象限）にブロックが無くなった場合に遷移処理を一旦リセットし、ステップＳ９０７に進む。ステップＳ９０７において、まだ処理済みでない（整列済みでない）ブロックがページ内に残っているか確認し、残っている場合においては（ステップＳ９０７にてＹｅｓ）、さらに処理を繰り返すため、ステップＳ９０１に戻る。ページ内に処理済みでないブロックが残っていない場合においては（ステップＳ９０７にてＮｏ）、ブロック間の整列処理を終了する。 Returning to the alignment process between blocks, in step S904, if the highest priority determination determined in step S903 is applied, the process in which the block transitions is performed according to the result of the highest priority determination. In addition, when the highest priority determination is not applied, the block transition direction is determined by using the linkage index derived in steps S902 and S903. In step S905, when the direction of transition is once determined in the first quadrant, thereafter, a process for only the first quadrant is performed. On the other hand, when the direction in which the transition is once determined in the third quadrant, processing for only the third quadrant is performed thereafter. Thereafter, in step S906, after performing the process of step S905, if there is no block in the direction of the transition target (target quadrant), the transition process is temporarily reset, and the process proceeds to step S907. In step S907, it is confirmed whether a block that has not yet been processed (not aligned) remains in the page. If it remains (Yes in step S907), the process returns to step S901 to repeat the process. If there are no unprocessed blocks remaining in the page (No in step S907), the alignment process between the blocks ends.

その後、ステップＳ５０４では、ステップＳ５０３の整列処理によってソートされたブロックの順にブロック内のテキストを抽出する。このように、文書レイアウトに則したテキスト抽出を行うことで、ページ全体の文章性を損なうことなくテキストを取得することができる。 Thereafter, in step S504, the texts in the blocks are extracted in the order of the blocks sorted by the alignment process in step S503. In this way, by performing text extraction in accordance with the document layout, it is possible to acquire text without impairing the sentence properties of the entire page.

よって、本発明によると、検索エンジンが本発明を用いることで、概念検索などページ全体の文章性が必要とされるような高度な検索において、その精度が向上する。 Therefore, according to the present invention, the use of the present invention by the search engine improves the accuracy in advanced searches such as concept searches that require the whole page to be written.

本発明は、複数の機器（例えばホストコンピュータ、インタフェース機器、プリンタなど）から構成されるシステムあるいは統合装置に適用しても、ひとつの機器からなる装置に適用してもよい。 The present invention may be applied to a system or an integrated apparatus constituted by a plurality of devices (for example, a host computer, an interface device, a printer, etc.), or may be applied to an apparatus constituted by a single device.

また、本発明の目的が、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体（または記録媒体）を、システムあるいは装置に供給することによって、達成される。この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。又、コンピュータが読み出したプログラムコードの指示に基づき、オペレーティングシステム（ＯＳ）などが実際の処理の一部または全部を行い、前述した実施形態の機能が実現される場合も含まれる。 Further, the object of the present invention is achieved by supplying a storage medium (or recording medium) that records a program code of software that realizes the functions of the above-described embodiments to a system or apparatus. In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the storage medium storing the program code constitutes the present invention. Further, the case where the functions of the above-described embodiments are realized by an operating system (OS) or the like performing part or all of the actual processing based on an instruction of the program code read by the computer is included.

さらに、記憶媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張カードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれる。その後、そのプログラムコードの指示に基づき、機能拡張カードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、前述した実施形態の機能が実現される場合も含まれる。 Further, the program code read from the storage medium is written in a memory provided in a function expansion card inserted into the computer or a function expansion unit connected to the computer. Thereafter, the CPU of the function expansion card or the function expansion unit performs part or all of the actual processing based on the instruction of the program code to realize the functions of the above-described embodiments.

なお、本発明を上記記憶媒体に適用する場合、その記憶媒体には、先に説明したフローチャートに対応するプログラムコードが格納されることになる。 When the present invention is applied to the above storage medium, the storage medium stores program codes corresponding to the flowcharts described above.

ＰＤＦファイルのプレビュー例と、そのＰＤＦファイル内のテキスト描画コマンドの配列の例を示す図である。It is a figure which shows the example of a preview of a PDF file, and the example of the arrangement | sequence of the text drawing command in the PDF file. 段組が施されたＰＤＦの例を示す図である。It is a figure which shows the example of the PDF to which the column setting was given. 図２に対して、ベースラインのイメージ図を追加した図である。It is the figure which added the image figure of a baseline with respect to FIG. 本実施形態の、文書処理システムを実現するハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example which implement | achieves the document processing system of this embodiment. 本実施形態の、文書処理の手順例を示すフローチャートである。It is a flowchart which shows the example of a procedure of document processing of this embodiment. 本実施形態の、レイアウト解析処理を示すフローチャートと、その補足図である。It is a flowchart which shows the layout analysis process of this embodiment, and its supplementary figure. 書体の構成要素となるベースラインの例を示す図である。It is a figure which shows the example of the baseline used as the structural element of a typeface. 本実施形態の、テキスト列結合処理を表すフローチャートと、その補足図である。It is the flowchart showing the text string connection process of this embodiment, and its supplementary figure. 本実施形態の、ブロック間の整列処理を表すフローチャートと、その補足図である。It is the flowchart showing the alignment process between blocks of this embodiment, and its supplementary figure. 本実施形態の、ページコンテンツ勘案処理を表すフローチャートと、その補足図である。It is the flowchart showing the page content consideration process of this embodiment, and its supplementary figure. 本実施形態の、最優先判定処理を表すフローチャートと、その補足図である。It is the flowchart showing the highest priority determination process of this embodiment, and its supplementary figure. 本実施形態の、連鎖性インデックス編集処理表すフローチャートと、その補足図である。It is the flowchart showing the linkage index edit processing of this embodiment, and its supplementary figure. 本実施形態の、条件入力のためのユーザーインターフェースの例を示す図である。It is a figure which shows the example of the user interface for condition input of this embodiment.

１０１ＰＤＦファイルのプレビュー例
１０２１０１のＰＤＦファイルが表すテキスト描画コマンド
２０１ＰＤＦファイルのプレビュー例
３０１ベースライン
３０２ベースライン
３０３ベースライン
４０１ＣＰＵ
４０２キーボード
４０３ディスプレイ
４０４ハードディスク
４０５ＲＯＭ
４０６ＲＡＭ
４０７レイアウト解析処理部
４０８テキスト列結合処理部
４０９ブロック間の整列処理部
６Ａフローチャートの補足説明
６Ｂフローチャートの補足説明
７０１ベースライン
８Ａ組み合わせ演算子標記による説明
８Ｂフローチャートの補足説明
８Ｃフローチャートの補足説明
９Ａフローチャートの補足説明
９Ｂフローチャートの補足説明
１１Ａフローチャートの補足説明
１１Ｂフローチャートの補足説明
１２Ａフローチャートの補足説明
１３０１条件設定のためのユーザーインターフェース 101 PDF file preview example 102 Text drawing command represented by 101 PDF file 201 PDF file preview example 301 Baseline 302 Baseline 303 Baseline 401 CPU
402 Keyboard 403 Display 404 Hard disk 405 ROM
406 RAM
407 Layout analysis processing unit 408 Text string combination processing unit 409 Inter-block alignment processing unit 6A Supplementary explanation of flowchart 6B Supplementary explanation of flowchart 701 Baseline 8A Explanation by combination operator notation 8B Supplementary explanation of flowchart 8C Supplementary explanation of flowchart 9A Flowchart 9B Supplementary explanation of flowchart 11A Supplementary explanation of flowchart 11B Supplementary explanation of flowchart 12A Supplementary explanation of flowchart 1301 User interface for condition setting

Claims

レイアウト情報を含む電子文書においてレイアウト解析処理（Ｓ５０１）を行い、平行かつ近傍に集合するテキスト列群を１つのブロックにまとめるブロック化手段（Ｓ５０２）と、前記ブロック間での読み順を決定するための連鎖性を判定するブロック連鎖性判定手段（Ｓ５０３）を備え、その連鎖順位に則した順番にてテキスト列を抽出する（Ｓ５０４）ことを特徴とする文書処理装置（４１０）。 A layout analysis process (S501) is performed on an electronic document including layout information, and a block forming unit (S502) that groups together text strings grouped in parallel and in a single block, and a reading order between the blocks are determined. The document processing apparatus (410) is characterized in that it includes a block linkage determining means (S503) for determining the linkage of the text strings, and extracts text strings in the order according to the linkage order (S504).

請求項１に記載の文書処理装置において、レイアウト情報を含む電子文書から、テキスト列及びテキスト列の座標情報を取得し（Ｓ８０１）、座標情報からブロック化すべき条件に含まれる場合（Ｓ８０２、Ｓ８０３）に、前記テキスト列をブロック化する（Ｓ８０４）ことを特徴とする文書処理装置（４１０）。 The document processing apparatus according to claim 1, wherein the text string and the coordinate information of the text string are acquired from the electronic document including the layout information (S801), and are included in the conditions to be blocked from the coordinate information (S802, S803). In addition, the document processing apparatus (410) is characterized in that the text string is blocked (S804).

請求項１に記載の文書処理装置において、ブロック連鎖性判定手段（Ｓ５０３）においては、連鎖性を判定するための決定的情報を解釈する最優先判定手段（Ｓ１００１）を備えることを特徴とする文書処理装置（４１０）。 2. The document processing apparatus according to claim 1, wherein the block linkage determination unit (S503) includes a highest priority determination unit (S1001) for interpreting decisive information for determining the linkage. Processing device (410).

請求項１に記載の文書処理装置において、最優先判定（Ｓ１００１）がなされなかった場合に、連鎖性のインデックス算出を行い、前記インデックスの比較による判定（Ｓ１００３）を行うことを特徴とする文書処理装置（４１０）。 2. The document processing apparatus according to claim 1, wherein when the highest priority determination (S1001) is not made, a chained index calculation is performed and a determination based on the comparison of the indexes (S1003) is performed. Device (410).

請求項１に記載の文書処理装置において、ブロック化判定を行うための条件のバロメータをユーザーインターフェース（１３０１）によって切り替えることができることを特徴とする文書処理装置（４１０）。 The document processing apparatus (410) according to claim 1, wherein a barometer of conditions for performing the blocking determination can be switched by a user interface (1301).