JP2002073598A

JP2002073598A - Document processor and method of processing document

Info

Publication number: JP2002073598A
Application number: JP2000254053A
Authority: JP
Inventors: Kazuyuki Saito; 和之齋藤
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2000-08-24
Filing date: 2000-08-24
Publication date: 2002-03-12

Abstract

PROBLEM TO BE SOLVED: To take out contents such as a 'text', a 'picture' and a 'table' from an electronic document and to make them integrally handleable and reusable. SOLUTION: In the document processor processing the electronic document, an electronic document preparation part 103 analyzes the layout of picture data, divides it into the areas of prescribed attributes and prepares the electronic document 104 including the contents for every divided area by designating the attribute so that it can be extracted. A contents detection part 109 detects the contents in the electronic document 104 and a content management part 110 registers and manages the detected contents based on information showing the attribute.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文書処理装置技術
に関するものであり、特にコンピュータを中心として、
ネットワークにより接続されたデジタル機器間でデジタ
ル化された文書データによる情報のやり取りが行われる
際の文書データのハンドリングに関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a document processing apparatus, and more particularly to a computer.
The present invention relates to handling of document data when information is exchanged between digitized document data between digital devices connected by a network.

【０００２】[0002]

【従来の技術】従来、紙文書から電子化した電子化文書
とワードプロセッサや表計算アプリケーション等で作成
した電子文書を、統合的に処理する（１つの“文書”と
して、加工、検索、出力等の処理を行う）場合、１つの
アプリケーションをあらかじめ定めてから、紙文書を光
学的に読み取って文書画像として入力するか、その文書
画像からテキスト領域のみを文字認識した『テキスト』
を入力する等の方法が一般に知られている。2. Description of the Related Art Conventionally, an electronic document digitized from a paper document and an electronic document created by a word processor, a spreadsheet application or the like are integratedly processed (processed, searched, output, etc. as one "document"). Process), one application is determined in advance, and then a paper document is optically read and input as a document image, or “text” in which only a text area is recognized from the document image.
Is generally known.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、それら
の方法では、電子化文書あるいは電子文書から、「テキ
スト」、「ピクチャ」、「表」等のコンテンツ（部品）
を取り出し、統合的に扱うことや再利用することが出来
ないという問題がある。However, in these methods, contents (parts) such as "text", "picture", and "table" are converted from an electronic document or an electronic document.
There is a problem that it is not possible to take out and handle it in an integrated manner or reuse it.

【０００４】本発明は、上述の課題に鑑みてなされたも
ので、その目的とするところは、電子化文書または電子
文書から、「テキスト」、「ピクチャ」、「表」等のコ
ンテンツ（部品）を取り出し、統合的に扱うことや再利
用することができる文書処理装置及び方法を提供するこ
とである。[0004] The present invention has been made in view of the above-mentioned problems, and has as its object to convert contents (parts) such as "text", "picture", and "table" from an electronic document or an electronic document. It is an object of the present invention to provide a document processing apparatus and method capable of taking out a document, handling the document in an integrated manner, and reusing the document.

【０００５】[0005]

【課題を解決するための手段】上記の目的を達成するた
めの本発明による文書処理装置は以下の構成を備える。
すなわち、電子化文書を処理する文書処理装置であっ
て、前記電子化文書中のコンテンツを検出するコンテン
ツ検出手段と、前記検出手段で検出したコンテンツをそ
の属性を示す情報に基づいて登録・管理する管理手段
と、前記管理手段で管理されているコンテンツを単位と
して、その内容を出力する出力手段とを備える。A document processing apparatus according to the present invention for achieving the above object has the following arrangement.
That is, a document processing apparatus for processing a digitized document, wherein the content detecting unit detects content in the digitized document, and registers and manages the content detected by the detecting unit based on information indicating an attribute of the content. A management unit; and an output unit that outputs the content in units of the content managed by the management unit.

【０００６】また、上記の目的を達成するための本発明
の文書処理方法は、電子化文書を処理する文書処理方法
であって、前記電子化文書中のコンテンツを検出するコ
ンテンツ検出工程と、前記検出工程で検出したコンテン
ツをその属性を示す情報に基づいて登録・管理する管理
工程と、前記管理工程で管理されているコンテンツを単
位として、その内容を出力する出力工程とを備える。According to another aspect of the present invention, there is provided a document processing method for processing a digitized document, comprising: a content detecting step of detecting a content in the digitized document; The method includes a management step of registering and managing the content detected in the detection step based on information indicating the attribute, and an output step of outputting the content in units of the content managed in the management step.

【０００７】[0007]

【発明の実施の形態】以下、添付の図面を参照して本発
明の好適な実施形態を説明する。Preferred embodiments of the present invention will be described below with reference to the accompanying drawings.

【０００８】［第１の実施形態］図１は、第１の実施形
態に係る文書ハンドリングシステムの構成を表すブロッ
ク図である。[First Embodiment] FIG. 1 is a block diagram showing a configuration of a document handling system according to a first embodiment.

【０００９】図１において、１０１は紙文書または電子
文書であるドキュメント、１０２はスキャナ等の画像入
力部、１０３は入力画像から電子化文書を作成する電子
化文書作成部、１０４は電子化文書、１０５は電子文書
入力部、１０６は電子文書、１０７は電子文書が未知の
場合に画像に変換する画像変換部、１０８は画像変換部
で変換した画像、１０９は電子文書もしくは電子化文書
内のコンテンツを検出するコンテンツ検出部、１１０は
コンテンツを管理するコンテンツ管理部、１１１はコン
テンツ、１１２はコンテンツを再利用するコンテンツ再
利用部、そして、１１３はプリンタやディスプレイ等の
出力部である。In FIG. 1, reference numeral 101 denotes a document which is a paper document or an electronic document; 102, an image input unit such as a scanner; 103, an electronic document creation unit for creating an electronic document from an input image; 104, an electronic document; 105 is an electronic document input unit, 106 is an electronic document, 107 is an image conversion unit that converts an electronic document into an image when it is unknown, 108 is an image converted by the image conversion unit, 109 is an electronic document or content in an electronic document. Is a content management unit that manages the content, 111 is the content, 112 is a content reuse unit that reuses the content, and 113 is an output unit such as a printer or a display.

【００１０】次に、上記図１に示された文書ハンドリン
グシステムの処理の流れについて図２、図３、図４、図
５および図６に従って説明する。Next, the flow of processing of the document handling system shown in FIG. 1 will be described with reference to FIGS. 2, 3, 4, 5 and 6.

【００１１】まず、ドキュメント１０１として入力する
対象が画像の場合、画像入力部１０２より文書画像を入
力する（ステップＳ２０１）。次に、電子化文書作成部
１０３において、入力した文書画像をもとに電子化文書
１０４を作成する（ステップＳ２０２）。First, when an object to be input as the document 101 is an image, a document image is input from the image input unit 102 (step S201). Next, the digitized document creation unit 103 creates the digitized document 104 based on the input document image (step S202).

【００１２】ここで、ステップＳ２０２における電子化
文書１０４の作成処理について説明する。図３は、電子
化文書の作成処理（ステップＳ２０２）の処理内容を説
明するフローチャートである。Here, the process of creating the digitized document 104 in step S202 will be described. FIG. 3 is a flowchart for explaining the processing contents of the digitized document creation processing (step S202).

【００１３】本例では、ステップＳ３０１でレイアウト
解析処理を行い、入力された画像を、タイトル（「テキ
スト」）、本文（「テキスト」）、非テキスト系領域
（「ピクチャ」、「表」）等の属性毎に領域分割する。
次に、分割領域毎に属性を判断して以下の処理を行う
（ステップＳ３０２）。In this example, a layout analysis process is performed in step S301, and the input image is converted into a title (“text”), a text (“text”), a non-text area (“picture”, “table”), and the like. Area is divided for each attribute of
Next, an attribute is determined for each divided area, and the following processing is performed (step S302).

【００１４】分割領域が「テキスト」に対しては、部分
画像を抽出する（ステップＳ３０７）とともに、ＯＣＲ
処理（文字認識処理）を行って（ステップＳ３０８）文
字コードの抽出を行う（ステップＳ３０９）。抽出した
データはＸＭＬデータ化する（ステップＳ３１０）。If the divided area is "text", a partial image is extracted (step S307), and the OCR is performed.
Processing (character recognition processing) is performed (step S308), and character codes are extracted (step S309). The extracted data is converted into XML data (step S310).

【００１５】また、分割領域が「表」の場合は、部分画
像を抽出する（ステップＳ３０４）とともに、表解析処
理を行い（ステップＳ３０５）、表データの抽出を行う
（ステップＳ３０６）。抽出したデータはテキストの場
合と同様にＸＭＬデータ化する（ステップＳ３１０）。If the divided area is a "table", a partial image is extracted (step S304), a table analysis process is performed (step S305), and table data is extracted (step S306). The extracted data is converted into XML data as in the case of text (step S310).

【００１６】更に、分割領域が「ピクチャ」の場合は、
部分画像を抽出する（ステップＳ３０３）。そして、抽
出したデータは上記と同様にＸＭＬデータ化する（ステ
ップＳ３１０）。Further, when the divided area is a "picture",
A partial image is extracted (step S303). Then, the extracted data is converted into XML data in the same manner as described above (step S310).

【００１７】次に、図２に戻って、以上のようにして作
成したＸＭＬ文書を電子化文書とし、「テキスト」、
「表」、「ピクチャ」等の属性毎にタグで分類して、コ
ンテンツ管理部１１０に登録し（ステップＳ２０３）、
コンテンツ１１１を管理する。本実施形態では、このコ
ンテンツの管理の方法は、抽出したコンテンツ（画像・
テキスト・表等）をファイル化して、それらのファイル
をネットワーク上の所定の位置に保存しておき、そのフ
ァイルのファイル名、属性、保存された位置等（更に
は、後述する第２，３の実施形態で作成する要約や翻訳
などのデータ等）を関連付けた情報を有する該ＸＭＬ文
書を用いて、このＸＭＬ文書の属性を示すタグでコンテ
ンツを分類して管理する。Next, returning to FIG. 2, the XML document created as described above is used as an electronic document, and “text”,
Tags are classified for each attribute such as “table” and “picture” and registered in the content management unit 110 (step S203).
Manage the content 111. In this embodiment, this content management method is based on the extracted content (image / image).
Texts, tables, etc. are converted into files, and the files are stored in predetermined locations on the network. The file names, attributes, stored positions, and the like of the files (further described in the second and third sections described later) Using the XML document having information associated with data such as summaries and translations created in the embodiment, contents are classified and managed by tags indicating attributes of the XML document.

【００１８】また、入力する対象が電子文書の場合、電
子文書入力部１０５より電子文書を入力する。本実施形
態においては、入力される電子文書がMarkup Language
文書（ＨＴＭＬ、ＸＭＬ等）であれば、オリジナルの電
子文書１０６のまま入力を行う（図４、ステップＳ４０
１）。When the input target is an electronic document, the electronic document is input from the electronic document input unit 105. In the present embodiment, the input electronic document is a Markup Language
If it is a document (HTML, XML, etc.), the input is performed as it is in the original electronic document 106 (FIG. 4, step S40).
1).

【００１９】次に、入力された電子文書（すなわち、Ma
rkup Language文書（ＨＴＭＬ、ＸＭＬ等））に対し
て、タグ解析処理ステップＳ４０２を行い、「テキス
ト」、「ピクチャ」、「表」等の属性に該当するタグに
分類し、それらの属性が、「テキスト」、「ピクチ
ャ」、「表」のいずれかを判断し（ステップＳ４０
３）、そのタグのデータを「テキストコードデータ」、
「表データ」、「画像データ」としてそれぞれ検出する
（ステップＳ４０４、ステップＳ４０５、ステップＳ４
０６、ステップＳ４０７）。そして、それぞれの属性に
該当するタグをコンテンツ管理部１１０に登録し（ステ
ップＳ４０８）、コンテンツを管理する。このコンテン
ツの管理の方法は、電子文書内に含まれる各コンテンツ
（テキスト、ピクチャ、表等）の属性を示すタグを用い
て、電子文書内に含まれる各コンテンツを分類して管理
する。Next, the input electronic document (ie, Ma
The rkup Language document (HTML, XML, etc.) is subjected to tag analysis processing step S402, and classified into tags corresponding to attributes such as "text", "picture", and "table". One of “text”, “picture”, and “table” is determined (step S40).
3), the data of the tag is "text code data",
Detected as “table data” and “image data” respectively (Step S404, Step S405, Step S4)
06, step S407). Then, tags corresponding to the respective attributes are registered in the content management unit 110 (step S408), and the content is managed. This content management method classifies and manages each content included in the electronic document by using a tag indicating an attribute of each content (text, picture, table, etc.) included in the electronic document.

【００２０】また、Markup Language文書以外の電子文
書（例えば、テキスト形式の文書や、所定のワープロソ
フト固有形式の文書など）であれば、未知の電子文書と
して入力し（ステップＳ５０１）、画像変換部１０７に
おいて該電子文書を画像に変換する画像変換処理（ステ
ップＳ５０２）を行い、変換した画像を改めて画像入力
部１０２より再入力し（ステップＳ５０３）、電子化文
書１０４を作成して（ステップＳ５０４（図３と同様の
処理））、コンテンツを登録する（ステップＳ５０
５）。If the document is an electronic document other than a Markup Language document (for example, a document in a text format or a document in a specific format of a word processing software), it is input as an unknown electronic document (step S501), and an image conversion unit is used. At 107, an image conversion process for converting the electronic document into an image is performed (step S502), and the converted image is input again from the image input unit 102 (step S503), and the digitized document 104 is created (step S504). 3), and register the content (step S50).
5).

【００２１】そして、登録されたコンテンツは図６のよ
うに出力部１１３によって出力（本例では表示）され
る。出力の形態はもちろん表示に限られるものではな
く、印刷等であってもよい。また、以下の第２の実施形
態以降で説明するコンテンツの再利用においては、再利
用を行うアプリケーションに対する出力となる。図６
は、ドキュメントハンドリングシステムのアプリケーシ
ョンの一例で、登録されているコンテンツの一覧を表示
していることを示す図である。この場合、左側のウィン
ドウにはハードディスク上の保管位置を表示しており、
真中のウィンドウには、コンテンツのもととなったオリ
ジナルの電子文書もしくは電子化文書のサムネールが表
示される。また右側のウィンドウには各コンテンツのサ
ムネールやテキストおよび属性、サイズ等が表示されて
いる。なお、コンテンツのもととなったオリジナルの電
子文書もしくは電子化文書のサムネールの表示におい
て、Ｗとあるのはオリジナルが電子文書（MS-WORD文
書）であることを示しており、実際にはその内部にドキ
ュメントイメージの縮小されたものが表示される（すな
わち、右側のウィンドウ表示されるコンテンツを含む文
書のサムネールが表示される）。The registered content is output (displayed in this example) by the output unit 113 as shown in FIG. The form of output is not limited to display, but may be print or the like. Further, in the reuse of the content described in the second and subsequent embodiments, the output is to the application to be reused. FIG.
FIG. 3 is a diagram illustrating an example of an application of the document handling system, in which a list of registered contents is displayed. In this case, the left window shows the storage location on the hard disk,
In the middle window, a thumbnail of the original electronic document or digitized document from which the content is based is displayed. In the right window, thumbnails, texts, attributes, sizes, and the like of each content are displayed. In the thumbnail display of the original electronic document or digitized document that is the source of the content, the letter W indicates that the original is an electronic document (MS-WORD document). A reduced version of the document image is displayed inside (ie, a thumbnail of the document containing the right windowed content is displayed).

【００２２】以上説明したように、第１の実施形態によ
れば、電子化文書または電子文書から、「テキスト」、
「ピクチャ」、「表」等のコンテンツ（部品）を取り出
し、取り出した複数のコンテンツを１つの文書として統
合的に扱うことができる。As described above, according to the first embodiment, a "text",
Contents (parts) such as “pictures” and “tables” can be extracted, and the extracted plurality of contents can be integrated and handled as one document.

【００２３】［第２の実施形態］また、選択したコンテ
ンツの再利用も可能である。第２の実施形態では、コン
テンツの再利用の一例として、テキストのコンテンツに
対して要約を行い、さらにその要約文書を新たなコンテ
ンツとして登録する。[Second Embodiment] Also, the selected content can be reused. In the second embodiment, as an example of content reuse, text content is summarized, and the summary document is registered as new content.

【００２４】図７は第２の実施形態によるコンテンツの
再利用を説明するフローチャートである。また、図８
は、第２の実施形態による表示状態を説明する図であ
る。FIG. 7 is a flowchart for explaining the reuse of contents according to the second embodiment. FIG.
FIG. 9 is a diagram illustrating a display state according to the second embodiment.

【００２５】まず、再利用すべきコンテンツを選択する
（ステップＳ７０１）。コンテンツの再利用処理として
要約処理を選択すると（ステップＳ７０２）、コンテン
ツ管理部１１０が、ステップＳ７０１で選択されたコン
テンツをＸＭＬ文書のタグから検索し、データを抽出す
る（ステップＳ７０３）。そして、抽出されたデータに
対して要約作成処理を実行し、要約文書を作成する（ス
テップＳ７０４）。その後、作成された要約文書をコン
テンツに再登録するかの判断がなされ（ステップＳ７０
５）、コンテンツに再登録するならばステップＳ７０６
によってコンテンツ管理部１１０に登録される。一方、
コンテンツに再登録しない場合は、そのまま処理を終了
する。再登録するか否かは、要約データや翻訳データ等
の作成終了後に「再登録しますか？」等のダイアログを
表示し、ＹＥＳ，ＮＯで応答することによって指示す
る。ステップＳ７０６で登録されたコンテンツ（要約文
書）は図８のように表示される。First, content to be reused is selected (step S701). When the summarization process is selected as the content reuse process (step S702), the content management unit 110 searches the XML document tag for the content selected in step S701, and extracts data (step S703). Then, a summary creation process is performed on the extracted data to create a summary document (step S704). Thereafter, it is determined whether to re-register the created summary document in the content (step S70).
5) If re-registering for the content, step S706
Is registered in the content management unit 110. on the other hand,
If the content is not to be re-registered, the process ends. Whether or not to re-register is indicated by displaying a dialog such as "Do you want to re-register?" The content (abstract document) registered in step S706 is displayed as shown in FIG.

【００２６】なお、ステップＳ７０１におけるコンテン
ツの選択は、図６の右側ウィンドウのサムネールをマウ
ス等のポインティングデバイスで選択することにより行
われる。コンテンツ管理部１１０は、選択されたコンテ
ンツのファイル名（その他ＩＤなどでも可）を元に、コ
ンテンツのファイル位置やファイル名、属性等の関連情
報であるＸＭＬ文書内を検索し、コンテンツの実際のフ
ァイル位置をつきとめ、そのファイルをデータとして抽
出し、要約処理や翻訳処理に渡す。以上説明したよう
に、第２の実施形態によれば、電子化文書または電子文
書から、「テキスト」、「ピクチャ」、「表」等のコン
テンツ（部品）を取り出し、再利用することができる。The selection of the content in step S701 is performed by selecting the thumbnail in the right window of FIG. 6 with a pointing device such as a mouse. The content management unit 110 searches an XML document that is related information such as a file position, a file name, and an attribute of the content based on the file name of the selected content (or another ID or the like). Locate the file, extract the file as data, and pass it to the summarization and translation processes. As described above, according to the second embodiment, contents (parts) such as “text”, “picture”, and “table” can be extracted from an electronic document or an electronic document and reused.

【００２７】［第３の実施形態］第３の実施形態では、
コンテンツの再利用の他の例として、選択したテキスト
のコンテンツに対して翻訳処理を行い、さらにその翻訳
文書を新たなコンテンツとして登録する場合を説明す
る。[Third Embodiment] In the third embodiment,
As another example of content reuse, a case will be described where a translation process is performed on the content of the selected text and the translated document is registered as new content.

【００２８】図９は第３の実施形態によるコンテンツの
再利用を説明するフローチャートである。また、図１０
は、第２の実施形態による表示状態を説明する図であ
る。FIG. 9 is a flowchart for explaining content reuse according to the third embodiment. FIG.
FIG. 9 is a diagram illustrating a display state according to the second embodiment.

【００２９】まず、コンテンツを選択し（ステップＳ９
０１）、コンテンツ再利用処理として翻訳処理を選択す
ると（ステップＳ９０２）、コンテンツ管理部１１０
が、登録されているコンテンツをＸＭＬ文書のタグから
検索し、データを抽出する（ステップＳ９０３）。そし
て、抽出されたデータから翻訳作成処理ステップＳ９０
４によって翻訳文書が作成され、翻訳文書をコンテンツ
に再登録するかの判断がなされ（ステップＳ９０５）、
コンテンツに再登録するならば再登録処理ステップＳ９
０６によってコンテンツ管理部１１０に登録される。コ
ンテンツに再登録しない場合は、終了する。登録された
コンテンツ（翻訳文書）は図１０のように表示される。First, a content is selected (step S9).
01), when the translation process is selected as the content reuse process (step S902), the content management unit 110
Retrieves the registered content from the tag of the XML document and extracts the data (step S903). Then, a translation creation processing step S90 is performed from the extracted data.
4, a translation document is created, and it is determined whether the translation document is to be re-registered as content (step S905).
If the content is to be re-registered, the re-registration processing step S9
06, it is registered in the content management unit 110. If the content is not to be reregistered, the process ends. The registered content (translated document) is displayed as shown in FIG.

【００３０】以上説明したように、第３の実施形態によ
れば、テキストのコンテンツに対して翻訳をして得た、
新たなコンテンツを扱うことや再利用することができ
る。As described above, according to the third embodiment, text contents obtained by translation are obtained.
New content can be handled and reused.

【００３１】［第４の実施形態］第4の実施形態では、
テキストのコンテンツを、ＷＥＢブラウザ、ワードプロ
セッサ、表計算アプリケーション、プレゼンテーション
・ツール等のアプリケーションへ転送し、それらのアプ
リケーション上で再利用する場合を説明する。[Fourth Embodiment] In the fourth embodiment,
A case where text content is transferred to applications such as a web browser, a word processor, a spreadsheet application, and a presentation tool and reused on those applications will be described.

【００３２】図１１は第４の実施形態によるコンテンツ
の再利用を説明するフローチャートである。まず、コン
テンツを選択し（ステップＳ１１０１）、コンテンツ再
利用処理としてアプリケーション転送処理を選択し（ス
テップＳ１１０２）、さらに転送先のアプリケーション
を選択すると（ステップＳ１１０３）る。FIG. 11 is a flowchart for explaining content reuse according to the fourth embodiment. First, the content is selected (step S1101), the application transfer process is selected as the content reuse process (step S1102), and the transfer destination application is selected (step S1103).

【００３３】すると、コンテンツ管理部１１０が登録さ
れているコンテンツをＸＭＬ文書のタグから検索し、デ
ータを抽出する（ステップＳ１１０４）。そして、抽出
されたデータからＨＴＭＬ変換処理ステップＳ１１０５
によってＨＴＭＬ文書が作成され、転送先アプリケーシ
ョンに入力する（ステップＳ１１０６）ことよって転送
先アプリケーションで再利用可能となる。Then, the content management unit 110 searches for the registered content from the tag of the XML document and extracts data (step S1104). Then, an HTML conversion processing step S1105 is performed based on the extracted data.
Accordingly, an HTML document is created and input to the transfer destination application (step S1106), whereby the document can be reused in the transfer destination application.

【００３４】以上説明したように、第４の実施形態によ
れば、テキストのコンテンツのアプリケーション上で再
利用することができる。As described above, according to the fourth embodiment, text contents can be reused on an application.

【００３５】なお、上記第２乃至第４の実施形態によっ
て示された再利用処理を実行可能に構成し、所望の処理
を選択して再利用を行うようにしてもよいことはいうま
でもない。例えば、ステップＳ１１０２において、要約
作成処理が選択されれば図７のステップＳ７０４へ、翻
訳処理が選択された場合は図９のステップＳ９０４へ処
理を進めるようにすればよい。It is needless to say that the reuse process shown in the second to fourth embodiments may be configured to be executable, and a desired process may be selected and reused. . For example, in step S1102, if the digest creation process is selected, the process may proceed to step S704 in FIG. 7, and if the translation process is selected, the process may proceed to step S904 in FIG.

【００３６】以上説明したように、上記各実施形態によ
れば、紙文書と電子文書のコンテンツ（部品）の取り出しが
可能。紙文書と電子文書の両方のコンテンツ（部品）の再利
用が可能。紙文書と電子文書の両方のコンテンツ（部品）の統合
的な管理が可能。等の効果がある。As described above, according to the above embodiments, the contents (parts) of the paper document and the electronic document can be extracted. Content (parts) of both paper documents and electronic documents can be reused. Integrated management of contents (parts) of both paper documents and electronic documents is possible. And so on.

【００３７】［他の実施形態］なお、本発明は、複数の
機器（例えばホストコンピュータ、インタフェイス機
器、リーダ、プリンタなど）から構成されるシステムに
適用しても、一つの機器からなる装置（例えば、複写
機、ファクシミリ装置など）に適用してもよい。[Other Embodiments] Even if the present invention is applied to a system constituted by a plurality of devices (for example, a host computer, an interface device, a reader, a printer, etc.), an apparatus comprising one device (for example, For example, the present invention may be applied to a copying machine, a facsimile machine, and the like.

【００３８】また、本発明の目的は、前述した実施形態
の機能を実現するソフトウェアのプログラムコードを記
録した記憶媒体（または記録媒体）を、システムあるい
は装置に供給し、そのシステムあるいは装置のコンピュ
ータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納された
プログラムコードを読み出し実行することによっても、
達成されることは言うまでもない。この場合、記憶媒体
から読み出されたプログラムコード自体が前述した実施
形態の機能を実現することになり、そのプログラムコー
ドを記憶した記憶媒体は本発明を構成することになる。
また、コンピュータが読み出したプログラムコードを実
行することにより、前述した実施形態の機能が実現され
るだけでなく、そのプログラムコードの指示に基づき、
コンピュータ上で稼働しているオペレーティングシステ
ム（ＯＳ）などが実際の処理の一部または全部を行い、
その処理によって前述した実施形態の機能が実現される
場合も含まれることは言うまでもない。Further, an object of the present invention is to provide a storage medium (or a recording medium) recording a program code of software for realizing the functions of the above-described embodiments to a system or an apparatus, and to provide a computer (a computer) of the system or the apparatus. Or a CPU or MPU) reads out and executes the program code stored in the storage medium,
Needless to say, this is achieved. In this case, the program code itself read from the storage medium implements the functions of the above-described embodiment, and the storage medium storing the program code constitutes the present invention.
In addition, by the computer executing the readout program code, not only the functions of the above-described embodiments are realized, but also based on the instructions of the program code,
The operating system (OS) running on the computer performs part or all of the actual processing,
It goes without saying that a case where the function of the above-described embodiment is realized by the processing is also included.

【００３９】さらに、記憶媒体から読み出されたプログ
ラムコードが、コンピュータに挿入された機能拡張カー
ドやコンピュータに接続された機能拡張ユニットに備わ
るメモリに書込まれた後、そのプログラムコードの指示
に基づき、その機能拡張カードや機能拡張ユニットに備
わるＣＰＵなどが実際の処理の一部または全部を行い、
その処理によって前述した実施形態の機能が実現される
場合も含まれることは言うまでもない。Further, after the program code read from the storage medium is written into the memory provided in the function expansion card inserted into the computer or the function expansion unit connected to the computer, the program code is read based on the instruction of the program code. , The CPU provided in the function expansion card or the function expansion unit performs part or all of the actual processing,
It goes without saying that a case where the function of the above-described embodiment is realized by the processing is also included.

【００４０】本発明を上記記憶媒体に適用する場合、そ
の記憶媒体には、先に説明したフローチャートに対応す
るプログラムコードが格納されることになる。When the present invention is applied to the storage medium, the storage medium stores program codes corresponding to the flowcharts described above.

【００４１】[0041]

【発明の効果】以上説明したように、本発明によれば、
電子化文書または電子文書から、「テキスト」、「ピク
チャ」、「表」等のコンテンツ（部品）を取り出し、統
合的に扱うことや再利用することが可能となる。As described above, according to the present invention,
Contents (parts) such as “text”, “picture”, and “table” can be extracted from an electronic document or an electronic document, and can be handled and reused in an integrated manner.

【図面の簡単な説明】[Brief description of the drawings]

【図１】第１の実施形態に係るシステムの構成を示すブ
ロック図である。FIG. 1 is a block diagram illustrating a configuration of a system according to a first embodiment.

【図２】入力ドキュメントが画像である場合の、第１の
実施形態に係るシステムにおけるコンテンツ登録までの
処理の流れを示すフローチャートである。FIG. 2 is a flowchart illustrating a flow of processing up to content registration in the system according to the first embodiment when an input document is an image.

【図３】第１の実施形態に係る電子文書作成処理の流れ
を示すフローチャートである。FIG. 3 is a flowchart illustrating a flow of an electronic document creation process according to the first embodiment.

【図４】入力ドキュメントが既知の電子文書である場合
の、第１の実施形態に係るシステムにおけるコンテンツ
登録までの処理の流れを示すフローチャートである。FIG. 4 is a flowchart showing a flow of processing up to content registration in the system according to the first embodiment when the input document is a known electronic document.

【図５】入力ドキュメントが未知の電子文書である場合
の、第１の実施形態に係るシステムにおけるコンテンツ
登録までの処理の流れを示すフローチャートである。FIG. 5 is a flowchart showing a flow of processing up to content registration in the system according to the first embodiment when an input document is an unknown electronic document.

【図６】第１の実施形態に係る処理によって出力される
コンテンツ出力の例を示す図である。FIG. 6 is a diagram illustrating an example of content output that is output by the process according to the first embodiment.

【図７】第２の実施形態に係るコンテンツ再利用（要約
作成）を可能とする処理の流れの１例を示すフローチャ
ートである。FIG. 7 is a flowchart illustrating an example of a flow of a process that enables content reuse (summary creation) according to the second embodiment.

【図８】第２の実施形態に係る処理によって出力される
コンテンツ出力の例を示す図である。FIG. 8 is a diagram illustrating an example of content output output by processing according to the second embodiment.

【図９】第３の実施形態に係るコンテンツ再利用（翻訳
作成）を可能とする処理の流れの１例を示すフローチャ
ートである。FIG. 9 is a flowchart illustrating an example of a process flow for enabling content reuse (translation creation) according to the third embodiment.

【図１０】第３の実施形態に係る処理によって出力され
るコンテンツ出力の例を示す図である。FIG. 10 is a diagram illustrating an example of content output that is output by the processing according to the third embodiment.

【図１１】第４の実施形態に係るコンテンツ再利用（ア
プリケーション転送）を可能とする処理の流れの１例を
示すフローチャートである。FIG. 11 is a flowchart illustrating an example of a flow of a process for enabling content reuse (application transfer) according to a fourth embodiment.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ０６Ｆ 12/00 ５４６Ｇ０６Ｆ 12/00 ５４６Ａ 17/30 ２２０ 17/30 ２２０Ｂ３１０３１０Ｃ ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) G06F 12/00 546 G06F 12/00 546A 17/30 220 17/30 220B 310 310C

Claims

【特許請求の範囲】[Claims]

【請求項１】電子化文書を処理する文書処理装置であ
って、前記電子化文書中のコンテンツを検出するコンテンツ検
出手段と、前記検出手段で検出したコンテンツをその属性を示す情
報に基づいて登録・管理する管理手段と、前記管理手段で管理されているコンテンツを単位とし
て、その内容を出力する出力手段とを備えることを特徴
とする文書処理装置。1. A document processing apparatus for processing a digitized document, comprising: a content detecting means for detecting a content in the digitized document; and a content detected by the detecting means registered based on information indicating an attribute of the content. A document processing apparatus comprising: a management unit for managing; and an output unit for outputting the content in units of content managed by the management unit.

【請求項２】原本文書の文書画像を光学的に読み取っ
て画像データを得る読取手段と、前記画像データに基づいて電子化文書を作成する電子化
文書作成手段を更に備えることを特徴とする請求項１に
記載の文書処理装置。2. The image processing apparatus according to claim 1, further comprising: a reading unit that optically reads a document image of the original document to obtain image data; and a digitized document creating unit that creates a digitized document based on the image data. Item 2. The document processing device according to Item 1.

【請求項３】前記電子化文書作成手段は、前記入力画像を所定の属性の領域に分割するレイアウト
解析手段と、前記レイアウト解析手段で分割された領域毎のコンテン
ツを、属性の指定によって抽出可能に含む電子化文書を
生成する生成手段とを備えることを特徴とする請求項２
に記載の文書処理装置。3. The digitized document creating means: a layout analyzing means for dividing the input image into areas having a predetermined attribute; and a content for each area divided by the layout analyzing means can be extracted by designating an attribute. Generating means for generating an electronic document included in the program.
A document processing device according to claim 1.

【請求項４】電子文書を画像データに変換する画像変
換手段を更に備え、前記電子化文書作成手段は、前記画像変換手段で得られ
た画像データに基づいて電子化文書を作成することを特
徴とする請求項２又は３に記載の文書処理装置。4. An image processing apparatus according to claim 1, further comprising an image conversion unit configured to convert the electronic document into image data, wherein the electronic document generation unit generates the electronic document based on the image data obtained by the image conversion unit. 4. The document processing apparatus according to claim 2, wherein:

【請求項５】前記電子化文書作成手段は、前記画像デ
ータに基づいてマークアップ言語で記述された電子化文
書を作成し、前記コンテンツ検出手段は、前記マークアップ言語で作
成された電子化文書からタグで分類された情報をコンテ
ンツとして検出することを特徴とする請求項２乃至４の
いずれかに記載の文書処理装置。5. The digitized document creating means creates an electronic document described in a markup language based on the image data, and the content detecting means creates an electronic document written in the markup language. 5. The document processing apparatus according to claim 2, wherein information classified by a tag is detected as content.

【請求項６】前記分割された領域の内、認識可能な文
字列を含む領域を識別する属性識別手段と、前記認識可能な文字列を含む領域と識別された領域につ
いて文字認識処理を実行し、文字コードを取得するＯＣ
Ｒ手段とを更に備え、前記電子化文書作成手段は、前記分割された領域内の画
像と前記ＯＣＲ手段で得られた文字コードに基づいて前
記電子化文書を作成することを特徴とする請求項３に記
載の文書処理装置。6. An attribute identifying means for identifying an area including a recognizable character string among the divided areas, and performing a character recognition process on the area identified as an area including the recognizable character string. , OC to get character code
R means, wherein the digitized document creating means creates the digitized document based on the image in the divided area and the character code obtained by the OCR means. 3. The document processing device according to 3.

【請求項７】前記管理手段によって登録・管理されて
いるコンテンツから所定の情報を抽出して利用するコン
テンツ再利用手段を更に備えることを特徴とする請求項
１に記載の文書処理装置。7. The document processing apparatus according to claim 1, further comprising a content reuse unit that extracts and uses predetermined information from the content registered and managed by the management unit.

【請求項８】前記コンテンツ再利用手段は、前記コン
テンツに含まれるテキストデータに基づいてその要約文
を表すテキストを生成することを含むことを特徴とする
請求項７に記載の文書処理装置。8. The document processing apparatus according to claim 7, wherein the content reuse unit includes generating a text representing a summary sentence based on text data included in the content.

【請求項９】前記コンテンツ再利用手段は、前記要約
文を表すテキストを新たなコンテンツとして登録するこ
とを特徴とする請求項８に記載の文書処理装置。9. The document processing apparatus according to claim 8, wherein the content reuse unit registers a text representing the summary sentence as new content.

【請求項１０】前記コンテンツ再利用手段は、前記コ
ンテンツに含まれるテキストデータに基づいてその翻訳
テキストを作成することを含むことを特徴とする請求項
７に記載の文書処理装置。10. The document processing apparatus according to claim 7, wherein the content reuse unit includes creating a translated text based on text data included in the content.

【請求項１１】前記コンテンツ再利用手段は、前記翻
訳テキストを新たなコンテンツとして登録することを特
徴とする請求項１０に記載の文書処理装置。11. The document processing apparatus according to claim 10, wherein the content reuse unit registers the translated text as new content.

【請求項１２】前記コンテンツ再利用手段は、前記管
理手段で管理されているコンテンツから選択されたコン
テンツを用いてマークアップ言語文書を作成することを
含むことを特徴とする請求項７記載の文書処理装置。12. The document according to claim 7, wherein the content reuse unit includes creating a markup language document by using content selected from the content managed by the management unit. Processing equipment.

【請求項１３】前記コンテンツ再利用手段は、前記マ
ークアップ言語文書を所定のアプリケーションに転送す
ることを特徴とする請求項１２記載の文書処理装置。13. The document processing apparatus according to claim 12, wherein the content reuse unit transfers the markup language document to a predetermined application.

【請求項１４】電子化文書を処理する文書処理方法で
あって、前記電子化文書中のコンテンツを検出するコンテンツ検
出工程と、前記検出工程で検出したコンテンツをその属性を示す情
報に基づいて登録・管理する管理工程と、前記管理工程で管理されているコンテンツを単位とし
て、その内容を出力する出力工程とを備えることを特徴
とする文書処理方法。14. A document processing method for processing a digitized document, comprising: a content detecting step of detecting content in the digitized document; and registering the content detected in the detecting step based on information indicating an attribute of the content. A document processing method comprising: a management step of managing; and an output step of outputting the content in units of the content managed in the management step.

【請求項１５】原本文書の文書画像を光学的に読み取
って画像データを得る読取工程と、前記画像データに基づいて電子化文書を作成する電子化
文書作成工程を更に備えることを特徴とする請求項１４
に記載の文書処理方法。15. The method according to claim 1, further comprising: a reading step of optically reading a document image of the original document to obtain image data; and a digitizing document creation step of creating a digitized document based on the image data. Item 14
Document processing method described in.

【請求項１６】前記電子化文書作成工程は、前記入力画像を所定の属性の領域に分割するレイアウト
解析工程と、前記レイアウト解析工程で分割された領域毎のコンテン
ツを、属性の指定によって抽出可能に含む電子化文書を
生成する生成工程とを備えることを特徴とする請求項１
５に記載の文書処理方法。16. The digitized document creation step may include: a layout analysis step of dividing the input image into areas having a predetermined attribute; and a content for each area divided in the layout analysis step can be extracted by designating an attribute. A generating step of generating an electronic document included in the program.
5. The document processing method according to 5.

【請求項１７】電子文書を画像データに変換する画像
変換工程を更に備え、前記電子化文書作成工程は、前記画像変換工程で得られ
た画像データに基づいて電子化文書を作成することを特
徴とする請求項１５又は１６に記載の文書処理方法。17. An electronic document generating method according to claim 17, further comprising an image converting step of converting the electronic document into image data, wherein the electronic document generating step generates the electronic document based on the image data obtained in the image converting step. 17. The document processing method according to claim 15, wherein:

【請求項１８】前記電子化文書作成工程は、前記画像
データに基づいてマークアップ言語で記述された電子化
文書を作成し、前記コンテンツ検出工程は、前記マークアップ言語で作
成された電子化文書からタグで分類された情報をコンテ
ンツとして検出することを特徴とする請求項１５乃至１
７のいずれかに記載の文書処理方法。18. The digitized document creation step creates an digitized document described in a markup language based on the image data, and the content detection step creates a digitized document created in the markup language. The information classified by a tag is detected as a content from the information.
7. The document processing method according to any one of 7.

【請求項１９】前記分割された領域の内、認識可能な
文字列を含む領域を識別する属性識別工程と、前記認識可能な文字列を含む領域と識別された領域につ
いて文字認識処理を実行し、文字コードを取得するＯＣ
Ｒ工程とを更に備え、前記電子化文書作成工程は、前記分割された領域内の画
像と前記ＯＣＲ工程で得られた文字コードに基づいて前
記電子化文書を作成することを特徴とする請求項１６に
記載の文書処理方法。19. An attribute identifying step of identifying an area including a recognizable character string among the divided areas, and performing a character recognition process on the area identified as an area including the recognizable character string. , OC to get character code
An electronic document creating step for creating the electronic document based on the image in the divided area and the character code obtained in the OCR step. 16. The document processing method according to item 16.

【請求項２０】前記管理工程によって登録・管理され
ているコンテンツから所定の情報を抽出して利用するコ
ンテンツ再利用工程を更に備えることを特徴とする請求
項１４に記載の文書処理方法。20. The document processing method according to claim 14, further comprising a content reuse step of extracting and using predetermined information from the content registered and managed by said management step.

【請求項２１】前記コンテンツ再利用工程は、前記コ
ンテンツに含まれるテキストデータに基づいてその要約
文を表すテキストを生成することを含むことを特徴とす
る請求項２０に記載の文書処理方法。21. The document processing method according to claim 20, wherein the content reuse step includes generating a text representing a summary sentence based on text data included in the content.

【請求項２２】前記コンテンツ再利用工程は、前記要
約文を表すテキストを新たなコンテンツとして登録する
ことを特徴とする請求項２１に記載の文書処理方法。22. The document processing method according to claim 21, wherein in the content reuse step, a text representing the summary sentence is registered as new content.

【請求項２３】前記コンテンツ再利用工程は、前記コ
ンテンツに含まれるテキストデータに基づいてその翻訳
テキストを作成することを含むことを特徴とする請求項
２０に記載の文書処理方法。23. The document processing method according to claim 20, wherein the content reuse step includes creating a translated text based on text data included in the content.

【請求項２４】前記コンテンツ再利用工程は、前記翻
訳テキストを新たなコンテンツとして登録することを特
徴とする請求項２３に記載の文書処理方法。24. The document processing method according to claim 23, wherein in the content reuse step, the translated text is registered as new content.

【請求項２５】前記コンテンツ再利用工程は、前記管
理工程で管理されているコンテンツから選択されたコン
テンツを用いてマークアップ言語文書を作成することを
含むことを特徴とする請求項２０記載の文書処理方法。25. The document according to claim 20, wherein the content reuse step includes creating a markup language document using the content selected from the content managed in the management step. Processing method.

【請求項２６】前記コンテンツ再利用工程は、前記マ
ークアップ言語文書を所定のアプリケーションに転送す
ることを特徴とする請求項２５記載の文書処理方法。26. The document processing method according to claim 25, wherein in the content reuse step, the markup language document is transferred to a predetermined application.

【請求項２７】請求項１４乃至２６のいずれかに記載
の文書処理方法をコンピュータによって実現するための
制御プログラムを格納する記憶媒体。27. A storage medium for storing a control program for implementing the document processing method according to claim 14 by a computer.