JP2007164705A

JP2007164705A - Method and program for converting computerized document

Info

Publication number: JP2007164705A
Application number: JP2005363603A
Authority: JP
Inventors: Toru Takazawa; 通高澤; Hiroyuki Iwabuchi; 博之岩渕
Original assignee: S TEN NINE KYOTO KK
Current assignee: S TEN NINE KYOTO KK
Priority date: 2005-12-16
Filing date: 2005-12-16
Publication date: 2007-06-28

Abstract

<P>PROBLEM TO BE SOLVED: To automatically convert a computerized document including various typesetting formats into a predetermined structured document. <P>SOLUTION: Structure style definition in which structure styles corresponding to elements of document structure of a predetermined schema are preset is read for the computerized document 10, a preset basic analysis rule to be applied to the computerized document 10 is read, a paragraph is extracted based on the basic analysis rule only from character strings of the computerized document 10 (32), the elements on document structure are determined for the extracted paragraph, a hierarchical level on the document structure in the paragraph is decided based on the basic analysis rule from character positions and constitution character types in the paragraph (34), the elements on the document structure according to the decided hierarchical level are determined (37) and the determined elements of the paragraph and the structure style set in the structure style definition for the determined elements in the paragraph are adapted, respectively in place of a format style (44). <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、電子化文書の表現形式を変換する技術の改良に関し、特に組版体裁や体裁スタイルを含む電子化文書の変換方法に関する。 The present invention relates to an improvement in technology for converting the expression format of an electronic document, and more particularly to a method for converting an electronic document including a typesetting style and a style.

近年、インターネットの普及に伴い、官公庁や企業などの組織ではＳＧＭＬ（Standard Generalized Mark-up Language）やＸＭＬ（eXtensible Markup Language）等の構造化文書によって電子的な文書管理が行われるようになっている（例えば、電子的医薬品等副作用・感染症症例報告、http://www.info.pmda.go.jp./info/pi_index.html）。そして、ＳＧＭＬやＸＭＬで記述された電子化文書はデータベースに格納されて、インターネット等を介して公衆に提供されている。周知のようにＳＧＭＬでは、文書の階層的な論理構造に着目し、文書の内容を構成する各要素に対し、その要素の論理構造上の意味を表すタグを付加することにより文書を構造化している。そして、タグ付け規則を統一することにより同種文書の論理構造の一貫性を保ち、文書の検索や再利用の効率向上を図っている。 In recent years, with the widespread use of the Internet, organizations such as government offices and companies have been managing electronic documents using structured documents such as SGML (Standard Generalized Markup Language) and XML (eXtensible Markup Language). (For example, electronic drug side effects / infection case reports, http://www.info.pmda.go.jp./info/pi_index.html). Electronic documents described in SGML and XML are stored in a database and provided to the public via the Internet or the like. As is well known, SGML focuses on the hierarchical logical structure of a document, and adds a tag that represents the logical structure of the element to each element constituting the content of the document to structure the document. Yes. By unifying the tagging rules, the consistency of the logical structure of similar documents is maintained, and the efficiency of document search and reuse is improved.

ＳＧＭＬやＸＭＬ及び電子化文書を作成するには、所定の文書構造（ＤＴＤ(Document Type Definition, 文書型定義) やスキーマ記述言語もしくはスキーマ言語）を意識して電子化文書データを作成する必要がある。 To create SGML, XML, and digitized documents, it is necessary to create digitized document data in consideration of a predetermined document structure (DTD (Document Type Definition), schema description language, or schema language). .

一般的に電子化文書を作成する場合には、ＤＴＰ（Desk Top Publishing）ソフトウェアや組版編集ソフトウェアを用いて電子化文書を作成するが、これらのソフトウェアでは印刷やＰＤＦ出力を目的としているため、文書の段落や文字に適用して組版体裁を指示するスタイル設定機能（以下、体裁スタイルという）を利用している。 Generally, when creating an electronic document, the electronic document is created using DTP (Desk Top Publishing) software or typesetting editing software. However, these software are intended for printing and PDF output. The style setting function (hereinafter referred to as “style”) is used to indicate the typesetting style applied to the paragraphs and characters.

この体裁スタイルはＸＭＬやＳＧＭＬでは不要で削除されるべき情報であるため、これらの体裁スタイルを用いて文書構造を特定する。即ち、既存電子化文書をＸＭＬやＳＧＭＬデータへ変換するために、既存電子化文書に適用されている体裁スタイルの替わりにＸＭＬやＳＧＭＬの文書構造を特定するためのスタイル（以下、構造スタイルという）を適用する。 Since this appearance style is information that is unnecessary and should be deleted in XML and SGML, the document structure is specified using these appearance styles. That is, in order to convert an existing digitized document into XML or SGML data, a style for specifying the document structure of XML or SGML instead of the appearance style applied to the existing digitized document (hereinafter referred to as a structural style) Apply.

また、既存の電子化文書をＳＧＭＬやＸＭＬに変換する手法としては、スタイル設定が必要な文字列を含んだ文書要素を抽出し、文字列を含む文書要素に対応するスタイルタグの一覧を表示して、ユーザに選択させる手法が知られている（例えば、特許文献１）。また、入力した電子化文書を他の表現形式の電子化文書に変換する変換定義文の記述内容に従って他の表現形式に変換する手法が知られている（例えば、特許文献２）。
特開２０００−３３９３０７号特開２００１−３５７０３０号 In addition, as a method of converting an existing digitized document into SGML or XML, a document element including a character string that requires style setting is extracted, and a list of style tags corresponding to the document element including the character string is displayed. Thus, a method for causing the user to make a selection is known (for example, Patent Document 1). Also, there is known a method of converting an input digitized document into another representation format in accordance with the description content of a conversion definition sentence that converts the digitized document into another representation format (for example, Patent Document 2).
JP 2000-339307 A JP 2001-357030 A

しかしながら、上記従来例では、変換元となる電子化文書の作成過程で、上記ソフトウェアで作成した既存文書は印刷やＰＤＦ出力を目的とするため、適用されている体裁スタイルは組版体裁に係わるものであって、ＸＭＬやＳＧＭＬの文書構造を表現するための構造タグとは何の対応関係はなく、単純な付替え作業ができる訳ではない。 However, in the above conventional example, in the process of creating a digitized document as a conversion source, the existing document created by the above software is intended for printing or PDF output, so the applied style is related to typesetting. Thus, there is no corresponding relationship with the structure tag for expressing the document structure of XML or SGML, and a simple replacement operation cannot be performed.

さらに既存の電子化文書には、字下げ字上げや改行など体裁を整える指令や、体裁を整えるための空白文字などが挿入されているが、ＸＭＬやＳＧＭＬでは不要な雑音情報となるため削除しなければならない。 Furthermore, in existing electronic documents, commands for adjusting the appearance such as indentation and line breaks and blank characters for adjusting the appearance are inserted, but they are deleted because they are unnecessary noise information in XML and SGML. There must be.

こうした作業を手作業で行うことは、所定の文書構造を意識してＸＭＬやＳＧＭＬデータ用の元データを作成するのと大差ない作業となり、既存文書から変換生成することによる効率改善は望めない、という問題がある。 Performing such work manually is not much different from creating the original data for XML or SGML data in consideration of a predetermined document structure, and improvement in efficiency by converting and generating from an existing document cannot be expected. There is a problem.

また、手作業時の際に本来必要な情報を誤って削除するなどの可能性が生じるため、データの再校正が必要になるなど、新たな作業が発生する、という問題が生じる。 In addition, since there is a possibility that information that is originally necessary is deleted by mistake during manual work, there is a problem that new work occurs, such as the need to re-calibrate data.

さらに、上記特許文献１、２によれば、固定的なルールによって電子化文書の解析を行い、構造スタイルを適用するものではあるが、多種多様な体裁スタイルの電子化文書に対応するものではない。つまり、上述の電子的医薬品等副作用・感染症症例報告、有価証券報告書、あるいは医療薬品医療機器添付文書などでは、多数の組織が情報を提供しており、表示すべき項目は規定されてはいるものの、各項目内では各メーカが任意の文書構造を適用することが可能となっている。 Further, according to Patent Documents 1 and 2 described above, the computerized document is analyzed by a fixed rule and the structural style is applied, but it does not correspond to the computerized documents of various styles. . In other words, many organizations provide information in the above-mentioned electronic drug side effect / infectious disease case report, securities report, or medical drug medical device package insert, etc. However, it is possible for each manufacturer to apply an arbitrary document structure within each item.

ところが記載内容の構成が固定的でも、多様性を吸収するために冗長な構成として定義され、その構成要素を選択的に記載するものや、構造／構成下の階層レベルが多層に定義され、記述の意味的内容によって階層レベルを決定しなければならない構造化文書の場合は、上記特許文献１、２を用いても、既存の文書から構造化文書への変換生成を一意に行うことや、単純なルールで変換することができないという問題がある。 However, even if the configuration of the description is fixed, it is defined as a redundant configuration in order to absorb diversity, and the components are selectively described, and the hierarchical levels under the structure / configuration are defined in multiple layers In the case of a structured document in which the hierarchical level must be determined according to the semantic content of the document, even if the above-mentioned Patent Documents 1 and 2 are used, the conversion generation from the existing document to the structured document can be performed uniquely, or simple There is a problem that it cannot be converted with simple rules.

例えば、上記電子的医薬品等副作用・感染症症例報告や医療薬品医療機器添付文書では、入力となる文書の印刷用組版体裁はメーカによって様々であり、さらに、同一のメーカによっても機器や薬品の種類が異なれば組版体裁が異なる場合もあるため、上記特許文献１、２をそのまま適用しても目的とする構造化文書が得られないという問題があった。 For example, in the case report of side effects / infectious diseases such as electronic medicines and attached documents for medical drugs and medical devices, the typesetting format for printing the input document varies from manufacturer to manufacturer. However, there is a problem in that the typed format may be different, so that even if the above Patent Documents 1 and 2 are applied as they are, the intended structured document cannot be obtained.

そこで本発明は、上記問題点に鑑みてなされたもので、多様な組版体裁を含む電子化文書を自動的に所定の構造化文書へ変換することを目的とする。 Accordingly, the present invention has been made in view of the above problems, and an object thereof is to automatically convert an electronic document including various typesetting forms into a predetermined structured document.

本発明は、電子化文書を読み込んで、前記電子化文書中の体裁スタイルを所定の構造スタイルに変換する電子化文書の変換方法であって、前記電子化文書を読み込み、前記電子化文書について所定のスキーマの文書構造の要素に対応する構造スタイルを予め設定した構造スタイル定義を読み込み、前記電子化文書に対して適用する予め設定された基本解析ルールを読み込み、前記読み込んだ電子化文書の文字列のみから前記基本解析ルールに基づいて段落を抽出し、抽出した段落について文書構造上の前記要素を決定し、前記段落内の文字位置と構成文字種から前記基本解析ルールに基づいて当該段落内の文書構造上の階層レベルを判定し、前記判定した階層レベルに応じた文書構造上の前記要素を決定し、前記決定した段落の要素と、前記決定した段落内の要素について、前記構造スタイル定義に設定された構造スタイルを体裁スタイルに代えてそれぞれ適用する。 The present invention is a method for converting an electronic document by reading an electronic document and converting the appearance style in the electronic document to a predetermined structural style, the electronic document being read, A structural style definition in which a structural style corresponding to an element of a document structure of a schema is read in advance, a preset basic analysis rule to be applied to the electronic document is read, and a character string of the read electronic document A paragraph based on the basic analysis rule, and determining the element on the document structure for the extracted paragraph, and the document in the paragraph based on the basic analysis rule from the character position and the constituent character type in the paragraph Determining a hierarchical level of the structure, determining the element on the document structure according to the determined hierarchical level, and determining the element of the determined paragraph; The elements of the boss was in a paragraph, applied respectively in place a set structure style the structural style defined format styles.

したがって、本発明によれば、読み込まれた電子化文書は基本解析ルールにより複数の段落に分割され、各段落内の階層構造を決定する。そして、段落内の階層レベルに応じた構造スタイルを体裁スタイルに代わって各段落毎に設定することで、目的とする構造化文書を自動的に得ることが可能となる。 Therefore, according to the present invention, the read digitized document is divided into a plurality of paragraphs according to the basic analysis rule, and the hierarchical structure in each paragraph is determined. Then, by setting the structural style corresponding to the hierarchical level in the paragraph for each paragraph instead of the appearance style, it is possible to automatically obtain a target structured document.

以下、本発明の一実施形態を添付図面に基づいて説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the accompanying drawings.

図１は、本発明を適用する計算機の構成を示すブロック図である。 FIG. 1 is a block diagram showing the configuration of a computer to which the present invention is applied.

計算機１は、ＣＰＵなどで構成されるコントローラ２と、コントローラ２により読み書きされるメモリ３と、Ｉ／Ｏデバイスの制御を行うインターフェース部４と、インターフェース部４に接続されて入力文書である電子化文書１０と、所定の表現形式で構成された出力文書である構造化文書２０を格納するディスク装置５と、オペレータの操作をインターフェース部４へ入力するキーボード６及びマウス７と、インターフェース部４に接続されてメッセージやデータの表示を行うディスプレイ装置８とを備える。 The computer 1 includes a controller 2 including a CPU, a memory 3 read and written by the controller 2, an interface unit 4 that controls an I / O device, and an electronic document that is connected to the interface unit 4 and is an input document. A disk device 5 that stores a document 10, a structured document 20 that is an output document configured in a predetermined expression format, a keyboard 6 and a mouse 7 that input operator operations to the interface unit 4, and an interface unit 4 And a display device 8 for displaying messages and data.

入力文書としての電子化文書１０は、上述したＤＴＰ（Desk Top Publishing）ソフトウェアや組版編集ソフトウェアを用いて作成した体裁スタイルを含んだ文書データや、スキャナなどから読み込んだ画像データをＯＣＲによりテキストデータとした文書で構成される。 The computerized document 10 as an input document is a document data including a style created using the above-described DTP (Desk Top Publishing) software or typesetting editing software, or image data read from a scanner or the like as text data by OCR. Composed of documents.

また、出力文書としての構造化文書２０は、後述のように入力された電子化文書１０を解析して、上述したＸＭＬやＳＧＭＬの文書構造を特定するための構造スタイルを適用した文書データを示す。 The structured document 20 as an output document indicates document data to which an electronic document 10 input as described later is analyzed and a structural style for specifying the XML or SGML document structure described above is applied. .

メモリ３には、電子化文書１０を読み込んで、予め設定した表現形式の構造化文書２０に変換する電子化文書変換プログラム３０がロードされ、オペレータの指示に基づいて実行される。 The memory 3 is loaded with an electronic document conversion program 30 that reads the electronic document 10 and converts it into a structured document 20 in a preset expression format, and is executed based on an operator's instruction.

また、メモリ３には、必要に応じて入力となる電子化文書１０を作成・編集する文書作成・編集プログラム５０が必要に応じてロードされ、オペレータの指示に基づいて実行される。なお、文書作成・編集プログラム５０は、ＤＴＰ（Desk Top Publishing）ソフトウェアや組版編集ソフトウェアなどで構成される。 In addition, a document creation / editing program 50 for creating / editing an electronic document 10 to be input as necessary is loaded into the memory 3 as necessary, and is executed based on an operator's instruction. The document creation / editing program 50 is configured by DTP (Desk Top Publishing) software, typesetting editing software, or the like.

以下、電子化文書変換プログラム３０の概要について説明する。 The outline of the digitized document conversion program 30 will be described below.

まず、電子化文書変換処理プログラム３０を起動して、文書読み込み部３１により予め設定された電子化文書１０又は計算機１のオペレータが指定した電子化文書１０を読み込む。 First, the digitized document conversion processing program 30 is activated, and the digitized document 10 preset by the document reading unit 31 or the digitized document 10 designated by the operator of the computer 1 is read.

また、変換プログラム３０は、オペレータの指示に基づいて、読み込んだ電子化文書１０を所定の構造スタイルを備えた書式へ変換するため、適用構造スタイル解析部３７が目的とする構造化文書に対応した構造スタイル定義を読み込んで構造スタイル記憶部４０の構造スタイル定義格納部４１に格納する。なお、構造スタイル定義は、変換目的の構造化文書２０を得るために、ＸＭＬやＳＧＭＬなどのスキーマの文書構造の要素を記述した構造スタイルを予め定義したものである。 In addition, the conversion program 30 converts the read digitized document 10 into a format having a predetermined structural style based on an instruction from the operator, so that the applied structural style analysis unit 37 corresponds to the target structured document. The structural style definition is read and stored in the structural style definition storage unit 41 of the structural style storage unit 40. The structure style definition is a definition of a structure style in which elements of a document structure of a schema such as XML or SGML are described in advance in order to obtain a structured document 20 for conversion.

次に、ブロック抽出部３２（見出し解析部３３）は、読み込んだ電子化文書１０の文書構造上の要素（見出し部）を、スタイル設定ルール記憶部３６に格納された見出し解析用のルール（体裁スタイル、組版体裁、文言や文字）に基づいて後述するように解析し、１つの見出しから次の見出しまでの文書群（段落）を、１ブロックとして抽出する。なお、１ブロックは、見出し部を階層レベルの最上位の段落として、１つ以上の段落を含むものである。 Next, the block extraction unit 32 (heading analysis unit 33) converts the elements (heading part) on the document structure of the read digitized document 10 into rules (style) for heading analysis stored in the style setting rule storage unit 36. Analysis is performed as described later based on the style, typesetting style, wording and characters), and a document group (paragraph) from one heading to the next heading is extracted as one block. Note that one block includes one or more paragraphs with the heading portion as the top paragraph at the hierarchical level.

抽出された１ブロックの文書群は、ブロック内段落解析部３４によってさらに後述する解析ルール２のように段落内の文字列パターンを解析する。 The extracted block document group is further analyzed by the in-block paragraph analysis unit 34 for the character string pattern in the paragraph as in the analysis rule 2 described later.

適用構造スタイル解析部３７は、ブロック内段落解析部３４の解析結果に基づいて、段落の階層レベルを判定し、文書構造上の要素を決定し、構造スタイル定義を適用していく。 The applied structural style analyzing unit 37 determines the hierarchical level of the paragraph based on the analysis result of the in-block paragraph analyzing unit 34, determines an element on the document structure, and applies the structural style definition.

このため、後述するように、前段落と現在の対象段落の先頭文字種が一致するかを比較して階層レベルを判定する段落先頭識別子一致判定部３８と、前段落と現在の対象段落の文字列パターンが一致するかを比較して階層レベルを判定する文字列パターン一致判定部３９とを備え、これら判定部３８、３９は構造スタイル記憶部４０を参照する。構造スタイル記憶部４０には、上記読み込んだ構造スタイル定義を格納する構造スタイル定義格納部４１と、段落毎の文字列パターンと適用した構造スタイルを記録したスタイル設定テーブル４２０を格納する適用済スタイル記憶部４２と、ブロック内の段落を構成する先頭文字種に適用した階層レベルを記憶する適用済階層レベル記憶部４３を備えている。 For this reason, as will be described later, a paragraph head identifier match determination unit 38 that determines whether or not the first character type of the previous paragraph matches the current target paragraph to determine the hierarchical level, and the character string of the previous paragraph and the current target paragraph A character string pattern matching determination unit 39 that determines whether the patterns match and determines a hierarchical level is provided. These determination units 38 and 39 refer to the structural style storage unit 40. The structural style storage unit 40 stores a structural style definition storage unit 41 that stores the read structural style definition, and an applied style storage that stores a style setting table 420 that records a character string pattern for each paragraph and the applied structural style. And an applied layer level storage unit 43 for storing a layer level applied to the first character type constituting the paragraph in the block.

なお、文書構造上の要素とは、「篇、節、章」、「条、項」等の項目と、各項目内の内容を示し、スタイル設定ルール記憶部３６に格納される基本的な解析ルールに予め設定されたものである。なお、スタイル設定ルール記憶部３６には、図２で示すように、ブロックの抽出とブロック内の解析に用いる解析テーブル（解析ルール１、２、３及び例外解析ルールを搭載）３６０とスタイル設定ルール３６１１が格納される。なお、解析テーブル３６０には、後述する登録テーブル３７０が含まれる。 The elements in the document structure indicate items such as “Hen, Section, Chapter”, “Article, Item” and the contents in each item, and are stored in the style setting rule storage unit 36. This is a preset rule. As shown in FIG. 2, the style setting rule storage unit 36 includes an analysis table (with analysis rules 1, 2, 3 and exception analysis rules) 360 and style setting rules used for block extraction and analysis within the block. 3611 is stored. The analysis table 360 includes a registration table 370 described later.

そして、構造スタイル適用部４４は、上記ブロック内段落解析部３４で解析した階層レベルと文書構造上の要素に対して、構造スタイル定義格納部４１に格納された構造スタイル定義を適用し、構造化文書を生成する。 Then, the structural style application unit 44 applies the structural style definition stored in the structural style definition storage unit 41 to the hierarchical level and the document structure element analyzed by the in-block paragraph analysis unit 34, thereby structuring. Generate a document.

なお、構造スタイル定義記憶部４１に格納する構造スタイル定義と、ブロック内段落解析部３４のスタイル設定ルール記憶部３６に記憶する解析ルール及びスタイル設定ルールは、上記オペレータの指示によりそれぞれ読み込まれるものである。また、構造スタイル定義記憶部４１に格納する構造スタイル定義と、ブロック内段落解析部３４のスタイル設定ルール記憶部３６に記憶する解析ルール及びスタイル設定ルールを１つのファイルで記述しておき、このファイルを読み込んだときに文書変換プログラム３０が、構造スタイル定義と解析ルールに分離して各記憶部へ格納するようにしてもよい。 The structural style definition stored in the structural style definition storage unit 41 and the analysis rules and style setting rules stored in the style setting rule storage unit 36 of the in-block paragraph analysis unit 34 are read in accordance with the instructions of the operator. is there. The structural style definition stored in the structural style definition storage unit 41 and the analysis rules and style setting rules stored in the style setting rule storage unit 36 of the in-block paragraph analysis unit 34 are described in one file. When the document is read, the document conversion program 30 may separate the structure style definition and the analysis rule and store them in each storage unit.

以下、本実施形態では、構造化文書をＳＧＭＬの文書とした場合を例示し、適用構造スタイル解析部３７では、入力された電子化文書１０の文書構造上の要素毎にＳＧＭＬのタグを付加する例を示す。 Hereinafter, in the present embodiment, a case where the structured document is an SGML document will be exemplified, and the applied structure style analysis unit 37 adds an SGML tag to each element on the document structure of the input digitized document 10. An example is shown.

また、ブロック内段落解析部３４は、段落内の基本的な解析ルール１、２と例外ルールをスタイル設定ルール記憶部３６に予め格納しておき、後述するようにブロック内の文書群に対して基本的な解析ルール１、２（以下、基本解析ルール）と例外ルールの何れを適用するかを判定する。 The in-block paragraph analysis unit 34 stores the basic analysis rules 1 and 2 and exception rules in the paragraph in the style setting rule storage unit 36 in advance, and the document group in the block as described later. It is determined which of basic analysis rules 1 and 2 (hereinafter referred to as basic analysis rules) and exception rules is applied.

＜見出し解析部：解析ルール１＞
以下では、図３に示す電子化文書１０を文書変換プログラム３０の入力文書とし、この電子化文書１０を所定の構造化文書２０に変換する例を示す。図３は、文書作成・編集プログラム５０上でこの電子化文書１０を表示した場合の出力イメージを示し、図４はＤＴＤ（Document Type Definition）の一部を示したものである。なお、電子化文書１０は、文書作成・編集プログラム５０で生成または編集されたものである。 <Heading Analysis Unit: Analysis Rule 1>
In the following, an example is shown in which the digitized document 10 shown in FIG. 3 is used as an input document of the document conversion program 30 and the digitized document 10 is converted into a predetermined structured document 20. FIG. 3 shows an output image when the electronic document 10 is displayed on the document creation / editing program 50, and FIG. 4 shows a part of DTD (Document Type Definition). The digitized document 10 is generated or edited by the document creation / editing program 50.

以下の例では、文書変換プログラム３０の見出し解析部３３において、変換対象となる電子化文書１０の体裁スタイル、組版体裁、段落を構成する文言や文字などの解析過程において、段落を構成する文字列のみから文書構造上の該当要素を決定する解析ルール１を示す。 In the following example, in the headline analysis unit 33 of the document conversion program 30, the character string constituting the paragraph in the analysis process of the style, typesetting style, wording and characters constituting the paragraph, etc. of the digitized document 10 to be converted The analysis rule 1 which determines the applicable element on a document structure from only is shown.

図３に示した電子化文書１０上の「＜使用注意＞」や「＜重要な基本的注意＞」は図４のＤＴＤ上で、
<!ELEMENT Use-cautions - - ( variablelabel?、 (%detailandlows )* ) >
<!ELEMENT Important-precautions - - ( variablelabel?、 (%detailandlows )* ) >
と定義した要素に対応する。 “<Usage Notes>” and “<Important Basic Notes>” on the digitized document 10 shown in FIG. 3 are on the DTD in FIG.
<! ELEMENT Use-cautions--(variablelabel ?, (% detailandlows) *)>
<! ELEMENT Important-precautions--(variablelabel ?, (% detailandlows) *)>
Corresponds to the defined element.

この例では、入力された電子化文書１０上で「＜使用注意＞」や「＜重要な基本的注意＞」がどの様な組版体裁で表現されていようとも、また、「『使用注意』」など表記が括弧類を伴うなどで体裁が異なっていたとしても、「使用注意」という文言の前後一文字を削除した残りの段落を構成する文字列は「使用注意」や「重要な基本的注意」であり、他の文言を含まないことに着目することで、ＤＴＤ上の該当要素を一意に決定することができる。 In this example, regardless of the typesetting style of “<Usage Notes>” or “<Important Basic Notes>” on the input digitized document 10, ““ Usage Notes ”” is also used. Even if the appearance is different due to parentheses, etc., the character string that makes up the remaining paragraph with the `` use caution '' word deleted is just `` use caution '' or `` important basic caution '' By focusing on the fact that other words are not included, the corresponding element on the DTD can be uniquely determined.

即ち、これは記載内容の構成が固定的なことを利用した解析ルールである。こうした解析ルールで文書構成上の該当要素を決定できる段落を、以後見出し部と呼称する。 In other words, this is an analysis rule using the fact that the structure of the description is fixed. A paragraph in which the corresponding element on the document structure can be determined by such an analysis rule is hereinafter referred to as a heading part.

この見出し部を決定する解析ルールとしては、段落を構成する文字列の解析時に、解析対象としない文字列、文字種などを定義し、例えば、前述の例では「＜＞」が見出し解析部３３で認識対象から排除すべき文字及び文字種となる。 As an analysis rule for determining the heading part, a character string, a character type, and the like that are not to be analyzed are defined at the time of analyzing the character string constituting the paragraph. For example, in the above example, “<>” is the heading analysis part 33. Characters and character types that should be excluded from recognition targets.

入力された電子化文書１０を目的の構造化文書２０へ変換するために、電子化文書１０が有する体裁スタイルに替えて構造スタイルを適用する具体的内容の一例を以下に説明する。 An example of specific contents in which the structural style is applied in place of the appearance style of the electronic document 10 in order to convert the input electronic document 10 into the target structured document 20 will be described below.

図５は、文書構造上の要素とスタイル名の対応関係を示す構造スタイル定義４１０の一例を示す。図５において、要素名４１１には要素タグとして「Use-cautions」、「mportant-precautions」に対応させたい構造スタイル名４１２を「使用注意」「重要な基本的注意」と定義する。 FIG. 5 shows an example of a structural style definition 410 indicating the correspondence between elements on the document structure and style names. In FIG. 5, in the element name 411, the structure style name 412 to be associated with “Use-cautions” and “mportant-precautions” as element tags is defined as “use caution” and “important basic caution”.

次に特定文言を検索し、この場合では、特定の文言として「使用注意」、「重要な基本的注意」など予め設定された文言を検出したら、検出した文書構造部分に所定の構造スタイルを自動的に適用する。なお、電子化文書１０が、テキストスタイルや段落スタイルという概念を持つ文書作成手段（文書作成編集プログラム５０）で作成された場合には、段落スタイルとして適用することができる。 Next, a specific wording is searched. In this case, when a predetermined wording such as “use caution” or “important basic caution” is detected as a specific wording, a predetermined structural style is automatically applied to the detected document structure part. Applicable. In addition, when the digitized document 10 is created by document creation means (document creation / editing program 50) having a concept of text style or paragraph style, it can be applied as a paragraph style.

構造スタイル定義４１０では、「使用注意」、「重要な基本的注意」といった要素を見出し部として扱い、見出し部の下位の階層レベルに「項目」と「内容」を定義する。さらに、「項目」と「内容」の下位の階層レベルには、階層レベル＝Ｌ１〜Ｌ３の階層レベルに分けられて、各階層レベル毎に要素名「low1subitem」〜「low3subitem」が設定され、それぞれ「項目（item）」と「内容（detail）」が設定される。 In the structural style definition 410, elements such as “use caution” and “important basic caution” are handled as headings, and “items” and “contents” are defined at a hierarchical level below the headings. Furthermore, the hierarchical levels below “item” and “content” are divided into hierarchical levels of hierarchical levels = L1 to L3, and element names “low1subitem” to “low3subitem” are set for each hierarchical level, respectively. “Item” and “detail” are set.

そして、解析ルール１に基づいて、構造スタイル定義４１０のスタイル名４１２を判定し、構造スタイルを適用する。こうして電子化文書１０に構造スタイルを適用した結果をＳＧＭＬ形式で出力し、ディスク装置５に格納する。なお、電子化文書１０をＳＧＭＬ形式で出力しない場合には、タグ付きテキスト形式で出力し、所定のツールや所定の変換プログラムでＸＭＬ／ＳＧＭＬ形式等の構造化文書２０に変換することができる。 Then, based on the analysis rule 1, the style name 412 of the structural style definition 410 is determined, and the structural style is applied. The result of applying the structural style to the electronic document 10 is output in the SGML format and stored in the disk device 5. If the digitized document 10 is not output in the SGML format, it can be output in a tagged text format and converted into the structured document 20 in the XML / SGML format or the like with a predetermined tool or a predetermined conversion program.

この時、「＜」、「＞」などを含む段落に構造スタイル「使用注意」「重要な基本的注意」を適用してあるので、段落に含まれる文言に替えてこれらのスタイル名自体を使用することで、ＸＭＬやＳＧＭＬでは不要な雑音情報である「＜」や「＞」を削除することができ、ＤＴＤで要求される適切な文言への変換が実行である。 At this time, structural styles “Usage precautions” and “Important basic precautions” have been applied to paragraphs containing “<”, “>”, etc., so these style names themselves are used instead of the words contained in the paragraphs. By doing so, “<” and “>”, which are unnecessary noise information in XML and SGML, can be deleted, and conversion into appropriate text required by DTD is executed.

上記の見出し解析部３３における解析ルール１により構造スタイルを適用した見出し部から、次の見出し部の直前の段落までを、先行する見出し部に属する段落群とし、これをブロックと呼称する。 The section from the heading section to which the structural style is applied according to the analysis rule 1 in the heading analysis section 33 to the paragraph immediately before the next heading section is defined as a group of paragraphs belonging to the preceding heading section, and this is called a block.

上記見出し解析部３３及びブロック抽出部３２により、後述の図１３に示すＳ１が実行されて見出し部に構造スタイルが適用される。この結果、図３に示した入力文書である電子化文書１０は、「使用注意」及び「重要な基本的注意」が見出し部として抽出され、図５に示した構造スタイル定義のうち、要素名４１１＝「Use-cautions」に対応するスタイル名４１２＝「使用注意」と、要素名４１１＝「important−precautions」に対応するスタイル名４１２＝「重要な基本的注意」が適用されて、設定された構造スタイル名と、構造化文書２０の見出し部は図６のようになる。また、後述するように、各見出し部以下の段落は、文書変換プログラム３０により階層構造毎に項目または内容といった文書構造上の要素が適用され、構造化文書２０における各段落とそこに適用した構造スタイルは図６のようになる。 The headline analysis unit 33 and the block extraction unit 32 execute S1 shown in FIG. 13 described later to apply the structural style to the heading unit. As a result, in the digitized document 10 that is the input document shown in FIG. 3, “use caution” and “important basic caution” are extracted as headings, and among the structural style definitions shown in FIG. 411 = style name corresponding to “Use-cautions” 412 = “use caution” and element name 411 = style name corresponding to “important-precautions” 412 = “important basic cautions” are applied and set The structure style name and the heading part of the structured document 20 are as shown in FIG. Further, as will be described later, in the paragraphs below each heading part, the document conversion program 30 applies document structural elements such as items or contents for each hierarchical structure, and each paragraph in the structured document 20 and the structure applied thereto. The style is as shown in FIG.

こうして電子化文書１０から生成したＸＭＬ／ＳＧＭＬ形式で記述された構造化文書２０では、ＸＭＬ／ＳＧＭＬ形式で記述された「構造スタイル」名とＤＴＤに定義された各要素が対応する関係となり、「構造スタイル」名をＤＴＤに規定された所定のＸＭＬ／ＳＧＭＬ要素タグ記述に変換することにより、図７に示した所望のＸＭＬ／ＳＧＭＬ文書が生成できる。 In the structured document 20 described in the XML / SGML format generated from the electronic document 10 in this way, the “structure style” name described in the XML / SGML format and each element defined in the DTD correspond to each other. The desired XML / SGML document shown in FIG. 7 can be generated by converting the “structure style” name into a predetermined XML / SGML element tag description defined in the DTD.

なお、この変換を行うツールあるいはプログラムは本発明に関係なく、一般的な変換ツールあるいはプログラムとして適宜構成すれば良い。 A tool or program for performing this conversion may be appropriately configured as a general conversion tool or program regardless of the present invention.

＜ブロック内段落解析部、適用構造スタイル解析部３７：解析ルール２＞
次に、図１に示したブロック内段落解析部３４の処理について以下に説明する。このブロック内段落解析部３４と適用構造スタイル解析部３７の処理は、図１３に示すＳ２〜Ｓ２０の処理に相当するものであり、このロジックを解析ルール２とする。 <In-block paragraph analysis unit, applied structure style analysis unit 37: analysis rule 2>
Next, processing of the in-block paragraph analysis unit 34 shown in FIG. 1 will be described below. The processes of the in-block paragraph analysis unit 34 and the applied structure style analysis unit 37 correspond to the processes of S2 to S20 shown in FIG.

ブロック内段落解析部３４では、ブロック抽出部３２で抽出した文書群について、段落内の文字位置と構成文字種に着目し、予め設定した解析ルール２に基づいて当該段落の文書構造上の階層レベル判定を行い文書構造上の該当要素を決定する。 The in-block paragraph analysis unit 34 pays attention to the character position in the paragraph and the constituent character type of the document group extracted by the block extraction unit 32, and determines the hierarchical level on the document structure of the paragraph based on the analysis rule 2 set in advance. To determine the relevant elements in the document structure.

ブロック抽出部３２で解析ルール１により抽出したブロック内の段落は、その記述内容をブロック内段落解析部３４が解析することにより、「項目」か「内容」のいずれかに分類する。ブロック内段落解析部３４は、ブロック内の段落の記述内容とその記述体裁の関係に着目して、次のように解析を行う。 The paragraph in the block extracted by the analysis rule 1 by the block extraction unit 32 is classified into “item” or “content” by analyzing the description content by the in-block paragraph analysis unit 34. The in-block paragraph analysis unit 34 performs the following analysis by paying attention to the relationship between the description contents of the paragraph in the block and the description format.

なお、本実施形態では、ブロック内の「項目」は、抽出されたブロック内で階層的な段落を構成する要素として定義される。そして、「内容」は、「項目」に従属する文書の内容を示す要素として定義される。 In the present embodiment, “items” in a block are defined as elements constituting hierarchical paragraphs in the extracted block. “Content” is defined as an element indicating the content of a document subordinate to “item”.

まず、入力文書である電子化文書１０の一般的な「項目」と「内容」の関係は、次のように表すことができ、「項目」を記述する場合の体裁は、以下のような特徴を有する。
（Ａ）複数の項目を列挙する場合は、特定文字を段落先頭に配置し、
（１）複数項目の順序を規定する場合は、順序を示す文字（アラビア数字、アルファベット、イロハなど、以下これを順序詞と呼称）を配置、
（２）項目を併記する場合は、各項目を明示するための文字（中黒、アスタリスク、米
印、注、★などの任意図形文字、以下これを項目明示詞と呼称）を配置、
（３）特定文字を括弧付きとするか、ピリオド付きとするかなどは任意
次に句読点類を含まない文字列（以下これを文と呼称）が続き、段落末尾に、
（４）段落先頭が順序詞の場合は段落末尾に特定文字を配置しない、
（５）段落先頭が項目明示詞の場合は段落末尾に特定文字を配置しないか、句読点類
（句点、ピリオド、コロン、セミコロンなど）を配置する構成となることが特徴である。 First, the general relationship between “items” and “contents” of the digitized document 10 as an input document can be expressed as follows, and the appearance when “items” are described has the following characteristics: Have
(A) When listing multiple items, place specific characters at the beginning of the paragraph,
(1) When specifying the order of multiple items, arrange the letters indicating the order (Arabic numerals, alphabets, Iloha, etc., hereinafter referred to as “orders”).
(2) When writing items together, place characters to clearly indicate each item (arbitrary graphic characters such as middle black, asterisk, American mark, note, ★, hereinafter referred to as item clarifiers).
(3) Arbitrary whether a specific character is parenthesized or a period is followed by a character string that does not include punctuation marks (hereinafter referred to as a sentence), and at the end of the paragraph,
(4) Do not place a specific character at the end of a paragraph when the beginning of a paragraph is an order
(5) If the beginning of a paragraph is an item explicit, the specific character is not placed at the end of the paragraph, or punctuation marks (such as a punctuation mark, period, colon, semicolon, etc.) are placed.

次に、「内容」を記述する場合の体裁は、
（Ｂ）文のみで構成
（Ｃ）段落末尾に文の終了を表す句点、ピリオドを配置した文で構成
（Ｄ）複数内容を列挙する場合は、順序詞を段落先頭に配置し、次に文が続き、段落末尾に文の終了を表す句点、ピリオドを配置した構成
となることが特徴である。 Next, when describing "content",
(B) Consists of sentences only (C) Consists of sentences with a period and a period placed at the end of the paragraph, and (D) When enumerating multiple contents, an ordinal is placed at the beginning of the paragraph, then the sentence It is characterized by the fact that a paragraph and a period representing the end of the sentence are placed at the end of the paragraph.

以下、ブロック内段落解析部３４が注目する段落が項目か内容かを識別するための順序詞、項目明示詞や句読点類を段落識別子と呼称する。これら段落識別子は構造スタイル定義の一部として予め設定する。 Hereinafter, an ordering word, an item clarifier, and punctuation marks for identifying whether the paragraph focused by the in-block paragraph analysis unit 34 is an item or content will be referred to as a paragraph identifier. These paragraph identifiers are preset as part of the structural style definition.

ブロック内段落解析部３４は、上記段落識別子の特性を利用して段落内の文字列パターンの解析を行う。 The in-block paragraph analysis unit 34 analyzes the character string pattern in the paragraph using the characteristics of the paragraph identifier.

図８はある文書の段落内を解析して予め設定した段落識別子を登録した解析テーブル３６０の一例を示したものである。解析テーブル３６０は、段落内を解析した結果を登録する登録テーブル３７０と、スタイル設定ルールと文字列パターンの対応関係を格納した基本ルール指定３６１とから構成される。 FIG. 8 shows an example of an analysis table 360 in which a paragraph identifier preset by analyzing the inside of a document is registered. The analysis table 360 includes a registration table 370 that registers the result of analyzing the inside of a paragraph, and a basic rule specification 361 that stores the correspondence between style setting rules and character string patterns.

なお、解析テーブル３６０の内、基本ルール指定３６１の内容については予め対象文書を解析して設定しておくものであるが、登録テーブル３７０の内容については、予め対象文書を解析して登録しておく方法と、解析しながら登録する方法が可能である。 In the analysis table 360, the contents of the basic rule specification 361 are set by analyzing the target document in advance, but the contents of the registration table 370 are analyzed and registered in advance. And a method of registering while analyzing.

以下では、解析しながら文字識別子、識別対象文字列を登録していく方法で処理の説明を行うものとする。 In the following, the processing will be described by a method of registering character identifiers and identification target character strings while analyzing them.

まず、登録テーブル３７０は、ブロック内の段落内を解析した結果として、先頭文字種と行末（文末）文字種を登録する文字列識別子定義３７１と、対象段落内で抽出した文字種（文字識別子）毎に具体的な文字列を登録する文字識別子設定３７２の２つの格納テーブルが設定されている。 First, the registration table 370 analyzes the character string identifier definition 371 for registering the first character type and the end-of-line (end-of-sentence) character type for each character type (character identifier) extracted in the target paragraph as a result of analyzing the paragraph in the block. Two storage tables of a character identifier setting 372 for registering a typical character string are set.

文字列識別子定義３７１は、一文の先頭を識別する文字種を登録した先頭文字識別子３７１１と、一文の行末（文末）を識別する文字種を登録する行末文字識別子３７１２から構成される。図８の例では、先頭文字識別子３７１１には、段落内を解析した結果、識別した先頭文字種として「１」（順序詞の意味）、「・」、「注」（項目明示詞の意味）の文字種が設定され、行末文字識別子３７１２には「。」（句点）の文字種が登録された例を示している。 The character string identifier definition 371 includes a head character identifier 3711 that registers a character type for identifying the head of a sentence, and a line end character identifier 3712 for registering a character type that identifies the end of a sentence (end of sentence). In the example of FIG. 8, the first character identifier 3711 includes “1” (meaning ordering), “·”, “note” (meaning item explicit) as the identified first character type as a result of analyzing the inside of the paragraph. In this example, the character type is set, and the character type of “.” (Punctuation mark) is registered in the end-of-line character identifier 3712.

文字識別子設定３７２には、文字列識別子定義３７１で設定した識別子が文字識別子３７２１に出現順で格納され、各文字識別子３７２１の各識別子ごとに段落内から抽出した文字が登録される。文字識別子以外の文字列は、具体的な文字列ではなく、単に「ＴＥＸＴ」として登録する。 In the character identifier setting 372, the identifier set in the character string identifier definition 371 is stored in the character identifier 3721 in the order of appearance, and a character extracted from the paragraph is registered for each identifier of each character identifier 3721. Character strings other than character identifiers are not registered as specific character strings but are simply registered as “TEXT”.

先頭文字識別子３７１１としての文字種「１」は順序詞であり、図８では、段落内で抽出した順序詞として「１．」、「（１）」、「１）」、「丸付数字」が識別対象文字列３７２２に登録された例を示している。 The character type “1” as the first character identifier 3711 is an ordinal. In FIG. 8, “1.”, “(1)”, “1)”, and “circled numbers” are extracted as the nouns extracted in the paragraph. An example registered in the identification target character string 3722 is shown.

同様に、先頭文字識別子３７１１としての「・」には、項目明示詞としての「・」、「○」、「◇」、「△」等が識別対象文字列３７２２に登録された例を示している。同様に、先頭文字識別子３７１１としての「注」には、項目明示詞としての「（注）」、「注）」、「※」が登録された例を示す。 Similarly, “·” as the first character identifier 3711 shows an example in which “•”, “◯”, “◇”, “Δ”, etc. as item explicit words are registered in the identification target character string 3722. Yes. Similarly, an example in which “(Note)”, “Note”, and “*” as item explicit words are registered in “Note” as the first character identifier 3711 is shown.

基本ルール指定部３６１には、ブロック内段落の階層と、「項目」及び「内容」の種別からなるスタイル設定ルール３６１１に対応する文字列パターン３６１２が設定される。このスタイル設定ルール３６１１と文字列パターンの対応関係は、予め設定されたものである。 In the basic rule designating unit 361, a character string pattern 3612 corresponding to a style setting rule 3611 including a hierarchy of paragraphs within a block and types of “item” and “content” is set. The correspondence between the style setting rule 3611 and the character string pattern is set in advance.

ここで、スタイル設定ルール３６１１は、図９で示すように予め設定されるもので、段落内の階層と「項目」または「内容」を識別するスタイル設定ルール名３６１２と、スタイル設定ルール名３６１２毎に定義を格納したスタイル設定ルール内容３６１３から構成される。図中「＋０」は同一階層（階層レベル＝＋０）の段落と見なす識別子で、「＋１」は現在の段落に従属する段落を示し、階層レベル＝＋１となる。スタイル設定ルール名３６１１は「＋０内容」であれば、現段落内の内容を示し、「＋１項目」であれば現在の段落に従属する段落の「項目」であることを示し、「＋１内容」であれば、現段落に従属する項目の内容を示す。 Here, the style setting rule 3611 is set in advance as shown in FIG. 9, and a style setting rule name 3612 for identifying a hierarchy in a paragraph and “item” or “content”, and a style setting rule name 3612 for each. Is composed of style setting rule contents 3613 in which the definition is stored. In the figure, “+0” is an identifier regarded as a paragraph in the same hierarchy (hierarchy level = + 0), and “+1” indicates a paragraph subordinate to the current paragraph, and the hierarchy level = + 1. If the style setting rule name 3611 is “+0 content”, it indicates the content in the current paragraph, and “+1 item” indicates that it is an “item” of a paragraph subordinate to the current paragraph, and “+1 content”. If so, it indicates the contents of the item subordinate to the current paragraph.

図８の基本ルール指定３６１には、同一段落内の内容を示す「＋０内容」の文字列パターン３６１２として、句点無しの文章を示す文字識別子「文」と、句点を行末文字に備えた文章を示す、文字識別子「文」と文字識別子「。」の「文。」を設定する。現在の段落に従属する段落の項目は、「＋１項目」の文字列パターンとして、順序詞＋文を示す「１文」と、項目明示詞＋文＋句点を示す「・文。」と、項目明示詞＋文を示す「・文」が設定される。そして、現在の段落に従属する段落の内容は、「＋１内容」の文字列パターンとして、順序詞＋文章＋句点を示す「１文。」が設定される。また、現在の段落の内容には「＋０内容」の文字列パターンとして、項目明示詞の「注」＋文＋句点を示す「注文。」が設定される。 In the basic rule specification 361 of FIG. 8, a character identifier “sentence” indicating a sentence without a punctuation and a sentence having a punctuation at the end-of-line character are used as a “+0 content” character string pattern 3612 indicating the content in the same paragraph. The character identifier “sentence” and the character identifier “.” “Sentence” are set. The items of the paragraph subordinate to the current paragraph are “+1 sentence” as a character string pattern, “1 sentence” indicating order sentence + sentence, “* sentence” indicating item clarifier + sentence + punctuation, and item. “· Sentence” indicating an explicit phrase + sentence is set. Then, the contents of the paragraph subordinate to the current paragraph are set as “+1 sentence” indicating “order sentence + sentence + phrase” as a character string pattern of “+1 contents”. In addition, as the character string pattern of “+0 content”, the item paragraph “note” + sentence + “order” indicating a punctuation is set in the content of the current paragraph.

これらのスタイル設定ルール名３６１１と文字列パターン３６１２の関係は、文書毎あるいはブロック毎に適宜設定することができる。 The relationship between the style setting rule name 3611 and the character string pattern 3612 can be set as appropriate for each document or block.

＜段落の項目／内容の識別と階層レベル判定処理＞
次に、上記解析テーブル３６０とスタイル設定ルール３６１１を用いてブロック内段落解析部３４と適用構造スタイル解析部３７で行われる段落の項目または内容を識別し、段落の階層レベルを判定する処理について以下に説明する。 <Paragraph item / content identification and hierarchy level determination processing>
Next, processing for identifying paragraph items or contents performed by the in-block paragraph analysis unit 34 and the applied structure style analysis unit 37 using the analysis table 360 and the style setting rule 3611 and determining the hierarchical level of the paragraphs will be described below. Explained.

この処理の概要は、ブロック抽出部３２で抽出されたブロックを区切る見出し部に続く段落群を、その段落を構成する段落識別子により識別した後、その段落の階層レベルを判定する。 The outline of this process is that, after identifying a group of paragraphs following a heading part that divides a block extracted by the block extraction unit 32 by a paragraph identifier constituting the paragraph, the hierarchical level of the paragraph is determined.

まず、現在着目している段落が順序詞または項目明示詞付きか否かで処理が異なる。順序詞または項目明示詞付きの場合は、上記体裁（Ａ）の順序詞、項目明示詞と、上記体裁（Ｄ）の順序詞、項目明示詞を別々に、出現順にその文字種を登録テーブル３７０に登録する。なお、順序詞または項目明示詞を識別しても、既に登録テーブル３７０に登録されたものである場合は登録しない。 First, processing differs depending on whether or not the currently focused paragraph has an order or item clarifier. In the case where an order or item clarifier is attached, the order (A) order and item clarifier, and the appearance (D) order and item clarifier are separated separately, and their character types are entered in the registration table 370 in the order of appearance. sign up. Note that even if an order or item clarifier is identified, if it is already registered in the registration table 370, it is not registered.

登録テーブル３７０への登録は以下の登録条件（登録ルール）により行う。 Registration in the registration table 370 is performed according to the following registration conditions (registration rules).

Ａ：同じ順序詞の場合、アラビア数字であっても、「１」、「１．」、「（１）」、「丸付数字」などは全て異なる文字種と判断し、登録テーブル３７０の識別対象文字列３７２２に登録する。 A: In the case of the same order, even if they are Arabic numerals, “1”, “1.”, “(1)”, “circled numbers”, etc. are all judged as different character types, and are identified in the registration table 370. Register in the character string 3722.

Ｂ：当該段落の順序詞・項目明示詞が初出の場合は順序詞または項目明示詞を識別対象文字列３７２２に登録する。 B: When the order / item clarifier of the paragraph appears for the first time, the order or item clarifier is registered in the character string 3722 to be identified.

Ｃ：この登録が登録テーブル３７０上で初めてのものであれば、この順序詞または項目明示詞が順序詞・項目明示詞群の最上位の階層レベルに位置するよう識別対象文字列３７２２に文字列を登録する。 C: If this registration is the first time on the registration table 370, a character string is added to the identification target character string 3722 so that this ordinal or item clarifier is positioned at the highest hierarchical level of the ordinal / item clarifier group. Register.

Ｄ：既に別の順序詞または項目明示詞が登録されている場合には、その順序詞・項目明示詞から一階層下位の階層レベルの識別対象文字列３７２２に登録する。 D: If another ordinal or item explicit is already registered, it is registered in the identification target character string 3722 at a hierarchical level one level lower than the ordinal / item explicit.

Ｅ：当該段落の順序詞または項目明示詞が既出の場合は、一致する順序詞または項目明示詞の階層レベルを取得し、該当する階層レベルに現在の段落の階層とした適用マークを付し、他の階層レベルに適用マークが付されている場合はこれを取り消す処理と、一致する順序詞または項目明示詞の階層レベルを取得すると同時に当該順序詞・項目明示詞より下の順序詞・項目明示詞の登録を消去する処理を行う。なお、適用マークに代わって、適用済階層レベル記憶部４３に記憶するようにしても良い。 E: If the order or item clarifier of the paragraph has already been found, the hierarchy level of the matching order or item clarifier is acquired, and the application mark as the hierarchy of the current paragraph is attached to the corresponding hierarchy level. If an application mark is attached to another hierarchical level, cancel the processing, and acquire the hierarchical level of the matching ordinal or item clarifier, and at the same time, the ordinal or item explicit below that ordinal or item clarifier. The process of deleting the registration of the lyrics is performed. Instead of the application mark, it may be stored in the applied hierarchy level storage unit 43.

ここで、前者の処理（他の階層レベルに適用マークが付されている場合はこれを取り消す処理）では、順序詞・項目明示詞が登録順に規則正しく繰り返されること、あるいは登録順と階層レベルが正確に対応していることを期待するものである。 Here, in the former process (the process of canceling the application mark if another hierarchy level is attached), the order / item explicit is repeated regularly in the registration order, or the registration order and the hierarchy level are accurate. It is expected that it corresponds to.

一方、後者の処理（当該順序詞・項目明示詞より下位の順序詞・項目明示詞の登録を消去する処理）は、より上位の階層の順序詞・項目明示詞の出現は、該当段落より下位の順序や項目の定義を再定義したい動機であると理解し対応しようとするものである。 On the other hand, in the latter process (the process of deleting registration of an order / item clarifier lower than the relevant order / item explicit), the appearance of an order / item explicit in a higher hierarchy is lower than the corresponding paragraph. It is understood that it is a motivation to redefine the order of items and the definition of items.

上記のうちどちらの処理とするかは適宜選択して採用することが可能である。以下の説明では後者の処理を採用するものとする。 Which of the above processes is used can be appropriately selected and adopted. In the following description, the latter process is adopted.

次に、ブロック内段落解析部３４は、登録テーブル３７０に対して以下の処理のうちの何れかを行う。
・処理対象ブロックの処理中のみ保持し、当該ブロックの処理を終えた時点でクリアする。
・対象文書処理中は保持し、同じ構造スタイルの見出し部には同じ登録内容を適用する。
・ライブラリとして保持し、同じＤＴＤを使用する文書群に対して有効とし、同じ構造スタイルの見出し部には同じ登録内容を適用する。 Next, the in-block paragraph analysis unit 34 performs any of the following processes on the registration table 370.
• Hold only during processing of the processing target block, and clear it when processing of the block is completed.
-Keep the target document during processing, and apply the same registration contents to the headings of the same structural style.
-Stored as a library, valid for a group of documents that use the same DTD, and apply the same registration contents to the headings of the same structural style.

上記いずれの処理によるかは、既存文書群において順序詞・項目明示詞を使用する場合、どの程度の統一性を持って作成されたかに依存し、予め決定しておくか、適宜選択可能とすることができる。統一性が高い場合は文書群に対して同一の登録テーブル３７０を共通に使用可能であり、恣意性が高い場合は処理対象のブロックにのみ使用可能である。 Which of the above processes is used depends on the degree of unity created when using order or item clarifiers in an existing document group, and can be determined in advance or can be selected as appropriate. be able to. When the unity is high, the same registration table 370 can be used in common for the document group, and when the unity is high, it can be used only for the block to be processed.

以下の説明では処理対象ブロックの処理中のみ登録テーブル３７０を保持するものとする。 In the following description, it is assumed that the registration table 370 is held only during processing of the processing target block.

次に階層レベルの判定ルール（ロジック）を説明する。この処理は図１３のＳ５の処理に相当する。
・階層レベルを、直前の段落と同じ階層レベルとする場合は「＋０」と表記し、
・直前の段落の階層レベルから一階層下位の階層レベルとする場合は「＋１」と表記し、
・直前の段落の階層レベルからｎ階層上位の階層レベルとする場合は「−ｎ」と表記する。 Next, hierarchical level determination rules (logic) will be described. This process corresponds to the process of S5 in FIG.
・ If the hierarchy level is the same hierarchy level as the previous paragraph, write “+0”.
・ If you want to make the hierarchy level one level lower than the hierarchy level of the previous paragraph, write “+1”,
-In the case where the hierarchy level is n levels higher than the hierarchy level of the immediately preceding paragraph, it is expressed as “−n”.

（１）項目／内容の識別結果の如何によらず体裁（Ａ）または（Ｄ）の時、順序詞・項目明示詞の登録テーブル３７０への登録可否および階層レベルを判定する。 (1) Regardless of the identification result of the item / content, whether or not it is registered in the registration table 370 of the order / item explicit words and the hierarchy level are determined in the form (A) or (D).

この登録が登録テーブル３７０上への初めての登録の場合は階層レベルが得られないので、下記（２）以下の判定ルールにより階層レベルを決定する。 If this registration is the first registration on the registration table 370, the hierarchical level cannot be obtained, so the hierarchical level is determined according to the following determination rule (2).

既に登録された順序詞・項目明示詞が存在するが、登録しようとする順序詞・項目明示詞は初出の場合は＋１
既に登録された順序詞・項目明示詞が存在し、その最下位のものと一致する順序詞・項目明示詞の場合は＋０
既に登録された順序詞・項目明示詞が存在し、そのｎ階層上位のものと一致する順序詞・項目明示詞の場合は−ｎ
この時、該当階層レベルより下位の既登録の順序詞・項目明示詞は消去する。 There are already registered orderings / item explicits, but the ordering / item explicits to be registered are +1 if this is the first time.
+0 if there is an already registered ordinal / item explicit, and it matches the lowest one
In the case where there is an already-registered order / item explicit, and it is an order / item explicit that matches the one higher in the n hierarchy, -n
At this time, the registered ordinal and item explicit words below the corresponding hierarchical level are deleted.

上記判定ルールで階層レベルを決定できた場合は下記（２）〜（５）の判定ルールはスキップする。 When the hierarchy level can be determined by the determination rule, the following determination rules (2) to (5) are skipped.

（２）識別結果「項目」で体裁（Ａ）の場合は階層レベルを＋１
（３）識別結果「内容」で体裁（Ｂ）の場合は階層レベルを＋０。但し、直前段落が内容の時は、直前段落と当該段落の文字列パターンを比較し、異なる場合は階層レベルを＋１
（４）識別結果「内容」で体裁（Ｃ）の場合は階層レベルを＋０。但し、直前段落が内容の時は、直前段落と当該段落の文字列パターンを比較し、異なる場合は階層レベルを＋１
（５）識別結果「内容」で体裁（Ｄ）の場合は階層レベルを＋１
以上の手順により、階層レベルが判定できれば、直前段落に適用した構造スタイルから一意に当該段落に付与すべき構造スタイルを決定することができる。 (2) If the identification result “item” is appearance (A), the hierarchy level is +1.
(3) If the identification result “content” is appearance (B), the hierarchy level is +0. However, when the immediately preceding paragraph is the content, the character string pattern of the immediately preceding paragraph and the corresponding paragraph is compared.
(4) If the identification result “content” is in appearance (C), the hierarchy level is +0. However, when the immediately preceding paragraph is the content, the character string pattern of the immediately preceding paragraph and the corresponding paragraph is compared.
(5) If the identification result “content” is appearance (D), the hierarchy level is +1.
If the hierarchical level can be determined by the above procedure, the structural style to be uniquely assigned to the paragraph can be determined from the structural style applied to the immediately preceding paragraph.

抽出したブロック単位で考えれば、ブロック先頭の構造スタイルが決定されているので以下のブロックを構成する段落群は先頭から順次構造スタイルを決定することができる。 Considering the extracted block unit, the structural style at the head of the block is determined, so that the paragraphs constituting the following blocks can determine the structural style sequentially from the head.

＜２．段落の項目／内容の識別と階層レベル判定処理の実装例＞
上記の解析の実装例を以下に説明する。 <2. Implementation example of paragraph item / content identification and hierarchy level determination processing>
An implementation example of the above analysis will be described below.

図８に示すように対象文書を解析した結果として得られた段落識別子を先頭識別子３７１１と末尾識別子３７１２に分け、それぞれとして出現可能性のある順序詞、項目明示詞や句読点類の具体的文字コード識別対象文字列３７２２を登録する。 As shown in FIG. 8, the paragraph identifier obtained as a result of analyzing the target document is divided into a head identifier 3711 and a tail identifier 3712, and specific character codes of ordinals, item clarifiers, and punctuation marks that may appear as the identifiers. The identification target character string 3722 is registered.

さらに、段落識別設定として文とこれら段落識別子の出現可能性のある組み合わせを列挙し、スタイル設定ルール３６１１としてその組み合わせを検出した場合の項目／内容識別結果と階層レベル判定を対応付ける。この情報を予め所謂ＤＢ（以下これを基本解析ルールＤＢ（図示省略）と呼称）として保持することができ、解析ルールを自動的または選択的に適用することが可能となる。 Further, sentences and combinations with the possibility of appearance of these paragraph identifiers are listed as the paragraph identification setting, and the item / content identification result when the combination is detected is associated with the hierarchy level determination as the style setting rule 3611. This information can be held in advance as a so-called DB (hereinafter referred to as a basic analysis rule DB (not shown)), and analysis rules can be applied automatically or selectively.

勿論、こうした解析ルールは文書の種類によって異なり、それぞれ対象とする文書を解析し、解析ルールとして抽出しておくものである。また、文書によって、段落内の文字位置、構成文字種の何に着目すれば解析ルールとして抽出できるかは異なる。 Of course, these analysis rules differ depending on the type of document, and each target document is analyzed and extracted as an analysis rule. Also, depending on the document, what character position within a paragraph and what kind of constituent character type should be focused on can be extracted as an analysis rule.

さらに、同一文書内であっても、特定のブロックや段落によってルールが異なることもあり得、文書群、文書、ブロック、段落といった単位で異なる適用範囲のルールとなり得る。いずれにしても、本発明は段落内の文字位置と構成文字種に着目し、項目か内容を識別し階層レベルの判定を行い文書構造上の該当要素を決定するものである。 Further, even within the same document, the rules may differ depending on the specific block or paragraph, and the rules may be applied in different units such as a document group, document, block, or paragraph. In any case, the present invention focuses on the character position in the paragraph and the constituent character type, identifies the item or content, determines the hierarchical level, and determines the corresponding element on the document structure.

図１１は、図１０に示す電子化文書１０’を解析した結果、基本解析ルールを文字列パターンと文字列とに対応するスタイル名（図５参照）との関係として示したスタイル設定テーブル４２０である。なお、図１０において「ＴＥＸＴ」は任意の文章を示す。 FIG. 11 is a style setting table 420 showing the basic analysis rule as a relationship between the character string pattern and the style name corresponding to the character string (see FIG. 5) as a result of analyzing the digitized document 10 ′ shown in FIG. is there. In FIG. 10, “TEXT” indicates an arbitrary sentence.

図１０は、第１行目が＜ＸＸＸＸ＞であるので、上記ブロック抽出部３２の解析ルール１によって見出し部として識別され、次の見出し部までの１行目から９行目までが１ブロックとして抽出された例を示す。 In FIG. 10, since the first line is <XXXX>, it is identified as a heading part by the analysis rule 1 of the block extraction unit 32, and the first to ninth lines up to the next heading part are regarded as one block. An extracted example is shown.

次に、ブロック内段落解析部３４は、上述した段落の項目または内容を識別する処理を実施する。まず、上記登録条件に基づいてブロック内の各段落から段落識別子を抽出し、登録テーブル３７０へ登録する。 Next, the in-block paragraph analysis unit 34 performs a process of identifying the paragraph item or content described above. First, a paragraph identifier is extracted from each paragraph in the block based on the registration condition and registered in the registration table 370.

これら抽出された段落識別子は、図１２に示すように、登録テーブル３７０の識別対象文字列３７２２に出現順で登録される。また、先頭文字識別子３７１１と行末文字識別子３７１２には、抽出した段落識別子に対応する文字種が出現順で登録される。 These extracted paragraph identifiers are registered in the order of appearance in the identification target character string 3722 of the registration table 370, as shown in FIG. In the first character identifier 3711 and the end-of-line character identifier 3712, the character types corresponding to the extracted paragraph identifiers are registered in the order of appearance.

ここで、登録テーブル３７０は、ひとつの文字識別子３７２１に対応する識別対象文字列３７２２には、複数の文字列を格納することができる。 Here, the registration table 370 can store a plurality of character strings in the identification target character string 3722 corresponding to one character identifier 3721.

図１０のブロックの場合、２行目から解析ルール２による解析を行い、２行目では行末文字識別子として句点「。」を抽出し、３行目では順序詞「１．」を、５行目では項目明示詞「注）」、６行目では順序詞「１）」、８行目では項目明示詞「○」を抽出する。また、各行では内容を構成する文字列「ＴＥＸＴ」が抽出される。これらの抽出した段落識別子は、図１２で示すように、登録テーブル３７０の識別対象文字列３７２２へ出現順に登録される。また、図８の先頭文字識別子３７１１には、順序詞「１」、項目明示詞「注」と「・」が出現順で登録され、行末文字識別子３７１２には、句点「。」が登録される。 In the case of the block in FIG. 10, the analysis according to the analysis rule 2 is performed from the second line, the phrase “.” Is extracted as the end-of-line character identifier in the second line, and the order “1.” is extracted in the third line. Then, the item demonstrator “Note” is extracted, the order qualifier “1)” is extracted from the sixth line, and the item clarifier “◯” is extracted from the eighth line. In each line, a character string “TEXT” constituting the content is extracted. These extracted paragraph identifiers are registered in the order of appearance in the identification target character string 3722 of the registration table 370, as shown in FIG. Further, in the first character identifier 3711 of FIG. 8, the order term “1”, the item clarifiers “Note” and “•” are registered in the order of appearance, and the ending mark “.” Is registered in the end-of-line character identifier 3712. .

次に、適用構造スタイル解析部３７では、登録テーブル３７０の段落識別子と電子化文書１０’の着目ブロックについて階層構造を決定し、適用するスタイル名を判定する。まず、見出し部の次の２行目の文字列パターンを判定する。２行目（第２段落）は、段落先頭から段落識別子の判定を行い、先頭の段落識別子は存在せず、文字列「ＴＥＸＴ」と句点「。」の構成であり、登録テーブル３７０の文字識別子３７２１の最上位には文字列「文」があるので、この２行目の文字列パターンは、解析テーブル３６０のスタイル設定ルール３６１１の文字列パターンに設定された「文。」と一致することが分かる。この文字列パターン「文。」は、基本ルール指定３６１に定義されたスタイル設定ルールの「＋０内容」に相当することが判定でき、第２行目は見出し部の内容を示す文字列パターンであると判定する。 Next, the applied structure style analysis unit 37 determines a hierarchical structure for the paragraph identifier of the registration table 370 and the target block of the digitized document 10 ′, and determines the style name to be applied. First, the character string pattern in the second line next to the heading part is determined. The second line (second paragraph) determines the paragraph identifier from the beginning of the paragraph, has no leading paragraph identifier, has a character string “TEXT” and a punctuation mark “.”, And has a character identifier in the registration table 370. Since there is a character string “sentence” at the top of 3721, the character string pattern in the second line may match “sentence” set in the character string pattern of the style setting rule 3611 of the analysis table 360. I understand. This character string pattern “sentence” can be determined to correspond to the “+0 content” of the style setting rule defined in the basic rule specification 361, and the second line is a character string pattern indicating the content of the heading part. Is determined.

このとき、構造スタイル記憶部４０の適用済階層レベル記憶部４３に最上位の階層レベル（Ｌ０）を文字列に適用したことを記録する。そして、最上位の階層レベルの「内容」の段落であるので、図５の構造スタイル定義４１０から最上位の階層レベルの「内容」を選択して、当該段落にスタイル名＝「内容」を適用する。そして、図１１で示す適用済スタイル記憶部４２に設定されたスタイル設定テーブル４２０の最上位の「内容」をブロック内の最初の段落に適用したことを記録する（図中灰色部）。なお、図１１のスタイル設定テーブル４２０は、ブロック内の段落４２２毎に適用したスタイル名４２１と、文字列パターン４２３を記録するもので、本例では理解を容易にするため各段落の実際の文字列を設けているが、必須ではない。 At this time, the fact that the highest hierarchical level (L0) has been applied to the character string is recorded in the applied hierarchical level storage 43 of the structural style storage 40. Since it is the “content” paragraph at the highest hierarchical level, “content” at the highest hierarchical level is selected from the structural style definition 410 of FIG. 5 and the style name = “content” is applied to the paragraph. To do. Then, the fact that the “content” at the top of the style setting table 420 set in the applied style storage unit 42 shown in FIG. 11 is applied to the first paragraph in the block is recorded (gray portion in the figure). The style setting table 420 in FIG. 11 records the style name 421 applied to each paragraph 422 in the block and the character string pattern 423. In this example, the actual character of each paragraph is easy to understand. A column is provided but is not required.

次に、３行目（第３段落）の文字列パターンを判定する。３行目は、順序詞「１．」と文字列「ＴＥＸＴ」の構成であり、文字列パターンは「１文」となる。文字列パターン「１文」は解析テーブル３６０に定義されたスタイル設定ルール３６１１の「＋１項目」に相当することが判定できる。したがって第３行目の階層レベルは＋１となり、ブロック内で第２位の階層レベルとなり、階層レベルがＬ１となる「項目」と決定する。 Next, the character string pattern in the third line (third paragraph) is determined. The third line is composed of an ordinal “1.” and a character string “TEXT”, and the character string pattern is “1 sentence”. It can be determined that the character string pattern “1 sentence” corresponds to “+1 item” of the style setting rule 3611 defined in the analysis table 360. Accordingly, the hierarchical level of the third row is +1, and is the second hierarchical level in the block, and is determined as an “item” whose hierarchical level is L1.

このとき、構造スタイル記憶部４０の適用済階層レベル記憶部４３に第２位の階層レベル（Ｌ１）を文字列パターン「１文」に適用したことを記録する。そして、第２位の階層レベルの「項目」の段落であるので、図５の構造スタイル定義４１０から第２位の階層レベルの「項目」を選択し、当該段落にスタイル名＝「Ｌ１項目」を適用する。また、図１１で示すスタイル設定テーブル４２０には、第２位の「Ｌ１項目」をブロック内の２番目の段落に適用したことを記録する（図中灰色部）。 At this time, the fact that the second hierarchical level (L1) is applied to the character string pattern “1 sentence” is recorded in the applied hierarchical level storage unit 43 of the structural style storage unit 40. Since it is a paragraph of “item” at the second hierarchical level, “item” at the second hierarchical level is selected from the structural style definition 410 of FIG. 5, and style name = “L1 item” is selected in the paragraph. Apply. Further, in the style setting table 420 shown in FIG. 11, it is recorded that the second “L1 item” is applied to the second paragraph in the block (the gray portion in the figure).

以下、同様に各行（各段落）毎に登録テーブル３７０の解析結果と、解析ルール指定３６１の定義に基づいて文字列パターンを判定し、得られた文字列パターンと解析テーブル３６０のスタイル設定ルール３６１１の文字列パターンとを比較して、文字列パターンが一致するスタイル設定ルールを解析テーブル３６０が取得し、各段落の階層レベルを直前御段落に基づいて決定する。そして、決定した階層レベルに対応するスタイル名を当該段落に適用し、また、スタイル設定テーブル４２１に適用したスタイル名を記録していくのである。 Similarly, for each line (each paragraph), the character string pattern is determined based on the analysis result of the registration table 370 and the definition of the analysis rule specification 361, and the obtained character string pattern and the style setting rule 3611 of the analysis table 360 are determined. The analysis table 360 acquires the style setting rule that matches the character string pattern, and determines the hierarchical level of each paragraph based on the immediately preceding paragraph. Then, the style name corresponding to the determined hierarchical level is applied to the paragraph, and the style name applied to the style setting table 421 is recorded.

＜処理フローの説明＞
上記の文書変換プログラム３０で実行される処理の一例を図１３のフローチャートに示す。図１３のフローチャートは、見出し部を抽出する度に繰り返して実行される処理である。 <Description of processing flow>
An example of processing executed by the document conversion program 30 is shown in the flowchart of FIG. The flowchart of FIG. 13 is a process that is repeatedly executed every time a headline part is extracted.

図１３において、Ｓ１では電子化文書１０を読み込んで、最初の見出し部を後述のように抽出し、見出し部に対応する構造スタイルを設定する（見出し解析部３３）。 In FIG. 13, in S1, the digitized document 10 is read, the first heading part is extracted as described later, and a structural style corresponding to the heading part is set (heading analysis part 33).

Ｓ２では、ブロックの開始であるので、見出し部からの階層の深さを示す変数である階層レベルＬを０にリセットし、ブロック内で適用する項目のモードを示す変数である項目モードＭを０にリセットする。なお、項目モードＭは、Ｍ＝０が項目であることを示し、Ｍ＝１が内容であることを示している。 In S2, since it is the start of the block, the hierarchy level L that is a variable indicating the depth of the hierarchy from the heading part is reset to 0, and the item mode M that is a variable indicating the mode of the item applied in the block is set to 0. Reset to. The item mode M indicates that M = 0 is an item, and M = 1 indicates content.

Ｓ３では、Ｓ１で抽出した見出しから次の見出しまでの１ブロックを段落毎に読み込んで、対象段落の文字列パターンを上述したように解析し、登録テーブル３７０に段落識別子を登録する（文字列パターン解析部３５）。 In S3, one block from the headline extracted in S1 to the next headline is read for each paragraph, the character string pattern of the target paragraph is analyzed as described above, and the paragraph identifier is registered in the registration table 370 (character string pattern). Analysis unit 35).

Ｓ４では、読み込んだ文字列が見出し部を含んでいるかを判定する。見出し部であれば、次のブロックであるので、処理を終了する。一方、見出し部でなければ、Ｓ５の処理に進んで、ブロック内段落解析部３４と適用構造スタイル解析部３７の処理を行う。 In S4, it is determined whether the read character string includes a heading part. If it is a heading part, it is the next block, so the processing is terminated. On the other hand, if it is not the headline part, the process proceeds to S5, where the in-block paragraph analysis part 34 and the applied structure style analysis part 37 are processed.

Ｓ５では、基本解析ルールに基づいて、Ｓ３で解析した文字列パターンについて、スタイル設定ルールを決定する。 In S5, a style setting rule is determined for the character string pattern analyzed in S3 based on the basic analysis rule.

そして、Ｓ６以降の処理は、決定したスタイル設定ルールに応じて次の処理が異なり、スタイル設定ルールが「＋０内容」の場合にはＳ７に進み、「＋１項目」の場合にはＳ１１へ進み、「＋１内容」の場合にはＳ１６に進む。 The processing after S6 differs depending on the determined style setting rule. If the style setting rule is “+0 content”, the process proceeds to S7. If the style setting rule is “+1 item”, the process proceeds to S11. In the case of “+1 content”, the process proceeds to S16.

同一段落内の内容を示す「＋０内容」の場合、Ｓ７ではスタイル設定テーブル４２０の文字列パターンを参照して、現在対象としている文字列パターンと、前段落の文字列パターンが一致しているか否かを判定する（文字列パターン一致判定部３９）。前段落と同一の文字列パターンであれば、段落内の階層レベルが同一の内容を示す文字列パターンであるので、Ｓ１０へ進んで前段落と同一の構造スタイルを当該段落に適用する。一方、文字列パターンが前段落と一致しない場合Ｓ８に進む。 In the case of “+0 content” indicating the content in the same paragraph, the character string pattern in the style setting table 420 is referred to in S7, and whether or not the current character string pattern matches the character string pattern of the previous paragraph. Is determined (character string pattern match determination unit 39). If the character string pattern is the same as that of the previous paragraph, the hierarchical level in the paragraph is a character string pattern indicating the same content, and the process advances to S10 to apply the same structural style to the previous paragraph. On the other hand, if the character string pattern does not match the previous paragraph, the process proceeds to S8.

Ｓ８では、前段落が「項目」であったか否かを項目モードＭの値に基づいて判定する。項目モードＭ＝０であれば前段落が「項目」であるので、Ｓ９に進んで階層レベルＬに１を加算（Ｌ＝Ｌ＋１）して段落内の階層レベルを１つ上げた後にＳ１０へ進む。一方、項目モードＭ＝１の場合には、前段落が「内容」であるのでそのままＳ１０に進む。 In S8, it is determined based on the value of the item mode M whether or not the previous paragraph was “item”. If the item mode M = 0, the previous paragraph is “item”, so the process proceeds to S9, 1 is added to the hierarchy level L (L = L + 1), and the hierarchy level in the paragraph is increased by 1, and then the process proceeds to S10. . On the other hand, when the item mode M = 1, since the previous paragraph is “content”, the process proceeds to S10.

Ｓ１０では前段落の構造スタイル定義を現在の対象段落に適用する。さらに前段落が項目か内容の何れであるかを示す項目モードＭを０にセットする。つまり前段落が「項目」であると設定する。これは、前段落が同一段落内の「内容」であっても、同一段落内の内容同士の場合、階層レベルの上下関係はないので、次の段落の判定に前段落が「項目」であったことを次の段落へ引き継ぐ。そして再びＳ３に戻って同様の処理を繰り返す。 In S10, the structural style definition of the previous paragraph is applied to the current target paragraph. Further, an item mode M indicating whether the previous paragraph is an item or content is set to 0. That is, the previous paragraph is set as “item”. This is because even if the previous paragraph is “content” in the same paragraph, there is no hierarchical relationship between the contents in the same paragraph, so the previous paragraph is “item” in the judgment of the next paragraph. To the next paragraph. And it returns to S3 again and repeats the same process.

次に、Ｓ６の判定で、スタイル設定ルールが「＋１項目」の場合、Ｓ１１に進んで現在対象としている段落の先頭文字種が、前段落の先頭文字種と一致するか否かを判定する（段落先頭識別子一致判定部３８）。 Next, when the style setting rule is “+1 item” in the determination of S6, the process proceeds to S11 to determine whether or not the first character type of the currently targeted paragraph matches the first character type of the previous paragraph (paragraph head). Identifier match determination unit 38).

Ｓ１１では、前段落と同一の先頭文字種であれば、現在の対象段落の「項目」と判定してＳ１５に進む。一方、先頭文字種が前段落と異なる場合にはＳ１２へ進む。 In S11, if it is the same first character type as the previous paragraph, it is determined as the “item” of the current target paragraph, and the process proceeds to S15. On the other hand, if the first character type is different from the previous paragraph, the process proceeds to S12.

Ｓ１２では、適用済階層レベル記憶部４３を参照して、対象段落の先頭文字種が、同一ブロック内で既に発生した先頭文字種と一致するか否かを判定する。現在の対象段落の先頭文字種が既に抽出された先頭文字種と一致した場合には、Ｓ１４に進んで一致した先頭文字種の階層レベルと同一の階層レベルをセットする。一方、一致する先頭文字種がない場合には、Ｓ１３で階層レベルＬに１を加算（Ｌ＝Ｌ＋１）して段落内の階層レベルを１つ上げた後にＳ１５へ進む。 In S12, the applied hierarchy level storage unit 43 is referred to and it is determined whether or not the first character type of the target paragraph matches the first character type that has already occurred in the same block. If the first character type of the current target paragraph matches the extracted first character type, the process proceeds to S14 to set the same hierarchical level as the hierarchical level of the matched first character type. On the other hand, if there is no matching first character type, 1 is added to the hierarchy level L in S13 (L = L + 1), the hierarchy level in the paragraph is increased by 1, and the process proceeds to S15.

Ｓ１５では対象段落の構造スタイル定義を、現在の階層レベルＬの「項目」に決定し、該当するスタイル名に対応する構造スタイル定義を構造スタイル定義４１０から読み込んで当該段落に適用する。さらに前段落が項目か内容の何れであるかを示す項目モードＭを１にセットして「項目」であることを設定する。そして、再びＳ３に戻って同様の処理を繰り返す。 In S15, the structural style definition of the target paragraph is determined as “item” at the current hierarchical level L, and the structural style definition corresponding to the corresponding style name is read from the structural style definition 410 and applied to the paragraph. Further, an item mode M indicating whether the previous paragraph is an item or content is set to 1 to set “item”. And it returns to S3 again and repeats the same process.

次に、Ｓ６の判定で、スタイル設定ルールが「＋１内容」の場合、Ｓ１６に進んで現在対象としている段落の先頭文字種が、前段落の先頭文字種と一致するか否かを判定する（段落先頭識別子一致判定部３８）。 Next, if the style setting rule is “+1 content” in the determination of S6, the process proceeds to S16 to determine whether or not the first character type of the currently targeted paragraph matches the first character type of the previous paragraph (paragraph head). Identifier match determination unit 38).

Ｓ１６では、前段落と同一の先頭文字種であれば、現在の対象段落の「内容」と判定してＳ２０に進む。一方、先頭文字種が前段落と異なる場合にはＳ１７へ進む。 In S16, if it is the same first character type as the previous paragraph, it is determined as “content” of the current target paragraph, and the process proceeds to S20. On the other hand, if the first character type is different from the previous paragraph, the process proceeds to S17.

Ｓ１７では、適用済階層レベル記憶部４３を参照して、対象段落の先頭文字種が、同一ブロック内で既に発生した先頭文字種と一致するか否かを判定する。現在の対象段落の先頭文字種が既に抽出された先頭文字種と一致した場合には、Ｓ１９に進んで一致した先頭文字種の階層レベルと同一の階層レベルをセットする。一方、一致する先頭文字種がない場合には、Ｓ１８で階層レベルＬに１を加算（Ｌ＝Ｌ＋１）して段落内の階層レベルを１つ上げた後にＳ２０へ進む。 In S17, with reference to the applied hierarchy level storage unit 43, it is determined whether or not the first character type of the target paragraph matches the first character type already generated in the same block. If the first character type of the current target paragraph matches the extracted first character type, the process advances to S19 to set the same hierarchical level as the hierarchical level of the matched first character type. On the other hand, if there is no matching first character type, 1 is added to the hierarchy level L in S18 (L = L + 1), and the hierarchy level in the paragraph is increased by 1, and the process proceeds to S20.

Ｓ２０では対象段落の構造スタイル定義を、現在の階層レベルＬの「内容」に決定し、該当するスタイル名に対応する構造スタイル定義を構造スタイル定義４１０から読み込んで当該段落に適用する。さらに前段落が項目か内容の何れであるかを示す項目モードＭを０にセットして「内容」であることを設定する。そして、再びＳ３に戻って同様の処理を繰り返す。 In S20, the structural style definition of the target paragraph is determined as “content” at the current hierarchical level L, and the structural style definition corresponding to the corresponding style name is read from the structural style definition 410 and applied to the paragraph. Further, an item mode M indicating whether the previous paragraph is an item or content is set to 0 to set “content”. And it returns to S3 again and repeats the same process.

以上の処理により、解析ルール１によって電子化文書１０は、複数のブロックに分割され、解析ルール２により各ブロック内の段落を解析し、段落内の階層構造を決定する。そして、項目と内容の判別を行って階層レベルに応じた構造スタイルを各段落毎に設定することで、目的とする構造化文書２０を自動的に得ることが可能となる。 Through the above processing, the digitized document 10 is divided into a plurality of blocks by the analysis rule 1, and the paragraphs in each block are analyzed by the analysis rule 2 to determine the hierarchical structure in the paragraphs. Then, by discriminating items and contents and setting a structural style corresponding to the hierarchical level for each paragraph, it is possible to automatically obtain the target structured document 20.

つまり、本発明では、解析テーブル３６０の基本ルール指定部３６１でスタイル設定ルール３６１１と文字列パターン３６１２の関係を定義しておくだけで、多種多様な文書を構造化文書に変換することができる。例えば、前記従来例で述べたように、印刷用組版体裁がメーカによって様々な電子化文書１０を構造化文書２０へ変換する場合でも、段落識別子を解析し、登録テーブル３７０で階層レベルを解析することにより、基本ルール指定３６１で定義された解析ルールに基づいて、構造スタイルを自動的に適用できるのである。 That is, according to the present invention, a wide variety of documents can be converted into structured documents simply by defining the relationship between the style setting rule 3611 and the character string pattern 3612 in the basic rule specifying unit 361 of the analysis table 360. For example, as described in the above-described conventional example, even when the typesetting for printing converts various digitized documents 10 into structured documents 20 by the manufacturer, the paragraph identifier is analyzed and the hierarchy level is analyzed by the registration table 370. As a result, the structural style can be automatically applied based on the analysis rule defined in the basic rule specification 361.

例えば、前記従来例のように固定的なルールによって構造化文書２０へ変換する場合では、最初にどの文字列を、どの階層レベルに対応させるか、という定義行う必要があり、一例として、項目について「１．」を階層レベルＬ＝１、「（１）」を階層レベルＬ＝２、「１）」を階層レベルＬ＝３、「・」を階層レベルＬ＝４と定義する。 For example, when converting to the structured document 20 by a fixed rule as in the conventional example, it is necessary to first define which character string corresponds to which hierarchical level. “1.” is defined as a hierarchy level L = 1, “(1)” is defined as a hierarchy level L = 2, “1” is defined as a hierarchy level L = 3, and “•” is defined as a hierarchy level L = 4.

この従来例では、図１４で示すように、項目が定義通りの文書では、定義に従って階層レベルを設定することができる。ところが、図１５で示すように、項目の出現順が異なる場合では、前記従来例の場合、出現順に係わらず定義通りの階層レベルを割り当てるため、体裁解析では階層レベルＬ＝２となる「１）文」が、定義に沿って階層レベルＬ＝３と判定される。このため、前記従来例では構造化文書２０を自動的に生成することはできず、人手による修正が必要となる。 In this conventional example, as shown in FIG. 14, in a document in which items are defined, hierarchical levels can be set according to the definition. However, as shown in FIG. 15, when the order of appearance of items is different, in the case of the conventional example, a hierarchical level as defined is assigned regardless of the order of appearance, so that the hierarchical level L = 2 in the appearance analysis is “1”. "Sentence" is determined to be hierarchical level L = 3 according to the definition. For this reason, in the conventional example, the structured document 20 cannot be automatically generated and manual correction is required.

さらに、図１６のように、項目の文字列として定義されていない項目「[1]」については、前記従来例では階層レベルの判定を行うことができず、項目の文字列を新たに設定し直す必要がある。 Further, as shown in FIG. 16, for the item “[1]” that is not defined as the item character string, the hierarchical level cannot be determined in the conventional example, and the item character string is newly set. I need to fix it.

これに対して、本発明では、項目や内容の先頭文字列を決定しておくのではなく、先頭文字種（先頭文字識別子）と行末文字種（行末文字識別子）として段落内から抽出する登録条件（Ａ〜Ｅ）を決めておき、各段落内を解析して文字種毎に出現順で登録テーブル３７０へ記憶し、抽出した段落識別子と文字列のパターンから文字列パターンを決定する。そして、決定した文字列パターンと一致する基本ルール指定３６１のスタイル設定ルール３６１１を検索し、該当するスタイル設定ルール３６１１に対応する要素を適用する。 On the other hand, in the present invention, the registration condition (A) for extracting from the paragraph as the first character type (first character identifier) and the end-of-line character type (end-of-line character identifier) is not determined. ~ E) are determined, each paragraph is analyzed and stored in the registration table 370 in the appearance order for each character type, and a character string pattern is determined from the extracted paragraph identifier and character string pattern. Then, the style setting rule 3611 of the basic rule specification 361 that matches the determined character string pattern is searched, and the element corresponding to the corresponding style setting rule 3611 is applied.

これにより、本発明では、図１４〜図１６の何れの体裁スタイルの文書（電子化文書１０）についても、段落内で解析した文字種の位置と文字列の組み合わせを、予め設定したスタイル設定ルール３６１１に対応付けることで、項目の出現順に階層レベルを決定して、体裁スタイルに対応した構造スタイルを自動的に適用することができるのである。 As a result, according to the present invention, the style setting rule 3611 in which the position of the character type analyzed in the paragraph and the combination of the character string are set in advance for any of the style-style documents (digitized documents 10) in FIGS. By associating with each other, it is possible to determine the hierarchical level in the order of appearance of the items and automatically apply the structural style corresponding to the appearance style.

＜例外ルール＞
上記解析ルール２の階層レベル判定は同じ文書群内の文書であっても、全ての文書に対して妥当とは言えない場合が生じる。これは構造スタイルが文書の論理的な構造や構成のみでなく、意味的な構造や構成にも由来して定義可能なことと関連し、執筆者の意図や記述の意味的内容によって、項目／内容の識別や階層レベル判定を基本解析ルールとは異ならせたいことがあり得るためである。そこで基本解析ルールに対して例外ルールが必要となる。 <Exception rules>
The hierarchy level determination of the analysis rule 2 may not be valid for all documents even in the same document group. This is related to the fact that the structural style can be defined not only from the logical structure and structure of the document, but also from the semantic structure and structure. Depending on the author's intention and the semantic content of the description, This is because there is a possibility that content identification and hierarchy level determination are different from the basic analysis rules. Therefore, an exception rule is required for the basic analysis rule.

例えば、図１７の対象文書（電子化文書１０）に対して、上記基本解析ルールを用いた場合に適用する構造スタイルは図１８となる。ここで望む構造化文書２０の結果は図１９に示すもので、「使用注意」ブロックの第一段落「使用注意(次の・・・)」は内容ではなく、一階層下のＬ１項目としたい。第二段落から第五段落まではＬ１項目となるが、これはＬ１内容としたい。「重要な基本的注意」ブロックの第一から第三段落はＬ１項目となるが、これはＬ１内容としたい。 For example, FIG. 18 shows the structural style applied when the basic analysis rule is used for the target document (digitized document 10) shown in FIG. The desired result of the structured document 20 shown in FIG. 19 is shown in FIG. 19, and the first paragraph “Usage caution (next... The second to fifth paragraphs are L1 items, but this is the L1 content. The first to third paragraphs of the “Important Basic Notes” block are L1 items, but this is the L1 content.

こうした項目／内容の識別や階層レベル判定は例外ルールを作成して対応する必要が生じる。ただし、こうした例外ルールは当該文書の「使用注意」、「重要な基本的注意」のブロックに対してのみ有効で、他の文書、他のブロックは基本解析ルールを用いても何ら問題はない。 Such item / content identification and hierarchy level determination need to be dealt with by creating an exception rule. However, these exception rules are effective only for the “use caution” and “important basic caution” blocks of the document, and there is no problem even if the basic analysis rules are used for other documents and other blocks.

従って、この様な事例は対象文書単位や対象ブロック単位に、基本解析ルールを適用するか、例外ルールを適用するかを選択的に設定可能とすることで解決できる。 Therefore, such a case can be solved by selectively setting whether the basic analysis rule or the exception rule is applied to the target document unit or the target block unit.

＜例外ルールの実装例＞
まず、図２０のように抽出したブロック毎に個々の解析ルールを適用可能とするために、各解析ルールにユニークな名称を与える。例えば、文書内のブロックが８個の場合、各ブロックに対応する基本解析ルールの名称をＢ１〜Ｂ８と命名する。そして基本解析ルールＢ２に対する例外ルールをＢ２１（図２１の（Ａ））、例外ルールが複数存在する場合には図２１（Ｂ）のようにＢ２２、Ｂ２３、・・・と命名する。 <Example of exception rule implementation>
First, in order to make it possible to apply each analysis rule to each extracted block as shown in FIG. 20, a unique name is given to each analysis rule. For example, when there are eight blocks in the document, the names of basic analysis rules corresponding to the blocks are named B1 to B8. The exception rules for the basic analysis rule B2 are named B21 (FIG. 21A), and when there are a plurality of exception rules, they are named B22, B23,... As shown in FIG.

次に、図２２に示す様に対象文書単位にどのブロックにどの例外ルールを採用するかをテーブル上で設定する。このテーブル上の設定内容をページルールと呼称する。そして、ページルールにもユニークな名称を与える。なお、ページルールを格納するテーブルは、例えば、スタイル設定ルール記憶部３６に格納しても良いし、基本解析ルールＤＢに格納してもよい。 Next, as shown in FIG. 22, which exception rule is adopted for which block is set on the table for each target document. The setting contents on this table are called page rules. A unique name is given to the page rule. The table for storing the page rules may be stored in the style setting rule storage unit 36 or in the basic analysis rule DB, for example.

よって、図２３の通り対象文書毎に適用ページルールを選択・決定することで、基本解析ルールと複数の例外ルールを適宜使い分けて、所望の構造スタイルを自動的に適用することができる。 Therefore, by selecting and determining the application page rule for each target document as shown in FIG. 23, the desired structural style can be automatically applied by appropriately using the basic analysis rule and the plurality of exception rules.

上記のように複数のページルールを適用する例としては、次のような例があげられる。 Examples of applying a plurality of page rules as described above include the following examples.

＜応用例１＞
構造スタイルを手操作で適用した文書を読み込み、既存の階層レベル判定解析ルールと比較し、不一致内容を抽出する方法への応用が可能である。 <Application example 1>
It can be applied to a method that reads a document to which a structural style is applied manually, compares it with an existing hierarchy level determination analysis rule, and extracts mismatched contents.

例外ルールは、前述の様に基本解析ルールによる構造スタイルの適用結果と、適用が望まれる構造スタイルを比較することによって抽出するが、この比較作業を容易にしようとするものである。 The exception rule is extracted by comparing the structural style application result based on the basic analysis rule with the structural style that is desired to be applied, as described above.

例外ルール抽出方法
文書変換プログラム３０を、文書作成編集プログラム５０のアドインソフトウェアとして組み込んだ場合について、説明する。この例では、文書作成編集プログラム５０が、本発明の文書変換プログラム３０を内包する形となる。 Exception Rule Extracting Method A case where the document conversion program 30 is incorporated as add-in software for the document creation / editing program 50 will be described. In this example, the document creation / editing program 50 includes the document conversion program 30 of the present invention.

文書作成手段で対象文書（電子化文書１０）を開き、体裁スタイルに替えて構造スタイルを適用する操作を手操作で実行する。こうして作成した文書を、予め基本解析ルールを登録してある文書作成手段（文書作成編集プログラム５０）で開き、基本解析ルールを適用する。ここでこの文書作成手段には、例えばプラグインソフトなどで、ブロック下の各段落に適用されている構造スタイルと、基本解析ルールにより適用が決定された構造スタイルが、同じ場合は何もすることなく、異なる場合は当該段落を特定色に変更するなど、当該段落を明示できるように変更する機能を組み込んでおくものとする。 The target document (the digitized document 10) is opened by the document creation means, and an operation for applying the structural style instead of the appearance style is executed manually. The document created in this way is opened by a document creation means (document creation / editing program 50) in which basic analysis rules are registered in advance, and the basic analysis rules are applied. Here, for this document creation means, for example, if the structural style applied to each paragraph under the block and the structural style determined to be applied by the basic analysis rules are the same, using plug-in software, etc. If it is different, a function for changing the paragraph so that it can be clearly indicated, such as changing the paragraph to a specific color, is incorporated.

これにより基本解析ルールを適用できない段落、表示色が異なる段落を目視・抽出し、この段落に適用されている構造スタイルを解析することで、この例外ルールを対応させる基本解析ルールと、項目／内容の別、階層レベルを知ることができ、これを例外ルールとして基本解析ルールＤＢへ登録する。 This allows you to view and extract the paragraphs to which the basic analysis rule cannot be applied and the paragraphs with different display colors, analyze the structural style applied to this paragraph, and analyze the structural rules applied to this paragraph, and the items / contents In addition, it is possible to know the hierarchy level and register it in the basic analysis rule DB as an exception rule.

＜応用例２＞
また、他の応用例としては以下のようなものが挙げられる。 <Application example 2>
Other application examples include the following.

構造スタイルを手操作で適用した文書を読み込み、既存の階層レベル判定解析ルールと比較し、不一致内容を抽出し、例外ルールとして自動登録する方法への応用が可能である。 It can be applied to a method in which a document to which a structural style is manually applied is read, compared with existing hierarchy level determination analysis rules, mismatched contents are extracted, and automatically registered as an exception rule.

例外ルール自動生成・登録方法
まず、対象文書に対する構造スタイル適用を基本解析ルールによってのみ実行する。その結果を文書作成手段（文書作成編集プログラム５０）でディスプレイ装置８に表示・確認し、適用された構造スタイルでは不都合な箇所を抽出し、所望の構造スタイルを手操作で適用する。ここで新たに適用した構造スタイルの項目／内容の別と階層レベルを例外ルールとすれば良いことになる。 Exception rule automatic generation / registration method First, structural style application to the target document is executed only by the basic analysis rule. The result is displayed / confirmed on the display device 8 by the document creation means (document creation / editing program 50), an inconvenient part is extracted from the applied structural style, and the desired structural style is applied manually. Here, the newly applied structural style items / contents and the hierarchy level may be used as exception rules.

文書作成手段が校正履歴機能を持っている場合は、この校正履歴から旧適用スタイルと新適用スタイルを取得し、例外ルールを対応させる基本解析ルールと、項目／内容の別と階層レベルを知ることができ、自動的に例外ルール生成ができる。校正履歴機能がない場合は、上記情報を分析し、手操作で基本解析ルールＤＢに例外ルールとして登録する。 If the document creation means has a proofreading history function, acquire the old applied style and new applied style from this proofreading history, and know the basic analysis rules that correspond to the exception rules, the item / content classification, and the hierarchy level. Exception rules can be generated automatically. If there is no calibration history function, the above information is analyzed and manually registered as an exception rule in the basic analysis rule DB.

＜解析ルール３＞
上記解析ルール１、２及び例外解析ルールに加えて、文書作成手段上で対象文書の体裁スタイル、組版体裁、段落を構成する文言や文字などの解析過程において、当該段落および当該段落の前に位置する段落内の文字位置と構成文字種に着目し、当該段落の文書構造上の階層レベル判定を行い文書構造上の該当要素を決定する解析ルールについて、以下に説明する。なお、この場合も、文書変換プログラム３０を、文書作成編集プログラム５０のアドインソフトウェアとして組み込んだものとして説明する。 <Analysis rule 3>
In addition to the above analysis rules 1 and 2 and the exception analysis rule, in the process of analyzing the style of the target document, typesetting style, words and characters constituting the paragraph, etc. on the document creation means Focusing on the character position and constituent character type in the paragraph to be analyzed, the analysis rule for determining the corresponding element on the document structure by determining the hierarchical level on the document structure of the paragraph will be described below. In this case as well, the document conversion program 30 will be described as being incorporated as add-in software for the document creation / editing program 50.

まず、上記基本解析ルールを用い、且つ当該段落および当該段落の前に位置する段落内の文字位置と構成文字種に着目し、当該段落の項目／内容の識別結果と階層レベル判定をテーブル化する。 First, using the above basic analysis rule, paying attention to the character position in the paragraph and the paragraph located before the paragraph and the constituent character type, the identification result of the item / content of the paragraph and the hierarchy level determination are tabulated.

階層レベル判定のテーブルは、図２４で示すように、前段落と当該段落の二次元表（ルールテーブル）とし、その交差位置に当該段落の項目／内容識別結果と階層レベル判定を掲載してある。このテーブルで「SAME」は当該段落の項目／内容識別結果と階層レベルを前段落の階層レベルと同じにすることを意味する。 As shown in FIG. 24, the hierarchy level determination table is a two-dimensional table (rule table) of the previous paragraph and the paragraph, and the item / content identification result and the hierarchy level determination of the paragraph are placed at the intersection. . In this table, “SAME” means that the item / content identification result and the hierarchy level of the paragraph are the same as the hierarchy level of the previous paragraph.

例えば、前段落が「見出し」、当該段落が「文。」の文字列パターンと識別できた場合、図２４の該当交差位置には「＋０内容」と定義されており、「文。」は「＋０内容」と判定することを示している。これは当然のことながら、先に説明した図８の解析テーブル３６０から得られる結果と同じ結果となる。 For example, if the preceding paragraph can be identified as a character string pattern of “Heading” and the paragraph is “Sentence.”, “+0 content” is defined at the corresponding intersection position in FIG. +0 content ”. Naturally, this is the same result as obtained from the analysis table 360 of FIG. 8 described above.

一方、基本ルールでは、前段落が「見出し」、当該段落が「文」の時には「＋０内容」と判定すると定義してあるが、これを「＋１項目」と判定する例外ルールを登録したい。
あるいは、基本ルールでは、前段落が「見出し」、当該段落が「・文。」の時には「＋１項目」と判定すると定義してあるが、これを「＋１内容」と判定する例外ルールを登録したいなど、例外ルールへの対応も要求される。 On the other hand, in the basic rule, it is defined that “+0 content” is determined when the preceding paragraph is “heading” and the paragraph is “sentence”, but an exception rule for determining this as “+1 item” is to be registered.
Alternatively, in the basic rule, it is defined that “+1 item” is determined when the previous paragraph is “Heading” and the paragraph is “.sentence”, but an exception rule for determining this as “+1 content” is to be registered. For example, it is required to handle exception rules.

図２５、図２６は上記「例外ルール」として説明したような例外への対応を示したもので、例外ルールを適用するための変更箇所を網掛けで示して図２５のように複数の例外ルールへも対応できる。いずれも特定のブロックに関する例外ルールへの対応例である。 FIG. 25 and FIG. 26 show the correspondence to the exception as described above as “exception rule”, and a plurality of exception rules as shown in FIG. Can also respond. Both are examples of handling exception rules related to a specific block.

先の「例外ルール」方式では、ブロック内で発生した文と段落識別子の全ての文字列パターンに対して例外ルールを自由に設定できるが、同一ブロック内では同一文字列パターンに対して、一つの例外ルールのみを適用するという特徴がある。 In the previous “exception rule” method, exception rules can be freely set for all character string patterns of sentence and paragraph identifiers generated in the block. There is a feature that only exception rules are applied.

一方、解析ルール３では、前段落に依存するという限定があるが、同一ブロック内であっても、同一文字列パターン対して異なる例外ルールが設定できるという特徴がある。また、同一文字列パターンであるが、前段落の文字列パターンによってルールの変更ができるという利点もある。 On the other hand, the analysis rule 3 is limited to depend on the previous paragraph, but has a feature that different exception rules can be set for the same character string pattern even in the same block. Moreover, although it is the same character string pattern, there also exists an advantage that a rule can be changed with the character string pattern of a previous paragraph.

＜応用例３＞
ここでは、電子化文書１０に構造スタイルを自動的に適用した構造化文書２０へ変換した結果を、ユーザの計算機上で表示するものである。 <Application example 3>
Here, the result of conversion into the structured document 20 in which the structural style is automatically applied to the electronic document 10 is displayed on the user's computer.

１．文書作成手段上で構造スタイルを自動適用した結果、文書構造が可視化できるように、構造を可視的な体裁に還元する体裁情報（文字サイズ、文字色、書体など）含むように構造スタイルを定義する。 1. As a result of automatically applying the structural style on the document creation means, the structural style is defined to include appearance information (character size, character color, typeface, etc.) that reduces the structure to a visible appearance so that the document structure can be visualized. .

２．文書作成手段上で、手操作で構造スタイルを適用する作業、構造スタイル適用結果を目視確認する作業、解析ルールの抽出作業などを行う場合、文書構造が明示されていれば作業の助けとなる。 2. When performing a task of manually applying a structural style, a task of visually confirming the result of applying the structural style, a task of extracting an analysis rule, etc. on the document creation means, it is helpful if the document structure is clearly specified.

以上のような２つの観点から、ＸＭＬやＳＧＭＬ、或いはＤＴＤそのものを扱うツール類では文書構造を明示する目的で、表示画面を分割して木構造と対応するテキストを並べて表示したり、テキストをクリックすると要素名や属性値を表示したりするように工夫されていることは周知の通りである。 From the above two viewpoints, tools that deal with XML, SGML, or DTD itself split the display screen to display the text corresponding to the tree structure and click the text for the purpose of clarifying the document structure. Then, as is well known, it has been devised to display element names and attribute values.

しかし、こうした従来の方法では文書構造とテキストの対応関係を視覚的に直結して認識することはできず、対応関係を思考上で形成しなければならない。 However, in such a conventional method, the correspondence between the document structure and the text cannot be directly visually recognized, and the correspondence must be formed on the basis of thought.

一方、本発明における構造スタイルの適用は、これをタグ形式で出力する際に、その構造スタイル名を利用することに目的がある。 On the other hand, the application of the structural style in the present invention is intended to use the structural style name when outputting the structural style in the tag format.

従って、対象とする段落に所定の構造スタイルを適用した結果として、当該段落が文書作成手段上でどの様な組版体裁で表示されるようにスタイルの内容を定義するかは、ＸＭＬやＳＧＭＬデータを変換生成すること自体には関与しない。スタイル内容は自由に定義することが可能である。 Therefore, as a result of applying a predetermined structural style to a target paragraph, the type of style is defined so that the paragraph is displayed on the document creation means by using XML or SGML data. It is not involved in generating the conversion itself. The style content can be freely defined.

そこで、文書作成手段上で構造スタイル適用結果を目視確認する便宜から、各要素や各階層関係などに対応して、組版体裁情報（文字サイズ、文字色、書体など）を決めておけば、構造スタイル適用後には、文書構造を可視的な組版体裁として表現でき、文書構造を可視化することができる。 Therefore, for the convenience of visually checking the result of applying the structural style on the document creation means, if the typesetting style information (character size, character color, typeface, etc.) is determined corresponding to each element and each hierarchical relationship, the structure After applying the style, the document structure can be expressed as a visible typesetting style, and the document structure can be visualized.

図２７は、前記図３の電子化文書１０に構造スタイルを適用した結果を文書作成手段（文書作成編集プログラム５０）でディスプレイ装置８に表示した場合の従来例を示す。 FIG. 27 shows a conventional example when the result of applying the structural style to the digitized document 10 of FIG. 3 is displayed on the display device 8 by the document creation means (document creation / edit program 50).

オペレータなどは、この画面を見ただけでは、即座に文書構造を認識することはできず、それぞれ図中左欄に示す各段落に適用した構造スタイルと対比しながら見る必要がある。これでは自動生成結果の確認作業や解析ルールの抽出作業には極めて不便である。 An operator or the like cannot immediately recognize the document structure just by looking at this screen, and must compare it with the structural style applied to each paragraph shown in the left column of the figure. This is extremely inconvenient for checking automatically generated results and extracting analysis rules.

これに対して、本発明による構造化文書２０の表示は図２８のようになる。図２８は、前記図３の電子化文書１０に構造スタイルを適用した結果（構造化文書２０）を文書作成手段（文書作成編集プログラム５０）でディスプレイ装置８に表示した本発明の一例を示す。 In contrast, the display of the structured document 20 according to the present invention is as shown in FIG. FIG. 28 shows an example of the present invention in which the result of applying the structural style (structured document 20) to the digitized document 10 of FIG. 3 is displayed on the display device 8 by the document creation means (document creation editing program 50).

各段落に適用する構造スタイルとして要素別にフォント、文字サイズ、文字色、インデント量などを違えて設定しておくことにより、それぞれ左欄に示す各段落に適用した構造スタイルと対比しながら見ることなく、画面上の文書自体を見ることによって文書構造を認識することが可能となる。 By setting different fonts, font sizes, font colors, indentation, etc. for each element as the structural style applied to each paragraph, you can compare them with the structural styles applied to each paragraph shown in the left column. The document structure can be recognized by looking at the document itself on the screen.

文書構造を可視的な組版体裁として認識できるため、自動生成結果の確認作業や解析ルールの抽出作業において生産性を向上させることができる。文書変換プログラム３０のＸＭＬ／ＳＧＭＬ自動変換生成においては、文書作成手段のスタイル設定機能による「スタイル名」のみ利用し、スタイル設定機能による組版体裁はＸＭＬ／ＳＧＭＬ自動変換生成自体には関与しない。このスタイル設定機能による組版体裁を、文書構造を可視化する手段として利用するのである。 Since the document structure can be recognized as a visual typesetting style, productivity can be improved in the confirmation work of the automatically generated result and the analysis rule extraction work. In the XML / SGML automatic conversion generation of the document conversion program 30, only the “style name” by the style setting function of the document creation means is used, and the typesetting style by the style setting function is not involved in the XML / SGML automatic conversion generation itself. The typesetting style by the style setting function is used as a means for visualizing the document structure.

ここで留意すべきは、印刷やＰＤＦ出力を目的とする場合に与える組版体裁指定と、文書構造を可視化するための組版体裁指定は全く異なり、従来例では、文書構造の可視化を目的に組版体裁指定を与える作業は、ＤＴＤを正確に理解した者が手操作で実施しなければならない。 It should be noted here that the typesetting style designation given for the purpose of printing and PDF output is completely different from the typesetting style designation for visualizing the document structure. In the conventional example, the typesetting style is intended to visualize the document structure. The task of giving the designation must be performed manually by a person who understands DTD correctly.

これに対して本発明では、文書構造を可視化するための組版体裁指定が、構造スタイルの自動適用時に同時に行われるという特徴があり、人手による作業をなくして、効率よく構造スタイルを可視化することが可能となる。 On the other hand, in the present invention, the typesetting style designation for visualizing the document structure is performed at the same time when the structural style is automatically applied, and it is possible to efficiently visualize the structural style without manual work. It becomes possible.

なお、ディスプレイ装置８で表示する際には、図２８のようにすることなく、構造スタイルを可視化したテキストのみを表示するだけでも、構造スタイルとテキストの対応関係は視覚的に直結して表現される。これにより、図２７に示した従来例のように、構造スタイルと文書構造の対応関係を思考する必要がなく、前述のような文書作成手段上での諸作業を効率よく行うことが可能となる。 When displaying on the display device 8, the correspondence between the structural style and the text can be expressed visually directly by displaying only the text that visualizes the structural style, as shown in FIG. 28. The Thus, unlike the conventional example shown in FIG. 27, it is not necessary to consider the correspondence between the structural style and the document structure, and it is possible to efficiently perform various operations on the document creation means as described above. .

なお、上記実施形態において、入力した電子化文書１０を所定のタグを付した構造化文書２０に変換する文書変換プログラム３０は、文書変換プログラム３０と並列的にメモリ上に格納しても良いし、上述のように文書変換プログラム３０を文書作成編集プログラム５０のアドインソフトウェアとして実行するようにしてもよい。あるいは、文書変換プログラム３０を他のサーバ上で実行し、クライアントから受信した電子化文書１０を、所定の解析ルールにより構造化文書２０へ変換しても良い。 In the above embodiment, the document conversion program 30 that converts the input digitized document 10 into the structured document 20 with a predetermined tag may be stored in the memory in parallel with the document conversion program 30. As described above, the document conversion program 30 may be executed as add-in software of the document creation / editing program 50. Alternatively, the document conversion program 30 may be executed on another server, and the digitized document 10 received from the client may be converted into the structured document 20 according to a predetermined analysis rule.

また、上記実施形態では文書作成編集プログラム５０で出力した電子化文書１０を入力文書とする一例を示したが、ＯＣＲにより認識した電子化文書を入力することもできる。 In the above embodiment, an example in which the digitized document 10 output by the document creation / editing program 50 is used as an input document is shown. However, a digitized document recognized by OCR can also be input.

また、図７に示した構造スタイルを得るために図５に示した構造スタイルと、図９のスタイル設定ルールを設定した例を示したが、変換する文書の構造スタイルに応じて、構造スタイルやスタイル設定ルールは適宜変更することができる。 Moreover, in order to obtain the structural style shown in FIG. 7, the example in which the structural style shown in FIG. 5 and the style setting rule of FIG. 9 are set is shown. However, depending on the structural style of the document to be converted, The style setting rule can be changed as appropriate.

以上のように、本発明によれば、多様な組版体裁を含む電子化文書を自動的に所定の構造化文書へ変換する組版装置や、文書変換プログラムに適用することができる。 As described above, according to the present invention, the present invention can be applied to a typesetting apparatus and a document conversion program for automatically converting a digitized document including various typesetting formats into a predetermined structured document.

本発明の実施形態を示す計算機のシステム構成を示すブロック図。The block diagram which shows the system configuration | structure of the computer which shows embodiment of this invention. 同じくスタイル設定ルール記憶部の構成を示すブロック図。The block diagram which similarly shows the structure of a style setting rule memory | storage part. 電子化文書の一例を示す説明図。Explanatory drawing which shows an example of an electronic document. 電子化文書の構造を示す説明図。Explanatory drawing which shows the structure of a digitized document. 文書構造上の要素とスタイル名の対応関係を示す構造スタイル定義４１０の説明図。Explanatory drawing of the structural style definition 410 which shows the correspondence of the element on a document structure, and a style name. 文書変換プログラムで変換した構造化文書の一例を示し、適用した設定スタイル名と文書内容の関係を示す説明図。FIG. 4 is an explanatory diagram illustrating an example of a structured document converted by a document conversion program and illustrating a relationship between an applied setting style name and document content. 文書変換プログラムで変換した構造化文書を所定のＸＭＬ／ＳＧＭＬ文書に変換した一例を示す説明図。Explanatory drawing which shows an example which converted the structured document converted with the document conversion program into the predetermined | prescribed XML / SGML document. 段落内を解析して段落識別子を登録した解析テーブルの一例を示す説明図。Explanatory drawing which shows an example of the analysis table which analyzed the inside of a paragraph and registered the paragraph identifier. スタイル設定ルールの一例を示す説明図。Explanatory drawing which shows an example of a style setting rule. 文書変換プログラムで変換する電子化文書の項目の一例を示す説明図。Explanatory drawing which shows an example of the item of the digitized document converted with a document conversion program. スタイル設定テーブルの一例を示す説明図で、段落毎の設定スタイル名と文字列パターンの関係を示す。It is explanatory drawing which shows an example of a style setting table, and shows the relationship between the setting style name for every paragraph, and a character string pattern. 解析後の登録テーブルの内容を示す説明図。Explanatory drawing which shows the content of the registration table after an analysis. 解析処理の一例を示すフローチャート。The flowchart which shows an example of an analysis process. 本発明と従来例の対比を示し、項目に対して設定した階層レベルの関係を示す。A comparison between the present invention and a conventional example is shown, and a relationship between hierarchical levels set for items is shown. 本発明と従来例の他の対比を示し、項目に対して設定した階層レベルの関係を示す。The other contrast of this invention and a prior art example is shown, and the relationship of the hierarchy level set with respect to the item is shown. 本発明と従来例のさらに他の対比を示し、項目に対して設定した階層レベルの関係を示す。FIG. 5 shows still another comparison between the present invention and a conventional example, and shows a relationship between hierarchical levels set for items. 例外ルールを適用する電子化文書の一例を示し、段落内容と文字列パターンの関係を示す説明図。Explanatory drawing which shows an example of the electronic document to which an exception rule is applied, and shows the relationship between a paragraph content and a character string pattern. 図１７の電子化文書に対して基本解析ルールを適用した場合の設定スタイル名の一例を示す説明図。FIG. 18 is an explanatory diagram illustrating an example of a setting style name when a basic analysis rule is applied to the digitized document of FIG. 17. 図１７の電子化文書に対して、目的とする設定スタイル名と文書内容の関係を示す説明図。FIG. 18 is an explanatory diagram showing a relationship between a target setting style name and document content for the digitized document of FIG. 17. 文書内のブロック毎に適用するブロックルールの一例を示す説明図。Explanatory drawing which shows an example of the block rule applied for every block in a document. ブロック毎の例外ルールの一例を示す説明図で、（Ａ）は基本解析ルールＢ２の例外ルール、（Ｂ）は同じく基本解析ルールＢ２の他の例外ルール、（Ｃ）は基本解析ルールＢ５の例外ルールを示す。It is explanatory drawing which shows an example of the exception rule for every block, (A) is the exception rule of basic analysis rule B2, (B) is the other exception rule of basic analysis rule B2, and (C) is the exception of basic analysis rule B5. Indicates a rule. ブロックに適用する例外ルールを設定したテーブルの説明図。Explanatory drawing of the table which set the exception rule applied to a block. 対象文書毎に適用ページルールを設定したテーブルの一例を示す説明図。Explanatory drawing which shows an example of the table which set the application page rule for every object document. 二次元ルールテーブルの一例を示す説明図。Explanatory drawing which shows an example of a two-dimensional rule table. 二次元ルールテーブルの一例を示し、例外ルールを適用したブロックの説明図。Explanatory drawing of the block which showed an example of the two-dimensional rule table and applied the exception rule. 二次元ルールテーブルの一例を示し、他の例外ルールを適用したブロックの説明図。Explanatory drawing of the block which showed an example of the two-dimensional rule table and applied the other exception rule. 図３の電子化文書に構造スタイルを適用した結果を従来例により表示した場合の画面イメージ。The screen image at the time of displaying the result of applying a structural style to the digitized document of FIG. 3 by a prior art example. 図３の電子化文書に構造スタイルを適用した結果を本発明により表示した場合の画面イメージ。The screen image at the time of displaying the result of applying a structural style to the digitized document of FIG. 3 by this invention.

符号の説明Explanation of symbols

１計算機
２コントローラ
３メモリ
４インターフェース部
５ディスク装置
８ディスプレイ装置
３０文書変換プログラム
３１文書読み込み部
３２ブロック抽出部
３４ブロック内段落解析部
３７適用スタイル解析部 DESCRIPTION OF SYMBOLS 1 Computer 2 Controller 3 Memory 4 Interface part 5 Disk apparatus 8 Display apparatus 30 Document conversion program 31 Document reading part 32 Block extraction part 34 In-block paragraph analysis part 37 Applicable style analysis part

Claims

電子化文書を読み込んで、前記電子化文書中の体裁スタイルを所定の構造スタイルに変換する電子化文書の変換方法であって、
前記電子化文書を読み込む手順と、
前記電子化文書について所定のスキーマの文書構造の要素に対応する構造スタイルを予め設定した構造スタイル定義を読み込む手順と、
前記電子化文書に対して適用する予め設定された基本解析ルールを読み込む手順と、
前記読み込んだ電子化文書の文字列のみから前記基本解析ルールに基づいて段落を抽出し、抽出した段落について文書構造上の前記要素を決定する手順と、
前記段落内の文字位置と構成文字種から前記基本解析ルールに基づいて当該段落内の文書構造上の階層レベルを判定し、前記判定した階層レベルに応じた文書構造上の前記要素を決定する手順と、
前記決定した段落の要素と、前記決定した段落内の要素について、前記構造スタイル定義に設定された構造スタイルを体裁スタイルに代えてそれぞれ適用する手順と、
を含むことを特徴とする電子化文書の変換方法。 A method of converting an electronic document by reading an electronic document and converting the appearance style in the electronic document into a predetermined structural style,
A procedure for reading the digitized document;
A procedure for reading a structural style definition in which a structural style corresponding to a document structure element of a predetermined schema is set in advance for the electronic document;
A procedure for reading a preset basic analysis rule to be applied to the digitized document;
Extracting a paragraph based only on the basic analysis rule from only the character string of the read digitized document, and determining the element on the document structure for the extracted paragraph;
Determining a hierarchical level on the document structure in the paragraph based on the basic analysis rule from a character position and a constituent character type in the paragraph, and determining the element on the document structure according to the determined hierarchical level; ,
Applying the structural style set in the structural style definition to the determined paragraph element and the element in the determined paragraph instead of the appearance style;
A method for converting an electronic document, comprising:

前記基本解析ルールは、前記電子化文書または電子化文書内の段落あるいは段落群について予め設定され、文書単位または段落単位あるいは段落群単位で適用することを特徴とする請求項１に記載の電子化文書の変換方法。 2. The digitization according to claim 1, wherein the basic analysis rule is preset for a paragraph or a group of paragraphs in the digitized document or digitized document, and is applied in document units, paragraph units, or paragraph group units. How to convert the document.

前記基本解析ルールは、予め設定した階層レベルに関する例外ルールを含み、
前記階層レベルに応じた文書構造上の前記要素を決定する手順は、
前記予め設定した階層レベルのときには、前記例外ルールに基づいて文書構造上の前記要素を決定することを特徴とする請求項１に記載の電子化文書の変換方法。 The basic analysis rule includes an exception rule regarding a preset hierarchy level,
The procedure for determining the element on the document structure according to the hierarchy level is as follows:
2. The method for converting an electronic document according to claim 1, wherein the element on the document structure is determined based on the exception rule at the preset hierarchical level.

前記例外ルールは、構造スタイルを適用した文書を読み込んで、前記基本解析ルールと一致しない内容を当該例外ルールとして抽出したことを特徴とする請求項３に記載の電子化文書の変換方法。 4. The method for converting a digitized document according to claim 3, wherein the exception rule reads a document to which a structural style is applied, and extracts contents that do not match the basic analysis rule as the exception rule.

前記階層レベルに応じた文書構造上の前記要素を決定する手順は、
当該段落および当該段落の前に位置する段落内の文字位置と構成文字種に基づいて当該段落の文書構造上の階層レベル判定を行い、前記文書構造上の要素を決定することを特徴とする請求項１に記載の電子化文書の変換方法。 The procedure for determining the element on the document structure according to the hierarchy level is as follows:
The element of the document structure is determined by performing a hierarchical level determination on the document structure of the paragraph based on the character position and the constituent character type in the paragraph positioned before the paragraph and the paragraph. 2. A method for converting an electronic document according to 1.

前記構造スタイルを適用した電子化文書を表示する手順をさらに含み、
前記構造スタイルを体裁スタイルに代えてそれぞれ適用する手順は、
前記文書構造を可視的な体裁に還元する体裁情報を含むように構造スタイルを設定することを特徴とする請求項１に記載の電子化文書の変換方法。 Further comprising displaying a digitized document to which the structural style is applied,
The procedure for applying the structural style instead of the appearance style is as follows:
2. The method for converting an electronic document according to claim 1, wherein a structure style is set so as to include appearance information that reduces the document structure to a visible appearance.

電子化文書を読み込んで、前記電子化文書中の体裁スタイルを所定の構造スタイルに変換する処理を計算機に実行させるプログラムであって、
前記電子化文書を読み込む処理と、
前記電子化文書について所定のスキーマの文書構造の要素に対応する構造スタイルを予め設定した構造スタイル定義を読み込む処理と、
前記電子化文書に対して適用する予め設定された基本解析ルールを読み込む処理と、
前記読み込んだ電子化文書の文字列のみから前記基本解析ルールに基づいて段落を抽出し、抽出した段落について文書構造上の前記要素を決定する処理と、
前記段落内の文字位置と構成文字種から前記基本解析ルールに基づいて当該段落内の文書構造上の階層レベルを判定し、前記判定した階層レベルに応じた文書構造上の前記要素を決定する処理と、
前記決定した段落の要素と、前記決定した段落内の要素について、前記構造スタイル定義に設定された構造スタイルを体裁スタイルに代えてそれぞれ適用する処理と、
を計算機に実行させるためのプログラム。 A program that reads a computerized document and causes a computer to execute a process of converting the appearance style in the computerized document into a predetermined structural style,
A process of reading the digitized document;
A process of reading a structural style definition in which a structural style corresponding to a document structure element of a predetermined schema is set in advance for the electronic document;
A process of reading a preset basic analysis rule to be applied to the digitized document;
A process of extracting a paragraph based only on the basic analysis rule from only the character string of the read digitized document, and determining the element on the document structure for the extracted paragraph;
A process of determining a hierarchical level on the document structure in the paragraph based on the basic analysis rule from a character position and a constituent character type in the paragraph, and determining the element on the document structure according to the determined hierarchical level; ,
A process of applying the structural style set in the structural style definition to the determined paragraph element and the element in the determined paragraph instead of the appearance style;
A program to make a computer execute.