JP2011128988A

JP2011128988A - Document management device, method of controlling the same, and program

Info

Publication number: JP2011128988A
Application number: JP2009288412A
Authority: JP
Inventors: Toru Ishizaki; 透石嵜
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2009-12-18
Filing date: 2009-12-18
Publication date: 2011-06-30

Abstract

<P>PROBLEM TO BE SOLVED: To easily generate and delete header information so that high speed access to document data can be achieved even for the document data encoded in binary XML. <P>SOLUTION: Encoding rules effective for the head of a read portion are extracted for the document portion to which reading is requested. When starting reading from the head of the read portion of the extracted encoding rules, the information with which the encoding rules can be restored is generated. The generated restore information is stored in association with reference to the read portion. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、符号化文書を管理するための文書処理技術に関する。 The present invention relates to a document processing technique for managing encoded documents.

従来、データベースシステムなどでは、管理している文書データへの高速アクセスを可能とするために、アクセス頻度の高いデータへのインデックス情報を記憶している。インデックス情報には、ファイルへのポインタやファイル内でのオフセット値が記録されており、文書内を探索することなく目的のデータへ高速アクセスすることができる。 Conventionally, in a database system or the like, index information for frequently accessed data is stored in order to enable high-speed access to managed document data. In the index information, a pointer to the file and an offset value in the file are recorded, and the target data can be accessed at high speed without searching the document.

一方、Ｗ３Ｃでは、ＸＭＬの処理効率化を目的として、Efficient XML Interchange(EXI)（非特許文献１）と呼ばれるＸＭＬ文書の符号化方式が標準化されている。この符号化方式では、従来文字として符号化していたデータを、より冗長性の少ない表現で符号化する。EXIは、既定の符号化規則でＸＭＬ文書の符号化を開始し、符号化中にＸＭＬ文書の記述に応じて符号化規則をより効率のよい規則に更新する。これらの符号化方式は学習型と呼ばれており、一定の規則で符号化する場合に比べ符号化効率がよい。しかし復号する際に、規則を再現するため文書の先頭から処理しなければならないという制約がある。よって、EXIで符号化した文書は、従来のようなインデックス情報による高速アクセスが実現できない。 On the other hand, in W3C, an XML document encoding method called Efficient XML Interchange (EXI) (Non-Patent Document 1) is standardized for the purpose of improving the processing efficiency of XML. In this encoding method, data encoded as a conventional character is encoded with an expression with less redundancy. EXI starts encoding an XML document with a predetermined encoding rule, and updates the encoding rule to a more efficient rule according to the description of the XML document during encoding. These encoding methods are called learning types, and have better encoding efficiency than the case of encoding according to a certain rule. However, when decoding, there is a restriction that the document must be processed from the beginning in order to reproduce the rule. Therefore, a document encoded by EXI cannot achieve high-speed access using conventional index information.

そこで、EXIでは、self-containedと呼ばれる要素を用意している。この要素で記述された部分は、その要素以前に記述された文書データに関係なく既定の符号化規則から符号化される。インデックス化したい部分をself-contained要素内に記述しておけば、文書途中から復号でき、インデックス情報による高速アクセスが可能になる。 Therefore, EXI provides an element called self-contained. The portion described by this element is encoded from a predetermined encoding rule regardless of the document data described before the element. If the part to be indexed is described in the self-contained element, it can be decoded from the middle of the document, and high-speed access by index information becomes possible.

Efficient XML Interchange (EXI) Format 1.0 - http://www.w3.org/TR/exiEfficient XML Interchange (EXI) Format 1.0-http://www.w3.org/TR/exi

しかしながら、こうしたインデックス情報は利用者にとっては意味のない情報であり、過度に作成するとインデックス情報の記憶容量の大きさが問題になりかねない。また、インデックス情報自体の処理が過大となり高速アクセスのメリットが小さくなる。そこで、アクセス頻度などをパラメータとして、必要なインデックス情報のみ残すように更新していくのが一般的である。 However, such index information is meaningless for the user, and if it is created excessively, the size of the storage capacity of the index information may become a problem. Further, the processing of the index information itself becomes excessive, and the merit of high-speed access is reduced. Therefore, updating is generally performed so that only necessary index information remains, using the access frequency as a parameter.

しかし、EXIで符号化された文書データにとって、インデックス情報の更新は大きな負担である。self-contained要素の記述を書き直さなければならないが、符号化データの復号、再符号化を伴い、扱う文書データのサイズに比例して負担が大きくなる。 However, updating the index information is a heavy burden for document data encoded with EXI. Although the description of the self-contained element has to be rewritten, accompanying the decoding and re-encoding of the encoded data, the burden increases in proportion to the size of the document data to be handled.

そこで本発明は、符号化文書に対するインデックス情報の更新を容易に行うことができるようにすることを目的とする。 SUMMARY An advantage of some aspects of the invention is that index information for an encoded document can be easily updated.

本発明の一側面によれば、符号化文書の読み出し部分の開始位置からの内容に依存して、前記読み出し部分に対する符号化規則を抽出する抽出手段と、クライアントで前記読み出し部分の開始位置から読み出しを開始する際に前記抽出手段により抽出された前記符号化規則を復元可能にするための復元情報を生成する生成手段と、前記生成手段により生成された前記復元情報を、前記読み出し部分への参照情報と関連付けて記憶する記憶手段とを有することを特徴とする文書管理装置が提供される。 According to one aspect of the present invention, an extraction unit that extracts an encoding rule for the read portion depending on the content from the start position of the read portion of the encoded document, and the client reads from the start position of the read portion. Generating means for generating restoration information for making it possible to restore the coding rule extracted by the extraction means at the time of starting, and referencing the restoration information generated by the generation means to the read portion There is provided a document management apparatus comprising storage means for storing information in association with information.

本発明によれば、インデックス化前後で符号化規則を変更しなくて済むので、再符号化の必要がなくインデックス情報の更新を容易に行うことができる。 According to the present invention, since it is not necessary to change the encoding rule before and after indexing, re-encoding is not necessary, and index information can be easily updated.

実施形態における文書管理装置の構成を示すブロック図。FIG. 2 is a block diagram illustrating a configuration of a document management apparatus according to the embodiment. 実施形態における文書管理装置で処理されるデータの例を示す図。FIG. 5 is a diagram illustrating an example of data processed by the document management apparatus according to the embodiment. 文字列テーブルの一例を示す図。The figure which shows an example of a character string table. 実施形態における文書管理装置で処理されるデータの例を示す図。FIG. 5 is a diagram illustrating an example of data processed by the document management apparatus according to the embodiment. 実施形態における文書管理装置の動作を示すフローチャート。6 is a flowchart illustrating the operation of the document management apparatus according to the embodiment. 実施形態における文書管理装置で処理されるデータの例を示す図。FIG. 5 is a diagram illustrating an example of data processed by the document management apparatus according to the embodiment. 実施形態における文書管理装置の動作を示すフローチャート。6 is a flowchart illustrating the operation of the document management apparatus according to the embodiment. 実施形態における文書管理装置で処理されるデータの例を示す図。FIG. 5 is a diagram illustrating an example of data processed by the document management apparatus according to the embodiment. 実施形態における文書管理装置で処理されるデータの例を示す図。FIG. 5 is a diagram illustrating an example of data processed by the document management apparatus according to the embodiment. 実施形態における文書管理装置の動作を示すフローチャート。6 is a flowchart illustrating the operation of the document management apparatus according to the embodiment. 実施形態における文書管理装置で処理されるデータの例を示す図。FIG. 5 is a diagram illustrating an example of data processed by the document management apparatus according to the embodiment.

以下、図面を参照して本発明の好適な実施形態について詳細に説明する。なお、本発明は以下の実施形態に限定されるものではなく、本発明の実施に有利な具体例を示すにすぎない。 DESCRIPTION OF EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings. In addition, this invention is not limited to the following embodiment, It shows only the specific example advantageous for implementation of this invention.

＜実施形態１＞
本実施形態の文書管理装置の構成について、図１のブロック図を参照して説明する。本実施形態の文書管理装置は例えばネットワーク上のサーバによって実現される。文書管理装置は単一のコンピュータ装置で実現してもよいし、必要に応じた複数のコンピュータ装置で機能を分散して実現してもよい。複数のコンピュータ装置で構成される場合は、互いに通信可能なようにLocal Area Network(LAN)などで接続されうる。 <Embodiment 1>
The configuration of the document management apparatus of this embodiment will be described with reference to the block diagram of FIG. The document management apparatus according to the present embodiment is realized by a server on a network, for example. The document management apparatus may be realized by a single computer apparatus, or may be realized by distributing functions by a plurality of computer apparatuses as necessary. When configured by a plurality of computer devices, they can be connected by a local area network (LAN) or the like so that they can communicate with each other.

図１において、１０１は文書管理装置１００全体を制御するCentral Processing Unit（ＣＰＵ）である。１０２は変更を必要としないプログラムやパラメータを格納するRead Only Memory（ＲＯＭ）である。１０３は外部装置などから供給されるプログラムやデータを一時記憶するRandom Access Memory（ＲＡＭ）である。１０４は文書管理装置１００に固定して設置された外部記憶装置である。外部記憶装置は例えば次のものを含む。ハードディスク、メモリカード、フレキシブルディスク（ＦＤ）、Compact Disk（ＣＤ）等の光ディスク、磁気カード、光カード、ＩＣカードなど。１０５はユーザの操作を受け、データを入力するポインティングデバイスやキーボード１０９などの入力デバイスとのインタフェイスである。１０６はインターネット１０３などのネットワーク回線に接続するためのネットワークインタフェイスである。１０７は１０１〜１０６の各ユニットを通信可能に接続するシステムバスである。 In FIG. 1, reference numeral 101 denotes a central processing unit (CPU) that controls the entire document management apparatus 100. A read only memory (ROM) 102 stores programs and parameters that do not need to be changed. A random access memory (RAM) 103 temporarily stores programs and data supplied from an external device or the like. Reference numeral 104 denotes an external storage device that is fixedly installed in the document management apparatus 100. The external storage device includes, for example, the following. Hard disks, memory cards, flexible disks (FD), compact disks (CD) and other optical disks, magnetic cards, optical cards, IC cards, etc. Reference numeral 105 denotes an interface with an input device such as a pointing device or a keyboard 109 that receives data from the user and inputs data. Reference numeral 106 denotes a network interface for connecting to a network line such as the Internet 103. A system bus 107 connects the units 101 to 106 so that they can communicate with each other.

本実施形態では、複数のバイナリＸＭＬ文書のサーバである文書管理装置１００が、クライアントからの読み出しアクセスを高速化するための見出し情報を生成する処理について説明する。文書管理装置１００では、例えば図２に示すような符号化文書が管理されている。この文書は、製品の説明資料に関するものであり、catalog要素には製品の用途に合わせたカタログ情報が記述されている。description要素には詳細情報が記述されている。 In the present embodiment, a description will be given of a process in which the document management apparatus 100, which is a server for a plurality of binary XML documents, generates header information for speeding up read access from a client. In the document management apparatus 100, for example, encoded documents as shown in FIG. 2 are managed. This document is related to product explanation material, and catalog information is described in the catalog element according to the use of the product. Detailed information is described in the description element.

クライアントは、必ずしもこの文書の全ての情報を必要とするとは限らず、必要な部分だけをサーバに要求しダウンロードすることができる。例えば営業担当者が、顧客先での使い方に合わせて製品資料のみをサーバに要求し、顧客先のプリンタにダウンロードして印刷する。これを実現すべく、サーバである文書管理装置１００は、クライアントからの符号化文書へのデータアクセスを管理し、管理下の符号化文書の読み出し要求に応答して、要求された読み出し部分を符号化文書から抽出する。その後、その読み出し部分のデータを後述するようなリプライメッセージとして送信する。 The client does not necessarily need all the information of this document, but can request and download only the necessary part from the server. For example, a sales representative requests only a product document from the server according to the usage at the customer site, downloads it to the customer's printer, and prints it. In order to realize this, the document management apparatus 100 as a server manages data access to the encoded document from the client, and encodes the requested read portion in response to a read request for the encoded document under management. Extract from the document. Thereafter, the data of the read portion is transmitted as a reply message as will be described later.

このとき、ダウンロード時間を短縮するために、サーバ側では一度発生した要求に対するインデックス情報を生成する。２回目以降の要求では、インデックス情報を参照することで目的の文書を高速に検出する。 At this time, in order to shorten the download time, the server side generates index information for the request once generated. In the second and subsequent requests, the target document is detected at high speed by referring to the index information.

一方、管理している文書は、文書サイズを小さくするために図２に示すようなバイナリＸＭＬに符号化されている。本実施形態で使用するバイナリＸＭＬは、文字列テーブルと文法規則を使った学習型の符号化を行う。 On the other hand, the managed document is encoded in binary XML as shown in FIG. 2 in order to reduce the document size. The binary XML used in this embodiment performs learning-type encoding using a character string table and grammatical rules.

文字列テーブルは、図３に示すように、文書中に出現する文字列データを順にテーブルに登録したものである。同じ文字列データが２回以上出現する場合は、２回目以降にテーブルのインデックスを割り当てることで符号化データのサイズを小さくする。本実施形態では、文字列テーブルを名前用(name)と値用(value)に分けている。図２において、“ID0001”という値が２回出現しており、２回目の“ID0001”には、“0”という文字列テーブルのインデックスが使われる。また文字列テーブルは、符号化時に一時的に作られるものであり符号化文書に明記されない。よって復号する際には、文字列テーブルを再現するために、符号化データを先頭から読まなければならない。 As shown in FIG. 3, the character string table is obtained by sequentially registering character string data appearing in a document. When the same character string data appears more than once, the size of the encoded data is reduced by assigning a table index after the second time. In this embodiment, the character string table is divided into a name (name) and a value (value). In FIG. 2, the value “ID0001” appears twice, and the index of the character string table “0” is used for the second “ID0001”. The character string table is temporarily created at the time of encoding and is not specified in the encoded document. Therefore, when decoding, the encoded data must be read from the top in order to reproduce the character string table.

文法規則は、要素や属性の構造を符号化するために用いられる。先ず既定の文法規則が用意され、文書を符号化するに従って効率のよい文法規則を学習する。図４は、図２に示した符号化を行う際の文法規則の学習を表している。以下、図４に沿って、文法規則の学習を説明する。 Grammar rules are used to encode the structure of elements and attributes. First, a predetermined grammar rule is prepared, and an efficient grammar rule is learned as the document is encoded. FIG. 4 shows learning of grammatical rules when performing the encoding shown in FIG. The grammatical rule learning will be described below with reference to FIG.

先ず、図２のproduct開始要素の符号化時に学習が行われる。文書の最初では、図４の既定のDocument Grammar 401が使用される。初めて出現する要素なので、規則中のSE(*)が適用され、0の値と“product”が符号化される。実際の符号化データは、データサイズが小さくなるように可変長のビット列になる。そして図４に示すproduct開始要素に対するgrammar 402が生成される。 First, learning is performed when the product start element of FIG. 2 is encoded. At the beginning of the document, the default Document Grammar 401 of FIG. 4 is used. Since this is the first element that appears, SE (*) in the rule is applied, and the value 0 and "product" are encoded. The actual encoded data is a variable-length bit string so that the data size is reduced. Then, a grammar 402 for the product start element shown in FIG. 4 is generated.

次に、catalog要素の符号化時に学習が行われる。catalog要素はproduct要素の子要素なので、product開始要素に対するgrammarが適用される。初めて出現する要素なので、規則中のSE(*)が適用され、(0,2)の値と“catalog”が符号化される。そして図４に示すcatalog開始要素に対するgrammar 405が生成される。また、繰り返し出現する文書構造を短いデータ長で符号化できるように、SE(catalog)の規則がproduct開始要素に対するgrammarに追加される。403に示すように、追加は先頭に行われ、今までの規則の符号化データはインクリメントされる。 Next, learning is performed when the catalog element is encoded. Since the catalog element is a child element of the product element, grammar for the product start element is applied. Since this is the first element that appears, SE (*) in the rule is applied, and the value (0,2) and “catalog” are encoded. Then, a grammar 405 for the catalog start element shown in FIG. 4 is generated. In addition, SE (catalog) rules are added to the grammar for the product start element so that repeated document structures can be encoded with a short data length. As indicated by 403, the addition is performed at the head, and the encoded data of the rules so far is incremented.

次に、serialNum属性の符号化時に学習が行われる。serialNum属性はcatalog要素の属性なので、catalog開始要素に対するgrammarが適用される。各適用grammarは入れ子の関係になりスタック構造に記憶される。これをルールスタック（文法規則の順序列）と呼ぶ。初めて出現する属性なので、規則中のAT(*)が適用され、(0,1)の値と“serialNum”が符号化される。属性も開始要素と同様に規則追加が行われ、406に示すように、AT(serialNum)の規則がcatalog開始要素に対するgrammarに追加される。 Next, learning is performed when the serialNum attribute is encoded. Since the serialNum attribute is an attribute of the catalog element, grammar for the catalog start element is applied. Each applied grammar is nested and stored in a stack structure. This is called a rule stack (order sequence of grammar rules). Since the attribute appears for the first time, the AT (*) in the rule is applied, and the value (0,1) and “serialNum” are encoded. The rule is added to the attribute in the same manner as the start element, and an AT (serialNum) rule is added to the grammar for the catalog start element as indicated by 406.

次に、typeAの開始要素の符号化時に学習が行われる。このときもcatalog開始要素に対するgrammarが適用される。初めて出現する要素なので、規則中のSE(*)が適用される。ただし、符号化データは既定(405)の(0,2)ではなく、追加後(406)の(1,2)と“typeA”になる。(407)に示すように、SE(typeA)の規則がcatalog開始要素に対するgrammarに追加される。 Next, learning is performed when the start element of type A is encoded. At this time, the grammar for the catalog start element is applied. Since this is the first element to appear, SE (*) in the rule is applied. However, the encoded data is not (0, 2) of the default (405), but (1, 2) and “type A” of (406) after the addition. As shown in (407), the rule of SE (type A) is added to the grammar for the catalog start element.

次に、typeB開始要素の符号化時に学習が行われる。このときもcatalog開始要素に対するgrammarが適用される。初めて出現する要素なので、規則中のSE(*)が適用される。ただし、符号化データは既定(405)の(0,2)ではなく、追加後(407)の(2,2)と“typeB”になる。408に示すように、SE(typeB)の規則がcatalog開始要素に対するgrammarに追加される。 Next, learning is performed when the typeB start element is encoded. At this time, the grammar for the catalog start element is applied. Since this is the first element to appear, SE (*) in the rule is applied. However, the encoded data is not (0, 2) of the default (405) but “type B” (2, 2) of (407) after addition. As shown at 408, an SE (type B) rule is added to the grammar for the catalog start element.

次に、description開始要素の符号化時に学習が行われる。descriptor要素はproduct要素の子要素なので、product開始要素に対するgrammarが適用される。product開始要素に対するgrammarはルールスタックから取得される。初めて出現する要素なので、規則中のSE(*)が適用される。ただし、符号化データは既定(402)の(0,2)ではなく、403に示す(1,2)と“description”になる。404に示すように、SE(description)の規則がproduct開始要素に対するgrammarに追加される。 Next, learning is performed when the description start element is encoded. Since descriptor element is a child element of product element, grammar for product start element is applied. The grammar for the product start element is obtained from the rule stack. Since this is the first element to appear, SE (*) in the rule is applied. However, the encoded data is not (0,2) of the default (402) but (1,2) and “description” shown in 403. As shown in 404, an SE (description) rule is added to the grammar for the product start element.

以後、description開始要素に対するgrammarが、production開始要素やcatalog開始要素と同様に生成される。 Thereafter, the grammar for the description start element is generated in the same manner as the production start element and the catalog start element.

以上の文法規則の学習で説明したように、符号化する文書に従い学習が行われ符号化規則が変化する。また、図２で２回目に出現するtypeAの開始要素には、SE(*)ではなく新しく追加された規則SE(typeA)が適用される。符号化する際には、このような文法規則と適用順序を再現するために、符号化データを先頭から順に読まなければならない。 As described in the above grammar rule learning, learning is performed according to the document to be encoded, and the encoding rule is changed. In addition, the newly added rule SE (type A) is applied to the start element of type A that appears second time in FIG. 2 instead of SE (*). When encoding, in order to reproduce such grammatical rules and application order, the encoded data must be read in order from the top.

そこで本実施形態では、読み出し要求が発生した文書部分の開始位置からの内容に依存して、その文書部分に対して有効な符号化規則を抽出する。処理フローを図５に示す。以下、図５のフローに従って説明する。 Therefore, in the present embodiment, an effective encoding rule is extracted for the document portion depending on the content from the start position of the document portion where the read request has occurred. The processing flow is shown in FIG. In the following, description will be given according to the flow of FIG.

まず、Ｓ５０１において、受信した読み出し要求に係る符号化文書の見出し情報が例えば外部記憶装置１０４に記憶されているかどうかをチェックする。見出し情報があれば、Ｓ５０２において、見出し情報からリプライ情報を取得する。見出し情報がなければ、Ｓ５０３において、受信した要求で指定された文書部分の位置まで符号化文書を先頭から解析する。 First, in S501, it is checked whether the header information of the encoded document related to the received read request is stored in the external storage device 104, for example. If there is heading information, reply information is acquired from the heading information in S502. If there is no heading information, in step S503, the encoded document is analyzed from the beginning up to the position of the document portion specified by the received request.

Ｓ５０４において、要求された文書部分が発見できた場合は、Ｓ５０５において、文書部分の開始位置における文字列テーブル、文法規則、ルールスタックを抽出する。本実施形態では、読み出す文書部分を最初のtypeA要素とし、図６に示す情報を抽出することになる。 If the requested document part can be found in S504, the character string table, grammar rule, and rule stack at the start position of the document part are extracted in S505. In this embodiment, the document portion to be read is set as the first typeA element, and the information shown in FIG. 6 is extracted.

抽出した情報は、保存するために、あるフォーマットに従って符号化される。本実施形態では、ＸＭＬで記述しバイナリＸＭＬに符号化する方法について説明する。ただし本発明は特定の符号化方式に限定されるものではない。バイナリＸＭＬに符号化することで、保存データの解釈に新たなプロセッサを必要としない、データサイズが小さくなる、リプライするバイナリＸＭＬデータに組み合わせやすいといったメリットがある。また、抽出した情報を符号化する際には、抽出した情報すべてを符号化してもよいが、図１３に示すように既定の符号化規則からの変換手順を符号化することでデータサイズを削減することができる。処理フローを図７に示す。また、ＸＭＬ表現した結果を図８に示す。以下、図７のフローに従って説明する。 The extracted information is encoded according to a certain format for storage. In the present embodiment, a description will be given of a method described in XML and encoded into binary XML. However, the present invention is not limited to a specific encoding method. Encoding to binary XML has the advantages that a new processor is not required for interpreting stored data, the data size is reduced, and it is easy to combine with binary XML data to be replied. Further, when the extracted information is encoded, all the extracted information may be encoded, but the data size is reduced by encoding the conversion procedure from the predetermined encoding rule as shown in FIG. can do. The processing flow is shown in FIG. In addition, FIG. 8 shows the result expressed in XML. Hereinafter, a description will be given according to the flow of FIG.

まず、Ｓ７０１において、要求された読み出し部分の開始位置における文字列テーブルをＸＭＬで表現する。図８に示すように、文字列テーブルはtable要素として記述し、table種別を表すためにid属性を記述する。名前の文字列テーブルであれば“name”、値であれば“value”になる。各文字列テーブル内の文字列は、それぞれentry要素としてindex順にtable要素内に記述される。また、最初のentry要素のindexをtable要素のstartIndex属性で記述する。 First, in S701, the character string table at the start position of the requested read portion is expressed in XML. As shown in FIG. 8, the character string table is described as a table element, and an id attribute is described to represent a table type. The name string table is “name”, and the value is “value”. Each character string in each character string table is described as an entry element in the table element in the order of index. Also, the index of the first entry element is described by the startIndex attribute of the table element.

次に、Ｓ７０２において、要求された読み出し部分の開始位置におけるルールスタックをＸＭＬで表現する。ルールスタックは、Ｓ７０３で表現される文法規則識別子を使って、それぞれstack要素として記述される。本実施形態ではスタックの下から上の順に記述されている。 Next, in S702, the rule stack at the start position of the requested read portion is expressed in XML. Each rule stack is described as a stack element using the grammar rule identifier expressed in S703. In this embodiment, it is described from the bottom to the top of the stack.

次に、Ｓ７０３において、要求された読み出し部分の開始位置における文法規則をＸＭＬで表現する。各文法規則は、既定の文法規則に追加したものとして記述する。本実施形態では、product開始要素に対する文法規則と、catalog開始要素に対する文法規則が追加されたので、文法規則識別子をそれぞれ“product”、“catalog”とし、grammar要素のname属性に記述する。各grammarに対する追加規則の記述は、それぞれadd要素として記述し、要素の規則であれば“SE”、属性の規則であれば“AT”を記述する。本実施形態では追加された規則の下から上の順に記述されている。 Next, in S703, the grammatical rule at the start position of the requested read portion is expressed in XML. Each grammar rule is described as an addition to the default grammar rule. In this embodiment, the grammar rule for the product start element and the grammar rule for the catalog start element are added, so that the grammar rule identifiers are “product” and “catalog”, respectively, and are described in the name attribute of the grammar element. The description of the additional rule for each grammar is described as an add element, and “SE” is described for the element rule, and “AT” is described for the attribute rule. In this embodiment, the rules are described in the order from the bottom to the top of the added rules.

最後に、Ｓ７０４において、バイナリＸＭＬに符号化する。これにより得られたバイナリＸＭＬ化した符号化規則は、クライアントで読み出し部分の開始位置から読み出しを開始する際に符号化規則を復元可能にするための復元情報となる。 Finally, in S704, the data is encoded into binary XML. The encoding rule obtained in this way converted into binary XML becomes restoration information for enabling the restoration of the encoding rule when the client starts reading from the start position of the reading portion.

説明を図５のフローに戻す。図７の処理フロー後、Ｓ５０６において、Ｓ７０４でバイナリＸＭＬ化した符号化規則（復元情報）のサイズと読み出し部分のサイズとを比較する。ここで読み出し部分が小さい場合は、読み出し部分自体をインデックス情報に保存したほうが効率的な場合がある。そこで比較した結果、抽出した符号化規則のサイズが読み出し部分のサイズより大きいときは、Ｓ５０７において、読み出し部分を再符号化しリプライ情報として見出し情報に保存する。見出し情報は例えば外部記憶装置１０４に記憶される。 The description returns to the flow of FIG. After the processing flow of FIG. 7, in S506, the size of the encoding rule (restoration information) converted into binary XML in S704 is compared with the size of the read portion. Here, when the read portion is small, it may be more efficient to store the read portion itself in the index information. As a result of comparison, if the size of the extracted encoding rule is larger than the size of the read portion, the read portion is re-encoded and stored in the header information as reply information in S507. The heading information is stored in the external storage device 104, for example.

バイナリＸＭＬ化した符号化規則のサイズの方が小さければ、Ｓ５０８において、バイナリＸＭＬ化した符号化規則を、読み出し部分への参照情報と関連付けてリプライ情報として見出し情報に保存する。図９に示すように、バイナリＸＭＬ化した符号化規則は復元情報として、restore1.xmlという識別子をつけて保存されている。 If the size of the encoding rule converted to binary XML is smaller, in S508, the encoding rule converted to binary XML is stored in the header information as reply information in association with the reference information for the read portion. As shown in FIG. 9, the encoding rules converted into binary XML are stored as restoration information with an identifier of restore1.xml.

Ｓ５０９において、リプライ情報を使って、リプライメッセージを送信する。リプライ情報が、Ｓ５０７で保存された再符号化された読み出し部分であれば、直接リプライ情報をリプライメッセージに入れ送信する。Ｓ５０８で保存された情報であれば、読み出し部分の参照情報に基づいて読み出し部分を抜き出し、復元情報と共にリプライメッセージに入れ送信する。この際、ZIPなどの一般的なアーカイブ方法を用いてデータ結合してもよいが、EXIの文書ヘッダのようにユーザ定義可能なフィールドがあれば、その部分に復元情報を書き込んでもよい。 In step S509, a reply message is transmitted using reply information. If the reply information is the re-encoded read portion stored in S507, the reply information is directly sent in the reply message. If it is the information stored in S508, the read portion is extracted based on the reference information of the read portion, and is sent in a reply message together with the restoration information. At this time, data may be combined using a general archiving method such as ZIP, but if there is a user-definable field such as an EXI document header, restoration information may be written in that portion.

以上の処理により、Ｓ５０８において当該読み出し部分の見出し情報が作成されたならば、同じ読み出し要求を再度受信したときは、Ｓ５０２で、見出し情報から復元情報及び参照情報を取得すればよい。 With the above processing, if the header information of the read portion is created in S508, when the same read request is received again, restoration information and reference information may be acquired from the header information in S502.

読み出し要求を行った側は、リプライメッセージに復元情報が含まれていれば、復元情報から符号化規則を復元し、リプライメッセージに含まれる読み出し部分を読む。 If the restoration request includes the restoration information, the side that made the read request restores the encoding rule from the restoration information and reads the readout portion included in the reply message.

図９は、見出し情報のテーブルを示しているが、この見出し情報テーブルは更新される。更新方法は、本発明において限定しないが、例えば次の方法が考えられる。テーブルにはエントリごとに参照フラグがあり、エントリ参照時には参照フラグをオンにする。一定時間ごとに参照フラグをチェックするルーチンがあり、参照フラグがオフになっていれば、アクセス頻度が少ない見出し情報と判断し、そのエントリを見出し情報テーブルから削除する。また、チェックルーチンは、チェック時に一旦全ての参照フラグをオフにする。 FIG. 9 shows a heading information table, which is updated. The update method is not limited in the present invention. For example, the following method can be considered. The table has a reference flag for each entry, and the reference flag is turned on when referring to the entry. There is a routine for checking the reference flag at regular time intervals. If the reference flag is off, it is determined that the header information has a low access frequency, and the entry is deleted from the header information table. The check routine once turns off all the reference flags at the time of checking.

次のような方法でも、見出しテーブルは更新される。テーブルにはエントリごとに参照時刻が記録される。見出し情報テーブルが一定サイズ以上になる場合は参照時刻の古いエントリを削除する。 The heading table is also updated by the following method. The table records the reference time for each entry. If the header information table exceeds a certain size, the entry with the old reference time is deleted.

いずれの場合でも、書き換えを行うのはテーブル情報のみで、符号化文書は関与しない。 In either case, only the table information is rewritten, and the encoded document is not involved.

以上の実施形態によれば、バイナリＸＭＬの読み出しアクセスを高速化する見出し情報の生成において、効率的な見出し情報の更新を実現することができる。 According to the embodiment described above, it is possible to efficiently update the header information in generating the header information that speeds up the binary XML read access.

＜実施形態２＞
実施形態２では、実施形態１で抽出した復元情報を、より小さいサイズに削減する例を示す。実施形態１で抽出した符号化規則の中には、読み出し要求が発生した文書部分で使用される規則だけでなく、読み出し部分の終了後や一度も使用されない規則も含まれる。例えば、文字列テーブルの“ID0001”はtypeA要素の外でしか使用されないし、“product”、“catalog”、“serialNum”は一度も使用されない。これらの情報は復元情報として冗長であるので削減したい。図１０に削減処理のフローを示す。また、図１１に削減結果の復元情報を示す。以下、図１０のフローに従って説明する。 <Embodiment 2>
Embodiment 2 shows an example in which the restoration information extracted in Embodiment 1 is reduced to a smaller size. The coding rules extracted in the first embodiment include not only the rules used in the document portion where the read request has been issued, but also rules that are not used once after the read portion ends. For example, “ID0001” in the character string table is used only outside the typeA element, and “product”, “catalog”, and “serialNum” are never used. Since these pieces of information are redundant as restoration information, we want to reduce them. FIG. 10 shows a flow of the reduction process. FIG. 11 shows the restoration information of the reduction result. Hereinafter, a description will be given according to the flow of FIG.

Ｓ５０４において文書部分の開始位置を発見した際に、Ｓ１００１において、解析処理で生成された文字列テーブルの各エントリに参照フラグを設定しオフにしておく。Ｓ１００２において、読み出し部分の終了位置を検出するために符号化文書を解析する。Ｓ１００３において、解析した結果、文字列テーブルのindexを使った符号化データが記述されていたかどうかをチェックする。記述されていれば、Ｓ１００４において、記述されていたindexの参照フラグをオンにする。Ｓ１００５において、読み出し部分の終了位置を発見した場合は、Ｓ１００６において、テーブル内の参照フラグがオンになったエントリを抽出してＳ５０５の処理を行う。 When the start position of the document part is found in S504, a reference flag is set in each entry of the character string table generated by the analysis process in S1001, and is turned off. In S1002, the encoded document is analyzed to detect the end position of the read portion. In step S1003, as a result of the analysis, it is checked whether encoded data using the index of the character string table has been described. If it is described, the reference flag of the described index is turned on in S1004. In S1005, when the end position of the read portion is found, in S1006, an entry in which the reference flag is turned on is extracted and the process of S505 is performed.

また、Ｓ７０１において、文字列テーブルをＸＭＬで記述する際、使用されたエントリしか表現しないので、indexが不連続になる場合がある。この場合は、空のentry要素を記述するか、entryの空の範囲を記述することで不連続部分を表現する。本実施形態では、文字列テーブルの全てのエントリが、文書部分内で使用されていないので、不連続部分を表現せず、図１１に示すようにstartIndex属性に文字列テーブルの次の空きindex番号を記述するようにする。 In S701, when describing the character string table in XML, only the used entry is expressed, so the index may be discontinuous. In this case, a discontinuous part is expressed by describing an empty entry element or by describing an empty range of the entry. In this embodiment, since all the entries in the character string table are not used in the document part, the discontinuous part is not expressed, and the next free index number in the character string table is set in the startIndex attribute as shown in FIG. To be described.

実施形態２は実施形態１に比べて、必要な復元情報を抽出する処理のオーバヘッドを生じるが、読み出し部分のサイズが小さい場合には有効な場合がある。そこで、実施形態１と実施形態２の方法を読み出し部分のサイズに応じて切り替えるようにしてもよい。 The second embodiment causes an overhead of processing for extracting necessary restoration information as compared with the first embodiment, but may be effective when the size of the read portion is small. Therefore, the method of the first embodiment and the second embodiment may be switched according to the size of the read portion.

以上の実施形態２によれば、見出し情報の生成において、よりコンパクトな復元情報で実現することができる。 According to the second embodiment described above, the header information can be generated with more compact restoration information.

（他の実施形態）
また、本発明は、以下の処理を実行することによっても実現される。即ち、上述した実施形態の機能を実現するソフトウェア（プログラム）を、ネットワーク又は各種記憶媒体を介してシステム或いは装置に供給し、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ等）がプログラムを読み出して実行する処理である。 (Other embodiments)
The present invention can also be realized by executing the following processing. That is, software (program) that realizes the functions of the above-described embodiments is supplied to a system or apparatus via a network or various storage media, and a computer (or CPU, MPU, or the like) of the system or apparatus reads the program. It is a process to be executed.

Claims

符号化文書の読み出し部分の開始位置からの内容に依存して、前記読み出し部分に対する符号化規則を抽出する抽出手段と、
クライアントで前記読み出し部分の開始位置から読み出しを開始する際に前記抽出手段により抽出された前記符号化規則を復元可能にするための復元情報を生成する生成手段と、
前記生成手段により生成された前記復元情報を、前記読み出し部分への参照情報と関連付けて記憶する記憶手段と、
を有することを特徴とする文書管理装置。 Extraction means for extracting an encoding rule for the read portion depending on the content from the start position of the read portion of the encoded document;
Generating means for generating restoration information for making it possible to restore the encoding rule extracted by the extraction means when starting reading from the start position of the reading portion at the client;
Storage means for storing the restoration information generated by the generation means in association with reference information to the read portion;
A document management apparatus comprising:

前記記憶手段は、前記生成手段により生成された前記復元情報を、前記読み出し部分への参照情報と関連付けて見出し情報として記憶し、
前記読み出し要求を再度受信したときは、前記見出し情報から前記復元情報及び前記参照情報を取得することを特徴とする請求項１に記載の文書管理装置。 The storage means stores the restoration information generated by the generation means as heading information in association with reference information to the read portion,
The document management apparatus according to claim 1, wherein when the read request is received again, the restoration information and the reference information are acquired from the header information.

前記抽出手段は、前記読み出し部分の開始位置における文字列テーブルと、文法規則と、文法規則の順序列とを抽出する手段を含むことを特徴とする請求項１に記載の文書管理装置。 2. The document management apparatus according to claim 1, wherein the extraction means includes means for extracting a character string table, a grammar rule, and an order string of grammar rules at a start position of the read portion.

前記復元情報のサイズが前記読み出し部分のサイズよりも大きいときは、前記読み出し部分を再符号化する手段を更に有し、
前記記憶手段は、前記復元情報に代えて、前記再符号化された読み出し部分を記憶することを特徴とする請求項１に記載の文書管理装置。 When the size of the restoration information is larger than the size of the read portion, the information further comprises means for re-encoding the read portion,
The document management apparatus according to claim 1, wherein the storage unit stores the re-encoded read portion instead of the restoration information.

前記見出し情報からアクセス頻度が少ないエントリを削除する手段を更に有することを特徴とする請求項２に記載の文書管理装置。 3. The document management apparatus according to claim 2, further comprising means for deleting an entry having a low access frequency from the header information.

前記符号化文書は、ＸＭＬであることを特徴とする請求項１に記載の文書管理装置。 The document management apparatus according to claim 1, wherein the encoded document is XML.

前記生成手段により生成される復元情報のサイズを削減するために、前記抽出手段により抽出された前記符号化規則から使用されない規則を削除する手段を更に有することを特徴とする請求項１に記載の文書管理装置。 The apparatus according to claim 1, further comprising means for deleting a rule that is not used from the encoding rule extracted by the extraction means in order to reduce the size of the restoration information generated by the generation means. Document management device.

前記生成手段は、既定の符号化規則からの変換手順を前記復元情報として生成することを特徴とする請求項１に記載の文書管理装置。 The document management apparatus according to claim 1, wherein the generation unit generates a conversion procedure from a predetermined encoding rule as the restoration information.

前記読み出し部分の開始位置は、前記クライアントからの符号化文書の読み出し要求で指定されたものであり、
前記記憶手段に記憶された前記復元情報及び当該復元情報に関連付けられた前記参照情報を前記クライアントに送信する送信手段を更に有することを特徴とする請求項１に記載の文書管理装置。 The start position of the read part is designated by the read request of the encoded document from the client,
The document management apparatus according to claim 1, further comprising a transmission unit configured to transmit the restoration information stored in the storage unit and the reference information associated with the restoration information to the client.

抽出手段が、符号化文書の読み出し部分の開始位置からの内容に依存して、前記読み出し部分に対する符号化規則を抽出する抽出ステップと、
生成手段が、クライアントで前記読み出し部分の開始位置から読み出しを開始する際に前記抽出ステップで抽出された前記符号化規則を復元可能にするための復元情報を生成する生成ステップと、
保存手段が、前記生成ステップで生成された前記復元情報を、前記読み出し部分への参照情報と関連付けて記憶手段に保存する保存ステップと、
を有することを特徴とする文書管理装置の制御方法。 An extracting step for extracting an encoding rule for the read portion depending on the content from the start position of the read portion of the encoded document;
A generating step for generating restoration information for making it possible to restore the coding rule extracted in the extraction step when the generation unit starts reading from the start position of the reading portion at the client;
A storage unit that stores the restoration information generated in the generation step in a storage unit in association with reference information to the read portion;
A method for controlling a document management apparatus, comprising:

コンピュータを請求項１乃至９のいずれか１項に記載の文書管理装置が有する各手段として機能させるためのプログラム。 A program for causing a computer to function as each unit included in the document management apparatus according to any one of claims 1 to 9.