JP4562749B2

JP4562749B2 - Document compression storage method and apparatus

Info

Publication number: JP4562749B2
Application number: JP2007132482A
Authority: JP
Inventors: 宏二伊藤
Original assignee: Digital Works Inc
Current assignee: Digital Works Inc
Priority date: 2007-05-18
Filing date: 2007-05-18
Publication date: 2010-10-13
Anticipated expiration: 2021-11-21
Also published as: JP2007293874A

Description

本発明は、文書データをデータベースに格納するための方法に関する。特に本発明は、拡張可能なマークアップ言語（XML）による文書の処理に適した格納方法に関する。 The present invention relates to a method for storing document data in a database. In particular, the present invention relates to a storage method suitable for processing a document in an extensible markup language (XML).

XMLにより作成されたデータのデータベースへの格納については、従来から種々の方法が提案されている。その代表的な例としては、リレーショナルデータベースマッピング型がある。この方式は、XMLの階層構造をリレーショナルデータベースのリレーションで表現し、XML要素の値や属性をテーブルのフィールドにマッピングするものである。この方式は、データ構造が固定的な場合に向いている、といわれており、データ構造に変更があるときは、テーブルの再定義が必要になる。また、XML文書の階層構造が複雑になると、リレーショナルテーブルとの対応付けが困難になる。 Conventionally, various methods have been proposed for storing data created by XML in a database. A typical example is a relational database mapping type. In this method, the hierarchical structure of XML is expressed by relations in a relational database, and values and attributes of XML elements are mapped to table fields. This method is said to be suitable when the data structure is fixed. When the data structure is changed, it is necessary to redefine the table. Further, when the hierarchical structure of the XML document becomes complicated, it becomes difficult to associate with the relational table.

他の方式としてオブジェクト指向データベース型と呼ばれる方式がある。この方式は、XML文書をDOMツリーとして管理するものであり、可変的な階層構造をもつXML文書に対する操作性に優れている。しかし、データアクセスのためにはDOMオブジェクトをメモリに展開しなければならず、特に大量のXML文書を扱う場合には、メモリに対する負荷が大きく、パフォーマンス及びメモリソース管理の面で問題がある。 Another method is called an object-oriented database type. This method manages an XML document as a DOM tree, and is excellent in operability for an XML document having a variable hierarchical structure. However, in order to access data, the DOM object must be expanded in the memory. Especially when dealing with a large amount of XML documents, the load on the memory is large, and there are problems in terms of performance and memory source management.

この他に、階層データベース型と呼ばれる、XML文書の階層構造を階層構造型のデータベースで管理する方式がある。しかし、この方式は、XML文書の階層構造の変化に対応することが困難である、という問題を有する。すなわち、この方式は、リレーショナルデータベースマッピング型と同様に、XML文書集合の構造変化に伴ってデータベースへのデータ格納構造の変更を余儀なくされるために、多様な構造のデータを扱うには問題がある。 In addition, there is a method called a hierarchical database type that manages the hierarchical structure of XML documents with a hierarchical database. However, this method has a problem that it is difficult to cope with a change in the hierarchical structure of the XML document. That is, this method, like the relational database mapping type, has a problem in handling data of various structures because the data storage structure in the database must be changed in accordance with the change in the structure of the XML document set. .

さらに、リレーショナルデータベースマッピング型以外の方式は、複雑な統計や分析のための処理が難しいという問題を有する。 Furthermore, methods other than the relational database mapping type have a problem that processing for complicated statistics and analysis is difficult.

本発明は、XML文書のように多様性のある文書を扱うのに適しており、文書構造の変化にも柔軟に対応でき、比較的複雑な統計や分析処理の遂行も容易に可能となる文書の圧縮格納方法を提供することを解決すべき課題とする。 The present invention is suitable for handling diverse documents such as XML documents, can flexibly cope with changes in the document structure, and can easily perform relatively complicated statistics and analysis processing. It is an object to be solved to provide a compressed storage method.

上記課題を解決するため、本発明による文書の圧縮格納方法においては、最初のステップとして、文書構造の実データノードから実データを削除し要素識別子ノードとした整形式スキーマを生成する。そして、整形式スキーマのそれぞれのノードに対応する文書構造の各部分に、ノード識別子及び独自の要素識別子を与え、該文書構造の上記部分の実データを、上記ノード識別子及び上記独自の要素識別子に対応させて文書識別子を付してメモリに格納する。また、整形式スキーマにおいては、文書構造の上記部分の各々についての情報を、実データを除いた形でノード識別子と独自の要素識別子によって表すデータ構造の形で格納する。さらに、要素識別子、ノード識別子及び文書識別子の関連を規定する圧縮結果インデックス（CRX）を生成してメモリに格納し、整形式スキーマの要素識別子と該圧縮結果インデックス（CRX）の対応する組の集合を圧縮結果セット（CRS）としてメモリに格納する。上記の過程において、文書構造の部分のうち、複数の文書に共通する部分について、要素識別子、ノード識別子、及び文書識別子の各々に関し共通の識別子を付与する。 In order to solve the above problems, in the method for compressing and storing documents according to the present invention, as a first step, a well-formed schema is generated by deleting actual data from actual data nodes in the document structure and using them as element identifier nodes. Then, a node identifier and a unique element identifier are given to each part of the document structure corresponding to each node of the well-formed schema, and the actual data of the part of the document structure is converted into the node identifier and the unique element identifier. Correspondingly, a document identifier is attached and stored in the memory. In the well-formed schema, information about each of the above parts of the document structure is stored in the form of a data structure represented by a node identifier and a unique element identifier, excluding actual data. Further, a compressed result index (CRX) that defines the relationship between the element identifier, node identifier, and document identifier is generated and stored in a memory, and a set of corresponding pairs of well-formed schema element identifiers and the compressed result index (CRX) Are stored in memory as a compressed result set (CRS). In the above process, a common identifier is assigned to each of an element identifier, a node identifier, and a document identifier for a portion common to a plurality of documents among the portions of the document structure.

本発明においては、文書構造の実データノードから実データを削除し、要素識別子ノードとした整形式スキーマを文書構造に対応して生成する。したがって、文書構造に変化があっても、変化した文書構造に対応した整形式スキーマを常時準備することができる。整形式スキーマのそれぞれのノードに対応する文書構造の各部分には、ノード識別子と独自の要素識別子を与え、文書構造の各部分の実データには文書識別子を付し、ノード識別子と独自の要素識別子に対応させてメモリに格納する。ノード識別子、独自の要素識別子及び文書識別子は、格納データの検索に際してのキーとなる。他の格納されるデータ構造は、文書構造の各部分についての情報を、実データを除いた形で、ノード識別子と同時の要素識別子によって表すデータ構造の形で格納する整形式スキーマと、要素識別子、ノード識別子及び文書識別子の関連を規定する圧縮結果インデックス（CRX）と、整形式スキーマの要素識別子と圧縮結果インデックス（CRX）の対応する組の集合を表す圧縮結果セット（CRS）である。これらの格納データ構造を使用して、文書の復元、データの検索、集計などを支障なく、高速に達成することができる。 In the present invention, the actual data is deleted from the actual data node of the document structure, and a well-formed schema that is an element identifier node is generated corresponding to the document structure. Therefore, even if the document structure changes, a well-formed schema corresponding to the changed document structure can always be prepared. Each part of the document structure corresponding to each node of the well-formed schema is given a node identifier and a unique element identifier. The actual data of each part of the document structure is given a document identifier, and the node identifier and unique element. Store in the memory in association with the identifier. The node identifier, the unique element identifier, and the document identifier are keys for searching stored data. The other stored data structures are: a well-formed schema that stores information about each part of the document structure in the form of a data structure represented by an element identifier at the same time as the node identifier, excluding actual data; A compression result index (CRX) that defines the relationship between the node identifier and the document identifier, and a compression result set (CRS) that represents a set of corresponding pairs of the well-formed schema element identifier and the compression result index (CRX). Using these stored data structures, it is possible to achieve document restoration, data retrieval, aggregation, and the like at high speed without any problems.

文書が複数の文書単位を含む文書集合である場合における本発明の方法は、複数の文書を含む文書集合から単一文書を切り出して単位文書とし、該単位文書の実データノードから実データを削除し要素識別子ノードとした文書単位整形式スキーマを生成するステップを含む。そして、複数の単位文書についての文書単位整形式スキーマを併合して文書集合整形式スキーマを生成し、該文書集合整形式スキーマのそれぞれのノードに対応する文書構造の各部分に、ノードのオブジェクト識別子及び独自の要素識別子を与える。さらに、該文書構造の各部分の実データを、ノード識別子及び独自の要素識別子に対応させて文書識別子を付してメモリに格納する。また、文書集合整形式スキーマにおいては、文書構造の各部分についての情報を、実データを除いた形でノード識別子と独自の要素識別子によって表すデータ構造の形で格納する。さらに、要素識別子、ノード識別子及び文書識別子の関連を規定する圧縮結果インデックス（CRX）を生成してメモリに格納し、整形式スキーマの要素識別子と圧縮結果インデックス（CRX）の対応する組の集合を圧縮結果セット（CRS）としてメモリに格納する。この場合も、文書構造の各部分のうち、複数の文書に共通する部分については、要素識別子、ノード識別子及び文書識別子の各々に関し共通の識別子を付与する。 When the document is a document set including a plurality of document units, the method of the present invention cuts out a single document from a document set including a plurality of documents to form a unit document, and deletes the actual data from the actual data node of the unit document. And generating a document unit well-formed schema as an element identifier node. Then, a document unit well-formed schema for a plurality of unit documents is merged to generate a document set well-formed schema, and an object identifier of the node is assigned to each part of the document structure corresponding to each node of the document set well-formed schema. And a unique element identifier. Further, the actual data of each part of the document structure is stored in the memory with the document identifier associated with the node identifier and the unique element identifier. In the document set well-formed schema, information about each part of the document structure is stored in the form of a data structure represented by a node identifier and a unique element identifier, excluding actual data. Furthermore, a compressed result index (CRX) that defines the relationship between the element identifier, node identifier, and document identifier is generated and stored in memory, and a set of corresponding pairs of well-formed schema element identifiers and compressed result index (CRX) is stored. Store in memory as a compressed result set (CRS). Also in this case, a common identifier is assigned to each of an element identifier, a node identifier, and a document identifier for a portion common to a plurality of documents among each portion of the document structure.

以下、本発明の実施の形態を図について詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の方法を実施した文書の圧縮格納システムの全体を示すブロック図である。格納されるデータとして、該システムの入力部INには、複数の単位文書ｄを含む文書集合Dがある。本件例では、文書ｄはXMLランゲージによりデジタル化されたXML文書である。 FIG. 1 is a block diagram showing the whole of a document compression storage system in which the method of the present invention is implemented. As data to be stored, there is a document set D including a plurality of unit documents d in the input unit IN of the system. In this example, the document d is an XML document digitized by the XML language.

図２に、２つの単位文書からなるXML文書集合の例を示す。この例は、文書１及び文書２からなる２通の発注書の例であり、それぞれの単位文書に発注書番号として、ID=1及びID=2の標識がつけられる。この標識が文書の開始タグとなる。また、文書の内容を示すものとして、発注書綴という見出しがあり、これがルートタグになる。 FIG. 2 shows an example of an XML document set made up of two unit documents. This example is an example of two purchase orders consisting of document 1 and document 2, and ID = 1 and ID = 2 are assigned to each unit document as a purchase order number. This indicator becomes the start tag of the document. In addition, there is a heading of a purchase order spelling as an indication of the contents of the document, which becomes a root tag.

図２に示すXML文書集合は、図３に示すような文書集合DOMに変換される。この文書集合DOMは、タグノードT、属性ノードATTR、テキストノードCDATAを備え、図２に示す情報をすべて含む。図３に示す文書集合DOMから、単位文書DOMが切り出される。文書１についての単位文書DOMは図４(a)に示すものとなり、文書２についての単位文書DOMは図４(b)に示すものとなる。 The XML document set shown in FIG. 2 is converted into a document set DOM as shown in FIG. This document set DOM includes a tag node T, an attribute node ATTR, and a text node CDATA, and includes all the information shown in FIG. A unit document DOM is cut out from the document set DOM shown in FIG. The unit document DOM for the document 1 is as shown in FIG. 4A, and the unit document DOM for the document 2 is as shown in FIG. 4B.

図４に示す単位文書DOMに基づいて、単位文書整形式スキーマが生成される。文書１についての整形式スキーマは、図５(a)に示すものとなり、文書２についての整形式スキーマは、図５(b)に示すものとなる。ここで、IDは識別子の略称であり、EIDは要素識別子を表す。図５(a)及び図５(b)から分かるように、整形式スキーマは、図４(a)及び図４(b)に示す単位文書DOMの書くノードに対応するノードを備え、単位文書DOMから、社名、住所についての各データ、電話番号、発注日、担当者、発注内容などの実データを削除し、その位置に要素識別子のノードEIDを設けたものである。 A unit document well-formed schema is generated based on the unit document DOM shown in FIG. The well-formed schema for document 1 is as shown in FIG. 5 (a), and the well-formed schema for document 2 is as shown in FIG. 5 (b). Here, ID is an abbreviation for identifier, and EID represents an element identifier. As can be seen from FIGS. 5A and 5B, the well-formed schema includes nodes corresponding to the nodes written by the unit document DOM shown in FIGS. 4A and 4B, and the unit document DOM. From this, actual data such as company name, address data, telephone number, order date, person in charge, order contents, etc. are deleted, and an element identifier node EID is provided at that position.

このようにして作成された単位文書整形式スキーマは、XMLデータベースから読み出される文書集合整形式スキーマと併合される。文書集合整形式スキーマが最初の状態において空である場合には、文書１の整形式スキーマを併合した後の文書集合整形式スキーマは、文書１についての単位文書整形式スキーマと同じ構造になる。同様に、文書集合整形式スキーマが最初の状態において空である場合には、文書２の整形式スキーマを併合した後の文書集合整形式スキーマは、文書２についての単位文書整形式スキーマと同じ構造になる。相違点は、文書集合整形式スキーマにおいては、各要素識別子ノードに識別子として識別番号が付されることである。文書１の単位文書整形式スキーマを併合した後の文書集合整形式スキーマを図６(a)に、文書２の単位文書整形式スキーマを併合した後の文書集合整形式スキーマを図６(b)に、それぞれ示す。この処理は、図１に示す変換過程TRにおいて行われる。このようにして生成された文書集合整形式スキーマは、図１に示すようにXMLデータベースとしてメモリに格納される。 The unit document well-formed schema created in this way is merged with the document set well-formed schema read from the XML database. If the document set well-formed schema is empty in the initial state, the document set well-formed schema after merging the well-formed schema of document 1 has the same structure as the unit document well-formed schema for document 1. Similarly, when the document set well-formed schema is empty in the initial state, the document set well-formed schema after merging the well-formed schema of document 2 has the same structure as the unit document well-formed schema for document 2 become. The difference is that in the document set well-formed schema, each element identifier node is given an identification number as an identifier. The document set well-formed schema after merging the unit document well-formed schema of document 1 is shown in FIG. 6 (a), and the document set well-formed schema after merging the unit document well-formed schema of document 2 is shown in FIG. 6 (b). Respectively. This process is performed in the conversion process TR shown in FIG. The document set well-formed schema generated in this way is stored in the memory as an XML database as shown in FIG.

次に、単位文書整形式スキーマに参照識別子が記録される。この処理は、単位文書整形式スキーマのノードを順に参照し、対応する文書集合整形式スキーマのノードから要素識別子を取得し、単位文書整形式スキーマの該当するノードにその要素識別子を書き込むことにより行う。文書集合整形式スキーマが最初の状態において空である場合には、要素識別子が記録された後の単位文書整形式スキーマの構造は、文書１についても、文書２についても、文書集合整形式スキーマと同じ構造になる。したがって、この処理後における単位整形式スキーマの図示は省略する。 Next, a reference identifier is recorded in the unit document well-formed schema. This process is performed by referring to the nodes of the unit document well-formed schema in order, obtaining the element identifier from the node of the corresponding document set well-formed schema, and writing the element identifier to the corresponding node of the unit document well-formed schema. . When the document set well-formed schema is empty in the initial state, the structure of the unit document well-formed schema after the element identifier is recorded is the document set well-formed schema for both document 1 and document 2. It becomes the same structure. Accordingly, illustration of the unit-formed schema after this processing is omitted.

図１に示すように、本発明のシステムにおいては、XMLデータベースに、圧縮結果インデックスCRXが格納されている。圧縮結果インデックスCRXは、ノード識別子（ID）リストとノード構造体とで構成される。ノード識別子は、原則として単位文書ごとに付与されるが、複数の単位文書間で共通する部分については、後の単位文書においては先の単位文書において付与されたノード識別子が共通して使用される。圧縮結果インデックスCRXは、要素識別子ごとに生成される。すなわち、図６(a)の文書集合整形式スキーマに対応する単位整形式スキーマにおいては、要素識別子０が付与される発注書属性IDの区域に１つのインデックスCRXが、要素識別子１が付与される社名CDATAの区域に対して別の１つのインデックスCRXが、というように、作成される。 As shown in FIG. 1, in the system of the present invention, a compression result index CRX is stored in an XML database. The compression result index CRX includes a node identifier (ID) list and a node structure. In principle, the node identifier is assigned to each unit document. However, for the common part among a plurality of unit documents, the node identifier assigned to the previous unit document is commonly used in subsequent unit documents. . The compression result index CRX is generated for each element identifier. That is, in the unit well-formed schema corresponding to the document set well-formed schema in FIG. 6A, one index CRX and element identifier 1 are assigned to the area of the purchase order attribute ID to which element identifier 0 is assigned. Another index CRX is created for the area of company name CDATA, and so on.

図７(a)は、文書１の登録後における要素識別子０の区域、すなわちEID=0の区域におけるインデックスCRXの例を示すものである。登録された文書は文書１のみであるから、登録件数は１であり、ノードへのポインタは〔０〕のみとなる。この場合におけるノード識別子は０のみであるから、ノード構造体にはノードID＝０が示されることになる。ノード構造体は、キー値へのポインタ、文書識別子へのポインタ、左ノードへのポインタ、及び右ノードへのポインタを含む。文書１のみが登録されているから、キー値へのポインタを操作することにより得られるキー値は「１」である。図７(b)は要素識別子１、すなわちEID=1の区域における圧縮結果インデックスCRXを示す。この区域では、キー値は発注者名を示すものとなる。同様に、図７(c)ないし図７(n)は、EID=２からEID=１３までの区域に対応するインデックスCRXを示す。 FIG. 7A shows an example of the index CRX in the area of the element identifier 0 after registration of the document 1, that is, the area of EID = 0. Since the registered document is only document 1, the number of registrations is 1, and the pointer to the node is only [0]. Since the node identifier in this case is only 0, node ID = 0 is indicated in the node structure. The node structure includes a pointer to a key value, a pointer to a document identifier, a pointer to a left node, and a pointer to a right node. Since only document 1 is registered, the key value obtained by operating the pointer to the key value is “1”. FIG. 7B shows the compression result index CRX in the area of element identifier 1, that is, EID = 1. In this area, the key value indicates the orderer name. Similarly, FIG. 7 (c) to FIG. 7 (n) show indexes CRX corresponding to areas from EID = 2 to EID = 13.

図８(a)は、文書１及び文書２の登録後における要素識別子０の区域、すなわちEID=0の区域におけるインデックスCRXの例を示すものである。登録された文書は文書１及び文書２であるから、登録件数は２であり、ノードへのポインタは、〔０〕及び〔１〕となる。この場合には、ノード識別子は０及び１であるから、ノード構造体としては、ノードID＝０の構造体とノードID=１の構造体の２つが生成されることになる。文書１のみの登録時と同様に、ノード構造体の各々は、キー値へのポインタ、文書識別子へのポインタ、左ノードへのポインタ、及び右ノードへのポインタを含む。文書１及び文書２が登録されているから、キー値へのポインタを操作することにより得られるキー値は、いずれのノード構造体からキー値を求めるかに応じて「１」又は「２」となる。 FIG. 8A shows an example of the index CRX in the area of the element identifier 0 after registration of the documents 1 and 2, that is, the area of EID = 0. Since the registered documents are Document 1 and Document 2, the number of registrations is 2, and the pointers to the nodes are [0] and [1]. In this case, since the node identifiers are 0 and 1, two node structures are generated: a structure with node ID = 0 and a structure with node ID = 1. Each node structure includes a pointer to a key value, a pointer to a document identifier, a pointer to a left node, and a pointer to a right node, as in the registration of only document 1. Since document 1 and document 2 are registered, the key value obtained by operating the pointer to the key value is “1” or “2” depending on which node structure the key value is obtained from. Become.

文書２においては、発注者名、郵便番号その他の発注者に関連するデータは文書１におけるものと共通であるから、EID=１からEID=６までに対応するインデックスCRXは、図７(b)から図７(g)に示すものと同一でよい。また、発注日も両文書において同一であるから、文書２におけるEID=７に対応するインデックスCRXも文書１のものと同一である。したがって、文書２については、図７(h)に対応するインデックスCRXは、文書１についてのものを共用できる。 In document 2, the orderer name, postal code, and other data related to the orderer are the same as those in document 1, so the index CRX corresponding to EID = 1 to EID = 6 is shown in FIG. To FIG. 7 (g). Since the order date is the same for both documents, the index CRX corresponding to EID = 7 in document 2 is also the same as that for document 1. Therefore, for the document 2, the index CRX corresponding to FIG.

図８(b)は、要素識別子EID=８に対応する担当者データを示す区域についてのインデックスCRXを示すものである。図８(c)ないし図８(e)は、それぞれ要素識別子EID=１０ないしEID=１２の区域におけるインデックスCRXを示す。EID=９の区域においては、文書２のインデックスCRXは、文書１のものと同一でよい。図８(f)ないし図８(j)は、文書２により追加されたデータの区域に対応するインデックスCRXを示す。 FIG. 8B shows an index CRX for an area indicating person-in-charge data corresponding to the element identifier EID = 8. FIGS. 8C to 8E show the indexes CRX in the areas of the element identifiers EID = 10 to EID = 12, respectively. In the area of EID = 9, the index CRX of document 2 may be the same as that of document 1. FIGS. 8F to 8J show the index CRX corresponding to the area of data added by the document 2. FIG.

図８に示す例において、ノード構造体及びノードの成長の仕方は、２分木の手法に従っている。しかし、圧縮結果インデックスCRXを構成する手法は２分木に限定されるものではなく、その構成手法としては、Ｂ木、Ｂ＋木、PAT木、ハッシュテーブルなど、任意の探索アルゴリズムを使用することができる。 In the example shown in FIG. 8, the node structure and the method of growing the node follow the binary tree method. However, the method of constructing the compression result index CRX is not limited to the binary tree, and as the construction method, an arbitrary search algorithm such as a B tree, a B + tree, a PAT tree, or a hash table may be used. it can.

圧縮結果インデックスCRXを更新する場合には、文書DOMのノードと単位文書整形式スキーマのノードとを対応させながら順に参照し、文書DOMのノードに記録されたXML要素の属性値すなわちATTRの実データ若しくは要素値すなわちCDATAの実データを取得し、これをキー値としてCRXを検索する。検索の結果、キー値が存在しない場合には新たなノード識別子を付してノードを作成し、そのノードに該キー値を登録する。検索の結果、キー値が存在する場合には、そのキー値のノードを参照し、ノード識別子を取得する。このノード識別子は、そのキー値の参照識別子となる。このようにして、単位文書整形式スキーマの要素識別子EIDと圧縮結果インデックスCRXのノード識別子NIDの組（EID、NID）の集合が取得される。 When updating the compression result index CRX, refer to the document DOM node and the unit document well-formed schema node in order, and refer to the attribute value of the XML element recorded in the document DOM node, that is, the actual data of ATTR. Alternatively, an element value, that is, actual data of CDATA is acquired, and CRX is searched using this as a key value. If the key value does not exist as a result of the search, a node is created with a new node identifier, and the key value is registered in the node. If the key value exists as a result of the search, the node of the key value is referenced to obtain the node identifier. This node identifier is a reference identifier for the key value. In this way, a set of sets (EID, NID) of the element identifier EID of the unit document well-formed schema and the node identifier NID of the compression result index CRX is acquired.

図示例における要素識別子EIDとノード識別子NIDの組を要素識別子ごとに示すと図９、図１０のようになる。図９は、文書１が登録された段階での上述の組を示すものであり、図１０は、文書２が登録された段階での同様な組を示すものである。このようにして得られた要素識別子EIDとノード識別子NIDの組は、要素識別子EIDをソートキーとして昇順ソートした後、図１に示すメモリ内の圧縮結果セットCRSに格納する。図１１及び図１２に格納後の圧縮結果セットCRSの構成を示す。図１１は文書１が登録された段階における圧縮結果セットCRSを、図１２は文書２が登録された段階における圧縮結果セットCRSをそれぞれ示す。 A pair of the element identifier EID and the node identifier NID in the illustrated example is shown for each element identifier as shown in FIGS. FIG. 9 shows the above set at the stage where the document 1 is registered, and FIG. 10 shows a similar set at the stage where the document 2 is registered. The set of the element identifier EID and the node identifier NID obtained in this way is sorted in ascending order using the element identifier EID as a sort key, and then stored in the compression result set CRS in the memory shown in FIG. 11 and 12 show the configuration of the compressed result set CRS after storage. FIG. 11 shows the compression result set CRS at the stage where the document 1 is registered, and FIG. 12 shows the compression result set CRS at the stage where the document 2 is registered.

要素識別子EIDとノード識別子NIDの組を上述のように圧縮結果セットCRSに格納する際に、この組に文書識別子DIDを割り当てる。割り当てられた文書識別子DIDは、文書識別子管理ファイルに収められる。図１１及び図１２に、文書識別子DIDを含む文書管理ファイルと圧縮結果セットCRSとの関係を示す。文書管理ファイルにおいて、文書識別子DIDを参照する要素には、圧縮結果セットCRSに格納した要素識別子EIDとノード識別子NIDの組（EID、NID）の集合領域の先頭アドレスを格納する。圧縮結果セットCRSにおけるデータの格納方法としては、要素識別子EIDとノード識別子NIDの組（EID、NID）ごとにファイルを垂直分割する方法とすべての組集合を一つのファイルに格納する方法とがある。前者は、要素識別子EIDとノード識別子NIDの特定の組（EID、NID）だけを高速に抽出する検索に適しており、応用面では多次元集計を高速化するのに効果的である。これに対して後者は、常に記録全体を参照するので、文書単位の処理に適している。本件例においては、文書単位の高速な登録と、更新、検索及び集計を行うトランザクション型データベース管理システムに重点を置くために、後者の手法を採用している。 When the set of the element identifier EID and the node identifier NID is stored in the compression result set CRS as described above, the document identifier DID is assigned to this set. The assigned document identifier DID is stored in the document identifier management file. 11 and 12 show the relationship between the document management file including the document identifier DID and the compression result set CRS. In the document management file, the element that refers to the document identifier DID stores the start address of the set area of the set of element identifier EID and node identifier NID (EID, NID) stored in the compression result set CRS. The data storage method in the compression result set CRS includes a method of vertically dividing a file for each set of element identifier EID and node identifier NID (EID, NID) and a method of storing all set sets in one file. . The former is suitable for a search that extracts only a specific set (EID, NID) of an element identifier EID and a node identifier NID at high speed, and is effective in speeding up multidimensional aggregation in terms of application. On the other hand, since the latter always refers to the entire recording, it is suitable for processing in document units. In this example, the latter method is adopted in order to focus on a transaction type database management system that performs high-speed registration in units of documents and performs update, search, and aggregation.

上述した文書識別子DIDは、高速な検索を実現するために、文書識別子リストファイルに格納される。この格納は、要素識別子ごとに文書識別子リストとして文書識別子を文書識別子リストファイルに格納することによって行われる。圧縮結果インデックスCRXにおけるノード構造体には、図７及び図８に示すように、文書識別子へのポインタが設けられ、文書識別子リストの格納領域の先頭アドレスを該圧縮結果インデックスにおける文書識別子へのポインタにセットする。 The document identifier DID described above is stored in a document identifier list file in order to realize a high-speed search. This storage is performed by storing the document identifier in the document identifier list file as a document identifier list for each element identifier. As shown in FIGS. 7 and 8, the node structure in the compression result index CRX is provided with a pointer to the document identifier, and the start address of the storage area of the document identifier list is used as the pointer to the document identifier in the compression result index. Set to.

図１３(a)に、文書２の登録された後の段階における発注者名に対応する要素識別子EID=１の区域の圧縮結果インデックスCRX格納データと文書識別子リストファイルとの関連を示す。ノード構造体の文書識別子へのポインタに文書識別子リストファイルがリンクされる。図１３(b)は、要素識別子EID=８の区域に対応する同様な図である。ここでは、圧縮結果インデックスCRXの格納データは、２人の発注者に対応するように、２つのノード構造体を有する。各ノード構造体の文書識別子へのポインタは、文書識別子リストファイルのそれぞれ対応する文書識別子リストにリンクされている。図１３(c)は、要素識別子EID=１１の区域に対応する図である。各ノード構造体のキー値へのポインタは、発注内容を示す実データの格納個所にリンクされている。図１３(d)及び図１３(e)は、それぞれ要素識別子EID=１５及びEID=１６の区域に対応する同様な図である。 FIG. 13A shows the relationship between the compression result index CRX storage data and the document identifier list file in the area of the element identifier EID = 1 corresponding to the orderer name at the stage after the document 2 is registered. The document identifier list file is linked to a pointer to the document identifier of the node structure. FIG. 13B is a similar diagram corresponding to the area of the element identifier EID = 8. Here, the storage data of the compression result index CRX has two node structures so as to correspond to two orderers. The pointer to the document identifier of each node structure is linked to the corresponding document identifier list of the document identifier list file. FIG.13 (c) is a figure corresponding to the area of the element identifier EID = 11. A pointer to the key value of each node structure is linked to a storage location of actual data indicating the order contents. FIGS. 13 (d) and 13 (e) are similar diagrams corresponding to the areas of element identifiers EID = 15 and EID = 16, respectively.

以上述べた手順で格納されたデータを有するXML文書データベースからXML文書を検索し再現する手順を、種々の検索条件について以下に述べる。
(1) 検索条件なし（すべてのXML文書を出力する場合）
出力条件：発注書
意味：ルートノード（発注書以下のXML文書を出力する）
処理手順：
(a) 図６に示す文書集合整形式スキーマをメモリに取得し、その写像を生成する。
(b) 図１２に示す文書識別子管理ファイルにおける文書識別子DIDの組集合へのポインタを順に参照して、圧縮結果セットCRSから組み集合を順に取得する。
(c) 文書集合整形式スキーマのノードをルートノード（T:発注書）から順にたどり、各ノードの要素識別子EIDを取得する。
(d) 取得した要素識別子EIDを検索条件として、組集合を検索する。検索が成功した場合には、自分が所属するすべての親ノードを自分の位置から上位に戻りながら、文書整形式スキーマの写像のノードに該当マークとノード識別子NIDを付ける。この時、親ノードに既に該当マークがついている場合には、必然的にそれよりも上位の親ノードは該当マーク付きであるので、次の処理に移行する。検索が失敗した場合には何もしない。
(e) 上記(b)(c)の手順を、文書集合整形式スキーマのすべてのノードについて行う。
(f) 文書整形式スキーマのノードをルートノードから順にたどり、ノードに該当マークが付いている場合には、ノードの階層と種別に応じて該当するデータを出力する。例えば、Ｔの場合にはタグを、ATTRの場合には属性値を、CDATAの場合には文字データを出力する。ATTR及びCDATAに関しては、圧縮結果インデックスCRXの該当する要素識別子EIDの区域におけるノード識別子NIDからオリジナルのデータを取得して出力する。
(g) すべての文書識別子DIDについて同様の処理を行う。
文書１及び文書２の登録後の文書整形式スキーマの写像に対して文書構造を再現した状態を図１４(a)(b)に示す。この文書構造に対応して再現された文書集合を図１５に示す。 A procedure for retrieving and reproducing an XML document from an XML document database having data stored in the above-described procedure will be described below for various search conditions.
(1) No search condition (when outputting all XML documents)
Output condition: Purchase order Meaning: Root node (Output XML document below purchase order)
Procedure:
(a) The document set well-formed schema shown in FIG. 6 is acquired in a memory, and its mapping is generated.
(b) By sequentially referring to the pointers to the set of document identifiers DID in the document identifier management file shown in FIG. 12, the set of sets is acquired in order from the compression result set CRS.
(c) The nodes of the document set well-formed schema are traced in order from the root node (T: purchase order), and the element identifier EID of each node is acquired.
(d) A tuple set is searched using the acquired element identifier EID as a search condition. When the search is successful, the corresponding mark and the node identifier NID are attached to the nodes of the mapping of the document well-formed schema while returning all parent nodes to which the user belongs from the position higher. At this time, if the parent node is already marked with the corresponding mark, the parent node higher than that is necessarily marked with the corresponding mark, and the process proceeds to the next process. If the search fails, do nothing.
(e) The above steps (b) and (c) are performed for all nodes in the document set well-formed schema.
(f) The nodes of the document well-formed schema are traced in order from the root node, and when the corresponding mark is attached to the node, the corresponding data is output according to the hierarchy and type of the node. For example, a tag is output for T, an attribute value is output for ATTR, and character data is output for CDATA. For ATTR and CDATA, the original data is acquired from the node identifier NID in the area of the corresponding element identifier EID of the compression result index CRX and output.
(g) The same processing is performed for all document identifiers DID.
FIGS. 14A and 14B show a state where the document structure is reproduced with respect to the mapping of the document well-formed schema after registration of the document 1 and the document 2. FIG. FIG. 15 shows a document set reproduced corresponding to this document structure.

(2) CDATAに検索条件を指定して部分文書を出力する（単一条件の場合）
検索条件：／発注書／発注企業／社名="デジタル商店"
出力条件：／発注書／発注企業
意味：社名のCDATAが"デジタル商店"のXML文書を、〔T：発注企業〕をルートタグとして出力する。 (2) Specifying search conditions in CDATA and outputting a partial document (in the case of a single condition)
Search conditions: / Purchase / Ordering company / Company name = "Digital store"
Output condition: / Purchase / Ordering company Meaning: An XML document in which the CDATA of the company name is “digital store” is output with [T: ordering company] as a root tag.

処理手順：
(a) 文書集合整形式スキーマ（図８参照）をメモリに取得し、写像を作る。
(b) 文書集合整形式スキーマを検索し／発注書／発注企業／社名のノードのEID集合を取得する。ノードが存在しない場合は、EID集合={}（空集合）となり該当文書は無し。
(c) ノードのEID集合={1}なので、EID集合からEID=1を取得し、CRXのEID=1の区域のノードを"デジタル商店"を検索キーとして検索し、該当するノードが存在した場合はノードの文書IDへのポインタを参照し文書IDリストファイル内のアドレスを取得し、文書IDリストファイルから文書IDリストを取得する（図１７ EID=1の区域を参照のこと）。該当するノードが存在しない場合は文書IDリスト=空集合とする。EID集合の要素数が２以上の場合は、同様に繰り返して文書IDリストを取得し、全ての文書IDリストの論理和集合を求め、最終的なDID集合を得る。
(d) 文書集合整形式スキーマの写像の該当マーク=該当無し、NID=値無しに設定する。
(e) DID集合の各々について順に文書集合整形式スキーマの写像に該当マークとNIDの値を付加する。この処理は、上記(1)の(b)から(e)までにおける処理と同じである。
(f) 該当マークとNIDの値が付加された文書集合整形式スキーマの写像の／発注書／発注企業のノードに位置付ける。該ノードから下位のノードを順に辿り、ノードに該当マークが付いているなら階層とノード種別に応じて、Tならタグ、ATTRなら"属性名=属性値"、CDATAなら文字データを出力する。この時、ATTR とCDATAについてはCRXのEID区域のNIDからオリジナルデータ値を取得し出力する。 Procedure:
(a) Acquire a document set well-formed schema (see FIG. 8) in a memory and create a mapping.
(b) Search the document set well-formed schema, and obtain the EID set of the node of purchase order / ordering company / company name. If there is no node, EID set = {} (empty set) and there is no corresponding document.
(c) Since the EID set of nodes = {1}, EID = 1 is obtained from the EID set, nodes in the area of EID = 1 in CRX are searched using "digital store" as a search key, and the corresponding node exists In this case, the address in the document ID list file is obtained by referring to the pointer to the document ID of the node, and the document ID list is obtained from the document ID list file (see the area of EID = 1 in FIG. 17). If there is no corresponding node, the document ID list = empty set. When the number of elements in the EID set is 2 or more, the document ID list is repeatedly obtained in the same manner, and a logical sum set of all document ID lists is obtained to obtain a final DID set.
(d) Set corresponding mark = not applicable and NID = no value in mapping of document set well-formed schema.
(e) Appropriate marks and NID values are added to the mapping of the document set well-formed schema in order for each DID set. This process is the same as the process in (b) to (e) of (1) above.
(f) It is positioned at the node of the mapping / ordering document / ordering company of the document set well-formed schema with the corresponding mark and NID value added. The node is traced from the node in order, and if the node has a corresponding mark, according to the hierarchy and the node type, the tag is output for T, “attribute name = attribute value” for ATTR, and character data is output for CDATA. At this time, for ATTR and CDATA, the original data value is acquired and output from the NID of the EID area of CRX.

以上のように文書集合整形式スキーマの写像を／発注書／発注企業のノードから下位に順にノードをたどることでXML文書を復元できる。復元したXML文書集合を図１６に示す。
(3) CDATAに検索条件を指定して部分文書を出力する（複数条件の場合）
検索条件：／発注書／発注企業／社名="デジタル商店" AND ／発注書／発注／発注内容／@CLASS="1"
出力条件：／発注書／発注／発注内容
意味：社名のCDATAが"デジタル商店"で、かつ発注内容のATTR：CLASSが"1"のXML文書を、〔T：発注内容〕をルートタグとして出力する。 As described above, the XML document can be restored by following the nodes in order from the node of the document set well-formed schema / the purchase order / ordering company node. FIG. 16 shows the restored XML document set.
(3) Specifying search conditions in CDATA and outputting partial documents (multiple conditions)
Search condition: / Purchase / Ordering company / Company name = "Digital store" AND / Purchase / Purchase / Order contents / @ CLASS = "1"
Output condition: / Purchase order / Order contents / Order contents Meaning: Output XML document with company name CDATA as "digital store" and order contents ATTR: CLASS "1", [T: order contents] as root tag To do.

処理手順：
(a) (2)と同様の手順にて第１の検索条件のDID集合=S1={1, 2}を取得する（図１７ EID=1の区域を参照のこと）。
(b) (2)と同様の手順にて第２の検索条件のDID集合=S2={1, 2}を取得する（図１７ EID=10の区域を参照のこと）。
(c) S1とS2論理積を求め、最終的なDID集合={1, 2}を得る。
(d) ２個以上の検索条件式の場合も上記(a)から(c)までと同様の手順を繰り返し、最終的な文書識別子DID集合を得る。
(e) (2)と同様の手順で文書集合整形式スキーマの写像に該当マークとNIDの値を付加する。
(f) 該当マークとNIDの値が付加された文書集合整形式スキーマの写像の／発注書／発注／発注内容のノードに位置付ける。該ノードから下位のノードを順に辿り、ノードに該当マークが付いているなら階層とノード種別に応じて、Tならタグ、ATTRなら"属性名=属性値"、CDATAなら文字データを出力する。この時、ATTR とCDATAについてはCRXのEID区域のNIDからオリジナルデータ値を取得し出力する。 Procedure:
(a) The DID set = S1 = {1, 2} of the first search condition is acquired by the same procedure as (2) (see the area of EID = 1 in FIG. 17).
(b) The DID set = S2 = {1, 2} of the second search condition is acquired by the same procedure as (2) (see the area of EID = 10 in FIG. 17).
(c) Find the AND of S1 and S2, and obtain the final DID set = {1, 2}.
(d) In the case of two or more search condition expressions, the same procedure as in the above (a) to (c) is repeated to obtain a final document identifier DID set.
(e) The corresponding mark and NID value are added to the mapping of the document set well-formed schema in the same procedure as (2).
(f) It is positioned at the node of the mapping / ordering document / ordering / ordering contents of the document set well-formed schema with the corresponding mark and NID value added. The node is traced from the node in order, and if the node has a corresponding mark, according to the hierarchy and the node type, the tag is output for T, “attribute name = attribute value” for ATTR, and character data is output for CDATA. At this time, for ATTR and CDATA, the original data value is acquired and output from the NID of the EID area of CRX.

以上のように文書集合整形式スキーマの写像を／発注書／発注／発注内容のノードから下位に順にノードを辿ることでXML文書を復元できる。復元したXML文書集合を図１７に示す。 As described above, the XML document can be restored by tracing the nodes of the document set well-formed schema in descending order from the node of the purchase order / order / order contents. The restored XML document set is shown in FIG.

なお、論理演算子としては、論理積（AND）、論理和（OR）、論理差（NOT）および論理演算の優先順位を制御する括弧「（）「」」を適用することができる。上記の説明で明らかのように、如何なる複雑な検索条件であってもCRXと文書IDリストファイルの組合せによって高速にDID集合を取得することができる。一旦DID集合が取得できれば文書集合整形式スキーマ、CRS、CRXの組合せによって、任意の階層以下のXML文書を再現できる。
次に、XML文書データベースからXML文書を検索し集計する処理手順について述べる。 As logical operators, logical product (AND), logical sum (OR), logical difference (NOT), and parentheses “()“ ”” for controlling the priority of logical operations can be applied. As is clear from the above description, a DID set can be acquired at high speed by combining a CRX and a document ID list file under any complicated search condition. Once a DID set can be obtained, an XML document below an arbitrary hierarchy can be reproduced by a combination of a document set well-formed schema, CRS, and CRX.
Next, a processing procedure for retrieving and totaling XML documents from the XML document database will be described.

集計は、検索によるDID集合取得、該当文書集合の集約キー項目値の組集合及び集計対象項目値の組集合の取得、集計処理の３段階の処理によって最終結果を得る。以下の例では、検索処理の手順は省略し、後段の処理に絞って説明する。用いる記号は前述したものと同じである。
(1) 全てのXML文書を集計する場合
検索条件：なし
集約キー項目：／発注書／発注企業／社名
集計対象項目：／発注書／発注／数量
出力条件：一覧表形式
／発注書／発注企業／社名, ／発注書／発注／数量
意味：〔T：社名〕を集約キーとして〔T：数量〕を集計し、XML文書を出力する。 Aggregation obtains a final result through three stages of processing: acquisition of a DID set by search, acquisition of a set of aggregate key item values and a set of aggregation target item values of the corresponding document set, and aggregation processing. In the following example, the procedure of the search process is omitted, and the description is focused on the subsequent process. The symbols used are the same as described above.
(1) When all XML documents are aggregated Search condition: None Aggregation key item: / Purchase order / Ordering company / Company name Target item: / Purchase order / Ordering / Quantity Output condition: List format / Purchase order / Ordering company / Company name, / Purchase order / Order / Quantity Meaning: [T: Quantity] is aggregated using [T: Company name] as an aggregation key, and an XML document is output.

処理手順：
(a) 前述の手法と同様の手順により、文書識別子DID集合を取得する。今回の場合は、DID集合=全件={1, 2}となる。
(b) ／発注書／発注企業／社名と／発注書／発注／数量のEIDの集合を取得する。／発注書／発注企業／社名のEID={1}、／発注書／発注／数量のEID={13, 18}となる。
(c) 文書識別子DID集合の各々について、文書集合整形式スキーマの写像から(EID, NID)の組の集合を取得する。DID=1のとき、／発注書／発注企業／社名の組集合{(EID, NID)}={(1, 0)}、／発注書／発注／数量の組集合{(EID, NID)}={(13, 0)}となる。DID=2のときは、／発注書／発注企業／社名の組集合{(EID, NID)}={(1, 0)}、／発注書／発注／数量の組集合{(EID, NID)}={(13, 0), (18, 0)}となる。
(d) ／発注書／発注企業／社名の組集合のNIDを集約キー、／発注書／発注／数量の組集合{(EID, NID)}をCRXを用いてオリジナルデータに変換後バイナリ数値に変換する。多次元集計処理に入力するための{集約キー値の組, 集計対象項目値の組}が取得できた場合には、任意のアルゴリズムに基づく多次元集計処理によって{集約キー値の組, 加算値の組, データ件数の組, 最大値の組, 最小値の組}を計算することができる。多次元集計処理は、入力された{集約キー値の組, 集計対象項目値の組}において、集約キー値の組毎に集計対象項目値の組毎の加算、最大値の入替え、最小値の入替えを行い、加算値、トータル入力データ件数、最大値、最小値を集計対象項目値の組毎に記録する。DID=1の場合、{集約キー値の組, 集計対象項目値の組}={(0), (1)}となり、この結果、{集約キー値の組, 加算値の組, データ件数の組, 最大値の組, 最小値の組}={(0), (1), (1), (1), (1)}となる。DID=2の場合、{集約キー値の組, 集計対象項目値の組}={(0),({1, 1}) }となり、この結果、{集約キー値の組, 加算値の組, データ件数の組, 最大値の組, 最小値の組}={(0), (2), (2), (1), (1)}となる。
(e) 多次元集計処理によって記録されたデータ={集約キー値の組, 加算値の組, データ件数の組, 最大値の組, 最小値の組}={(0), (2), (2), (1), (1)}から集約キー値、加算値を順に取得し、集約キー値（未だNIDのままである）を、CRXを利用してオリジナルデータに復元し、加算値を十進数テキストデータに変換しXML文書集合として出力する。 Procedure:
(a) A document identifier DID set is acquired by the same procedure as described above. In this case, DID set = all cases = {1, 2}.
(b) Acquire a set of EIDs: / Purchase / Ordering company / Company name and / Purchase / Order / Quantity. / Purchase / Ordering company / Company name EID = {1}, / Purchase / Order / Quantity EID = {13, 18}.
(c) For each document identifier DID set, obtain a set of (EID, NID) pairs from the mapping of the document set well-formed schema. When DID = 1, / Purchase / Ordering company / Company name set {(EID, NID)} = {(1, 0)}, / Purchase / Order / Quantity set {(EID, NID)} = {(13, 0)}. When DID = 2, / Purchase / Ordering company / Company name set {(EID, NID)} = {(1, 0)}, / Purchase / Order / Quantity set {(EID, NID) } = {(13, 0), (18, 0)}.
(d) NID of / Purchase / Ordering company / Company name group set is aggregated key, / Purchase / Purchase / Quantity set {(EID, NID)} is converted to original data using CRX and converted to binary value Convert. If {Aggregation key value pair, Aggregation target item value pair} for input to multi-dimensional aggregation process can be acquired, {Aggregation key value pair, addition value by multi-dimensional aggregation process based on any algorithm Set, data count set, maximum value set, minimum value set}. Multidimensional aggregation processing is performed by adding the aggregation target item value for each aggregate key value pair, replacing the maximum value, replacing the minimum value in the input {aggregation key value pair, aggregation target item value pair}. Replace and record the added value, the total number of input data, the maximum value, and the minimum value for each set of items to be counted. When DID = 1, {Aggregation key value pair, Aggregation target item value pair} = {(0), (1)}. As a result, {Aggregation key value pair, addition value pair, number of data items Set, set of maximum values, set of minimum values} = {(0), (1), (1), (1), (1)}. When DID = 2, {aggregation key value pair, aggregation target item value pair} = {(0), ({1, 1})}, and as a result, {aggregation key value pair, addition value pair , The number of data records, the maximum value set, the minimum value set} = {(0), (2), (2), (1), (1)}.
(e) Data recorded by multi-dimensional aggregation processing = {Aggregation key value pair, addition value pair, data count pair, maximum value pair, minimum value pair} = {(0), (2), Aggregate key value and addition value are acquired in order from (2), (1), (1)}, and the aggregation key value (still NID) is restored to the original data using CRX, and the addition value Is converted to decimal text data and output as an XML document set.

以上のように任意のノードのデータを集約キーおよび集計対象項目に指定して多次元集計処理を効率的に実行できる。出力したXML文書集合のイメージを図１８に示す。
(2) CDATAに検索条件を指定して集計する場合
検索条件：／発注書／発注／発注内容／@CLASS="1"
集約キー項目：／発注書／発注企業／社名
集計対象項目：／発注書／発注／数量
出力条件：一覧表形式
／発注書／発注企業／社名, ／発注書／発注／数量
意味：T：発注内容のATTR：CLASS="2"を検索条件に該当したDID集合に対し、"T：社名を集約キーとしてT：数量を集計し、XML文書を出力する。 As described above, it is possible to efficiently execute multidimensional aggregation processing by specifying data of an arbitrary node as an aggregation key and an aggregation target item. An image of the output XML document set is shown in FIG.
(2) When calculating by specifying search conditions in CDATA Search conditions: / Purchase / Order / Contents of order / @ CLASS = "1"
Aggregation Key Item: / Purchase Order / Purchasing Company / Company Name Target Item: / Purchase Order / Purchase / Quantity Output Condition: List Form / Purchase Order / Purchasing Company / Company Name, / Purchase Order / Purchase / Quantity Meaning: T: Order For the DID set that matches the search condition with ATTR: CLASS = "2" in the content, "T: Aggregate key with T: Company name as aggregation key, and output XML document.

処理手順：
(a) 前述の手法と同様の手順にてDID集合を取得する。検索の処理手順は前述したものと全く同じである。今回の場合はDID集合={2}となる。
(b) 後の処理は、(1)と全く同じである。
出力したXML文書集合のイメージを図１９に示す。集約キーの組要素が２個以上の場合及び集計対象項目の組要素が２個以上の場合も全く同様に処理することができる。
次に、XML文書構造の変更の手順を説明する。 Procedure:
(a) Acquire a DID set in the same procedure as described above. The search processing procedure is exactly the same as described above. In this case, DID set = {2}.
(b) The subsequent processing is exactly the same as (1).
An image of the output XML document set is shown in FIG. The same processing can be performed when there are two or more grouping elements of the aggregation key and when there are two or more grouping elements of the aggregation target items.
Next, the procedure for changing the XML document structure will be described.

上述したことから明らかなように、XML文書構造は、文書集合整形式スキーマ構造に併合して写像され、要素毎に分解された要素値は、EIDとNIDがCRSに格納されるとともにオリジナルデータ値がCRXに格納され、文書IDによって索引される構造をもっている。 As is clear from the above, the XML document structure is merged with the document set well-formed schema structure, and the element values decomposed element by element are stored in the CRS with the EID and NID. Is stored in the CRX and indexed by document ID.

実体データは、圧縮結果セットCRSと圧縮結果インデックスCRXによって格納管理されるが、これら実体データ内にはXML文書の階層構造情報は持たず、XML文書の階層構造情報は、専ら文書集合整形式スキーマ構造によって保持される。 The entity data is stored and managed by the compression result set CRS and the compression result index CRX. However, the entity data does not have the hierarchical structure information of the XML document, and the hierarchical structure information of the XML document is exclusively the document set well-formed schema. Retained by the structure.

このような構造を使用すれば、XML文書データベースに格納されたXML文書集合に対するXML文書構造の変更は、文書集合整形式スキーマ構造の変更によって完了する。したがって、XML-DBに格納されたXML文書集合に対する構造の変更は、極めて容易になる。 If such a structure is used, the change of the XML document structure with respect to the XML document set stored in the XML document database is completed by changing the document set well-formed schema structure. Therefore, it is very easy to change the structure of the XML document set stored in XML-DB.

以上述べたように、本発明により、構造が可変で不確定要素の多い、例えばXML文書のような文書を扱うに際し、その文書構造が変化するばあいでも、実データの格納構造を変更することなくXMLドキュメントを管理することができ、しかも、高速な検索や複雑な統計・分析を可能とするXMLデータベースを得ることが可能になる。 As described above, according to the present invention, when handling a document such as an XML document whose structure is variable and has many uncertain elements, even if the document structure changes, the storage structure of actual data can be changed. XML documents can be managed easily, and an XML database that enables high-speed search and complex statistics and analysis can be obtained.

本発明の方法の具体例をブロックにより示す系統図である。It is a systematic diagram which shows the specific example of the method of this invention with a block. 本発明の方法を実施することができるXML文書集合の一例を示す発注書の図である。It is a diagram of a purchase order showing an example of a set of XML documents that can implement the method of the present invention. 本発明の方法を実施することができるXML文書集合の一例を示す発注書の図である。It is a diagram of a purchase order showing an example of a set of XML documents that can implement the method of the present invention. 図２の文書集合から生成された文書集合DOMの例を示す概念図である。It is a conceptual diagram which shows the example of the document set DOM produced | generated from the document set of FIG. 図２の文書集合から生成された文書集合DOMの例を示す概念図である。It is a conceptual diagram which shows the example of the document set DOM produced | generated from the document set of FIG. 図３に示す文書集合DOMから切り出された文書１に関する単位文書DOMの概念図である。It is a conceptual diagram of the unit document DOM regarding the document 1 cut out from the document set DOM shown in FIG. 図３に示す文書集合DOMから切り出された文書２に関する単位文書DOMの概念図である。It is a conceptual diagram of the unit document DOM regarding the document 2 cut out from the document set DOM shown in FIG. 図４ａに示す文書集合DOMから生成された文書１に関する単位文書整形式スキーマの概念図である。FIG. 4B is a conceptual diagram of a unit document well-formed schema relating to document 1 generated from the document set DOM shown in FIG. 4A. 図４ｂに示す文書集合DOMから生成された文書２に関する単位文書整形式スキーマの概念図である。FIG. 4B is a conceptual diagram of a unit document well-formed schema related to a document 2 generated from the document set DOM shown in FIG. 4B. 文書１の単位整形式スキーマを併合した状態における文書集合整形式スキーマの概念図である。FIG. 5 is a conceptual diagram of a document set well-formed schema in a state where unit-formed schemas of document 1 are merged. 文書２の単位整形式スキーマを併合した状態における文書集合整形式スキーマの概念図である。FIG. 5 is a conceptual diagram of a document set well-formed schema in a state in which the unit well-formed schema of document 2 is merged. 文書１の登録後における要素識別子EID=０区域に対応する圧縮結果インデックスCRXの概念図である。10 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 0 area after registration of a document 1; FIG. 文書１の登録後における要素識別子EID=１区域に対応する圧縮結果インデックスCRXの概念図である。FIG. 10 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 1 area after registration of a document 1; 文書１の登録後における要素識別子EID=２区域に対応する圧縮結果インデックスCRXの概念図である。FIG. 11 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 2 area after registration of a document 1; 文書１の登録後における要素識別子EID=３区域に対応する圧縮結果インデックスCRXの概念図である。10 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 3 area after registration of a document 1; FIG. 文書１の登録後における要素識別子EID=４区域に対応する圧縮結果インデックスCRXの概念図である。FIG. 10 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 4 area after registration of a document 1; 文書１の登録後における要素識別子EID=５区域に対応する圧縮結果インデックスCRXの概念図である。10 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 5 area after registration of a document 1; FIG. 文書１の登録後における要素識別子EID=６区域に対応する圧縮結果インデックスCRXの概念図である。FIG. 10 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 6 area after registration of a document 1; 文書１の登録後における要素識別子EID=７区域に対応する圧縮結果インデックスCRXの概念図である。12 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 7 area after registration of a document 1; FIG. 文書１の登録後における要素識別子EID=８区域に対応する圧縮結果インデックスCRXの概念図である。10 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 8 area after registration of a document 1; FIG. 文書１の登録後における要素識別子EID=９区域に対応する圧縮結果インデックスCRXの概念図である。12 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 9 area after registration of a document 1; FIG. 文書１の登録後における要素識別子EID=１０区域に対応する圧縮結果インデックスCRXの概念図である。FIG. 10 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 10 area after registration of a document 1; 文書１の登録後における要素識別子EID=１１区域に対応する圧縮結果インデックスCRXの概念図である。12 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 11 area after registration of a document 1; FIG. 文書１の登録後における要素識別子EID=１２区域に対応する圧縮結果インデックスCRXの概念図である。12 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 12 area after registration of a document 1; FIG. 文書１の登録後における要素識別子EID=１３区域に対応する圧縮結果インデックスCRXの概念図である。It is a conceptual diagram of the compression result index CRX corresponding to the element identifier EID = 13 area after registration of the document 1. 文書２の登録後における要素識別子EID=０区域に対応する圧縮結果インデックスCRXの概念図である。10 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 0 area after registration of a document 2. FIG. 文書２の登録後における要素識別子EID=８区域に対応する圧縮結果インデックスCRXの概念図である。10 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 8 area after registration of a document 2. FIG. 文書２の登録後における要素識別子EID=１０区域に対応する圧縮結果インデックスCRXの概念図である。10 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 10 area after registration of a document 2. FIG. 文書２の登録後における要素識別子EID=１１区域に対応する圧縮結果インデックスCRXの概念図である。12 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 11 area after registration of a document 2; FIG. 文書２の登録後における要素識別子EID=１２区域に対応する圧縮結果インデックスCRXの概念図である。12 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 12 area after registration of a document 2. FIG. 文書２の登録後における要素識別子EID=１４区域に対応する圧縮結果インデックスCRXの概念図である。12 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 14 area after registration of a document 2. FIG. 文書２の登録後における要素識別子EID=１５区域に対応する圧縮結果インデックスCRXの概念図である。12 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 15 area after registration of a document 2; FIG. 文書２の登録後における要素識別子EID=１６区域に対応する圧縮結果インデックスCRXの概念図である。10 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 16 area after registration of a document 2; FIG. 文書２の登録後における要素識別子EID=１７区域に対応する圧縮結果インデックスCRXの概念図である。FIG. 11 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 17 area after registration of a document 2; 文書２の登録後における要素識別子EID=１８区域に対応する圧縮結果インデックスCRXの概念図である。12 is a conceptual diagram of a compression result index CRX corresponding to an element identifier EID = 18 area after registration of a document 2. FIG. 文書１の登録後における要素識別子EIDとノード識別子NIDの組を各要素識別子EIDの区域について示す概念図である。It is a conceptual diagram which shows the group of the element identifier EID and node identifier NID after registration of the document 1 about the area | region of each element identifier EID. 文書２の登録後における要素識別子EIDとノード識別子NIDの組を各要素識別子EIDの区域について示す概念図である。FIG. 5 is a conceptual diagram showing a set of element identifier EID and node identifier NID after registration of document 2 for each element identifier EID area. 文書２の登録後における要素識別子EIDとノード識別子NIDの組を各要素識別子EIDの区域について示す概念図である。FIG. 5 is a conceptual diagram showing a set of element identifier EID and node identifier NID after registration of document 2 for each element identifier EID area. 文書１の登録後における圧縮結果セットCRSを示す概念図である。6 is a conceptual diagram showing a compression result set CRS after registration of a document 1; FIG. 文書１の登録後における圧縮結果セットCRSを示す概念図である。6 is a conceptual diagram showing a compression result set CRS after registration of a document 1; FIG. 文書２の登録後における圧縮結果インデックスCRXと文書識別子リストファイルを要素識別子EID=１の区域について示す概念図である。It is a conceptual diagram which shows the compression result index CRX and document identifier list file after registration of the document 2 about the area | region of element identifier EID = 1. 文書２の登録後における圧縮結果インデックスCRXと文書識別子リストファイルを要素識別子EID=８の区域について示す概念図である。It is a conceptual diagram which shows the compression result index CRX and document identifier list file after registration of the document 2 about the area | region of element identifier EID = 8. 文書２の登録後における圧縮結果インデックスCRXと文書識別子リストファイルを要素識別子EID=１１の区域について示す概念図である。It is a conceptual diagram which shows the compression result index CRX and document identifier list file after registration of the document 2 about the area | region of element identifier EID = 11. 文書２の登録後における圧縮結果インデックスCRXと文書識別子リストファイルを要素識別子EID=１５の区域について示す概念図である。It is a conceptual diagram which shows the compression result index CRX and document identifier list file after registration of the document 2 about the area | region of element identifier EID = 15. 文書２の登録後における圧縮結果インデックスCRXと文書識別子リストファイルを要素識別子EID=１６の区域について示す概念図である。It is a conceptual diagram which shows the compression result index CRX and document identifier list file after registration of the document 2 about the area | region of element identifier EID = 16. 文書１について文書集合整形式スキーマの写像に対するXML文書の構造を再構成した状態を示す図である。It is a figure which shows the state which reconfigure | reconstructed the structure of the XML document with respect to the mapping of a document set well-formed schema about the document 1. FIG. 文書２について文書集合整形式スキーマの写像に対するXML文書の構造を再構成した状態を示す図である。6 is a diagram illustrating a state in which the structure of an XML document with respect to a mapping of a document set well-formed schema for document 2 is reconstructed. FIG. 再現したXML文書集合を示す図である。It is a figure which shows the reproduced XML document set. 再現したXML文書集合を示す図である。It is a figure which shows the reproduced XML document set. 部分的に復元したXML文書を示す図である。It is a figure which shows the XML document decompress | restored partially. 部分的に復元したXML文書の他の例を示す図である。It is a figure which shows the other example of the XML document partially decompress | restored. 項目指定して出力した文書の例を示す図である。It is a figure which shows the example of the document which specified and output the item. 条件指定して出力した文書の例を示す図である。It is a figure which shows the example of the document which specified and output conditions.

Claims

タグノード（Ｔ）、属性ノード（ＡＴＴＲ）及び／又はテキストノード（ＣＤＡＴＡ）からなる複数種の構造化された文書構造体で各々が表される複数の文書集合を圧縮し、これを高速に検索するためのデータ構造をコンピュータ上で提供するための方法であって、
コンピュータとして文書を入力する入力部と、前記入力された文書を変換する変換部と、前記入力された文書又は前記変換部において変換された文書を格納する格納部とを備えるものを使用し、
前記変換部において、前記文書集合を構成する前記複数種の文書構造体を構成する各ノードタグ（Ｔ）に係属する属性ノード（ＡＴＴＲ）又はテキストノード（ＣＤＡＴＡ）を要素として識別するための要素識別子ノードを付与し、
前記要素識別子ノードに対して、既出の要素識別子ノードには同じＩＤが割り当てられるように整数連番として要素ＩＤを割り当て、
前記要素ＩＤが割り当てられた要素識別子ノードに対応する属性ノード又はテキストノードに格納された実データに対しては、既出の実データの値には、同じＩＤが割り当てられるように整数連番として実データノードＩＤを割り当て、
前記要素ＩＤと対応する実データノードＩＤの組を要素とする配列構造で格納される数値データの組のリストを形成し、かつ、前記リストを前記要素ＩＤの昇順にソートし、
前記複数の文書に文書ＩＤを割り当て、検索条件を指定しない場合の文書単位の検索、及び集計処理のために前記文書ＩＤごとに前記リストへのポインタを有する文書ＩＤ管理ファイルとして管理し、
前記複数の文書における要素ＩＤと実データノードＩＤとの各組について、前記要素ＩＤごとに、前記実データノードＩＤに係属する文書の文書ＩＤを、前記実データごとに、文書ＩＤリストファイルとして形成する
という一連の処理を行い、その結果得られたデータ構造を検索のために前記コンピュータの前記格納部に格納し、
検索条件を与えて前記文書集合を検索するときは、検索項目として１組の属性ノード又はテキストノード及び実データを与え、
前記検索項目として与えられた実データを、前記検索項目として与えられた属性ノード又はテキストノードに対応する要素ＩＤに係属する実データノードＩＤ集合に属する項目値をたどって一致するか否かを判定し、一致する場合に前記実データノードＩＤに対応する文書の文書ＩＤを抽出し、
検索条件を指定しない場合の文書ＩＤに対応する文書における検索、及び集計の際には、前記文書ＩＤ管理ファイルをたどって処理することを特徴とする方法。 A plurality of types of structured document structures each including a tag node (T), an attribute node (ATTR), and / or a text node (CDATA) are compressed, and a plurality of document sets are compressed and searched at high speed. A method for providing a data structure for a computer on a computer,
Using an input unit that inputs a document as a computer, a conversion unit that converts the input document, and a storage unit that stores the input document or a document converted by the conversion unit,
In the conversion unit, an element identifier node for identifying an attribute node (ATTR) or text node (CDATA) associated with each node tag (T) constituting the plurality of types of document structures constituting the document set as an element And grant
For the element identifier node, assign an element ID as an integer sequence number so that the same ID is assigned to the element identifier node that has already been issued,
For the actual data stored in the attribute node or text node corresponding to the element identifier node to which the element ID is assigned, the actual ID is assigned as an integer serial number so that the same ID is assigned to the value of the existing actual data. Assign a data node ID,
Forming a list of numerical data sets stored in an array structure having elements corresponding to combinations of actual data node IDs corresponding to the element IDs, and sorting the list in ascending order of the element IDs;
The assignment of the document ID to a plurality of documents, and managed as the document ID management file having a pointer to the list for each said document ID search of document unit when not specified search criteria, and for aggregation treatment,
For each set of element IDs and actual data node IDs in the plurality of documents, a document ID of a document associated with the actual data node ID is formed as a document ID list file for each actual data. A series of processing is performed, and the resulting data structure is stored in the storage unit of the computer for retrieval,
When searching the document set by giving a search condition, a set of attribute nodes or text nodes and actual data are given as search items,
It is determined whether or not the actual data given as the search item is matched by following the item values belonging to the actual data node ID set belonging to the element ID corresponding to the attribute node or text node given as the search item And if they match, extract the document ID of the document corresponding to the actual data node ID,
That put the corresponding document to the document ID if you do not specify a search condition search, and the time of aggregation, and wherein to process by following the document ID management file.