JP5506629B2

JP5506629B2 - Quasi-frequent structure pattern mining apparatus, frequent structure pattern mining apparatus, method and program thereof

Info

Publication number: JP5506629B2
Application number: JP2010234233A
Authority: JP
Inventors: 潤鈴木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-10-19
Filing date: 2010-10-19
Publication date: 2014-05-28
Anticipated expiration: 2030-10-19
Also published as: JP2012088880A

Description

この発明は、データの集合、主に大規模なデータの集合（データベース）から、有用な規則を抽出する技術であるデータマイニングの分野に属し、特に処理速度を高め、メモリ容量を削減した準頻出構造パターンマイニング装置と頻出構造パターンマイニング装置とそれらの方法、及びプログラムに関する。 The present invention belongs to the field of data mining, which is a technique for extracting useful rules from a set of data, mainly a large set of data (database), and is particularly quasi-frequent with increased processing speed and reduced memory capacity. The present invention relates to a structure pattern mining device, a frequent structure pattern mining device, a method thereof, and a program.

計算機上で扱われる実データのいくつかは、計算機で扱い易い形式に変換する段階で、離散構造データ、つまり、ラベル付きグラフにより表現される。図６に、これらのラベル付きグラフにより記述できる離散構造データの実例を示す。例えば、テキストの文法・意味的な構造（図６（a）,(b)）、遺伝子配列に対する蛋白質の構造、人や物の関係を記述するソーシャルネットワーク（図６(d)）等である。 Some of the actual data handled on the computer is expressed by discrete structure data, that is, a labeled graph at the stage of conversion into a format that can be easily handled by the computer. FIG. 6 shows an example of discrete structure data that can be described by these labeled graphs. For example, text grammar / semantic structure (FIGS. 6A and 6B), protein structure relative to gene sequence, social network describing relationships between people and objects (FIG. 6D), and the like.

データマイニングは、このような離散構造データがいくつか集った離散構造データベースから意味のある有用な規則を抽出する。データベース中に繰り返し頻出するパターンは、統計的な意味で何か有益な情報である可能性が高い。よってここでは、意味のある有用な規則として、データ中に繰り返し現れるパターンを想定する。 Data mining extracts meaningful and useful rules from a discrete structure database in which several pieces of such discrete structure data are collected. A pattern that repeatedly appears in the database is likely to be useful information in a statistical sense. Therefore, here, a pattern that appears repeatedly in the data is assumed as a meaningful and useful rule.

図７に構造パターンマイニングの典型例を示す。図７は、商品購買履歴のデータベースを入力として、出現頻度２回を閾値として抽出した頻出パターンの例を示す。このように大量のデータの中から有益な情報を抽出する技術は、様々な情報処理の基盤技術としての価値が高い。特に、データ量が大規模になればなるほど人手による抽出は困難となるため、計算機により自動的に有益な情報を抽出する技術の有益性が高まる。 FIG. 7 shows a typical example of structure pattern mining. FIG. 7 shows an example of a frequent pattern extracted using a product purchase history database as an input and using the appearance frequency twice as a threshold. Such a technique for extracting useful information from a large amount of data is highly valuable as a basic technique for various information processing. In particular, the larger the amount of data, the more difficult it is to perform manual extraction, and the usefulness of a technique for automatically extracting useful information by a computer increases.

計算機による構造パターンマイニングでは、大規模なデータから頻出する構造パターンを抽出することが一般的な前提条件である。これは、上記したように、データ規模が小さい場合は人手による抽出が比較的容易に実現できる場合があり、自動で行うことの意義を発揮するには人手ではコストや時間的に困難であるような状況を想定しているためである。 In structure pattern mining by a computer, it is a general precondition to extract a structure pattern that frequently appears from large-scale data. This is because, as described above, manual extraction may be relatively easy when the data scale is small, and it may be difficult to manually and costly to demonstrate the significance of performing it automatically. This is because a special situation is assumed.

対規模性がそれなりに高い構造パターンマイニング方法として、gSpan（非特許文献１）、FREQT(非特許文献２)、PrexSpan（非特許文献３）などが知られている。 As a structural pattern mining method having a relatively high scale property, gSpan (Non-Patent Document 1), FREQT (Non-Patent Document 2), PrexSpan (Non-Patent Document 3), and the like are known.

また、情報処理分野の一般論として、複数プロセッサを用いて分散並列処理を行うことで、処理時間を大幅に縮減する方法論が従来から存在していた。近年では、（非特許文献４）といった情報処理分野で幅広く適用することが可能な汎用的な分散並列処理の計算モデルが提案され、情報検索といった実問題で活用されている。 Further, as a general theory in the information processing field, there has conventionally been a methodology for significantly reducing processing time by performing distributed parallel processing using a plurality of processors. In recent years, a general-purpose distributed parallel processing calculation model that can be widely applied in the information processing field such as (Non-Patent Document 4) has been proposed and used in actual problems such as information retrieval.

Xifeng Yan and Jiawei Han. 2002. gSpan: Graph-Based Substructure Patten Mining. Proc. of ICDM, pp. 721-724.Xifeng Yan and Jiawei Han. 2002.gSpan: Graph-Based Substructure Patten Mining.Proc. Of ICDM, pp. 721-724. Mohammed J. Zaki. 2002. Efficiently Mining Frequent Trees in a Forest. Proc. Of SIG-KDD.Mohammed J. Zaki. 2002. Efficiently Mining Frequent Trees in a Forest. Proc. Of SIG-KDD. Lian Pei, Jiawei Han, Behzad M. Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Mei C. Hsu. 2001. PrefixSpan: Mining Sequential PatternsEfficiently by Prefix-Projected Pattern Growth. Proc. of ICDE.Lian Pei, Jiawei Han, Behzad M. Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Mei C. Hsu. 2001. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. Proc. Of ICDE. J. Dean and S. Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51(1):107-113.J. Dean and S. Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM, 51 (1): 107-113.

従来の構造パターンマイニング方法は、例えば、非特許文献１，２，３等は、単一のプロセッサ上で動くことを前提にしたものであり、再帰処理の逐次実行で構成されている。つまり、従来の構造パターンマイニング装置に関しては、再帰処理の逐次実行という構成であるが故に分散並列処理に好適な方法が存在していなかった。従来の逐次処理では、近年のデータの大規模化に対応するには難しい状況であり、分散並列処理といった別の方法論で処理時間を大幅に縮減する方法が求められている。 A conventional structure pattern mining method is based on the premise that non-patent documents 1, 2, 3, etc. run on a single processor, and is configured by sequential execution of recursive processing. In other words, the conventional structure pattern mining apparatus has a configuration of sequential execution of recursive processing, and therefore there is no method suitable for distributed parallel processing. In conventional sequential processing, it is difficult to cope with the recent increase in the scale of data, and there is a need for a method that significantly reduces processing time by another method such as distributed parallel processing.

この発明は、このような課題に鑑みてなされたものであり、複数プロセッサの並列処理に好適な準頻出構造パターンマイニング装置と頻出構造パターンマイニング装置とそれらの方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of such problems, and an object thereof is to provide a quasi-frequent structure pattern mining device, a frequent structure pattern mining device, a method thereof, and a program suitable for parallel processing by a plurality of processors. And

この発明の準頻出構造パターンマイニング装置は、データ前処理部と、局所頻出パターン抽出部と、局所頻出パターン集計部と、頻出パターン出力部、とを具備する。データ前処理部は、離散構造データ集合Dを入力として、その離散構造データ集合Dを予め定められた第１の分割数P₁個に分割した部分離散構造データD_iと、予め設定された閾値Tを上記分割数P₁個に分割した局所閾値T_iとを生成する。局所頻出パターン抽出部は、部分離散構造データ集合D_iのそれぞれにおいて、局所閾値T_iを上回る頻度で現れる局所的な部分構造パターンを分散並列処理して抽出し、その部分構造パターンをハッシュにより第２の分割数Q₁個の部分集合に振り分けて出力する。局所頻出パターン集計部は、第２の分割数Q₁個に振り分けて出力される部分集合の全てに共通する部分構造パターンを頻出パターンとして判別して頻出パターン集合F_qを生成し、それ以外の部分構造パターンを未完パターンと判別して未完パターン集合S_qを生成する。頻出パターン出力部は、頻出パターン集合F_qを外部に出力する。 The quasi-frequent structure pattern mining device of the present invention includes a data preprocessing unit, a local frequent pattern extraction unit, a local frequent pattern totaling unit, and a frequent pattern output unit. The data preprocessing unit receives the discrete structure data set D and inputs the partial structure data D _{i obtained} by dividing the discrete structure data set D into a predetermined first division number P ₁ and a preset threshold value. A local threshold value T _i obtained by dividing T into the number of divisions P ₁ is generated. The local frequent pattern extraction unit extracts and extracts local partial structure patterns that appear at a frequency exceeding the local threshold T _i in each of the partial discrete structure data sets D _i by distributed parallel processing. Divide into 2 subsets Q ₁ and output. Local frequent pattern aggregation unit generates a frequent pattern set F _q to determine the common partial structure pattern as a frequent pattern in all the subsets which are output distributed to the second dividing number Q ₁ or, otherwise The partial structure pattern is determined as an incomplete pattern, and an incomplete pattern set S _q is generated. The frequent pattern output unit outputs the frequent pattern set F _q to the outside.

また、この発明の頻出構造パターンマイニング装置は、上記した準頻出構造パターンマイニング装置に加え、未完パターン再抽出部と、未完パターン再集計部と、頻出パターン出力部と、を具備する。未完パターン再抽出部は、離散構造データ集合Dを予め定められた第３の分割数P₂個に分割した部分離散構造データ集合D_k′に出現する上記未完パターン集合S_qに含まれる部分構造パターンと、その部分構造パターンに対する頻度情報とを、部分離散構造データD_k′の数で分散並列処理して再抽出し、その部分構造パターンをハッシュにより第４の分割数Q₂個の集合に振り分けて出力する。未完パターン再集計部は、第４の分割数Q₂個に振り分けて出力される集合を取得し、その取得した集合S_m={S_1,m,…,S_P2,m}に含まれる部分構造パターンとその出現頻度を集計し、その出現頻度が閾値Tを上回る部分構造パターンを新たな頻出パターン集合F_m′とする。頻出パターン出力部は、準頻出構造パターンマイニング装置が出力する頻出パターン集合F_qと、頻出パターン集合F_m′を合わせたものを頻出パターン集合として出力する。 In addition to the quasi-frequent structure pattern mining device described above, the frequent structure pattern mining device of the present invention includes an incomplete pattern re-extraction unit, an incomplete pattern recounting unit, and a frequent pattern output unit. The incomplete pattern re-extraction unit includes a partial structure included in the incomplete pattern set S _q that appears in the partial discrete structure data set D _k ′ _obtained by dividing the discrete structure data set D into a predetermined third division number P _2. The pattern and the frequency information for the partial structure pattern are re-extracted by performing distributed parallel processing on the number of the partial discrete structure data D _k ′, and the partial structure pattern is hashed into a set of the fourth division number Q ₂ Sort and output. Unfinished pattern recount portion, the portion Get set which is output distributed to the fourth division number Q ₂ amino, the acquired set S _m = included in _{{S 1, m, ...,} S P2, m} The structure patterns and their appearance frequencies are totaled, and the partial structure patterns whose appearance frequencies exceed the threshold T are set as a new frequent pattern set F _m ′. The frequent pattern output unit outputs a combination of the frequent pattern set F _q output from the quasi-frequent structure pattern mining apparatus and the frequent pattern set F _m ′ as a frequent pattern set.

この発明の準頻出構造パターンマイニング装置は、処理する対象の離散構造データ集合をほぼ等しい固定サイズに分割すると共に、頻出パターンを抽出する閾値も同様に分割する。そして、分割した部分集合の全てに共通する部分構造パターンを頻出パターンとして出力する。従って、離散構造データ集合を分割した部分集合の単位の処理を複数のプロセッサで分担することが可能であり、効率的にデータマイニングを行うことが出来る。また、プロセッサ１個当たりのメモリ容量も小さくすることが出来る。 The quasi-frequent structure pattern mining apparatus of the present invention divides a discrete structure data set to be processed into substantially equal fixed sizes, and similarly divides a threshold value for extracting frequent patterns. Then, a partial structure pattern common to all the divided subsets is output as a frequent pattern. Therefore, it is possible to share the processing of the unit of the subset obtained by dividing the discrete structure data set by a plurality of processors, and data mining can be performed efficiently. Also, the memory capacity per processor can be reduced.

また、この発明の頻出構造パターンマイニング装置は、準頻出構造パターンマイニング装置で未完パターンと判別された部分構造パターンを、部分離散構造データごとに再抽出するので、頻出パターンを過不足なく抽出することが出来る。 Further, the frequent structure pattern mining device of the present invention re-extracts the partial structure patterns determined as incomplete patterns by the semi-frequent structure pattern mining device for each partial discrete structure data, so that frequent patterns can be extracted without excess or deficiency. I can do it.

この発明の準頻出構造パターンマイニング装置１００の機能構成例を示す図。The figure which shows the function structural example of the quasi-frequent structure pattern mining apparatus 100 of this invention. 準頻出構造パターンマイニング装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the semi frequent appearance structure pattern mining apparatus. この発明の頻出構造パターンマイニング装置２００の機能構成例を示す図。The figure which shows the function structural example of the frequent structure pattern mining apparatus 200 of this invention. 頻出構造パターンマイニング装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the frequent structure pattern mining apparatus. この発明の頻出構造パターンマイニング方法のデータマイニングの実行時間と従来の方法の実行時間を比較した結果を示す図。The figure which shows the result of having compared the execution time of the data mining of the frequent structure pattern mining method of this invention, and the execution time of the conventional method. ラベル付きグラフにより記述できる離散構造データの実例を示す図である。It is a figure which shows the example of the discrete structure data which can be described by the graph with a label. 従来の構造パターンマイニングの典型例を示す図。The figure which shows the typical example of the conventional structure pattern mining. 従来の典型的なMap-Reduce計算モデルを示す図。The figure which shows the conventional typical Map-Reduce calculation model.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。実施例の説明の前に、この発明の考えを説明する。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated. Prior to the description of the embodiments, the idea of the present invention will be described.

〔この発明の考え〕
この発明は、分散並列計算モデルであるMap-Reduceの枠組みに則って動作する構造パターンマイニングの方法を提案し、分散並列計算を利用することで大規模データから効率的に構造パターンを抽出するデータマイニングを行うものである。 [Concept of this invention]
This invention proposes a structure pattern mining method that operates in accordance with the Map-Reduce framework, which is a distributed parallel computing model, and uses distributed parallel computing to efficiently extract structure patterns from large-scale data. Mining is performed.

図８に、従来の典型的なMap-Reduceモデルを、入力データを文書とした例で示す。入力データの四角は一文を意味する。Map-Reduceモデルによる分散並列計算は、入力データをほぼ均等に分割することで、分割された各データに対する計算量を均等化することにより、各データを処理する際に必要な最大メモリ料をほぼ一定に保つことができる。 FIG. 8 shows an example of a conventional typical Map-Reduce model using input data as a document. The square in the input data means one sentence. Distributed parallel computation using the Map-Reduce model divides input data almost evenly, and by equalizing the amount of calculation for each divided data, the maximum memory charge required for processing each data is almost the same. Can be kept constant.

但し、単純にMap-Reduceの枠組みで入力データを分割して処理すると、データ全体に対する各パターンの合計頻度情報の収集が難しくなる問題が発生する。そこでこの発明では、入力データを分割して処理する際に、「事前に設定された閾値を分割する。」、また、「頻出パターンかどうか判定するには頻度情報が不足している状態のパターン（未完パターン集合）を保持する。」こと、の２つの処理を行うことでMap-Reduce計算モデルに基づく分散計算環境でも、得られる頻出パターンとその頻度情報を正確に獲得することを可能にする。 However, if the input data is simply divided and processed in the Map-Reduce framework, there is a problem that it is difficult to collect the total frequency information of each pattern for the entire data. Therefore, in the present invention, when the input data is divided and processed, “a threshold value set in advance is divided.” And “a pattern in which frequency information is insufficient to determine whether it is a frequent pattern or not. It is possible to accurately obtain the frequent patterns and frequency information obtained even in a distributed computing environment based on the Map-Reduce calculation model. .

図１に、この発明の準頻出構造パターンマイニング装置１００の機能構成例を示す。ここで準頻出構造パターンとは、本装置の出力となる頻出構造パターンと未完パターンを合わせたパターンの集合を指すものとする。その動作フローを図２に示す。準頻出構造パターンマイニング装置１００は、データ前処理部１０と、局所頻出パターン抽出部２０と、局所頻出パターン集計部３０と、を具備する。局所頻出パターン抽出部２０は、P₁個の局所頻出パターン抽出部２０₁〜２０_P1で構成され、分散並列処理を行う。その各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows a functional configuration example of a semi-frequent structure pattern mining apparatus 100 of the present invention. Here, the quasi-frequent structure pattern refers to a set of patterns in which the frequent structure pattern that is the output of the apparatus and the incomplete pattern are combined. The operation flow is shown in FIG. The quasi-frequent structure pattern mining apparatus 100 includes a data preprocessing unit 10, a local frequent pattern extraction unit 20, and a local frequent pattern totaling unit 30. The local frequent pattern extraction unit 20 includes P ₁ local frequent pattern extraction units 20 _{1 to} 20 _P1 and performs distributed parallel processing. The functions of the respective units are realized by a predetermined program being read into a computer constituted by, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

データ前処理部１０は、離散構造データ集合Dを入力として、その離散構造データ集合Dを予め定められた第１の分割数P₁個に分割した部分離散構造データ集合D_iと、予め設定された閾値Tを上記分割数P₁個に分割した局所閾値T_iとを生成する（ステップＳ１０）。 The data pre-processing unit 10 receives a discrete structure data set D as an input, and is set in advance as a partial discrete structure data set D _{i obtained} by dividing the discrete structure data set D into a predetermined first division number P _1. The local threshold value T _i obtained by dividing the threshold value T into the division number P ₁ is generated (step S10).

局所頻出パターン抽出部２０は、部分離散構造データ集合D_iのそれぞれにおいて、局所閾値T_iを上回る頻度で現れる局所的な部分構造パターンを分散並列処理して抽出し、その部分構造パターンをハッシュにより第２の分割数Q₁個の部分集合L_i,1,…,L_i,Q1に振り分けて出力する（ステップＳ２０）。 The local frequent pattern extraction unit 20 extracts and extracts local partial structure patterns that appear at a frequency exceeding the local threshold value T _i by distributed parallel processing in each of the partial discrete structure data sets D _i , and extracts the partial structure patterns by hashing. second dividing number Q ₁ or subsets _{L i, 1, ..., L} i, Q1 to sorting and outputs (step S20).

局所頻出パターン集計部３０は、第２の分割数Q_１個に振り分けて出力される部分集合L_i,1,…,L_i,Q1の全てに共通する部分構造パターンを頻出パターンとして判別して頻出パターン集合F_qを生成し、それ以外の部分構造パターンを未完パターンと判別して未完パターン集合S_qを生成する（ステップＳ３０）。 Local frequent pattern counting part 30, the subset L _{i, 1} to be output by distributing the second division number Q ₁ or, ..., to determine the common partial structure pattern to all L _{i, Q1} as frequent pattern A frequent pattern set F _q is generated, and other partial structure patterns are determined as incomplete patterns, and an incomplete pattern set S _q is generated (step S30).

頻出パターン出力部４０は、頻出パターン集合F_ｑを外部に出力する（ステップＳ４０）。 The frequent pattern output unit 40 outputs the frequent pattern set _Fq to the outside (step S40).

準頻出構造パターンマイニング装置１００は、処理する対象の離散構造データ集合をほぼ等しい固定サイズに分割すると共に、頻出パターンを抽出する閾値も同様に分割するので、離散構造データ集合Dを分割した部分離散構造データ集合D_iの単位の処理を複数のプロセッサで分担することが可能であり、高速にデータマイニングを行うことが出来る。また、プロセッサ１個当たりのメモリ容量も小さくすることが出来る。 The quasi-frequent structure pattern mining apparatus 100 divides the discrete structure data set to be processed into substantially equal fixed sizes and also divides the threshold value for extracting the frequent pattern in the same manner. it is possible to share the processing of the unit of structure data set D _i a plurality of processors, it is possible to perform data mining on high speed. Also, the memory capacity per processor can be reduced.

以降、各機能部の動作を更に詳しく説明する。まず、準頻出構造パターンマイニング装置１００が処理する対象と、入力データについて説明する。 Hereinafter, the operation of each functional unit will be described in more detail. First, an object to be processed by the semi-frequent structure pattern mining apparatus 100 and input data will be described.

準頻出構造パターンマイニング装置１００は、グラフで記述される離散構造データの集合である離散構造データベースに対して、頻出する部分構造パターン（頻出パターン）を列挙するものである。具体的には、図６に示したように例えば、商品購買履歴のデータベースから頻出する購買パターンを頻出パターンとして抽出する。 The semi-frequent structure pattern mining apparatus 100 enumerates frequent partial structure patterns (frequent patterns) in a discrete structure database that is a set of discrete structure data described in a graph. Specifically, as shown in FIG. 6, for example, a frequent purchase pattern is extracted as a frequent pattern from a product purchase history database.

入力されるデータは、全てラベル付きグラフに変換されており、変換したデータの集合をDとし、N個のグラフで構成されるものとする。また、頻出パターンとする頻度の閾値をTとし、予め設定されているものとする。但し、閾値Tは１以上の正の整数とする。 The input data is all converted into a graph with a label, and the set of converted data is assumed to be D, and is composed of N graphs. In addition, it is assumed that the frequency threshold for the frequent pattern is T and is set in advance. However, the threshold T is a positive integer of 1 or more.

離散構造データ集合Dを部分離散構造データ集合D_iに分割する第１のデータ分割数P₁と第３のデータ分割数P₂と、その部分離散構造データ集合D_iにおいて局所閾値Tiを上回る頻度で現れる局所的な部分構造パターンを抽出してその部分構造パターンをハッシュにより振り分ける数を決める第２のパターン分割数Q₁と第４のパターン分割数Q_２とは、予め設定されているものとする。P₁とP₂、Q₁とQ_２とは分散並列処理の並列数を決める値である。これらの値は同じ値であっても良いし、異なる値であっても良い。 A first data division number P ₁ of dividing the discrete structures data set D in the partial discrete structure data set D _i and the third data division number P _2, the frequency above a local threshold value Ti in its partial discrete structure data set D _i The second pattern division number Q ₁ and the fourth pattern division number Q ₂ that determine the number of local partial structure patterns appearing in FIG. To do. P ₁ and P ₂ , Q ₁ and Q ₂ are values that determine the parallel number of distributed parallel processing. These values may be the same value or different values.

データ分割数P₁とP₂は、データ数が分割の上限であるため、必然的に１<P₁,P₂≦Nとなる。P₁,P₂=1は並列処理を行わないことを意味するので、この発明の対象外である。P₁,P₂は、一般的にはデータ数Nを適度な大きさに分割する値に設定する。一方、パターン分割数Q₁とQ_２は事前に上限値を知ることはできないので、適当な値を設定すれば良い。 The data division numbers P ₁ and P ₂ inevitably satisfy 1 <P ₁ and P ₂ ≦ N because the data number is the upper limit of division. Since P ₁ and P ₂ = 1 mean that parallel processing is not performed, it is out of the scope of the present invention. P ₁ and P ₂ are generally set to values that divide the number of data N into an appropriate size. On the other hand, the number of pattern divisions Q ₁ and Q ₂ cannot be known in advance, so an appropriate value may be set.

〔データ前処理部〕
データ前処理部１０は、入力された離散構造データDを、第１のデータ分割数P₁に基づいて分割する。分割されたi番目の部分離散構造データの集合をD_iと書く。離散構造データ集合Dを分割する方法は、各グラフのノード数やエッジ数がほぼ一定の場合にはデータ数を均等にするように分割する。逆に、グラフに含まれるノード数やエッジ数が個々のグラフで比較的違う場合には、グラフのノードやエッジ数が均等になるように離散構造データ集合Dを分割した方が、リソースの分散につながる場合が多い。基本的には、この何れかの方法で離散構造データ集合Dを分割すれば良い。 [Data pre-processing section]
The data preprocessing unit 10 divides the input discrete structure data D based on the _first data division number P ₁ . A set of divided i-th partial discrete structure data was written as D _i. As a method of dividing the discrete structure data set D, when the number of nodes and the number of edges of each graph are substantially constant, the data number is divided so as to be equal. Conversely, if the number of nodes and edges included in a graph are relatively different for each graph, it is better to divide the discrete structure data set D so that the number of nodes and edges in the graph is equal. It often leads to. Basically, the discrete structure data set D may be divided by any one of these methods.

次に、予め設定された閾値Tを第１のデータ分割数P₁で分割したP₁個の局所閾値T_iを決定する。局所閾値T_iは、部分離散構造データ集合D_iで、局所的な頻出パターンの抽出に用いる閾値である。i番目の局所閾値をT_iと書き、その値は正の整数とする。 Next, P ₁ local threshold values T _i obtained by dividing the preset threshold value T by the first data division number P ₁ are determined. The local threshold value T _i is a threshold value used for extracting a local frequent pattern in the partial discrete structure data set D _i . The i-th local threshold is written as T _i, and its value is a positive integer.

閾値Tと局所閾値T_iには式（１）の関係が成り立つ。 The relationship of Formula (1) is established between the threshold value T and the local threshold value T _i .

つまり、局所閾値T_i（i＝１,…,P₁）の総和が閾値Tを越えないようにP₁個の局所閾値T_iを設定する。具体的には、各分割された部分離散構造データ集合D_iに含まれるサンプル数に偏りがある場合には各分割された部分離散構造データ集合D_iに含まれるデータ数|D_i|個に基づいて式（２）に示すように設定するのが最も簡単である。|D|は、離散構造データ集合Dに含まれるデータ数（|D|=N）を表す。 In other words, the local threshold T _i (i = 1, ..., P ₁₎ for setting the local threshold value T _i sum _one P of so as not to exceed the threshold value T of. Specifically, the number of data included in each divided portion discrete structure data set D _i when there is a bias in the number of samples included in each divided portion discrete structural data set D _i | into individual | D _i The simplest setting is based on equation (2). | D | represents the number of data included in the discrete structure data set D (| D | = N).

但し、できるだけ左辺と右辺が近い値をとる方が理想的である。つまり、等号が成り立つことが望ましい。よって式（３）とした時、 However, it is ideal that the left side and the right side take values as close as possible. In other words, it is desirable that the equal sign holds. Therefore, when formula (3) is used,

１からrまでのiに対しては式（４）とする拡張も考えられる。 For i from 1 to r, an extension of equation (4) is also conceivable.

つまり、局所閾値T_iは、閾値Tを離散構造データ集合Dのデータ数|D|で除した値に、上記部分離散構造データ集合D_iのデータ数|D_i|を乗じた値の小数点以下を切り捨てた整数に１を加えた値を超えないように設定される。 That is, the local threshold value T _i is a value obtained by dividing the threshold value T by the number of data | D | in the discrete structure data set D and the number of data in the partial discrete structure data set D _i | D _i | It is set not to exceed the value obtained by adding 1 to the integer obtained by rounding down.

例えば、サンプル数N=100でデータ分割数P₁=8とする。このとき、データの分割はiが1から4までを|D_i|=13、iが5から8までを|D_i|=12とすれば式（５）が成り立つ。 For example, assume that the number of samples N = 100 and the number of data divisions P ₁ = 8. At this time, the data is divided by formula (5) if i is from 1 to 4 | D _i | = 13 and i is from 5 to 8 | D _i | = 12.

以上の処理によって、データ分割数P₁個に分割された部分離散構造データ集合D_iと局所閾値T_iのペア{D₁,T₁},…,(D_P1,T_P1)}が生成される。なお、P₁≠P₂の場合には、データ分割数P₂についても、上記と同様の方法で離散構造データ集合DをP₂個に分割した部分離散構造データ集合D_k′(k=1, …,P₂)を一緒に生成しても良い。 Through the above processing, a pair {D ₁ , T ₁ },..., (D _P1 , T _P1 )} of the partial discrete structure data set D _i and the local threshold value T _i divided into the data division number P ₁ is generated. The In the case of P ₁ ≠ P ₂ , for the data division number P ₂ , the partial discrete structure data set D _k ′ (k = 1 (k = 1) obtained by dividing the discrete structure data set D into P _{2 in} the same manner as described above. ,…, P ₂ ) may be generated together.

〔局所頻出パターン抽出部〕
局所頻出パターン抽出部２０は、データ前処理部１０が生成したP₁個の部分離散構造データ集合D_iと局所閾値T_iのペア{D₁,T₁},…,(D_P1,T_P1)}が、それぞれ入力されるP₁個の局所頻出パターン抽出部２０₁〜２０_P1を備える。 [Local frequent pattern extraction unit]
The local frequent pattern extraction unit 20 includes pairs {D ₁ , T ₁ }, ..., (D _P1 , T _P1 ) of P ₁ partial discrete structure data sets D _i generated by the data preprocessing unit 10 and local threshold values T _i. )} Includes P ₁ local frequent pattern extraction units 20 _{1 to} 20 _P1 that are respectively input.

i番目の局所頻出パターン抽出部２０_iは、分割された部分離散構造データ集合D_iと局所閾値T_iのペア{D_i,T_i}を取得し、周知の構造パターンマイニング法を用いて局所頻出パターンを獲得する。ここでの「局所頻出パターンの抽出」とは、部分離散構造データ集合D_i内で局所閾値T_iを越える頻度をもつ部分構造パターンを抽出することである。 The i-th local frequent pattern extraction unit 20 _i obtains a pair {D _i , T _i } of the divided partial discrete structure data set D _i and the local threshold value T _i and uses the well-known structure pattern mining method to locally Acquire frequent patterns. Here, “extraction of local frequent patterns” refers to extracting partial structure patterns having a frequency exceeding the local threshold value T _i in the partial discrete structure data set D _i .

この部分構造パターンを抽出する処理は、各局所パターン抽出部２０_iの間で情報の共有が不要であるため、部分離散構造データ集合D_iの単位で完全に分散並列処理が可能である。この処理は、Map-Reduce計算モデルのMap処理（図８のMap関数）に相当する。 Since the process of extracting the partial structure pattern does not require sharing of information between the local pattern extraction units 20 _i , it is possible to perform completely distributed parallel processing in units of the partial discrete structure data set D _i . This process corresponds to the Map process of the Map-Reduce calculation model (Map function in FIG. 8).

局所頻出パターンの抽出には、単一の計算機で処理する従来の構造パターンマイニング技術を適用することができる。例えば、この実施例ではgSpan（非特許文献１）を用いる。扱うデータベース内のグラフ構造が、依存構造解析やXML文書等の木構造に限定されている場合は、木構造の効率的な構造パターンマイニング方法であるFREQT（非特許文献２）を利用することも考えられる。 A conventional structural pattern mining technique for processing with a single computer can be applied to the extraction of local frequent patterns. For example, gSpan (Non-Patent Document 1) is used in this embodiment. If the graph structure in the database to be handled is limited to tree structures such as dependency structure analysis and XML documents, FREQT (Non-Patent Document 2), which is an efficient structure pattern mining method for tree structures, may be used. Conceivable.

また、データが、例えば文書や商品購買履歴のように時系列構造に限定されている場合には、系列構造に特化して効率的に頻出パターンを抽出するPrexSpan（非特許文献３）を利用することもできる。上記した３種類の方法は、全て深さ優先探索による効率的な構造パターンマイニング方法であり、基本的にはgSpanが他の二つの枠組みを含んだグラフ構造を効率的に列挙することができるアルゴリズムになっている。 In addition, when data is limited to a time-series structure such as a document or a product purchase history, PrexSpan (Non-patent Document 3) that efficiently extracts frequent patterns specialized for the sequence structure is used. You can also. The above three methods are all efficient structure pattern mining methods by depth-first search. Basically, gSpan can efficiently enumerate graph structures including the other two frameworks. It has become.

局所頻出パターン抽出部２０_iは、入力された部分離散構造データ集合D_iから局所閾値T_iを上回る頻度で出現した局所頻出パターンを部分構造パターンとして出力する。このとき、予め設定した第２の分割数Q₁に基づいて、ハッシュ関数によりQ₁個の部分構造パターンの部分集合L_i,1,…,L_i,Q1に振り分ける。これを全てのi=1,…,Q₁に対して同じ処理を行う。この振り分けの処理は、MapReduce計算モデルに基づいた処理である。ハッシュ関数は、各パターンに対して１からQ₁の間の整数を返す任意の関数である。この振り分けにより、別の局所頻出パターン集計部から得られた同一のパターンは、必ず同じ局所頻出パターン集計部に代入される。つまり、i番目の局所頻出パターン抽出部２０iとj番目の局所頻出パターン抽出部２０_j（iとjは別番号とする）で得られた同一のパターン、例えば”ABC”は、ABCのハッシュ値がkなら部分集合L_i,kとL_j,kに振り分けられ、最終的に、それぞれ同じ局所パターン集計部３０_kに代入される。部分構造パターンには、各部分構造パターンに対応する頻度情報も対応付けて出力されるものとする。 Local frequent pattern extraction unit 20 _i outputs the local frequent pattern that appeared at a frequency above the local threshold T _i from the input portion discrete structure data set D _i as a partial structure pattern. At this time, based on the second division number Q ₁ set in advance, the hash function is used to distribute the subsets L _{i, 1} ,..., L _{i, Q1} of Q ₁ partial structure patterns. The same processing is performed for all i = 1,..., Q ₁ . This distribution process is a process based on the MapReduce calculation model. A hash function is any function that returns an integer between 1 and Q ₁ for each pattern. By this distribution, the same pattern obtained from another local frequent pattern totaling unit is always assigned to the same local frequent pattern totaling unit. That is, the same pattern obtained by the i-th local frequent pattern extraction unit 20 i and the j-th local frequent pattern extraction unit 20 _j (i and j are different numbers), for example, “ABC” is the hash value of ABC. Is divided into subsets L _{i, k} and L _{j, k} and finally assigned to the same local pattern totaling unit 30 _k . It is assumed that frequency information corresponding to each partial structure pattern is also output in association with the partial structure pattern.

〔局所頻出パターン集計部〕
局所頻出パターン集計部３０は、局所頻出パターン抽出部２０_iが振り分けた数に対応した数の局所頻出パターン集計部３０_jと、頻出パターン集合生成部３１と、未完パターン集合生成部３２と、を備える。図１の例では、1個目の局所頻出パターン集計部３０₁とQ₁個目の局所頻出パターン集計部３０_Q１のみを表記し、２個目からQ_１−１個目の局所頻出パターン集計部を省略している。 [Local frequent pattern totaling section]
The local frequent pattern totaling unit 30 includes a number of local frequent pattern totaling units 30 _j corresponding to the number assigned by the local frequent pattern extracting unit 20 _i , a frequent pattern set generating unit 31, and an incomplete pattern set generating unit 32. Prepare. In the example of FIG. 1, only the first local frequent pattern totaling unit 30 ₁ and the Q _first local frequent pattern totaling unit 30 _Q1 are described, and the second to Q ₁ −1 local frequent pattern totaling is described. The part is omitted.

j番目（j=1,…,Q₁）の局所頻出パターン集計部３０_jは、各局所頻出パターン抽出部２０iから出力されるj番目の部分構造パターンの部分集合L_1,j,…,L_P1,jを取得し、取得した部分集合に含まれる部分構造パターンが頻出パターンになるかどうかを判定する。この処理は、Map-Reduce計算モデルのReduce処理（図８のreduce関数）に相当する。 The j-th (j = 1,..., Q ₁ ) local frequent pattern totaling unit 30 _j outputs a subset L _{1, j} ,..., L of the j-th partial structure pattern output from each local frequent pattern extraction unit 20 _i . _{P1 and j} are acquired, and it is determined whether the partial structure pattern included in the acquired subset is a frequent pattern. This process corresponds to the Reduce process (reduce function in FIG. 8) of the Map-Reduce calculation model.

各局所頻出パターン集計部３０_jに入力される部分集合L_1,j,…,L_P1,jはパターン単位で完全に独立な処理なので、部分集合の単位で分散並列処理が可能である。 Since the subsets L _{1, j} ,..., L _{P1, j} input to each local frequent pattern totaling unit 30 _j are completely independent processing in pattern units, distributed parallel processing is possible in units of subsets.

局所頻出パターン集計部３０_iは、取得した部分集合L_1,j,…,L_P1,jに含まれる部分構造パターンを、頻出パターンと未完パターンの２種類に判別する。なお、１個の局所頻出パターン集計部３０₁で、局所頻出パターン集計部３０を構成しても良い。その場合は、第２の分割数Q₁が１であり、局所頻出パターン集計部３０は分散並列処理を行わないので、準頻出構造パターンマイニング装置１００の構成を簡単にすることができる。 The local frequent pattern totaling unit 30 _i discriminates the partial structure patterns included in the acquired subsets L _{1, j} ,..., L _{P1, j} into two types of frequent patterns and incomplete patterns. The local frequent pattern totaling unit 30 ₁ may be configured by _one local frequent pattern totaling unit 301. In this case, the second division number Q ₁ is 1, and the local frequent pattern totaling unit 30 does not perform distributed parallel processing, so that the configuration of the quasi-frequent structure pattern mining apparatus 100 can be simplified.

〔頻出パターン〕
頻出パターンは、準頻出構造パターンマイニング装置１００の出力である。頻出パターンは、データ分割数P₁個で独立に処理された局所頻出パターン抽出部２０_iの全てにおいて、局所閾値T_i（i=1, …,P₁）を上回った部分構造パターンである。つまり、取得した部分構造集合L_1,j,…,L_P1,jの全てに含まれる部分構造パターンを頻出パターンとして判定する。 [Frequent pattern]
The frequent pattern is an output of the semi-frequent structure pattern mining apparatus 100. The frequent pattern is a partial structure pattern that exceeds the local threshold value T _i (i = 1,..., P ₁ ) in all of the local frequent pattern extraction units 20 _i independently processed with the number of data divisions P ₁ . That is, the partial structure patterns included in all of the acquired partial structure sets L _{1, j} ,..., L _{P1, j} are determined as frequent patterns.

局所頻出パターン集計部３０_iで判別されて頻出パターンF_jは、頻出パターン集合生成部３１に入力され頻出パターン集合F_qとなる。 The frequent pattern F _j determined by the local frequent pattern totaling unit 30 _i is input to the frequent pattern set generation unit 31 and becomes the frequent pattern set F _q .

〔未完パターン〕
未完パターンは、部分集合L_1,j,…,L_P1,jの一部において局所閾値T_iを上回る部分構造パターンであっても、全ての部分集合に含まれないパターンである。 [Unfinished pattern]
An incomplete pattern is a pattern that is not included in all the subsets even if it is a partial structure pattern that exceeds the local threshold value T _i in a part of the subsets L _{1, j} _,.

ここで重要なことは、P₁個の独立に処理された局所頻出パターン抽出部２０において、局所閾値T_iを一つも上回れなかった部分構造パターンは、絶対に頻出パターンに成り得ない点である。これは、式（１）から簡単に証明することが可能である。 What is important here is that a partial structure pattern that has not exceeded any local threshold value T _i in the P ₁ locally processed frequent pattern extraction unit 20 can never become a frequent pattern. . This can be easily proved from the equation (1).

ある部分構造パターンCの部分離散構造データD_iにおける出現頻度をC_iとした場合、C_iが全ての部分離散構造データで局所閾値T_iを上回らなかったということはC_i< T_iであり式（６）が成り立つ。 If the appearance frequency in the partial discrete structure data D _i of a certain partial structure pattern C was C _i, that C _i does not exceed the local threshold T _i in all parts discrete structure data is in C _i <T _i Equation (6) holds.

よって、このような部分構造パターンCは閾値Tを上回ることがないので頻出パターンにならない。P₁個の局所頻出パターン抽出部２０_iは、絶対に頻出パターンにならない部分構造パターンを除外していることになる。 Therefore, since such a partial structure pattern C does not exceed the threshold value T, it does not become a frequent pattern. P ₁ amino local frequent pattern extraction unit 20 _i would excludes partial structure pattern does not become absolutely frequent pattern.

未完パターンと判定された部分構造パターンは、未完パターン集合生成部３２で、未完パターンとその頻度情報の組の集合である未完パターン集合とされる。 The partial structure pattern determined to be an incomplete pattern is made into an incomplete pattern set that is a set of an incomplete pattern and its frequency information by the incomplete pattern set generation unit 32.

〔頻出パターン出力部〕
頻出パターン出力部４０は、頻出パターン集合生成部３１が生成した頻出パターン集合F_qを外部に出力する。 [Frequent pattern output section]
The frequent pattern output unit 40 outputs the frequent pattern set F _q generated by the frequent pattern set generation unit 31 to the outside.

図３に、この発明の頻出構造パターンマイニング装置２００の機能構成例を示す。その動作フローを図４に示す。頻出構造パターンマイニング装置２００は、実施例１で説明した準頻出構造パターンマイニング装置１００と、未完パターン再抽出部５０と、未完パターン再集計部６０と、頻出パターン出力部４０′と、を具備する。 FIG. 3 shows a functional configuration example of the frequent structure pattern mining apparatus 200 of the present invention. The operation flow is shown in FIG. The frequent structure pattern mining apparatus 200 includes the quasi-frequent structure pattern mining apparatus 100 described in the first embodiment, an incomplete pattern re-extraction unit 50, an incomplete pattern recounting unit 60, and a frequent pattern output unit 40 ′. .

頻出構造パターンマイニング装置２００は、準頻出構造パターンマイニング装置１００で未完パターンと判別された部分構造パターンを、部分離散構造データごとに再抽出することで頻出パターンを過不足なく抽出できるようにしたものである。 The frequent structure pattern mining apparatus 200 can extract frequent patterns without excess or deficiency by re-extracting partial structure patterns determined as incomplete patterns by the semi-frequent structure pattern mining apparatus 100 for each partial discrete structure data. It is.

以降、準頻出構造パターンマイニング装置１００と異なる部分のみを説明する。 Hereinafter, only the parts different from the quasi-frequent structure pattern mining apparatus 100 will be described.

〔未完パターン再抽出部〕
未完パターン再抽出部５０は、離散構造データ集合Dを第３のデータ分割数P₂個に分割した部分離散構造データD_k′と、準頻出構造パターンマイニング装置１００の未完パターン集合生成部３２が生成した未完パターン集合S_q={S₁, …,S_Q1}が、それぞれ入力されるP₂個の未完パターン再抽出部２０₁〜２０_P2を備える。 [Incomplete pattern re-extraction section]
The incomplete pattern re-extraction unit 50 includes a partial discrete structure data D _k ′ _obtained by dividing the discrete structure data set D into the third data division number P ₂ , and an incomplete pattern set generation unit 32 of the quasi-frequent structure pattern mining apparatus 100. unfinished pattern set S _q = generated _{_{{S 1, ..., S Q1}} } comprises a P ₂ amino unfinished pattern reextracted unit 20 ₁ to 20 _P2 respectively input.

P₂=P₁であっても良い。離散構造データDをP₂個に分割する処理は、準頻出構造パターンマイニング装置１００のデータ前処理部１０の内部で行っても良いし、頻出構造パターンマイニング装置２００側で行っても良い。 P ₂ = P ₁ may be sufficient. The process of dividing the discrete structure data D into P ₂ pieces may be performed inside the data preprocessing unit 10 of the semi-frequent structure pattern mining apparatus 100 or may be performed on the frequent structure pattern mining apparatus 200 side.

k番目（k=1,…,P₂）の未完パターン再抽出部５０_kは、k番目の部分離散構造データD_k′と未完パターン集合S_q={S₁, …,S_Q1}を入力として、未完パターン集合S_q中で、頻出パターンとなる可能性のある部分構造パターンを再抽出する。 The k-th (k = 1,..., P ₂ ) incomplete pattern re-extraction unit 50 _k receives the k-th partial discrete structure data D _k ′ and the incomplete pattern set S _q = {S ₁ _,. As such, partial structure patterns that may become frequent patterns in the incomplete pattern set S _q are re-extracted.

未完パターン再抽出部５０の処理は、上記した局所頻出パターン抽出部２０と同様に分散並列処理が可能である。この処理は、Map-Reduce計算モデルのMap処理（図８のMap関数）に相当する。 The incomplete pattern re-extraction unit 50 can perform distributed parallel processing in the same manner as the local frequent pattern extraction unit 20 described above. This process corresponds to the Map process of the Map-Reduce calculation model (Map function in FIG. 8).

未完パターン再抽出部５０は、上記した局所頻出パターン抽出部２０とほぼ同等の処理を行うので、上記した既存の構造パターンマイニング手法が利用できる。処理が異なる点は、「未完パターン集合S_qに含まれる部分構造パターンであるか否か」のマイニングに基づいて枝刈りを行う点である。 Since the incomplete pattern re-extraction unit 50 performs substantially the same processing as the local frequent pattern extraction unit 20 described above, the above-described existing structure pattern mining technique can be used. The processing is different in that pruning is performed based on the mining of “whether it is a partial structure pattern included in the incomplete pattern set S _q ”.

未完パターン再抽出部５０は、部分離散構造データ集合D_k′に出現する部分構造パターンでかつ、未完パターン集合S_qに含まれる部分構造パターンと、その部分構造パターンに対する頻度情報とを、再抽出した未完パターンとして出力する。したがって、未完パターン再抽出部５０から出力される部分構造パターンは、必ず未完パターンの部分集合となる。 The incomplete pattern re-extraction unit 50 re-extracts partial structure patterns that appear in the partial discrete structure data set D _k ′ and are included in the incomplete pattern set S _q and frequency information for the partial structure patterns. Is output as an incomplete pattern. Therefore, the partial structure pattern output from the incomplete pattern re-extraction unit 50 is always a subset of the incomplete pattern.

再抽出した未完パターンは、ハッシュにより事前に設定した第４の分割数Q_２個の集合S_k,1,…,S_kQ2に振り分けて未完パターン再集計部６０に出力する。この振り分けの処理は、Map-Reduce計算モデルに基づいた処理である。 Unfinished pattern re-extracted, the set of the fourth division number Q ₂ pieces set in advance by the hash S _{k, 1,} ..., distributes the S _KQ2 outputs unfinished pattern recount portion 60. This distribution process is a process based on the Map-Reduce calculation model.

〔未完パターン再集計部〕
未完パターン再集計部６０は、未完パターン再抽出部５０が振り分けた数に対応した数の未完パターン再集計部６０と、新頻出パターン集合生成部６１と、を備える。図３の例では、1個目の未完パターン再集計部６０₁とQ₂個目の未完パターン再集計部６０_Q2のみを表記し、２個目からQ₂−１個目の未完パターン再集計部を省略している。また、作図の都合により、未完パターン再集計部から新頻出パターン集合生成部６１に入力される部分構造パターンのF₂とF_Q2-1の表記を省略している。 [Uncompleted Pattern Recalculation Department]
The incomplete pattern recounting unit 60 includes a number of incomplete pattern recounting units 60 corresponding to the number assigned by the incomplete pattern reextracting unit 50 and a new frequent pattern set generation unit 61. In the example of Figure 3, only one th unfinished pattern recount unit 60 _1, Q ₂ th unfinished pattern recount portion 60 _Q2 denoted unfinished pattern recount Q ₂ -1 th from 2 th The part is omitted. For convenience of drawing, the notation of F ₂ and F _Q2-1 of the partial structure pattern input from the incomplete pattern recounting unit to the new frequent pattern set generation unit 61 is omitted.

m番目（m=1,…,Q₂）の未完パターン再集計部６０_mは、P₂個の未完パターン再抽出部５０からそれぞれm番目の再抽出した部分構造パターンである未完パターンを取得する。新頻出パターン集合生成部６１は、取得した集合S_m={S_1,m,…,S_P2,m}に含まれる部分構造パターンについてその頻度情報が閾値Tを越えるものを、新たな頻出パターンの集合F_m′として生成する。 The m-th (m = 1,..., Q ₂ ) incomplete pattern recounting unit 60 _m obtains incomplete patterns that are m-th re-extracted partial structure patterns from the P ₂ incomplete pattern re-extraction units 50, respectively. . The new frequent pattern set generation unit 61 converts a partial structure pattern included in the acquired set S _m = {S _{1, m} ,..., S _{P2, m} } to a new frequent pattern whose frequency information exceeds the threshold value T. As a set F _m ′.

頻出パターン出力部４０′は、準頻出構造パターンマイニング装置１００が出力する頻出パターン集合F_q={F₁,…,F_Q1}と、新頻出パターン集合生成部６１が出力する新頻出パターン集合F_m′={F₁′,…,F_Q2′}を合わせたものを最終的な頻出パターン集合Fとして出力する。なお、この発明の処理上、集合F_qとF_m′には重複するパターンは含まれないことが保証されている。 The frequent pattern output unit 40 ′ includes the frequent pattern set F _q = {F ₁ ,..., F _Q1 } output from the quasi-frequent structure pattern mining apparatus 100 and the new frequent pattern set F output from the new frequent pattern set generation unit 61. _{A combination of m} ′ = {F ₁ ′,..., F _Q2 ′} is output as a final frequent pattern set F. In the processing of the present invention, it is guaranteed that the sets F _q and F _m ′ do not include overlapping patterns.

〔評価結果〕
この発明の頻出構造パターンマイニング装置の有効性を確認する目的で、シングルプロセッサでパターンマイニングを行った場合と、この発明の方法で行った場合の実行時間を比較した。その結果を図５に示す。〔Evaluation results〕
For the purpose of confirming the effectiveness of the frequent structure pattern mining apparatus of the present invention, the execution time when the pattern mining was performed by a single processor and when the method of the present invention was performed was compared. The result is shown in FIG.

図５の横軸は文の数を（Million）で表す。縦軸は実行時間を（秒）で表す。シングルプロセッサ（破線）は１ＣＰＵ、本発明の並列分散処理（実線）では同じＣＰＵを２００個用いた。ＣＰＵにはインテルXEON2.96GHzを用いた。 The horizontal axis in FIG. 5 represents the number of sentences in (Million). The vertical axis represents the execution time in seconds. A single processor (broken line) used 1 CPU, and 200 parallel CPUs (solid line) used in the present invention used 200 same CPUs. Intel XEON2.96GHz was used for CPU.

文の数が３（Million）つまり、３百万個の文を対象にしたパターンマイニングの処理時間は本発明が約１００秒であり、シングルプロセッサが約１３００秒である。本発明の方が１０分の１以上も実行時間が短くて済む。また、本発明はデータをほぼ固定サイズに分割することにより、分割された各データの処理に必要な計算コストの上限を一定に保つことができる。よって、計算時間はデータの分割数に依存する。図５によれば、用いるデータの総量に対してほぼ線形の実行時間であることが分かる。 The number of sentences is 3 (Million), that is, the processing time of pattern mining for 3 million sentences is about 100 seconds for the present invention, and about 1300 seconds for a single processor. The execution time of the present invention can be shortened by more than 1/10. Further, according to the present invention, the upper limit of the calculation cost necessary for processing each divided data can be kept constant by dividing the data into substantially fixed sizes. Therefore, the calculation time depends on the number of data divisions. FIG. 5 shows that the execution time is almost linear with respect to the total amount of data used.

このように、この発明の頻出構造パターンマイニング装置は、高速にデータマイニングを行うことが出来る。 Thus, the frequent structure pattern mining device of the present invention can perform data mining at high speed.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD（Digital Versatile Disc）、DVD-RAM（Random Access Memory）、CD-ROM（Compact Disc Read Only Memory）、CD-R（Recordable）/RW（ReWritable）等を、光磁気記録媒体として、MO（Magneto Optical disc）等を、半導体メモリとしてEEP-ROM（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording media, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

離散構造データ集合Dを入力として、その離散構造データ集合Dを予め定められた第１の分割数P₁個に分割した部分離散構造データ集合D_iと、予め設定された閾値Tを上記第１の分割数P₁個に分割した局所閾値T_iとを生成するデータ前処理部と、
上記部分離散構造データ集合D_iのそれぞれにおいて、上記局所閾値T_iを上回る頻度で現れる局所的な部分構造パターンを分散並列処理して抽出し、その部分構造パターンをハッシュにより第２の分割数Q₁個の部分集合に振り分けて出力する局所頻出パターン抽出部と、
上記第２の分割数Q₁個に振り分けて出力される上記部分集合の全てに共通する部分構造パターンを頻出パターンとして判別して頻出パターン集合F_qを生成し、それ以外の部分構造パターンを未完パターンと判別して未完パターン集合S_qを生成する局所頻出パターン集計部と、
上記頻出パターン集合F_qを外部に出力する頻出パターン出力部と、
を具備する準頻出構造パターンマイニング装置。 With the discrete structure data set D as an input, a partial discrete structure data set D _{i obtained} by dividing the discrete structure data set D into a predetermined first division number P ₁ and a preset threshold T are set as the first value. a data pre-processor for generating a local threshold value T _i of the division number obtained by dividing P into _one,
In each of the partial discrete structure data sets D _i , local partial structure patterns appearing at a frequency exceeding the local threshold value T _i are extracted by distributed parallel processing, and the partial structure patterns are extracted by the second division number Q by hashing. the local frequent pattern extraction unit to output distributed to _one subset,
The determination to generate a frequent pattern set F _q a common partial structure pattern to all the subsets is output distributed to the second dividing number Q ₁ or a frequent pattern, unfinished the other part structure pattern A local frequent pattern totaling unit that generates an incomplete pattern set S _q by distinguishing from patterns,
A frequent pattern output unit for outputting the frequent pattern set F _q to the outside;
A semi-frequent structure pattern mining device comprising:

請求項１に記載した準頻出構造パターンマイニング装置と、
上記離散構造データ集合Dを予め定められた第３の分割数P₂個に分割した部分離散構造データ集合D_k′に出現する上記未完パターン集合S_qに含まれる部分構造パターンと、その部分構造パターンに対する頻度情報とを、上記部分離散構造データD_k′の数で分散並列処理して再抽出し、その部分構造パターンをハッシュにより第４の分割数Q₂個の集合に振り分けて出力する未完パターン再抽出部と、
上記第４の分割数Q₂個に振り分けて出力される集合を取得し、その取得した集合S_m={S_1,m,…,S_P2,m}に含まれる部分構造パターンとその出現頻度を集計し、その出現頻度が上記閾値Tを上回る部分構造パターンを新たな頻出パターン集合F_m′とする未完パターン再集計部と、
上記準頻出構造パターンマイニング装置が出力する頻出パターン集合F_qと、上記頻出パターン集合F_m′を合わせたものを頻出パターン集合として出力する頻出パターン出力部と、
を具備する頻出構造パターンマイニング装置。 The quasi-frequent structure pattern mining device according to claim 1,
The partial structure pattern included in the incomplete pattern set S _q appearing in the partial discrete structure data set D _k ′ _obtained by dividing the discrete structure data set D into a predetermined third division number P ₂ and the partial structure unfinished to the frequency information for the pattern, distributed parallel processing by the number of the partial discrete structure data D _k 're-extracted, and outputs the hash the partial structure pattern allocated to the fourth division number Q ₂ pieces of aggregate A pattern re-extraction unit;
A set output after being distributed to the fourth division number Q ₂ is acquired, and the partial structure pattern included in the acquired set S _m = {S _{1, m} ,..., S _{P2, m} } and its appearance frequency And an incomplete pattern recounting unit that sets a partial structure pattern whose appearance frequency exceeds the threshold value T as a new frequent pattern set F _m ′,
A frequent pattern output unit that outputs a frequent pattern set F _q output from the quasi-frequent structure pattern mining device and a combination of the frequent pattern set F _m ′ as a frequent pattern set;
A frequent structure pattern mining apparatus comprising:

請求項１又は２に記載した準頻出構造パターンマイニング装置又は頻出構造パターンマイニング装置において、
上記局所頻出パターン集計部は上記第２の分割数Q₁個の部分集合ごとに、また、上記未完パターン再集計部は上記第４の分割数Q₂個の集合ごとに並列に処理を実行するものであることを特徴とする準頻出構造パターンマイニング装置又は頻出構造パターンマイニング装置。 In the quasi-frequent structure pattern mining device or the frequent structure pattern mining device according to claim 1 or 2,
The local frequent pattern aggregation unit for each set the second division number Q ₁ pieces of partial, also, the unfinished pattern recount unit executes processing in parallel for each set division number Q of _two of the fourth A quasi-frequent structure pattern mining apparatus or a frequent structure pattern mining apparatus characterized by being a thing.

請求項１乃至３の何れかに記載した準頻出構造パターンマイニング装置又は頻出構造パターンマイニング装置において、
上記局所閾値T_iは、上記閾値Tを上記離散構造データDのデータ数で除した値に、上記部分離散構造データ集合D_iのデータ数を乗じた値の小数点以下を切り捨てた整数に１を加えた値を超えないように設定されることを特徴とする準頻出構造パターンマイニング装置又は頻出構造パターンマイニング装置。 In the quasi-frequent structure pattern mining device or the frequent structure pattern mining device according to any one of claims 1 to 3,
The local threshold value T _i is obtained by dividing the value obtained by dividing the threshold value T by the number of data of the discrete structure data D by the number of data of the partial discrete structure data set D _i , and adding 1 to the integer that is rounded down after the decimal point. A quasi-frequent structure pattern mining apparatus or a frequent structure pattern mining apparatus, which is set so as not to exceed the added value.

離散構造データ集合Dを入力として、その離散構造データ集合Dを予め定められた第１の分割数P₁個に分割した部分離散構造データD_iと、予め設定された閾値Tを上記分割数P₁個に分割した局所閾値T_iとを生成するデータ前処理過程と、
上記部分離散構造データ集合D_iのそれぞれにおいて、上記局所閾値T_iを上回る頻度で現れる局所的な部分構造パターンを分散並列処理して抽出し、その部分構造パターンをハッシュにより第２の分割数Q₁個の部分集合に振り分けて出力する局所頻出パターン抽出過程と、
上記第２の分割数Q₁個に振り分けて出力される上記部分集合の全てに共通する部分構造パターンを頻出パターンとして判別して頻出パターン集合F_qを生成し、それ以外の部分構造パターンを未完パターンと判別して未完パターン集合S_qを生成する局所頻出パターン集計過程と、
上記頻出パターン集合F_qを外部に出力する頻出パターン出力過程と、
を備える準頻出構造パターンマイニング方法。 As input a discrete structural data set D, the discrete structures data first and division number P ₁ or the divided partial discrete structure data D _i of the set D predetermined preset threshold T the division number P a data preprocessing step of generating a local threshold value T _i which is divided into _one,
In each of the partial discrete structure data sets D _i , local partial structure patterns appearing at a frequency exceeding the local threshold value T _i are extracted by distributed parallel processing, and the partial structure patterns are extracted by the second division number Q by hashing. the local frequent pattern extraction step of outputting distributed to _one subset,
The determination to generate a frequent pattern set F _q a common partial structure pattern to all the subsets is output distributed to the second dividing number Q ₁ or a frequent pattern, unfinished the other part structure pattern A local frequent pattern tabulation process for generating an incomplete pattern set S _q by discriminating from a pattern,
The frequent pattern output process of outputting the frequent pattern set F _q to the outside,
A semi-frequent structure pattern mining method comprising:

請求項５に記載した準頻出構造パターンマイニング方法と、
上記離散構造データ集合Dを予め定められた第３の分割数P₂個に分割した部分離散構造データ集合D_k′に出現する上記未完パターン集合S_qに含まれる部分構造パターンと、その部分構造パターンに対する頻度情報とを、上記部分離散構造データD_k′の数で分散並列処理して再抽出し、その部分構造パターンをハッシュにより第４の分割数Q₂個の集合に振り分けて出力する未完パターン再抽出過程と、
上記第４の分割数Q₂個に振り分けて出力される集合を取得し、その取得した集合S_m={S_1,m,…,S_P2,m}に含まれる部分構造パターンとその出現頻度を集計し、その出現頻度が上記閾値Tを上回る部分構造パターンを新たな頻出パターン集合F_m′とする未完パターン再集計過程と、
上記準頻出構造パターンマイニング方法が生成する頻出パターン集合F_qと、上記頻出パターン集合F_m′をあわせたものを頻出パターン集合として出力する頻出パターン出力過程と、
を備える頻出構造パターンマイニング方法。 The quasi-frequent structure pattern mining method according to claim 5,
The partial structure pattern included in the incomplete pattern set S _q appearing in the partial discrete structure data set D _k ′ _obtained by dividing the discrete structure data set D into a predetermined third division number P ₂ and the partial structure unfinished to the frequency information for the pattern, distributed parallel processing by the number of the partial discrete structure data D _k 're-extracted, and outputs the hash the partial structure pattern allocated to the fourth division number Q ₂ pieces of aggregate Pattern re-extraction process;
A set output after being distributed to the fourth division number Q ₂ is acquired, and the partial structure pattern included in the acquired set S _m = {S _{1, m} ,..., S _{P2, m} } and its appearance frequency And the incomplete pattern recalculation process in which the substructure pattern whose appearance frequency exceeds the threshold value T as a new frequent pattern set F _m ′,
A frequent pattern output process for outputting a frequent pattern set F _q generated by the quasi-frequent structure pattern mining method and the frequent pattern set F _m ′ as a frequent pattern set;
A frequent structure pattern mining method comprising:

請求項５又は６に記載した準頻出構造パターンマイニング方法又は頻出構造パターンマイニング方法において、
上記局所頻出パターン集計過程は上記第２の分割数Q₁個の部分構造集合ごとに、また、上記未完パターン再集計過程は上記第４の分割数Q₂個の集合ごとに並列に処理を実行する過程であることを特徴とする準頻出構造パターンマイニング方法又は頻出構造パターンマイニング方法。 In the quasi-frequent structure pattern mining method or the frequent structure pattern mining method according to claim 5 or 6,
For each of the local frequent pattern aggregation process the second division number Q ₁ pieces of partial structure aggregate, also, the unfinished pattern recount process executes processing in parallel for each set division number Q of _two of the fourth A quasi-frequent structure pattern mining method or a frequent structure pattern mining method characterized in that:

請求項５乃至７の何れかに記載した準頻出構造パターンマイニング方法又は頻出構造パターンマイニング方法において、
上記局所閾値T_iは、上記閾値Tを上記離散構造データDのデータ数で除した値に、上記部分離散構造データ集合D_iのデータ数を乗じた値の小数点以下を切り捨てた整数に１を加えた値を超えないように設定されることを特徴とする準頻出構造パターンマイニング方法又は頻出構造パターンマイニング方法。 In the quasi-frequent structure pattern mining method or the frequent structure pattern mining method according to any one of claims 5 to 7,
The local threshold value T _i is obtained by dividing the value obtained by dividing the threshold value T by the number of data of the discrete structure data D by the number of data of the partial discrete structure data set D _i , and adding 1 to the integer that is rounded down after the decimal point. A quasi-frequent structure pattern mining method or a frequent structure pattern mining method, characterized in that it is set so as not to exceed the added value.

請求項１乃至４の何れかに記載した準頻出構造パターンマイニング装置又は頻出構造パターンマイニング装置としてコンピュータを機能させるためのプログラム。 A program for causing a computer to function as the quasi-frequent structure pattern mining device or the frequent structure pattern mining device according to any one of claims 1 to 4.