JP5478526B2

JP5478526B2 - Data analysis and machine learning processing apparatus, method and program

Info

Publication number: JP5478526B2
Application number: JP2011019173A
Authority: JP
Inventors: 佳史福本; 真鬼塚
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-01-31
Filing date: 2011-01-31
Publication date: 2014-04-23
Anticipated expiration: 2031-01-31
Also published as: JP2012160014A

Description

本発明は、データ分析及び機械学習処理装置及び方法及びプログラムに係り、特に、大規模なデータの分析処理のための機械学習処理の効率化を図るためのデータ分析及び機械学習処理装置及び方法及びプログラムに関する。 The present invention relates to a data analysis and machine learning processing apparatus, method, and program, and more particularly, to a data analysis and machine learning processing apparatus and method for improving the efficiency of machine learning processing for analysis processing of large-scale data, and Regarding the program.

大規模データの分析処理のための技術として"MapReduce"がある（例えば、非特許文献１参照）。 As a technique for analyzing large-scale data, there is “MapReduce” (for example, see Non-Patent Document 1).

図２６は、マップリデュース（MapReduce）処理の流れを示す。MapReduceはネットワークによって相互に接続された複数のコンピュータを用いる。MapReduceは図２６に示すような流れで処理が行われる分散処理フレームワークである。 FIG. 26 shows the flow of map reduce processing. MapReduce uses multiple computers connected together by a network. MapReduce is a distributed processing framework in which processing is performed according to the flow shown in FIG.

MapReduce処理が開始されると、まずは各コンピュータにおいて、ユーザが任意に定義したマッパー（Mapper）が起動され、予め分散ファイルシステム（HDFS）に格納されていた分散データを各々のコンピュータが読み込み、Map処理が行われる。 When MapReduce processing starts, the mapper defined by the user is started on each computer, and each computer reads the distributed data stored in the distributed file system (HDFS) in advance. Is done.

マップ（Map）処理では、各コンピュータが自らに割り当てられた分散データを先頭から順に１行に対して１回、ユーザが定義したMap関数が適用される。Map関数において、処理の結果として複数のKey-Value（キー/値）形式のレコードの集合を中間データとして出力する。なお、Key・Valueそれぞれの型は一定制約下でユーザが任意に定義するものとする。 In the map process, a map function defined by the user is applied once for each row of the distributed data assigned to each computer in order from the top. In the Map function, as a result of processing, a set of records in a plurality of key-value formats is output as intermediate data. Note that the key and value types are arbitrarily defined by the user under certain restrictions.

次に、各コンピュータにおいて各ユーザが任意に定義したリデューサー（Reducer）が起動され、中間データのキー（Key）部が同じレコードが１台のコンピュータに集められるように、ネットワークを介して中間データを相互に移動させる。これをシャッフル（Shuffle）処理"と呼ぶ。 Next, a reducer arbitrarily defined by each user is activated on each computer, and the intermediate data is collected via the network so that the records having the same key part of the intermediate data are collected in one computer. Move each other. This is called “shuffle processing”.

同じKeyを持つ中間データのレコードは最終的にソートされた状態で、値（Value）はイテレータとしてユーザが定義したリデュース（Reduce）関数に与えられる。Reduce関数において、処理の結果として複数のKey-Value形式のレコードの集合が出力される。 The records of intermediate data with the same key are finally sorted, and the value (Value) is given to the reduce function defined by the user as an iterator. The Reduce function outputs a set of records in multiple key-value formats as a result of processing.

図２７にMapReduceを利用した一般的なアプリケーションの処理フローを示す。 FIG. 27 shows a processing flow of a general application using MapReduce.

ステップ１００）初期設定ファイルを入力し、MapReduceを開始するためにジョブを生成する。 Step 100) Input an initial setting file and generate a job to start MapReduce.

ステップ１１０）生成されたジョブに対して利用するマッパー（Mapper）、リデューサー（Reducer）やReducerの数、その他のMapReduce処理そのものに必要な設定や、ユーザが定義するMapper、Reducer等の内部で利用するパラメータ、コマンドライン引数などの解析により実行時に変更可能なパラメータをMapReduceジョブに与える。 Step 110) The number of mappers (Mapper), reducers (Reducers) and reducers used for the generated job, other settings required for the MapReduce process itself, and the user-defined Mapper, Reducer, etc. Give parameters that can be changed at runtime to the MapReduce job by analyzing parameters, command line arguments, etc.

ステップ１２０） MapReduce処理を行う。詳細は図２８で述べる。 Step 120) MapReduce processing is performed. Details will be described with reference to FIG.

ステップ１３０） MapReduce処理によって出力された結果に対して何らかの処理（ユーザが定義した処理）を行う。 Step 130) Any processing (processing defined by the user) is performed on the result output by the MapReduce processing.

次に、図２７のステップ１２０のMapReduce処理について説明する。MapReduce処理は複数のノード（コンピュータ）をネットワークで相互に接続したクラスタ上で行う分散処理フレームワークである。 Next, the MapReduce process in step 120 of FIG. 27 will be described. MapReduce processing is a distributed processing framework that runs on a cluster in which multiple nodes (computers) are connected to each other via a network.

図２８は、一般的なMapReduce処理のフローを示す。 FIG. 28 shows a flow of general MapReduce processing.

ステップ２００） MapReduce処理が開始されると、まずは、各ノードにおいてユーザが任意に定義したMapperが起動され、予め分散ファイルシステムに格納されていた分散データを各々のノードが読み込み、Map処理が行われる。このとき、ユーが定義した任意のMap処理を行う。各ノードが自らに割り当てた分散データを先頭から順に１行に対して１回、ユーザが定義したMap関数が適用され、Key-Value形式の任意の中間データが出力される。 Step 200) When MapReduce processing is started, first, Mapper arbitrarily defined by the user is started in each node, each node reads the distributed data stored in the distributed file system in advance, and Map processing is performed. . At this time, arbitrary Map processing defined by you is performed. A map function defined by the user is applied to the distributed data assigned to each node by itself once per row in order from the top, and arbitrary intermediate data in the key-value format is output.

ステップ２１０）ユーザがコンバイン（Combine：結合処理）に利用するクラスを明示的に指定している場合のみ、各ノードにおいてMap処理が終わり次第、次にコンバイナー（Combiner）が起動され、それぞれのMap処理によって出力された中間データを対象として、キーが共通である中間データを一つにまとめるCombine処理（ローカルでのReduce）が行われ、Key・Valueリストの形をとった任意の複数の中間データが出力される。 Step 210) Only when the user explicitly specifies the class to be used for the combine (combine processing), as soon as the map processing is completed at each node, the combiner is started and each map processing is started. For the intermediate data output by, combine processing (local reduction) that combines the intermediate data with the same key into one is performed, and arbitrary multiple intermediate data in the form of Key / Value lists Is output.

ステップ２２０）各ノードにおいてユーザが任意に定義したReducerが起動され、中間データのキー部が同じレコードが１台のノードに集められるように、ネットワークを介して中間データを相互に移動させるシャッフル処理（Shuffle）が行われる。Shuffleの際、中間データはKey部を元にしてソートされ、１つのキーに対して複数のValueの形式のリストが出力される。 Step 220) Shuffle processing for moving intermediate data to each other over the network so that a reducer arbitrarily defined by the user is activated in each node and records having the same key part of the intermediate data are collected in one node ( Shuffle) is performed. During Shuffle, the intermediate data is sorted based on the Key part, and a list of multiple Value formats is output for one key.

ステップ２３０） Shuffleされた中間データそれぞれに対してReduce処理が行われる。 Step 230) Reduce processing is performed on each of the shuffled intermediate data.

上記の技術は、同じデータを入力として、複数のMapReduce処理（Grep処理）を行うものであり、中間データのキーへのタグの付与、中間データの削減処理（ジョブの事前統合と複数ジョブの同時実行）を行う(例えば、非特許文献２参照)。 The above technology performs multiple MapReduce processing (Grep processing) using the same data as input, adds tags to intermediate data keys, and reduces intermediate data (pre-job integration and simultaneous multiple jobs) (For example, see Non-Patent Document 2).

MRShare: Sharing Across Multiple Queries in MapReduce [Tomasz Nykiel, Michalis Potamias, Chaitanya Mishra, George Kollios, Nick Koudas, VLDB2010,2010年9月]MRShare: Sharing Across Multiple Queries in MapReduce [Tomasz Nykiel, Michalis Potamias, Chaitanya Mishra, George Kollios, Nick Koudas, VLDB2010, September 2010] MapReduce:Simpli_ed Data Processing on Large Clusters [Jeffrey Dean, Sanjay Ghemawat, OSDI2004] http://static.***usercontent.com/external content/untrusted dlcp/labs.***.com/ja//papers/mapreduce-osdi04.pdfMapReduce: Simpli_ed Data Processing on Large Clusters [Jeffrey Dean, Sanjay Ghemawat, OSDI2004] http://static.***usercontent.com/external content / untrusted dlcp / labs.***.com / en // papers / mapreduce-osdi04.pdf

機械学習処理は、事前に機械学習アルゴリズムに与える設定値次第で得られる処理結果の精度が大きく異なる場合があるため、設定値を調整しながら処理を複数回繰り返さなければ最良の結果を得ることができない。特に、MapReduceを利用して大規模なデータを対象とした処理を行う際に、例えば、機械学習ライブラリである"Mahout"では設定値の調整があまり考慮されないため、良い結果を得るために長時間に及ぶ処理を複数回行うことになり、非常に効率が悪い。 In machine learning processing, the accuracy of the processing result obtained depending on the setting value given to the machine learning algorithm in advance may vary greatly, so the best result can be obtained unless the processing is repeated multiple times while adjusting the setting value. Can not. In particular, when performing processing for large-scale data using MapReduce, for example, "Mahout", which is a machine learning library, does not take into account the adjustment of setting values, so it takes a long time to obtain good results. This process is very inefficient.

上記の非特許文献１の技術は、複数回の処理において、各処理のデータ読み込み部分の共有化により、中間データ削減によるShuffleコストは削減できるが、MapReduceを提供するHadoopそのものには手を加えないため、粒度の小さい（関数単位の）処理の共有化ができないという問題がある。 The technology of Non-Patent Document 1 described above can reduce the Shuffle cost by reducing the intermediate data by sharing the data reading part of each process in a plurality of processes, but does not touch Hadoop itself that provides MapReduce. Therefore, there is a problem that it is impossible to share processing with a small granularity (in units of functions).

本発明は、上記の点に鑑みなされたもので、粒度の小さい処理の共有化ができないという問題を解決し、パラメータに依存しない重複部分を関数単位での共有化を可能とし、重複する処理を削減することが可能なデータ分析及び機械学習処理装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points, solves the problem that it is not possible to share processing with a small granularity, enables sharing of overlapping portions that do not depend on parameters in units of functions, and performs overlapping processing. An object of the present invention is to provide a data analysis and machine learning processing apparatus, method and program that can be reduced.

上記の課題を解決するために、本発明（請求項１）は、１つのキーと値の組をマッピングして中間データを生成するジョブを行うマップ（Map）手段と、該中間データを更に小さな値にセットするジョブを行うリデュース（Reduce）手段、及び、与えられた教師データを格納する分散ファイルシステムを有し、大規模なデータを並列分散処理するためのデータ分析及び機械学習処理装置であって、
前記Map手段または前記Reduce手段にジョブが与えられると、該ジョブのパラメータの値、利用位置、及び、利用元関数を含むパラメータ情報を検出し、前記分散ファイルシステムに格納するパラメータ検出手段と、
前記Map手段、前記Reduce手段に新たなジョブが与えられると、該ジョブのパラメータの値、利用位置、利用元関数を含む新パラメータ情報を検出し、該新パラメータ情報と前記分散ファイルシステムに格納されている前記パラメータ情報と比較し、パラメータに依存しない重複部分のジョブを統合するジョブ統合手段と、
前記Map手段または前記Reduce手段を、統合されたジョブを重複部分を共有しながら実行させる共有実行手段と、を有する。 In order to solve the above-described problems, the present invention (Claim 1) includes a map means for performing a job for generating intermediate data by mapping a pair of a key and a value, and further reducing the intermediate data. A data analysis and machine learning processing apparatus for performing parallel distributed processing of large-scale data, including a reduce means for performing a job for setting a value and a distributed file system for storing given teacher data. And
When a job is given to the Map means or the Reduce means, parameter detection means for detecting parameter information including a parameter value of the job, a use position, and a use source function, and storing the parameter information in the distributed file system;
When a new job is given to the Map means and the Reduce means, new parameter information including the parameter value, use position, and use source function of the job is detected and stored in the new file information and the distributed file system. A job integration unit that integrates the jobs of overlapping parts that do not depend on parameters, compared with the parameter information
Shared execution means for causing the Map means or Reduce means to execute an integrated job while sharing overlapping portions.

また、本発明（請求項２）は、請求項１の前記ジョブ統合手段に、
前記Map手段または前記Reduce手段に対して新たなジョブが追加されると、該Map手段、または、該Reduce手段の実行前に、該ジョブを遅延キューに蓄積する遅延キュー制御手段と、
新たに追加されたジョブと既に前記遅延キューに格納されたジョブのパラメータを比較し、利用するMap手段名またはReduce手段名が同じものがある場合は、ジョブを統合し、所定のキュー保持時間が経過すると、通常のキューに統合されたジョブを移行させる手段と、を含む。 The present invention (Claim 2) includes the job integration unit according to Claim 1,
When a new job is added to the Map unit or the Reduce unit, a delay queue control unit that accumulates the job in a delay queue before the execution of the Map unit or the Reduce unit;
Compare the newly added job and the parameters of the job already stored in the delay queue, and if there is the same Map means name or Reduce means name to be used, the jobs are integrated and the predetermined queue retention time And means for transferring a job integrated into a normal queue when it has elapsed.

また、本発明（請求項３）は、請求項１または２の前記ジョブ統合手段に、
前記分散ファイルシステムに格納されている前記パラメータ情報と前記新パラメータ情報を比較して、前記パラメータに依存しない重複部分がある場合は、代表する１つのジョブに対してその他のジョブのパラメータとの差分を追加する手段を含む。 Further, the present invention (Claim 3) includes the job integration unit according to Claim 1 or 2,
If the parameter information stored in the distributed file system is compared with the new parameter information, and there is an overlapping part that does not depend on the parameter, the difference between the representative job and the parameters of other jobs Means for adding.

また、本発明（請求項４）は、請求項１の前記共有実行手段に、
前記ジョブ統合手段で統合されたジョブの重複部分を共有しながら前記Map手段、または、前記Reduce手段を実行させ、パラメータに依存する処理部分に到達した時点で処理を分岐させる手段を含む。 Further, the present invention (Claim 4) provides the sharing execution means according to Claim 1,
The Map unit or the Reduce unit is executed while sharing the overlapping part of the jobs integrated by the job integration unit, and the process is branched when the processing part depending on the parameter is reached.

本発明（請求項５）は、１つのキーと値の組をマッピングして中間データを生成するジョブを行うマップ（Map）手段と、該中間データを更に小さな値にセットするジョブを行うリデュース（Reduce）手段、及び、与えられた教師データを格納する分散ファイルシステムを有し、大規模なデータを並列分散処理する装置における、データ分析及び機械学習処理方法であって、
パラメータ検出手段が、前記Map手段または前記Reduce手段にジョブが与えられると、該ジョブのパラメータの値、利用位置、及び、利用元関数を含むパラメータ情報を検出し、前記分散ファイルシステムに格納するパラメータ検出ステップと、
ジョブ統合手段が、前記Map手段、前記Reduce手段に新たなジョブが与えられると、該ジョブのパラメータの値、利用位置、利用元関数を含む新パラメータ情報を検出し、該新パラメータ情報と前記分散ファイルシステムに格納されている前記パラメータ情報と比較し、パラメータに依存しない重複部分のジョブを統合するジョブ統合ステップと、
共有実行手段が、統合されたジョブを重複部分を共有しながら前記Map手段または前記Reduce手段を実行させる共有実行ステップと、を行う。 According to the present invention (Claim 5), map means for performing a job for generating intermediate data by mapping a pair of one key and a value, and reducing (for performing a job for setting the intermediate data to a smaller value) Reduce) means, and a data analysis and machine learning processing method in a device having a distributed file system for storing given teacher data and performing parallel distributed processing of large-scale data,
When the parameter detection unit is given a job to the Map unit or the Reduce unit, the parameter detection unit detects parameter information including the parameter value, usage position, and usage source function of the job, and stores the parameter information in the distributed file system A detection step;
When a new job is given to the Map unit and the Reduce unit, the job integration unit detects new parameter information including the parameter value, usage position, and usage source function of the job, and the new parameter information and the distribution A job integration step of comparing the parameter information stored in the file system and integrating jobs of overlapping parts independent of parameters;
The shared execution means performs a shared execution step for executing the Map means or the Reduce means while sharing overlapping portions of the integrated jobs.

また、本発明（請求項６）は、請求項５の前記ジョブ統合ステップにおいて、
前記Map手段または前記Reduce手段に対して新たなジョブが追加されると、該Map手段、または、該Reduce手段の実行前に、該ジョブを遅延キューに蓄積する遅延キュー制御ステップと、
新たに追加されたジョブと既に前記遅延キューに格納されたジョブのパラメータを比較し、利用するMap手段名またはReduce手段名が同じものがある場合は、ジョブを統合し、所定のキュー保持時間が経過すると、通常のキューに統合されたジョブを移行させるステップと、を行う。 Further, the present invention (Claim 6) is the job integration step according to Claim 5,
When a new job is added to the Map unit or the Reduce unit, a delay queue control step of storing the job in a delay queue before the execution of the Map unit or the Reduce unit;
Compare the newly added job and the parameters of the job already stored in the delay queue, and if there is the same Map means name or Reduce means name to be used, the jobs are integrated and the predetermined queue retention time After the elapse, the step of transferring the job integrated into the normal queue is performed.

また、本発明（請求項７）は、請求項５または６の前記ジョブ統合ステップにおいて、
前記分散ファイルシステムに格納されている前記パラメータ情報と前記新パラメータ情報を比較して、前記パラメータに依存しない重複部分がある場合は、代表する１つのジョブに対してその他のジョブのパラメータとの差分を追加する。 According to the present invention (Claim 7), in the job integration step of Claim 5 or 6,
If the parameter information stored in the distributed file system is compared with the new parameter information, and there is an overlapping part that does not depend on the parameter, the difference between the representative job and the parameters of other jobs Add

また、本発明（請求項８）は、請求項５の前記共有実行ステップにおいて、
前記ジョブ統合手段で統合されたジョブの重複部分を共有しながら前記Map手段、または、前記Reduce手段を実行させ、パラメータに依存する処理部分に到達した時点で処理を分岐させる。 Further, the present invention (Claim 8) is the sharing execution step according to Claim 5,
The Map unit or the Reduce unit is executed while sharing the overlapping part of the jobs integrated by the job integration unit, and the process is branched when the processing part depending on the parameter is reached.

本発明（請求項９）は、請求項１乃至４のいずれか１項に記載のデータ分析・機械学習装置を構成する各手段としてコンピュータを機能させるためのデータ分析及び機械学習処理プログラムである。 The present invention (Claim 9) is a data analysis and machine learning processing program for causing a computer to function as each means constituting the data analysis / machine learning apparatus according to any one of Claims 1 to 4.

上記のように、従来、粒度の小さい処理の共有化ができないという問題に対し、本発明は、MapReduce処理中でのパラメータの利用順序、利用関数を検知し、次にその情報を利用して複数パターンのパラメータを持つ処理を同時に行う場合に、パラメータに依存しない重複部分を関数単位で統合することで、重複する処理を削減することが可能となり、MapReduce処理の高速化を図ることができる。つまり、本発明では、最低限２回のMapReduce処理が必要になるが、１回目では、ジョブの統合などは行わずに、パラメータ情報の検知のみを行い、２回目以降において、複数ジョブの統合(処理の共有化)を行うことで、処理量の削減を行うことができる。 As described above, with respect to the problem that processing with a small granularity cannot be conventionally shared, the present invention detects the parameter usage order and usage function in MapReduce processing, and then uses that information to make multiple When processing with pattern parameters is performed at the same time, it is possible to reduce overlapping processing by integrating overlapping portions that do not depend on parameters in units of functions, and to speed up MapReduce processing. That is, in the present invention, MapReduce processing is required at least twice, but at the first time, only the parameter information is detected without performing job integration, and after the second time, multiple jobs are integrated ( By sharing the processing, the amount of processing can be reduced.

本発明の第１の実施の形態におけるシステム構成図である。It is a system configuration figure in a 1st embodiment of the present invention. 本発明の第１の実施の形態における全体動作のフローチャートである。It is a flowchart of the whole operation | movement in the 1st Embodiment of this invention. 本発明の第１の実施の形態におけるジョブの統合・共有化処理のフローチャートである。4 is a flowchart of job integration / sharing processing according to the first embodiment of the present invention. 本発明の第１の実施の形態におけるジョブの統合を説明するための図である。It is a figure for demonstrating the integration of the job in the 1st Embodiment of this invention. 本発明の第２の実施の形態におけるパラメータ情報を検知する処理のフローチャート（S300）である。It is a flowchart (S300) of the process which detects the parameter information in the 2nd Embodiment of this invention. 本発明の第２の実施の形態におけるパラメータ情報を検知する処理のフローチャート（S420）である。It is a flowchart (S420) of the process which detects the parameter information in the 2nd Embodiment of this invention. 本発明の第２の実施の形態におけるパラメータ検知のイメージである。It is an image of the parameter detection in the 2nd Embodiment of this invention. 本発明の第３の実施の形態における遅延キュー追加時の処理のフローチャート(S620)である。It is a flowchart (S620) of the process at the time of delay queue addition in the 3rd Embodiment of this invention. 本発明の第３の実施の形態における統合処理のイメージである。It is an image of the integration process in the 3rd Embodiment of this invention. 本発明の第３の実施の形態における統合ジョブMapReduce処理のフローチャート（S630）である。It is a flowchart (S630) of the integrated job MapReduce process in the 3rd Embodiment of this invention. 本発明の第３の実施の形態における統合処理を説明するための図である。It is a figure for demonstrating the integration process in the 3rd Embodiment of this invention. 本発明の第３の実施の形態における統合処理の例である。It is an example of the integration process in the 3rd Embodiment of this invention. 本発明の一実施例のHadoopによるMapReduce処理開始時の各ノードの動きを示す図である。It is a figure which shows the motion of each node at the time of the MapReduce process start by Hadoop of one Example of this invention. クライアントノードの一般的な処理のフローチャートである。It is a flowchart of a general process of a client node. 本発明の一実施例のジョブ追加時の処理のフローチャート（S1150）である。It is a flowchart (S1150) of the process at the time of job addition of one Example of this invention. 本発明の一実施例の遅延キューにジョブを追加するときの処理のフローチャートである。It is a flowchart of a process when adding a job to the delay queue of one Example of this invention. 本発明の一実施例の遅延キューのジョブ取り出しスレッドの処理のフローチャート（S1430）である。It is a flowchart (S1430) of the process of the job extraction thread | sled of the delay queue of one Example of this invention. 本発明の一実施例のJobTrackerノードのジョブスケジューラの処理のフローチャートである。It is a flowchart of the process of the job scheduler of the JobTracker node of one Example of this invention. 本発明の一実施例のTaskTrackerのタスク取得ループ部分の処理のフローチャートである。It is a flowchart of a process of the task acquisition loop part of TaskTracker of one Example of this invention. 本発明の一実施例のタスク開始時の処理のフローチャート（S1711,S1731）である。It is a flowchart (S1711, S1731) of the process at the time of the task start of one Example of this invention. 本発明の一実施例のパラメータ検知用Mapperの処理のフローチャート（S1820）である。It is a flowchart (S1820) of the process of Mapper for parameter detection of one Example of this invention. 本発明の一実施例の検知用Configurationクラスのパラメータ呼出時の処理のフローチャートである。It is a flowchart of the process at the time of the parameter call of the Configuration class for a detection of one Example of this invention. 本発明の一実施例の検知用Reducerの処理のフローチャート（S1820）である。It is a flowchart (S1820) of the process of the detection reducer of one Example of this invention. 本発明の一実施例の統合ジョブ用Mapperの処理のフローチャート（S1820）である。It is a flowchart (S1820) of the process of the mapper for integrated jobs of one Example of this invention. 本発明の一実施例の統合ジョブ用Reducerの処理のフローチャート（S1820）である。It is a flowchart (S1820) of the process of the reducer for integrated jobs of one Example of this invention. MapReduce処理の流れを示す図である。It is a figure which shows the flow of MapReduce processing. MapReduceを利用した一般的な処理のフローチャートである。It is a flowchart of a general process using MapReduce. 一般的なMapReduce処理のフローチャートである。It is a flowchart of a general MapReduce process.

以下図面と共に、本発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

［第１の実施の形態］
図１は、本発明の第1の実施の形態におけるシステム構成を示す。 [First Embodiment]
FIG. 1 shows a system configuration in the first embodiment of the present invention.

本発明の構成は、図２６と同様に、ネットワークによって相互に接続された複数のコンピュータを用い、図２６と同様の流れを処理が行われる分散処理フレームワークを行うシステムである。 As in FIG. 26, the configuration of the present invention is a system that uses a plurality of computers connected to each other via a network and performs a distributed processing framework in which processing is performed in the same flow as in FIG.

本実施の形態では、MapReduce処理において、制御部（Controller）５０が、集計用データ作成部（Mapper）２０、集計部（Reducer）４０に対して検知を指示すると、集計用データ作成部（Mapper）２０、集計部（Reducer）４０は、パラメータに関する情報を検知し、分散ファイルシステム１０に保存する。 In this embodiment, in the MapReduce process, when the control unit (Controller) 50 instructs the totaling data creation unit (Mapper) 20 and the totaling unit (Reducer) 40 to detect, the totaling data creation unit (Mapper) 20, aggregation unit (Reducer) 40 detects the information about the parameters, and stores the partial Chifu § yl system 10.

Controller５０は、ジョブ開始時に、Mapper２０、Reducer４０に対してパラメータの情報の検知を指示する。また、遅延キューを利用することによるMapper２０、Reducer４０及び入力データが共通である複数のジョブの統合を行う。さらに、ジョブ開始時に、統合されたジョブパラメータ情報を利用して、Mapper２０、Reducer４０の処理のうち、どこまでを共有化できるかを判定する。判定した共有範囲をMapper２０、及びReducer４０に指示する。 The controller 50 instructs the mapper 20 and the reducer 40 to detect parameter information at the start of the job. In addition, the Mapper 20, the Reducer 40, and a plurality of jobs having common input data are integrated by using the delay queue. Furthermore, at the start of the job, it is determined how much of the processing of Mapper 20 and Reducer 40 can be shared using the integrated job parameter information. The determined sharing range is instructed to Mapper 20 and Reducer 40.

Mapper２０、Reducer４０は、Controller５０から指示があると、パラメータ情報の検知を実行する。また、Controller５０からの共有化の指示に応じて、共有可能な範囲は１回、以降はパラメータに応じた処理に分岐するように、集計用データ作成（Mapping）、集計処理（Reducing)を実行する。 Mapper 20 and Reducer 40 detect parameter information when instructed by Controller 50. Also, according to the sharing instruction from the Controller 50, the data sharing range (Mapping) and the aggregation process (Reducing) are executed so that the sharable range is once, and thereafter the process branches to the process according to the parameter. .

図２は、本発明の第１の実施の形態における全体動作のフローチャートである。 FIG. 2 is a flowchart of the overall operation in the first embodiment of the present invention.

ステップ３００）データが入力されると、Mapper２０、Reducer４０はMapReduce処理を行い、その処理の過程において、制御部(Controller)５０は、Mapper２０とReducer４０に、利用されるパラメータの値、当該パラメータの利用位置、利用元関数の検知を指示し、Mapper２０、Reducer４０は指示された検知処理を行い、分散ファイルシステム１０、または、ローカルに保存する。当該処理は、請求項１の「パラメータ検出手段」に対応する。 Step 300) When data is input, the Mapper 20 and the Reducer 40 perform MapReduce processing, and in the course of the processing, the control unit (Controller) 50 provides the Mapper 20 and the Reducer 40 with the values of parameters used and the usage positions of the parameters. The mapper 20 and the reducer 40 perform the instructed detection process and store them in the distributed file system 10 or locally. This process corresponds to “parameter detection means” in claim 1.

ステップ３１０） MapReduce処理の過程において、Controller５０は、ジョブに与えられたパラメータと、ステップ３００で分散ファイルシステム１０に格納されているパラメータに関する情報と照合し、中間データをキャッシュとして保存すべきかを判断し、保存する場合には、Mapper２０、Reducer４０に共有範囲を通知し、中間データを分散ファイルシステム１０またはローカルに保存するよう指示する。これにより、Mapper２０、Reducer４０は、Controller５０からの指示に応じて中間データを保存する。中間データを保存する際には、通常の出力とは別に出力する。 Step 310) In the process of MapReduce processing, the Controller 50 compares the parameter given to the job with the information about the parameter stored in the distributed file system 10 in Step 300, and determines whether the intermediate data should be saved as a cache. In the case of saving, the sharing range is notified to the Mapper 20 and the Reducer 40 and the intermediate data is instructed to be saved locally. Thereby, Mapper20 and Reducer40 preserve | save intermediate data according to the instruction | indication from Controller50. When saving the intermediate data, it is output separately from the normal output.

次に、上記のステップ３１０の処理を詳細に説明する。 Next, the process of step 310 will be described in detail.

図３は、本発明の第１の実施の形態におけるジョブの統合・共有化による処理量を削減したMapReduceを利用する処理のフローチャートである。図４は、ジョブの統合方法を示しており、上段は従来の、下段は本発明のキューの利用方法を示している。 FIG. 3 is a flowchart of processing using MapReduce in which the amount of processing is reduced by integrating and sharing jobs according to the first embodiment of this invention. FIG. 4 shows a job integration method. The upper part shows a conventional method, and the lower part shows a method of using a queue of the present invention.

ステップ６００）ステップ１００と同様に、MapReduceジョブを生成する。 Step 600) Similar to step 100, a MapReduce job is generated.

ステップ６１０）ステップ１１０と同様に、生成したジョブにパラメタータを設定する。 Step 610) As in step 110, parameter data is set in the generated job.

ステップ６２０）パラメータが付与されたジョブを遅延キューに追加する。その際、図４に示すように、新しく追加するジョブとすでに遅延キュー内にある各ジョブとパラメータを比較し、入力データと利用するMapper名、Reduce名と同じものがあった場合はそのジョブに統合する。図４の例では、新しく追加する"ジョブ７"の入力データが、既に遅延キューに入っている"ジョブ６"または、"ジョブ４、５"と同じである場合は、遅延キューに統合する。当該処理は請求項１の「ジョブ統合手段」に相当する。遅延キューに追加してからユーザが指定した時間が経過したジョブは順次遅延キューから出力され、通常のキューに戻される。 Step 620) Add the job with the parameters to the delay queue. At that time, as shown in FIG. 4, the newly added job is compared with the parameters of each job already in the delay queue, and if the input data and the mapper name and reduce name used are the same, Integrate. In the example of FIG. 4, when the input data of “job 7” to be newly added is the same as “job 6” or “jobs 4 and 5” already in the delay queue, they are integrated into the delay queue. This processing corresponds to “job integration unit” in claim 1. Jobs for which the time specified by the user has elapsed since being added to the delay queue are sequentially output from the delay queue and returned to the normal queue.

ステップ６３０）ステップ６２０で統合済みのジョブを、パラメータに依存しない重複する部分を共有化しながら処理を行う。これにより、統合前の複数のジョブを実行したときと同じ結果を得る。当該処理は請求項１の「共有実行手段」に相当する。 Step 630) The job that has been integrated in Step 620 is processed while sharing overlapping parts that do not depend on parameters. As a result, the same result as when a plurality of jobs before integration is executed is obtained. This process corresponds to “shared execution means” in claim 1.

ステップ６４０）ステップ１３０と同様に、MapReduce処理を行う。 Step 640) Similar to Step 130, MapReduce processing is performed.

ステップ６５０）ステップ１４０と同様の処理を行う。 Step 650) The same processing as step 140 is performed.

上記のように、本実施の形態によれば、MapReduce処理において、パラメータに関する情報を検出し、分散ファイルシステム１０に保存し、新たなMapReduceジョブが追加された場合に、すぐにMapReduce処理を開始せずに、遅延キューに蓄積し、入力データやMapper２０、Reducer４０は同じであるが、パラメータ値の異なる複数パターンのMapReduceジョブが同時に入力されたことを検知してジョブを統合することで、重複部分の処理を削減することができる。 As described above, according to the present embodiment, in MapReduce processing, information regarding parameters is detected and stored in the distributed file system 10, and when a new MapReduce job is added, MapReduce processing can be started immediately. However, the input data, Mapper 20 and Reducer 40 are the same, but it is detected that multiple patterns of MapReduce jobs with different parameter values are input at the same time. Processing can be reduced.

［第２の実施の形態］
本実施の形態では、第１の実施の形態のステップ３００におけるパラメータ情報検出方法（パラメータ検出手段）について詳述する。 [Second Embodiment]
In the present embodiment, the parameter information detection method (parameter detection means) in step 300 of the first embodiment will be described in detail.

図５は、本発明の第２の実施の形態におけるパラメータ情報を検知するMapReduceを利用した処理のフローチャートである。 FIG. 5 is a flowchart of processing using MapReduce for detecting parameter information according to the second embodiment of the present invention.

ステップ４００）ステップ１００と同様に、MapReduceジョブを生成する。 Step 400) Similar to step 100, a MapReduce job is generated.

ステップ４１０）ステップ１１０と同様に、ジョブにパラメータを与える。 Step 410) As with step 110, parameters are given to the job.

ステップ４２０）各ノードに割り当てられた入力データに対し、図２７と同様のMapReduce処理を行うが、その過程で、Controller５０の指示により、Mapper２０やReducer４０は、利用するパラメータの値、利用位置、利用元関数を検知し、分散ファイルシステム１０に保存する。 Step 420) The MapReduce process similar to that shown in FIG. 27 is performed on the input data assigned to each node. In the process, the Mapper 20 and Reducer 40 use the parameter value, the usage position, and the usage source in accordance with an instruction from the Controller 50 The function is detected and stored in the distributed file system 10.

ステップ４３０）ステップ１３０と同様の処理を行う。 Step 430) The same processing as step 130 is performed.

本発明では、最低限２回のMapReduce処理を必要とし、１回目では、ジョブの統合などは行わずに、パラメータ情報の検知のみを行う。２回目において、複数ジョブの統合(処理の共有化)を行うことで、処理量の削減を行っている。 In the present invention, at least two MapReduce processes are required, and only the parameter information is detected at the first time without performing job integration. In the second time, the amount of processing is reduced by integrating a plurality of jobs (processing sharing).

上記のステップ４２０について詳細に説明する。 The above step 420 will be described in detail.

図６は、本発明の第１の実施の形態におけるパラメータ情報を検知するMapReduce処理のフローチャートであり、図７は、パラメータ検知のイメージを示す。 FIG. 6 is a flowchart of MapReduce processing for detecting parameter information according to the first embodiment of the present invention, and FIG. 7 shows an image of parameter detection.

ステップ５００）各ノードに割り当てられた入力データに対し、Mapper２０はユーザが定義した任意のMap処理を行う。各ノードが自らに割り当てられた分散データを先頭から順に１行に対して１回、ユーザが定義したMap関数が適用される。Controller５０は、その間Mapper２０においてパラメータが利用される際に、当該Mapper２０が利用したMapperクラス名、関数、パラメータ名、パラメータ値を抽出して分散ファイルシステム１０に保存し、重複範囲の特定に利用する。 Step 500) The mapper 20 performs an arbitrary map process defined by the user for the input data assigned to each node. A map function defined by the user is applied once to each row of distributed data assigned to each node in order from the top. The controller 50 extracts mapper class names, functions, parameter names, and parameter values used by the mapper 20 and stores them in the distributed file system 10 when parameters are used in the mapper 20 during that time, and uses them for specifying overlapping ranges.

ステップ５１０） Combiner６０は、Map処理により生成されたKey-Valueの形式の中間データのうち、キーが共通のものを収集し（Key１つに対してリスト状のValue形式）、ユーザが定義した任意のCombine処理を行い、中間データ記憶部３０に格納する。また、パラメータが利用される際に、利用したCombineクラス名、関数、パラメータ名、パラメータ値を分散ファイルシステム１０に保存し、以降の重複範囲の特定に利用する。 Step 510) The Combiner 60 collects the intermediate data in the key-value format generated by the map process and has the same key (a list-like value format for one key), and the user-defined arbitrary data Combine processing is performed and stored in the intermediate data storage unit 30. Further, when a parameter is used, the used Combine class name, function, parameter name, and parameter value are stored in the distributed file system 10 and used for specifying the subsequent overlapping range.

ステップ５２０） Shuffle部７０は、ステップ５１０で中間データ記憶部３０に出力された中間データのShuffle処理を行う。 Step 520) The Shuffle unit 70 performs a Shuffle process on the intermediate data output to the intermediate data storage unit 30 in Step 510.

ステップ５３０） Reducer４０は、ステップ５２０の処理で得られたKey-Value形式の中間データに対して、ユーザが定義した任意のReduce処理を行う。その間、パラメータが利用される際に、利用したReduceクラス名、関数、パラメータ名、パラメータ値を分散ファイルシステム１０に格納する。 Step 530) The Reducer 40 performs an arbitrary Reduce process defined by the user on the intermediate data in the Key-Value format obtained in the process of Step 520. Meanwhile, when the parameter is used, the used Reduce class name, function, parameter name, and parameter value are stored in the distributed file system 10.

上記のように、本実施の形態では、MapReduce処理に使われるMapperクラス、Reducerクラスの名前、事前に与えるパラメータ名とパラメータ値等が、MapReduce処理中のどのような順序で、どの関数内で利用されるかを検出し、その情報を分散ファイルシステム１０に保存することで、以後の処理において利用できるようにする。 As described above, in this embodiment, the Mapper class used in the MapReduce process, the name of the Reducer class, the parameter name and parameter value given in advance, etc. are used in what order and in what function during the MapReduce process. By detecting whether or not the information is stored, the information is stored in the distributed file system 10 so that it can be used in subsequent processing.

［第３の実施の形態］
本実施の形態では、複数処理を統合し（ジョブ統合手段）、重複処理を共有化する（共有実行手段）処理について説明する。 [Third Embodiment]
In the present embodiment, a process for integrating a plurality of processes (job integration unit) and sharing a duplicate process (a sharing execution unit) will be described.

図８は、本発明の第３の実施の形態における遅延キュー追加時の処理のフローチャートである。 FIG. 8 is a flowchart of processing when a delay queue is added according to the third embodiment of this invention.

ステップ７００） Controller５０は、分散ファイルシステム１０に格納されている既存のジョブの持つパラメータと新たに追加されたジョブのパラメータを比較して、入力データが同じであるかどうか、また、処理に利用するMapper２０とReducer４０が同じかどうかを判定し、同じジョブがあれば、統合可能なジョブとする。 Step 700) The controller 50 compares the parameters of the existing job stored in the distributed file system 10 with the parameters of the newly added job, and checks whether the input data is the same and uses it for processing. It is determined whether the Mapper 20 and the Reducer 40 are the same, and if there is the same job, the job can be integrated.

ステップ７１０）既存のジョブの持つパラメータと新たに追加されたジョブのパラメータを比較し、代表する一つのジョブに対して、その他のジョブのパラメータとの差分を追加することで、パラメータの合成を行い、遅延キューに追加する。 Step 710) The parameters of the existing job are compared with the parameters of the newly added job, and for one representative job, the difference between the parameters of the other jobs is added and the parameters are synthesized. Add to the delay queue.

図９は、本発明の第３の実施の形態における統合処理のイメージを示す。 FIG. 9 shows an image of integration processing in the third embodiment of the present invention.

同図に示すように、「ジョブ１のConfiguration」と「ジョブ２のConfiguration」の内容(name:Value)を比較すると、"input.dir"、"map.class"、"reduce.class"、"param2"が同じである。これらを１つに合成し、「ジョブ１，２のConfiguration」に示すように、異なる"param1"のみを追記する。 As shown in the figure, when comparing the contents (name: Value) of “Configuration of job 1” and “Configuration of job 2”, “input.dir”, “map.class”, “reduce.class”, “ param2 "is the same. These are combined into one, and only different “param1” is added as shown in “Configuration of jobs 1 and 2”.

統合されたジョブを統合前のものが相互に影響しないように、かつ、統合前のジョブ間で重複するパラメータに依存しない部分を共有化しながら実行する。そのために、Mapなどの関数別にどのジョブを共有するのかを示す共有グループを生成し、それに従って必要な数だけ統合前の元のMapperやReducerを生成し、それらに処理を実行させ、処理結果を共有する。 The integrated jobs are executed so that the jobs before the integration do not affect each other, and the parts that do not depend on the parameters that overlap between the jobs before the integration are shared. For that purpose, create a sharing group that shows which job is shared by function such as Map, and generate the original Mapper and Reducer as many as necessary according to it, let them execute the processing, and the processing result Share.

図１０は、本発明の第３の実施の形態における統合ジョブMapReduce処理のフローチャートである。当該処理は、図３のステップ６３０に対応する。 FIG. 10 is a flowchart of the integrated job MapReduce process according to the third embodiment of the present invention. This process corresponds to step 630 in FIG.

ステップ８００） Controller５０は、遅延キューに蓄積されたジョブの中で、指定した遅延時間が経過している統合されたジョブを取得する（図４の下段の「保持時間切れによる取り出し」を参照）。その際、ジョブが持つパラメータを元に、予め第２の実施の形態において分散ファイルシステム１０に保存していた各パラメータがMapper２０、Reducer４０など、どの位置の関数で利用されるかの情報を参照することで、関数毎に処理の共有グループを生成してパラメータとして追加しておく。 Step 800) The controller 50 obtains an integrated job in which the specified delay time has elapsed among the jobs accumulated in the delay queue (see “Retrieving by holding time out” in the lower part of FIG. 4). At that time, based on the parameters of the job, reference is made to information on which function, such as Mapper 20 and Reducer 40, each parameter previously stored in the distributed file system 10 in the second embodiment is used. Thus, a shared group of processes is generated for each function and added as a parameter.

詳しくは、図１１に示すように、
（１）まず、Mapperクラス名"Map.ClassA"、Reducerクラス名"Red.ClassB"、差分パラメータ"param1"をもとに分散ファイルシステム１０に蓄積した情報を参照する。 Specifically, as shown in FIG.
(1) First, information stored in the distributed file system 10 is referred to based on the Mapper class name “Map.ClassA”, the Reducer class name “Red.ClassB”, and the difference parameter “param1”.

（２）差分パラメータがMapReduce処理でどの位置で利用されるのかを判定する。図１１の例では、"pram1"はReduce処理で使われることがわかる。 (2) It is determined at which position the difference parameter is used in the MapReduce process. In the example of FIG. 11, it can be seen that “pram1” is used in the Reduce process.

（３）パラメータに依存しない、重複する処理のあるジョブをMap、Reduce単位でグルーピングする。図１１の例では、Map処理では、"１，２"がグルーピングされる。なお、このとき、それより以前の共有化グループも考慮するものとする。また、Reduce処理では、"１"、"２"となりグループ化されない。 (3) Grouping jobs with duplicate processing that do not depend on parameters in Map and Reduce units. In the example of FIG. 11, “1, 2” are grouped in the Map process. At this time, the sharing group before that is also taken into consideration. Further, in the Reduce process, “1” and “2” are obtained and the groups are not grouped.

（４）共有化グループの情報をConfigurationに追加する。 (4) Add shared group information to Configuration.

ステップ８１０） Controller５０は、Map処理の共有グループの数だけ元のMapperを生成し、各Mapper２０に、ステップ２００と同様のMap処理を行わせる。 Step 810) The controller 50 generates the original mapper for the number of shared groups for map processing, and causes each mapper 20 to perform the same map processing as in step 200.

ステップ８２０） Controller５０は、ステップ８１０と同様に、必要数のCombiner６０を生成し、各Combiner６０にステップ２１０と同様の処理を行わせる。 Step 820) The controller 50 generates the required number of combiners 60 as in step 810, and causes each combiner 60 to perform the same processing as in step 210.

ステップ８３０） Controller５０は、ステップ８１０と同様に、必要数のReducer４０を生成し、各Reducer４０にステップ２３０と同様の処理を行わせる。 Step 830) As in step 810, the controller 50 generates the required number of reducers 40 and causes each reducer 40 to perform the same processing as in step 230.

図１２は、本発明の第３の実施の形態における統合処理の例である。同図に示すように、３パターンのジョブ"１"、"２"、"３"を統合する場合に、Mapperの共有化グループは"２，３"であり、Reducerの共有化グループがない場合に、Mapperでは、共有化グループ"２，３"のジョブとジョブ"１"をパラレルに実行し、中間データを出力する際にタグを付与する。Reducerは共有化グループがないので、ジョブ"２"、ジョブ"３"を順に行い、それぞれ処理結果を出力する。 FIG. 12 is an example of integration processing in the third embodiment of the present invention. As shown in the figure, when 3 patterns of jobs “1”, “2”, “3” are integrated, the shared group of Mapper is “2, 3”, and there is no shared group of Reducer In addition, in Mapper, the job of the sharing group “2, 3” and the job “1” are executed in parallel, and a tag is assigned when outputting intermediate data. Reducer does not have a shared group, so it performs job “2” and job “3” in order, and outputs the processing results.

このように、本実施の形態では、MapReduceのジョブが新しく追加されたときに、すぐに処理を開始しないよう、遅延評価キューを利用し、キューに入力データやMapperクラス、Reduceクラスが同じでも、複数のパターンの異なるパラメータを持つジョブが与えられたとき、各ジョブの持つパラメータ値を比較し、代表する一つのジョブに対して、その他のジョブのパラメータとの差分を追加し、統合する。そして、第２の実施の形態で分散ファイルシステム１０に格納された情報を利用して、パラメータに依存しない重複する関数を抽出し、統合されたジョブを重複部分を共有しながら実行し、パラメータ依存となった時点で処理を分岐させることによって重複部分の処理量を削減する。 In this way, in this embodiment, when a MapReduce job is newly added, the delay evaluation queue is used so that processing does not start immediately, even if the input data, Mapper class, and Reduce class are the same, When jobs having different parameters of a plurality of patterns are given, the parameter values of the jobs are compared, and a difference from other job parameters is added to one representative job and integrated. Then, by using the information stored in the distributed file system 10 in the second embodiment, a duplicate function that does not depend on parameters is extracted, and the integrated job is executed while sharing the duplicated portion. At this point, the processing amount is reduced by branching the processing.

以下、本発明をオープンソース分散システムの「Hadoop」に適用した例を示す。 Hereinafter, an example in which the present invention is applied to “Hadoop” of an open source distributed system will be described.

以下では、Hadoopにおいて、「パラメータ」は具体的にはConfigurationクラスを指す。Configurationクラスは、Hadoopによる一連のMapReduce処理の流れの中で、Hadoopそのもののパラメータ、ユーザが定義したMapper、Reducerを指定するパラメータ、MapperやReducer等で利用される変数等のパラメータなど、あらゆるパラメータを区別なく格納、管理するためのクラスであり、MapReduceに関わる様々なクラスがConfigurationクラスを継承あるいはConfigurationクラスを保持する形で利用する。パラメータは、「名前：値」の形でConfigurationクラスに格納され、必要とする型に応じたメソッドにパラメータの名前を与えると、その型の値が応答される。HDFSやローカルファイルに出力するために、XML形式に変換するメソッドを持つ。 Hereinafter, in Hadoop, “parameter” specifically refers to the Configuration class. The Configuration class includes all parameters such as parameters of Hadoop itself, parameters defined by the user-defined Mapper and Reducer, variables such as variables used by Mapper and Reducer, etc. in the series of MapReduce processing flow by Hadoop. It is a class for storing and managing without distinction, and various classes related to MapReduce use it by inheriting the Configuration class or holding the Configuration class. Parameters are stored in the Configuration class in the form of “name: value”. When a parameter name is given to a method corresponding to a required type, a value of that type is returned. Has a method to convert to XML format for output to HDFS and local files.

「ジョブ（jobまたはJobConf）」は、Hadoopにおいて、具体的にはJobクラス、もしくは、JobConfクラスを指す。HadoopによるMapReduce処理を開始する際、ジョブを生成し、ジョブが提供する各種メソッドを用いて処理に利用するMapperクラスやReduceクラスなどを与え、開始メソッドによってMapReduce処理を開始することができる。Configurationクラスを内包し、与えられたパラメータは基本的に全てConfigurationに格納する。 “Job (job or JobConf)” specifically refers to the Job class or the JobConf class in Hadoop. When starting MapReduce processing with Hadoop, you can create a job, give the Mapper class or Reduce class used for processing using various methods provided by the job, and start MapReduce processing with the start method. The Configuration class is included, and all the given parameters are stored in the Configuration.

なお、後述する「JobTracker」は前述の実施の形態における「制御部（Controller）５０」に対応する。 Note that “JobTracker” to be described later corresponds to “control unit (Controller) 50” in the above-described embodiment.

図１３は、HadoopによるMapReduce処理開始時の各ノードの動作を示す。 FIG. 13 shows the operation of each node at the start of MapReduce processing by Hadoop.

同図に示すシステムは、JobClient１００２、MapReduceアプリケーション１００１を有するクライアントノード１０００、JobTracker１０１１を有するJobTrackerノード１０１０、TaskTracker１０２１、Map or Reduceタスク１０２２を有するTrack Trackerノード１０２０から構成される。 The system shown in the figure includes a JobClient 1002, a client node 1000 having a MapReduce application 1001, a JobTracker node 1010 having a JobTracker 1011, a TaskTracker 1021, and a Track Tracker node 1020 having a Map or Reduce task 1022.

まず、一般的なクライアントノード１０００の動作を説明する。 First, the operation of a general client node 1000 will be described.

図１４は、クライアントノードの一般的な処理のフローチャートである。 FIG. 14 is a flowchart of general processing of the client node.

ステップ１１００） Hadoopに含まれるJobクラスのインスタンスの生成時に、XML形式の初期パラメータが読み込まれ、Configurationクラスとしてジョブに与えられる。 Step 1100) When an instance of a Job class included in Hadoop is generated, an initial parameter in XML format is read and given to the job as a Configuration class.

ステップ１１１０） Mappker・Reducer・Key・ValueクラスなどのHadoopが必要とするパラメータや、処理の途中に利用する変数の初期値などのパラメータをジョブに与える（ジョブが保持するConfigurationにパラメータを与える）。 Step 1110) Parameters such as Mappker / Reducer / Key / Value class required by Hadoop and parameters such as initial values of variables used during the processing are given to the job (parameters are given to the configuration held by the job).

ステップ１１２０）入力データの場所や実行時にユーザが変更可能なパラメータをコマンドライン引数の解析などによって取得し、ジョブに与える（ジョブが保持するConfigurationにパラメータを与える）。 Step 1120) The location of the input data and the parameters that can be changed by the user at the time of execution are acquired by analyzing the command line arguments and the like are given to the job (the parameter is given to the configuration held by the job).

ステップ１１３０）クライアントノード１０００は、JobClient１００２を利用してJobTracker１１０１が動作しているサーバから新規のジョブIDを取得し、ジョブに与える。 Step 1130) The client node 1000 uses the JobClient 1002 to acquire a new job ID from the server on which the JobTracker 1101 is operating and gives it to the job.

ステップ１１４０）ステップ１１３０で取得したジョブIDを元にした分散ファイルシステム１０３０上の所定の場所（パス）に、ステップ１１００，１１１０、１１２０によって生成され、設定値が付与されたジョブをXMLとして出力する（また、入力データを示すファイルも所定の場所に出力する）。 Step 1140) The job generated by Steps 1100, 1110, and 1120 and assigned with a set value is output as XML to a predetermined location (path) on the distributed file system 1030 based on the job ID acquired in Step 1130. (A file indicating input data is also output to a predetermined location).

ステップ１１５０）クライアントノード１０００は、JobClient１００２を利用して新しいジョブの追加をJobTracker１０１１に通知する。 Step 1150) The client node 1000 notifies the JobTracker 1011 of the addition of a new job using the JobClient 1002.

ステップ１１６０） JobTracker１０１１において追加されたジョブが実行中ジョブとなり、キューに蓄積される。適宜キューを元にMapReduce処理が実行され、処理のログを標準出力に表示しながら終了まで待機する。 Step 1160) The job added in JobTracker 1011 becomes an executing job and is accumulated in the queue. MapReduce processing is executed appropriately based on the queue, and waits until the end while displaying the processing log on the standard output.

ステップ１１７０） MapReduce処理の結果を取得し、ユーザが定義した任意の処理を行う。 Step 1170) The result of the MapReduce process is acquired, and an arbitrary process defined by the user is performed.

上記のステップ１１５０のジョブ追加時の一般的な処理としては、ジョブIDに基づいてジョブの初期化を行い、実行ジョブ（JobInProgress）を生成し、実行ジョブをキューに格納していた。 As a general process at the time of adding a job in step 1150, the job is initialized based on the job ID, an execution job (JobInProgress) is generated, and the execution job is stored in the queue.

これに対し、本発明は、以下のような処理を行う。 In contrast, the present invention performs the following processing.

図１５は、本発明の一実施例のジョブ追加時の処理のフローチャート（S1150）である。ステップ１３００）ジョブIDを元にしてジョブの初期化を行い、実行ジョブ（JobInProgress）を生成する。 FIG. 15 is a flowchart (S1150) of processing when adding a job according to an embodiment of the present invention. Step 1300) The job is initialized based on the job ID, and an execution job (JobInProgress) is generated.

ステップ１３１０）ジョブが保持するConfigurationを参照し、パラメータ検知設定が有効か判断する。有効である場合はステップ１３２０に、無効である場合はステップ１３３０に移行する。 Step 1310) Referring to the configuration held by the job, it is determined whether the parameter detection setting is valid. If it is valid, step 1320 follows. If it is invalid, step 1330 follows.

ステップ１３２０）設定済みのMapper、Reducerは別の設定名として退避し、パラメータ検知用のMapper、Reducerを入れ替える。また、入れ替え済みのConfigurationの内容を分散ファイルシステム１０３０上に保存されているXMLにも反映する。 Step 1320) The set mapper and reducer are saved as different setting names, and the parameter detection mapper and reducer are replaced. The contents of the replaced Configuration are also reflected in the XML saved on the distributed file system 1030.

ステップ１３３０）ユーザによる明示的なキャッシュ有効指定がある場合は、ステップ１３３１に移行し、また、Mapper・Reducerクラス名、入力データ名、パラメータ名を用いて分散ファイルシステム１０３０上に蓄積されたパラメータの利用クラス、関数の情報を検索し、Mapperがパラメータに依存しない場合や、依存していても同じパラメータ値をユーザが指定する回数以上利用している場合には、ステップ１３３１に移行する。それ以外はステップ１３４０に移行する。 Step 1330) If there is an explicit cache validity designation by the user, the process proceeds to Step 1331, and parameters stored in the distributed file system 1030 using the Mapper / Reducer class name, input data name, and parameter name are transferred. If the usage class / function information is searched and Mapper does not depend on the parameter, or if the same parameter value is used more than the number of times specified by the user, the process proceeds to step 1331. Otherwise, the process proceeds to step 1340.

ステップ１３３１）実行ジョブを通常のキューではなく、遅延キューに追加し、処理を終了する。 Step 1331) The execution job is added to the delay queue instead of the normal queue, and the process ends.

ステップ１３４０）実行ジョブを通常のキューに追加する。 Step 1340) Add an execution job to the normal queue.

次に、上記のステップ１３３１の遅延キューに実行ジョブを追加する処理を説明する。 Next, processing for adding an execution job to the delay queue in step 1331 will be described.

図１６は、本発明の一実施例の遅延キューにジョブを追加するときの処理のフローチャートである。 FIG. 16 is a flowchart of processing when a job is added to the delay queue according to the embodiment of this invention.

ステップ１４００） TaskTrackerノード１０２０は、JobTracker１１０１から追加された実行中ジョブを取得し、そのジョブが保持している設定を取り出す。 Step 1400) The TaskTracker node 1020 acquires the job being executed added from the JobTracker 1101, and retrieves the settings held by the job.

ステップ１４１０）追加されたジョブの設定（SX00の出力）を参照し、遅延キュー内にある全てのジョブの設定と比較して、入力データ、Mapperクラス、Combinerクラス、Reducerクラスが共通であるジョブがあるかどうか判定し、統合可能であれば、そのジョブをキューから一旦除去する。 Step 1410) Referring to the setting of the added job (SX00 output) and comparing it with the settings of all jobs in the delay queue, the jobs with common input data, Mapper class, Combiner class, and Reducer class If it can be integrated, the job is once removed from the queue.

ステップ１４２０）ステップ１４００とステップ１４１０の出力の設定値を比較し、値の違うものがある場合は、SX10の出力に対して差分としてその値を付加する。 Step 1420) The set values of the outputs of Step 1400 and Step 1410 are compared, and if there is a different value, that value is added as a difference to the output of SX10.

ステップ１４３０）ジョブにユーザが指定した待機時間を付与し、遅延キューに追加する。 Step 1430) A waiting time specified by the user is given to the job and added to the delay queue.

上記のステップ１４３０により遅延キューに追加された遅延キューのジョブの取り出し方法について説明する。 A method of taking out the job of the delay queue added to the delay queue in the above step 1430 will be described.

図１７は、本発明の一実施例の遅延キューのジョブ取り出しスレッドの処理のフローチャートである。 FIG. 17 is a flowchart of the process of the job fetch thread of the delay queue according to the embodiment of this invention.

ステップ１５００） JobTrackerノード１０１０のジョブスケジューラを利用して、遅延キューの中に待機時間がすでに経過しているジョブがあるかどうか判定し、ある場合はそのジョブを取り出す。 Step 1500) Using the job scheduler of the JobTracker node 1010, it is determined whether there is a job whose waiting time has already elapsed in the delay queue, and if there is, the job is taken out.

ステップ１５１０）設定済みのMapper、Reducerは別の設定名として退避し、パラメータ検知用のMapper、Reducerに入れ替える。 Step 1510) The set mapper and reducer are saved as different setting names and replaced with the parameter detection mapper and reducer.

ステップ１５２０）分散ファイルシステム１０３０上に保存されているパラメータの利用されるクラス、順序、関数の情報を、ステップ１５１０の出力が保持しているConfigurationの内容（クラス名）を利用して参照し、統合可能な関数とジョブ番号をConfigurationに与える。 Step 1520) Refers to the class, order, and function information of the parameters stored on the distributed file system 1030 using the contents (class name) of the Configuration held in the output of Step 1510, and Functions that can be integrated and job numbers are given to Configuration.

ステップ１５３０）ステップ１５２０の出力を通常のキューに追加する。 Step 1530) Add the output of step 1520 to the normal queue.

ステップ１５４０）ユーザが指定した時間だけスレッドを待機させた後、遅延キュースレッドの終了フラグが有効か否かを判定する。有効である場合は処理を終了する。 Step 1540) After waiting the thread for the time specified by the user, it is determined whether the end flag of the delay queue thread is valid. If it is valid, the process is terminated.

次に、JobTrackerノード１０１０内のジョブスケジューラの処理について説明する。 Next, processing of the job scheduler in the JobTracker node 1010 will be described.

図１８は、本発明の一実施例のJobTrackerのジョブスケジューラの処理のフローチャートである。 FIG. 18 is a flowchart of processing of the job scheduler of JobTracker according to the embodiment of this invention.

ステップ１６００）ジョブスケジューラは、キューの中に実行ジョブがあるかどうかを判定する。 Step 1600) The job scheduler determines whether there is an execution job in the queue.

ステップ１６１０）キューから実行ジョブを取得する。 Step 1610) An execution job is acquired from the queue.

ステップ１６２０）実行ジョブを元にして分散ファイルシステム１０３０の適切な場所からジョブのリソースを取得し、必要な数のMapタスク、Reduceタスクを生成する。 Step 1620) Based on the execution job, job resources are acquired from an appropriate location of the distributed file system 1030, and a necessary number of Map tasks and Reduce tasks are generated.

ステップ１６３０）終了判定を行い、終了フラグがない場合はステップ１６００に移行し、終了フラグがある場合は当該処理を終了する。 Step 1630) An end determination is made. If there is no end flag, the process proceeds to Step 1600. If there is an end flag, the process ends.

次に、TaskTrackerノード１０２０のTaskTracker１０２１の処理を説明する。 Next, the process of the TaskTracker 1021 of the TaskTracker node 1020 will be described.

図１９は、本発明の一実施例のTaskTrackerのタスク取得ループ部分の処理のフローチャートである。 FIG. 19 is a flowchart of the processing of the task acquisition loop part of TaskTracker according to an embodiment of the present invention.

ステップ１７００） TaskTracker１０２１は、JobTracker１０１１に対して、自らの生存を確認させるためにハートビート（HeatBeat）を送出し、JobTracker１０１１からそれに対する応答を受け取る。 Step 1700) The TaskTracker 1021 sends a heartbeat (HeatBeat) to the JobTracker 1011 to confirm its existence, and receives a response from the JobTracker 1011.

ステップ１７１０） TaskTracker１０２１は、ハートビートに対する応答にタスクが含まれているかを判定する。含まれている場合はステップ１７２０に移行し、含まれていない場合はステップ１７００に移行する。 Step 1710) The TaskTracker 1021 determines whether or not a task is included in the response to the heartbeat. If it is included, the process proceeds to step 1720. If it is not included, the process proceeds to step 1700.

ステップ１７２０）タスクがMapタスクであるかを判定し、Mapタスクである場合はステップ１７１１に移行し、それ以外である場合はステップ１７３０に移行する。 Step 1720) It is determined whether the task is a Map task. If it is a Map task, the process proceeds to Step 1711. Otherwise, the process proceeds to Step 1730.

ステップ１７１１） Mapタスク１０２２において与えられたMapタスクを実行し、ステップ１７００に移行する。 Step 1711) The Map task given in the Map task 1022 is executed, and the process proceeds to Step 1700.

ステップ１７３０）タスクはReduceタスクであるかを判定し、Reduceタスクである場合は、ステップ１７３１に移行する。それ以外の場合は、ステップ１７４０に移行する。 Step 1730) It is determined whether the task is a Reduce task, and if it is a Reduce task, the process proceeds to Step 1731. Otherwise, the process proceeds to step 1740.

ステップ１７３１） Reduceタスク１０２２でReduceタスクを実行する。 Step 1731) The Reduce task is executed by the Reduce task 1022.

ステップ１７４０）終了フラグの有無を調べ、終了フラグが立っていれば処理を終了し、立っていなければステップ１７００に移行する。 Step 1740) The presence / absence of an end flag is checked. If the end flag is set, the process is ended. If it is not set, the process proceeds to Step 1700.

次に、上記のステップ１７１１、１７３１のMapタスク、Reduceタスクの開示時の処理を説明する。 Next, the processing at the time of disclosure of the Map task and Reduce task in the above steps 1711 and 1731 will be described.

図２０は、本発明の一実施例のタスク開始時の処理のフローチャートである。 FIG. 20 is a flowchart of processing at the start of a task according to an embodiment of the present invention.

ステップ１８００） Map or Reduceタスク１０２２は、パラメータ(Configuration)
、ジョブの進捗状況（Status）、入出力などを含むContextクラスをタスクに応じて生成する。 Step 1800) The Map or Reduce task 1022 is a parameter (Configuration)
, Context class including job progress status (Status), input / output, etc. is generated according to the task.

ステップ１８１０） MapperクラスまたはReducerクラスに応じたクラスを生成する。 Step 1810) Generate a class corresponding to the Mapper class or the Reducer class.

ステップ１８２０）生成したContextクラスをMapperクラスまたはReduceクラスに与え、実行する。 Step 1820) The generated Context class is given to the Mapper class or Reduce class and executed.

上記のステップ１８２０におけるMapperの処理を説明する。 The mapper process in step 1820 will be described.

通常、Mapperでは、事前処理（Setup）、Map処理、終了処理（Cleanup）に関しては、ユーザがMapperクラスを継承したクラスを自由にプログラミングすることになる。Setup、Map、Cleanupの各関数内でパラメータを利用する場合は、全てContextクラスが保持しているConfigurationクラスにアクセスして取得する。これに対して、本発明は、図２０のステップ１８２０について、以下のような処理を行う。 Normally, in Mapper, the user can freely program a class that inherits the Mapper class for pre-processing (Setup), Map processing, and termination processing (Cleanup). When using parameters in each function of Setup, Map, and Cleanup, all are accessed by accessing the Configuration class held by the Context class. On the other hand, the present invention performs the following processing for step 1820 in FIG.

図２１は、本発明の一実施例のパラメータ検知用Mapperの処理のフローチャートである。 FIG. 21 is a flowchart of the process of the parameter detection mapper according to the embodiment of this invention.

ステップ２１００） Mapタスク１０２２は、Contextクラスが保持するConfigurationクラスを取り出し、それをパラメータが利用されたクラス、関数、順序を検知するためのConfigurationに入れ替える。 Step 2100) The Map task 1022 takes out the Configuration class held by the Context class, and replaces it with the Configuration for detecting the class, function, and order in which the parameters are used.

ステップ２１１０） Mapperクラスの名前をConfigurationから取得し、通常のMapperクラスを生成する。 Step 2110) The name of the Mapper class is acquired from the Configuration, and a normal Mapper class is generated.

ステップ２１２０）ユーザの定義した任意の事前処理のみを行う（検知が容易な関数名）。ここで、Mapperの中の関数名をそれぞれMapper内、Reducer内のものであることが明示的にわかるような関数名（検知が容易な関数名）にしておくものとする。 Step 2120) Only arbitrary pre-processing defined by the user is performed (function name easy to detect). Here, the function names in Mapper are assumed to be function names (function names that are easy to detect) that can be clearly identified as those in Mapper and Reducer, respectively.

ステップ２１３０）入力の次の１行に対してユーザの定義した任意のMap処理のみを行う（検知が容易な関数名）。 Step 2130) Only the arbitrary Map processing defined by the user is performed on the next line of the input (function name that is easy to detect).

ステップ２１４０）図１９のステップ１７１０で読み込んだ行がデータの最終行であるかを判定し、最終行である場合はステップ２１５０に移行し、最終行でない場合はステップ２１３０に移行する。 Step 2140) It is determined whether or not the line read in Step 1710 of FIG. 19 is the last line of the data. If it is the last line, the process proceeds to Step 2150. If it is not the last line, the process proceeds to Step 2130.

ステップ２１５０）ユーザの定義した任意の終了処理のみを行う（検知が容易な関数名）。 Step 2150) Only an arbitrary end process defined by the user is performed (function name easy to detect).

次に、ステップ１８００で生成された検知用コンテキストに対してパラメータを要求する場合の処理を説明する。 Next, a process for requesting a parameter to the detection context generated in step 1800 will be described.

図２２は、本発明の一実施例の検知用Configurationクラスのパラメータ呼び出し時の処理のフローチャートである。 FIG. 22 is a flowchart of processing when a parameter of the Configuration class for detection of the embodiment of the present invention is called.

ステップ２２００） TaskTracker１０２１からの指示により、Java（登録商標）の機能であるStackTraceを用いて当該関数を呼び出した関数を辿り、呼び出し元のMapperクラス名と関数名を検知する。 Step 2200) In accordance with an instruction from TaskTracker 1021, the function that called the function is traced using StackTrace, which is a Java (registered trademark) function, and the Mapper class name and function name of the caller are detected.

ステップ２２１０）分散ファイルシステム１０３０上にステップ２２００の出力を、呼び出し元クラス名、順序（数値）、関数名、パラメータ名をディレクトリ名、値をファイル名、頻度を内容としてファイルを生成し、分散ファイルシステム１０３０に出力する。 Step 2210) The output of Step 2200 is generated on the distributed file system 1030, and a file is generated with the caller class name, order (numerical value), function name, parameter name as directory name, value as file name, and frequency as content. Output to system 1030.

ステップ２２２０）保持しているプロパティクラスから、パラメータ名をキーとして取得し、呼び出されたメソッドに応じて適切な型に変換する。 Step 2220) The parameter name is acquired from the held property class as a key, and converted into an appropriate type according to the called method.

次に、図２０のステップ１８２０の検出用Reducer処理について説明する。 Next, the reducing process for detection in step 1820 in FIG. 20 will be described.

図２３は、本発明の一実施例の検知用Reducerの処理のフローチャートである。 FIG. 23 is a flowchart of the processing of the detection reducer according to the embodiment of the present invention.

ステップ２４００） Reduceタスク１０２２は、TaskTracker１０２１からReducerタスクで利用するconfigurationを検知用のものに入れ替える指示があると、Contextクラスが保持するConfigurationクラスを取り出し、それをパラメータが利用されたクラス、関数、順序を検知するためのConfigurationに入れ替える。 Step 2400) When the Reduce task 1022 is instructed by the TaskTracker 1021 to replace the configuration used in the Reducer task with that for detection, the Reduce class 1022 takes out the Configuration class held by the Context class, and uses the parameters for the class, function, and order. Replace with Configuration to detect.

ステップ２４１０） Reducerクラスの名前をConfigurationから取得し、通常のReduceクラスを生成する。 Step 2410) The name of the Reducer class is acquired from the Configuration, and a normal Reduce class is generated.

ステップ２４２０）ユーザが定義した任意の事前処理のみを行う。ここで、Reducerの中の関数名をReducer内のものであることが明示的にわかるような関数名（検知が容易な関数名）にしておくものとする。 Step 2420) Only any pre-processing defined by the user is performed. Here, it is assumed that the function name in the Reducer is set to a function name (function name that can be easily detected) that clearly indicates that the function is in the Reducer.

ステップ２４３０）割り当てられた中間データ（Key、Value）に対してReduce処理を行う（検知が容易な関数名）。 Step 2430) Reduce processing is performed on the assigned intermediate data (Key, Value) (function name that is easy to detect).

ステップ２４４０）ユーザの定義した任意の終了処理のみを行う（検知が容易な関数名）。 Step 2440) Only an arbitrary end process defined by the user is performed (function name easy to detect).

次に、図２０のステップ１８２０において、Map or Reduceタスク１０２２が統合ジョブを行うMapperの処理を説明する。 Next, the Mapper process in which the Map or Reduce task 1022 performs an integrated job in Step 1820 of FIG. 20 will be described.

図２４は、本発明の一実施例の統合ジョブ処理を行うMapperの処理のフローチャートである。 FIG. 24 is a flowchart of mapper processing for performing integrated job processing according to an embodiment of the present invention.

ステップ２５００） TaskTrackerノード１０２０のTaskTracker１０２１において、Contextクラスが持つConfigurationクラスを参照し、そこに格納されている共有化可能な関数の情報により、事前処理、Map処理、終了処理の各段階での共有化グループを表すリストを生成し、Mapperタスク１０２２に出力する。 Step 2500) In the TaskTracker 1021 of the TaskTracker node 1020, the Configuration class of the Context class is referenced, and sharing at each stage of the pre-processing, the map processing, and the termination processing is performed based on the information on the functions that can be shared. A list representing the group is generated and output to the mapper task 1022.

ステップ２５１０） Mapperタスク１０２２は、ステップ２５００の出力より共有化グループの数だけ本来のMapperを生成する。 Step 2510) The mapper task 1022 generates the original mapper by the number of shared groups from the output of step 2500.

ステップ２５２０）本来のMapperそれぞれに事前処理を行わせる。 Step 2520) Each of the original mappers performs pre-processing.

ステップ２５３０）ステップ２５１０の出力に対して、入力の次の１行に対してユーザの定義した任意のMap処理のみを行わせる。 Step 2530) For the output of Step 2510, only the arbitrary Map processing defined by the user is performed for the next line of the input.

ステップ２５４０）ステップ２５３０で読み込んだ行がデータの最終行であるかどうかを判定し、最終行である場合はステップ２５５０に移行し、最終行でない場合はステップ２５３０に移行する。 Step 2540) It is determined whether or not the line read in Step 2530 is the last line of the data. If it is the last line, the process proceeds to Step 2550, and if not, the process proceeds to Step 2530.

ステップ２５５０）ユーザの定義した任意の終了処理のみを行わせる。 Step 2550) Only an arbitrary end process defined by the user is performed.

次に、図２０のステップ１８２０において、Map or Reduceタスク１０２２の統合ジョブを説明する。 Next, the integrated job of the Map or Reduce task 1022 will be described in Step 1820 of FIG.

図２５は、発明の一実施例の統合ジョブ処理を行うReducerの処理のフローチャートである。 FIG. 25 is a flowchart of a Reducer process that performs integrated job processing according to an embodiment of the present invention.

ステップ２６００） TaskTrackerノード１０２０のTaskTracker１０２１は、Contextクラスが持つConfigurationクラスを参照し、そこに格納されている共有化可能な関数の情報により、事前処理、Reduce処理、終了処理の各段階での共有化グループを表すリストを生成し、Reducerタスク１０２２に出力する。 Step 2600) The TaskTracker 1021 of the TaskTracker node 1020 refers to the Configuration class of the Context class, and shares information at each stage of the pre-processing, Reduce processing, and termination processing based on the information on the functions that can be shared. A list representing the group is generated and output to the Reducer task 1022.

ステップ２６１０）ステップ２６００の出力より共有化グループの数だけ本来のReducerを生成する。 Step 2610) The original Reducers are generated from the output of Step 2600 by the number of shared groups.

ステップ２６２０）本来のReducerそれぞれに事前処理を行わせる。 Step 2620) Have each original reducer perform pre-processing.

ステップ２６３０）ステップ２６１０の出力に対して、入力の次の１行に対してユーザの定義した任意のReduce処理のみを行わせる。 Step 2630) For the output of Step 2610, only an arbitrary Reduce process defined by the user is performed on the next line of the input.

ステップ２６４０）ステップ２６３０で読み込んだ行がデータの最終行であるかどうかを判定し、最終行である場合はステップ２５５０に移行し、最終行でない場合はステップ２５３０に移行する。 Step 2640) It is determined whether or not the line read in Step 2630 is the last line of the data. If it is the last line, the process proceeds to Step 2550, and if it is not the last line, the process proceeds to Step 2530.

ステップ２６５０）ユーザの定義した任意の終了処理のみを行わせる。
なお、上記の図１に示す構成要素の動作をプログラムとして構築し、データ分析・機械学習装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 Step 2650) Only an arbitrary end process defined by the user is performed.
The operations of the components shown in FIG. 1 can be constructed as a program and installed in a computer used as a data analysis / machine learning device for execution, or distributed via a network.

また、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments and examples, and various modifications and applications are possible within the scope of the claims.

１０分散ファイルシステム（HDFS）
２０集計用データ作成部（Mapper）
３０中間データ記憶部
４０集計部（Reducer）
５０制御部（Controller）
６０結合部（Combiner）
７０シャッフル(Shuffle)部
１０００クライアントノード
１００１ MapReduceアプリケーション
１００２ジョブクライアント（JobClient）
１０１０ジョブトラッカ（JobTracker）ノード
１０１１ジョブトラッカ（JobTracker）
１０２０タスクとラッカ（TaskTracker）ノード
１０２１タスクトラッカ（TaskTracker）
１０２２ MapまたはReduceタスク
１０３０分散ファイルシステム(HDFS) 10 Distributed file system (HDFS)
20 Data creation part for aggregation (Mapper)
30 Intermediate data storage section 40 Total section (Reducer)
50 Controller (Controller)
60 Combiner
70 Shuffle part 1000 Client node 1001 MapReduce application 1002 Job client (JobClient)
1010 Job Tracker (JobTracker) node 1011 Job Tracker (JobTracker)
1020 Task and Tracker Node 1021 Task Tracker
1022 Map or Reduce task 1030 Distributed file system (HDFS)

Claims

１つのキーと値の組をマッピングして中間データを生成するジョブを行うマップ（Map）手段と、該中間データを更に小さな値にセットするジョブを行うリデュース（Reduce）手段、及び、与えられた教師データを格納する分散ファイルシステムを有し、大規模なデータを並列分散処理するためのデータ分析及び機械学習処理装置であって、
前記Map手段または前記Reduce手段にジョブが与えられると、該ジョブのパラメータの値、利用位置、利用元関数を含むパラメータ情報を検出し、前記分散ファイルシステムに格納するパラメータ検出手段と、
前記Map手段、前記Reduce手段に新たなジョブが与えられると、該ジョブのパラメータの値、利用位置、及び、利用元関数を含む新パラメータ情報を検出し、該新パラメータ情報と前記分散ファイルシステムに格納されている前記パラメータ情報と比較し、パラメータに依存しない重複部分のジョブを統合するジョブ統合手段と、
前記Map手段または前記Reduce手段を、統合されたジョブを重複部分を共有しながら実行させる共有実行手段と、
を有することを特徴とするデータ分析及び機械学習処理装置。 Map means for executing a job for generating intermediate data by mapping one key and value pair, Reduce means for executing a job for setting the intermediate data to a smaller value, and given A data analysis and machine learning processing apparatus having a distributed file system for storing teacher data and performing parallel distributed processing of large-scale data,
When a job is given to the Map means or the Reduce means, parameter information including a parameter value of the job, a use position, and a use source function is detected, and parameter detection means for storing in the distributed file system;
When a new job is given to the Map unit and the Reduce unit, new parameter information including the parameter value, usage position, and usage function of the job is detected, and the new parameter information and the distributed file system are detected. Job integration means for comparing the stored parameter information and integrating jobs of overlapping parts that do not depend on parameters;
A shared execution unit that executes the Map unit or the Reduce unit while executing an integrated job while sharing overlapping portions;
A data analysis and machine learning processing apparatus characterized by comprising:

前記ジョブ統合手段は、
前記Map手段または前記Reduce手段に対して新たなジョブが追加されると、該Map手段、または、該Reduce手段の実行前に、該ジョブを遅延キューに蓄積する遅延キュー制御手段と、
新たに追加されたジョブと既に前記遅延キューに格納されたジョブのパラメータを比較し、利用するMap手段名またはReduce手段名が同じものがある場合は、ジョブを統合し、所定のキュー保持時間が経過すると、通常のキューに統合されたジョブを移行させる手段と、
を含む請求項１記載のデータ分析及び機械学習処理装置。 The job integration means includes
When a new job is added to the Map unit or the Reduce unit, a delay queue control unit that accumulates the job in a delay queue before the execution of the Map unit or the Reduce unit;
Compare the newly added job and the parameters of the job already stored in the delay queue, and if there is the same Map means name or Reduce means name to be used, the jobs are integrated and the predetermined queue retention time After that, there is a means to migrate jobs that are integrated into the normal queue,
The data analysis and machine learning processing apparatus according to claim 1 including:

前記ジョブ統合手段は、
前記分散ファイルシステムに格納されている前記パラメータ情報と前記新パラメータ情報を比較して、前記パラメータに依存しない重複部分がある場合は、代表する１つのジョブに対してその他のジョブのパラメータとの差分を追加する手段を含む
請求項１または２記載のデータ分析及び機械学習処理装置。 The job integration means includes
If the parameter information stored in the distributed file system is compared with the new parameter information, and there is an overlapping part that does not depend on the parameter, the difference between the representative job and the parameters of other jobs The data analysis and machine learning processing device according to claim 1, further comprising means for adding

前記共有実行手段は、
前記ジョブ統合手段で統合されたジョブの重複部分を共有しながら前記Map手段、または、前記Reduce手段を実行させ、パラメータに依存する処理部分に到達した時点で処理を分岐させる手段を含む
請求項１記載のデータ分析及び機械学習処理装置。 The sharing execution means includes
2. The processing unit according to claim 1, further comprising: a unit that executes the Map unit or the Reduce unit while sharing an overlapping part of the jobs integrated by the job integration unit, and branches the process when the processing part depending on the parameter is reached. The data analysis and machine learning processing device described.

１つのキーと値の組をマッピングして中間データを生成するジョブを行うマップ（Map）手段と、該中間データを更に小さな値にセットするジョブを行うリデュース（Reduce）手段、及び、与えられた教師データを格納する分散ファイルシステムを有し、大規模なデータを並列分散処理する装置における、データ分析及び機械学習処理方法であって、
パラメータ検出手段が、前記Map手段または前記Reduce手段にジョブが与えられると、該ジョブのパラメータの値、利用位置、及び、利用元関数を含むパラメータ情報を検出し、前記分散ファイルシステムに格納するパラメータ検出ステップと、
ジョブ統合手段が、前記Map手段、前記Reduce手段に新たなジョブが与えられると、該ジョブのパラメータの値、利用位置、利用元関数を含む新パラメータ情報を検出し、該新パラメータ情報と前記分散ファイルシステムに格納されている前記パラメータ情報と比較し、パラメータに依存しない重複部分のジョブを統合するジョブ統合ステップと、
共有実行手段が、統合されたジョブを重複部分を共有しながら前記Map手段または前記Reduce手段を実行させる共有実行ステップと、
を行うことを特徴とするデータ分析及び機械学習処理方法。 Map means for executing a job for generating intermediate data by mapping one key and value pair, Reduce means for executing a job for setting the intermediate data to a smaller value, and given A data analysis and machine learning processing method in a device having a distributed file system for storing teacher data and performing parallel distributed processing of large-scale data,
When the parameter detection unit is given a job to the Map unit or the Reduce unit, the parameter detection unit detects parameter information including the parameter value, usage position, and usage source function of the job, and stores the parameter information in the distributed file system A detection step;
When a new job is given to the Map unit and the Reduce unit, the job integration unit detects new parameter information including the parameter value, usage position, and usage source function of the job, and the new parameter information and the distribution A job integration step of comparing the parameter information stored in the file system and integrating jobs of overlapping parts independent of parameters;
A shared execution step for causing the Map means or the Reduce means to execute the shared execution means while sharing overlapping portions of the integrated job;
A data analysis and machine learning processing method characterized by:

前記ジョブ統合ステップにおいて、
前記Map手段または前記Reduce手段に対して新たなジョブが追加されると、該Map手段、または、該Reduce手段の実行前に、該ジョブを遅延キューに蓄積する遅延キュー制御ステップと、
新たに追加されたジョブと既に前記遅延キューに格納されたジョブのパラメータを比較し、利用するMap手段名またはReduce手段名が同じものがある場合は、ジョブを統合し、所定のキュー保持時間が経過すると、通常のキューに統合されたジョブを移行させるステップと、
を行う請求項５記載のデータ分析及び機械学習処理方法。 In the job integration step,
When a new job is added to the Map unit or the Reduce unit, a delay queue control step of storing the job in a delay queue before the execution of the Map unit or the Reduce unit;
Compare the newly added job and the parameters of the job already stored in the delay queue, and if there is the same Map means name or Reduce means name to be used, the jobs are integrated and the predetermined queue retention time When it has passed, the steps to migrate jobs that are integrated into the normal queue,
The data analysis and machine learning processing method according to claim 5.

前記ジョブ統合ステップにおいて、
前記分散ファイルシステムに格納されている前記パラメータ情報と前記新パラメータ情報を比較して、前記パラメータに依存しない重複部分がある場合は、代表する１つのジョブに対してその他のジョブのパラメータとの差分を追加する
請求項５または６記載のデータ分析及び機械学習処理方法。 In the job integration step,
If the parameter information stored in the distributed file system is compared with the new parameter information, and there is an overlapping part that does not depend on the parameter, the difference between the representative job and the parameters of other jobs The data analysis and machine learning processing method according to claim 5 or 6, wherein:

前記共有実行ステップにおいて、
前記ジョブ統合手段で統合されたジョブの重複部分を共有しながら前記Map手段、または、前記Reduce手段を実行させ、パラメータに依存する処理部分に到達した時点で処理を分岐させる
請求項５記載のデータ分析及び機械学習処理方法。 In the sharing execution step,
The data according to claim 5, wherein the Map unit or the Reduce unit is executed while sharing an overlapping part of the jobs integrated by the job integration unit, and the processing is branched when the processing part depending on the parameter is reached. Analysis and machine learning processing method.

請求項１乃至４のいずれか１項に記載のデータ分析及び機械学習装置を構成する各手段としてコンピュータを機能させるためのデータ分析及び機械学習処理プログラム。 A data analysis and machine learning processing program for causing a computer to function as each means constituting the data analysis and machine learning device according to any one of claims 1 to 4.