JP5965498B2

JP5965498B2 - Parallel processing method and parallel computer system

Info

Publication number: JP5965498B2
Application number: JP2014553925A
Authority: JP
Inventors: 幸二福田; 由子長坂; 拓実仁藤
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2012-12-26
Filing date: 2012-12-26
Publication date: 2016-08-03
Anticipated expiration: 2032-12-26
Also published as: WO2014102917A1; JPWO2014102917A1

Description

本発明は、並列処理方法、および並列計算機システムに関し、特に、処理結果の送信に関する。 The present invention relates to a parallel processing method and a parallel computer system, and more particularly to transmission of processing results.

並列計算装置の上で効率よく計算をする汎用的な方式として、非特許文献１では、図２に示したＭａｐＲｅｄｕｃｅと呼ばれる計算モデルを開示している。なお、後述するように、ここでＭａｐＲｅｄｕｃｅを取り上げるのは単に技術説明の例とするためであり、本願における発明自体はＭａｐＲｅｄｕｃｅ計算モデルをその前提としているわけでない。 Non-Patent Document 1 discloses a calculation model called MapReduce shown in FIG. 2 as a general-purpose method for efficiently performing calculations on a parallel computing device. As will be described later, MapReduce is taken up merely as an example of the technical description, and the invention itself in this application does not assume the MapReduce calculation model.

図２に示したように、ＭａｐＲｅｄｕｃｅ計算モデルは、Ｍａｐフェーズと、Ｓｏｒｔフェーズと、Ｒｅｄｕｃｅフェーズと、の３つのフェーズによって構成される計算モデルである。Ｍａｐフェーズでは、入力データが多数の処理単位に分割されて読み込まれ、Ｍａｐプロセスに入力され、Ｍａｐプロセスは処理単位ごとになんらかの計算あるいは処理を行って、＜Ｋｅｙ，Ｖａｌｕｅ＞の組を出力する。続くＳｏｒｔフェーズでは、Ｍａｐフェーズで出力された＜Ｋｅｙ，Ｖａｌｕｅ＞の組が、Ｋｅｙ毎に分類（整列）され、各Ｋｅｙに対して複数のＶａｌｕｅを組にしたものが出力される。Ｒｅｄｕｃｅフェーズでは、各Ｋｅｙに複数のＶａｌｕｅを組にしたものが、Ｒｅｄｕｃｅプロセスに入力され、Ｒｅｄｕｃｅプロセスが入力された該組になんらかの計算あるいは処理を行って最終的な結果を出力する。 As shown in FIG. 2, the MapReduce calculation model is a calculation model configured by three phases of a Map phase, a Sort phase, and a Reduce phase. In the Map phase, input data is read after being divided into a large number of processing units and input to the Map process. The Map process performs some calculation or processing for each processing unit and outputs a set of <Key, Value>. In the subsequent Sort phase, the set of <Key, Value> output in the Map phase is classified (aligned) for each key, and a set of multiple values for each Key is output. In the Reduce phase, a set of a plurality of values for each key is input to the Reduce process, and some calculation or processing is performed on the set to which the Reduce process is input, and a final result is output.

ここで、Ｍａｐフェーズ、およびＲｅｄｕｃｅフェーズにおいて、各Ｍａｐプロセス、および各Ｒｅｄｕｃｅプロセスは、他のＭａｐプロセスおよびＲｅｄｕｃｅプロセスとの依存関係はないため、並列に実行することが可能である。したがって、ＭａｐＲｅｄｕｃｅ計算モデルを用いることで、複数の計算ノードで構成される並列計算装置で並列的に計算（処理）を行うことができる。 Here, in the Map phase and the Reduce phase, each Map process and each Reduce process have no dependency relationship with other Map processes and Reduce processes, and thus can be executed in parallel. Therefore, by using the MapReduce calculation model, it is possible to perform calculation (processing) in parallel by a parallel calculation device including a plurality of calculation nodes.

図３は、複数の計算ノードで構成される並列計算装置上で、ＭａｐＲｅｄｕｃｅ計算モデルを動作させた様子を示す模式図である。前述のように、Ｍａｐフェーズにおける複数のＭａｐプロセス、および、Ｒｅｄｕｃｅフェーズにおける複数のＲｅｄｕｃｅプロセスは、それぞれ並列に実行することが可能である。したがって、これらのプロセスを複数の計算ノードに割り振って並列に実行させることができる。なお、ＭａｐフェーズとＲｅｄｕｃｅフェーズは、必ずしも同一の並列計算装置で行われる必要はない。ところで、多くの場合、Ｍａｐプロセス、およびＲｅｄｕｃｅプロセスの総数は、計算ノードの総数に比べて多い。その場合、必然的に、各計算ノードは、それぞれ、複数のＭａｐプロセス、およびＲｅｄｕｃｅプロセスを担当することとなる。Ｒｅｄｕｃｅプロセスの総数（Ｍａｐフェーズで出力されるＫｅｙの種類の総数）が、計算ノードの総数に比べて多い場合を考える。この場合、図２におけるＳｏｒｔフェーズは、Ｓｈｕｆｆｌｅフェーズ（計算ノード間通信フェーズ）、およびＬｏｃａｌＳｏｒｔフェーズ（受信側整列フェーズ）、の２つのフェーズに細分化されることになる。Ｓｈｕｆｆｌｅフェーズでは、Ｍａｐフェーズで出力された＜Ｋｅｙ，Ｖａｌｕｅ＞の組が、Ｋｅｙ毎に一意に定まるＲｅｄｕｃｅプロセスを担当する計算ノードに送信される。各計算ノードが複数のＲｅｄｕｃｅプロセス（複数のＭａｐ出力Ｋｅｙ）を担当する場合、Ｓｈｕｆｆｌｅフェーズで送られた＜Ｋｅｙ，Ｖａｌｕｅ＞の組は、受信側の計算ノードでＫｅｙ毎に分類（整列）される必要がある。この分類（整列）が行われるのがＬｏｃａｌＳｏｒｔフェーズ（受信側整列フェーズ）である。 FIG. 3 is a schematic diagram illustrating a state in which a MapReduce calculation model is operated on a parallel computing device including a plurality of computation nodes. As described above, a plurality of Map processes in the Map phase and a plurality of Reduce processes in the Reduce phase can be executed in parallel. Therefore, these processes can be allocated to a plurality of calculation nodes and executed in parallel. Note that the Map phase and the Reduce phase are not necessarily performed by the same parallel computing device. By the way, in many cases, the total number of Map processes and Reduce processes is larger than the total number of computation nodes. In that case, each computation node is necessarily responsible for a plurality of Map processes and Reduce processes. Consider a case where the total number of Reduce processes (the total number of types of keys output in the Map phase) is larger than the total number of computation nodes. In this case, the Sort phase in FIG. 2 is subdivided into two phases, a Shuffle phase (inter-computation node communication phase) and a Local Sort phase (receiving side alignment phase). In the Shuffle phase, a set of <Key, Value> output in the Map phase is transmitted to a calculation node in charge of a Reduce process that is uniquely determined for each Key. When each computation node is in charge of a plurality of Reduce processes (a plurality of Map output keys), the <Key, Value> pairs sent in the Shuffle phase are classified (aligned) for each Key in the computation node on the receiving side. There is a need. This classification (alignment) is performed in the Local Sort phase (reception side alignment phase).

以上のように、ＭａｐＲｅｄｕｃｅ計算モデルを用いることで、複数の計算ノードで構成される並列計算装置で並列的に計算（処理）を行うことができる。さらに、複数の計算ノードで構成される並列計算装置でＭａｐＲｅｄｕｃｅ計算モデルを用いる場合、実際の計算および処理を行うのは、Ｍａｐプロセス、およびＲｅｄｕｃｅプロセスのみであり、Ｓｈｕｆｆｌｅフェーズ、およびＬｏｃａｌＳｏｒｔフェーズはアプリケーションの処理内容によらず共通である。したがって、Ｓｈｕｆｆｌｅフェーズ、およびＬｏｃａｌＳｏｒｔフェーズをあらかじめ共通フレームワークとして作成しておくことで、ＭａｐプロセスおよびＲｅｄｕｃｅプロセスの処理内容のみを変更することで、複数のアプリケーションを簡単に作成することができる。 As described above, by using the MapReduce calculation model, it is possible to perform calculation (processing) in parallel with a parallel calculation device including a plurality of calculation nodes. Furthermore, when the MapReduce calculation model is used in a parallel computing device including a plurality of calculation nodes, only the Map process and the Reduce process perform actual calculation and processing, and the Shuffle phase and the Local Sort phase are applications. This is the same regardless of the processing content. Therefore, by creating the Shuffle phase and the Local Sort phase in advance as a common framework, it is possible to easily create a plurality of applications by changing only the processing contents of the Map process and the Reduce process.

ジェフリー・ディーン（ＪｅｆｆｒｅｙＤｅａｎ）、サンジェイ・ゲマワット（ＳａｎｊａｙＧｈｅｍａｗａｔ）、「マップリデュース：ラージクラスタ上の単純化データ処理（ＭａｐＲｅｄｕｃｅ：ＳｉｍｐｌｉｆｉｅｄＤａｔａＰｒｏｃｅｓｓｉｎｇｏｎＬａｒｇｅＣｌｕｓｔｅｒｓ）」、ＯＳＤＩ’０４：第６回オペレーティングシステムの設計と実装シンポジウムの紀要（ＰｒｏｃｅｅｄｉｎｇｓｏｆＯＳＤＩ’０４：６ｔｈＳｙｍｐｏｓｉｕｍｏｎＯｐｅｒａｔｉｎｇＳｙｓｔｅｍｓＤｅｓｉｇｎａｎｄＩｍｐｌｅｍｅｎｔａｔｉｏｎ）、２００４年、ｐ．１３７−１４９Jeffrey Dean, Sanjay Gemawat, "Map Reduce: Simplified Data Processing on Large Clusters", 6th OSDI's Operating System Proceedings of OSDI '04: 6th Symposium on Operating Systems Design and Implementation, 2004, p. 137-149

前述のように、ＭａｐＲｅｄｕｃｅ計算モデルを用いることで、複数の計算ノードで構成される並列計算装置で並列的に計算（処理）を行うことができる。しかしながら、処理すべきデータ量が大きくなるにつれて、とくに、Ｒｅｄｕｃｅプロセスの総数が、計算ノードの総数に比べて多くなるにつれて、ＬｏｃａｌＳｏｒｔフェーズ（受信側整列フェーズ）が重たい処理となり、全体の処理時間の大部分を占めてしまう。これは、Ｍａｐフェーズ、Ｒｅｄｕｃｅフェーズ、およびＳｈｕｆｆｌｅフェーズが、処理すべきデータ量に比例する時間がかかるのに対して、ＬｏｃａｌＳｏｒｔフェーズでは、単純な整列アルゴリズムを用いると、データ量Ｎに対して、Ｎ×Ｌｏｇ（Ｎ）に比例する時間がかかるためである。 As described above, by using the MapReduce calculation model, it is possible to perform calculation (processing) in parallel with a parallel calculation device including a plurality of calculation nodes. However, as the amount of data to be processed increases, especially as the total number of Reduce processes increases compared to the total number of computing nodes, the Local Sort phase (reception side alignment phase) becomes heavy processing, and the total processing time Will occupy the majority. This is because the Map phase, Reduce phase, and Shuffle phase take time proportional to the amount of data to be processed, whereas in the Local Sort phase, using a simple alignment algorithm, This is because it takes time proportional to N × Log (N).

さらに悪いことに、ＬｏｃａｌＳｏｒｔフェーズで行われるデータの並び替えは、一般に記憶装置へのランダムアクセスを必要とする。通常、各計算ノードが備える一時記憶装置であるランダムアクセスメモリ（ＲＡＭ）は小容量であるため、扱うべきデータ量が大きい場合には、ＬｏｃａｌＳｏｒｔフェーズの処理に、高速にランダムアクセスが可能な一時記憶装置であるＲＡＭだけではなく、大容量ではあるが特にランダムアクセス速度が低速な、フラッシュメモリやハードディスクドライブなどの記憶装置を使う必要がある。したがって、扱うべきデータ量が一時記憶装置であるＲＡＭの容量よりも大きくなると、ＬｏｃａｌＳｏｒｔに必要な時間がさらに急激に増加することになる。 To make matters worse, data rearrangement performed in the Local Sort phase generally requires random access to the storage device. Normally, a random access memory (RAM), which is a temporary storage device included in each computation node, has a small capacity. Therefore, if the amount of data to be handled is large, temporary processing capable of random access at high speed can be performed in the Local Sort phase process. It is necessary to use not only a RAM as a storage device but also a storage device such as a flash memory or a hard disk drive which has a large capacity but a low random access speed. Therefore, when the amount of data to be handled becomes larger than the capacity of the RAM which is a temporary storage device, the time required for the Local Sort will increase more rapidly.

本願の発明は、このようなことを鑑みてなされたものであり、その目的の一つは、Ｍａｐフェーズ、およびＳｈｕｆｆｌｅフェーズで、Ｋｅｙ毎の分類（整列）処理、あるいはその一部を行うことで、Ｓｈｕｆｆｌｅフェーズにおいて受信側の計算ノードに、なるべくＫｅｙ毎に分類（整列）済みのデータが届くようにする手段を提供し、それを用いて、ＬｏｃａｌＳｏｒｔフェーズの処理時間を短縮する手段を提供することにある。本願の発明の前記並びにその他の目的と新規な特徴は、本明細書の記述及び添付図面から明らかになるであろう。 The invention of the present application has been made in view of the above, and one of its purposes is to perform classification (alignment) processing for each key or a part thereof in the Map phase and the Shuffle phase. In the Shuffle phase, a means for making the data sorted (aligned) as much as possible reach the computation node on the receiving side as much as possible, and a means for shortening the processing time of the Local Sort phase by using it are provided. There is. The above and other objects and novel features of the present invention will become apparent from the description of this specification and the accompanying drawings.

なお、本願の発明が解決しようとする課題は、複数の計算ノードで複数の処理単位を並列処理するときに一般的に発生する課題であって、ここでＭａｐＲｅｄｕｃｅを取り上げたのは単に説明の例とするためである。したがって、本願の発明は、ＭａｐＲｅｄｕｃｅ計算モデルのみならず、複数の計算ノードで複数の処理単位を並列処理する場合の多くに適用が可能である。図４に、ＭａｐＲｅｄｕｃｅとは別の、本願の発明が適用可能な複数の計算ノードで複数の処理単位を並列処理する例として、グラフ問題を並列計算装置で計算する場合の模式図を示した。 The problem to be solved by the invention of the present application is a problem that generally occurs when a plurality of processing units are processed in parallel by a plurality of computing nodes, and MapReduce is taken as an example only for explanation. This is because. Therefore, the invention of the present application can be applied not only to the MapReduce calculation model but also to many cases where a plurality of processing units are processed in parallel by a plurality of calculation nodes. FIG. 4 shows a schematic diagram when a graph problem is calculated by a parallel computing device as an example of parallel processing of a plurality of processing units by a plurality of calculation nodes to which the invention of the present application can be applied, which is different from MapReduce.

グラフ問題の並列計算とは、複数の頂点と頂点間を結ぶエッジとで構成されるグラフ構造を所与として、その上で種々の計算を行うものである。図４では、計算ノード１〜４が、それぞれに与えられたグラフの部分についての処理を行う例を示した。通常、グラフ問題は、複数の計算ステップにより構成されており、各計算ステップで、全頂点に関して、前ステップにおける自頂点に向かうエッジを持つ頂点の計算結果を入力としてなんらかの計算が行われる。グラフ問題を複数の計算ノードで構成される並列計算装置で計算する場合、各頂点が独立に計算できることを考えると、各計算ノードに頂点を割り当てて並列処理するのが自然である。このとき、各計算ステップが終了するごとに、計算結果をエッジのつながり先の頂点に送信し、受信側の計算ノードでは、受信したデータを宛先頂点毎に分類（整列）する必要がある。これは、前述のＭａｐＲｅｄｕｃｅ計算モデルにおける、Ｓｈｕｆｆｌｅフェーズ、およびＬｏｃａｌＳｏｒｔフェーズと同様の処理であり、頂点の数が多くなるほど、すなわち大規模なグラフ問題になるほど、受信側計算ノードにおける宛先頂点毎の分類（整列）処理に時間がかかることが課題となる。 The parallel calculation of the graph problem is to perform various calculations on a given graph structure composed of a plurality of vertices and edges connecting the vertices. In FIG. 4, the calculation nodes 1-4 showed the example which processes about the part of the graph given to each. Usually, the graph problem is composed of a plurality of calculation steps, and at each calculation step, some calculation is performed with respect to all the vertices as input of the calculation result of the vertex having the edge toward the own vertex in the previous step. When a graph problem is calculated by a parallel computing device composed of a plurality of calculation nodes, it is natural to assign a vertex to each calculation node and perform parallel processing considering that each vertex can be calculated independently. At this time, each calculation step is completed, the calculation result is transmitted to the vertex to which the edge is connected, and the receiving calculation node needs to classify (align) the received data for each destination vertex. This is the same processing as in the Shuffle phase and the Local Sort phase in the above-described MapReduce calculation model. As the number of vertices increases, that is, as the graph problem becomes larger, the classification for each destination vertex in the reception-side calculation node is increased. The problem is that the (alignment) processing takes time.

本発明は、複数の計算ノードを有する並列計算機システムで、第１のグループ分けで処理対象を分割して各計算ノードに配置して処理し、第２のグループ分けに基づいてストレージ装置群に該処理結果を保存し、保存された処理結果を第１のグループ分けに従って各計算ノードに送信することで、前述の課題を解決する。 The present invention is a parallel computer system having a plurality of computing nodes, which divides a processing target in a first grouping, arranges the processing target in each computing node, and processes the storage device group based on a second grouping. The processing result is stored, and the stored processing result is transmitted to each computation node according to the first grouping, thereby solving the above-described problem.

本発明により、受信側の計算ノードで行う分類（整列）処理に要する時間を削減することが可能となる。ひいては、並列計算を高速化することが可能となる。 According to the present invention, it is possible to reduce the time required for the classification (alignment) processing performed in the calculation node on the receiving side. As a result, parallel computation can be accelerated.

本発明の実施例である情報処理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the information processing system which is an Example of this invention. 非特許文献１で開示されたＭａｐＲｅｄｕｃｅと呼ばれる計算モデルを説明する図である。It is a figure explaining the calculation model called MapReduce disclosed by the nonpatent literature 1. FIG. 非特許文献１で開示されたＭａｐＲｅｄｕｃｅ計算モデルを、並列計算装置上で動作させた場合の概念図である。It is a conceptual diagram at the time of operating the MapReduce calculation model disclosed by the nonpatent literature 1 on a parallel computer. 本発明が解決する課題の一例として、グラフ問題を並列計算装置上で動作させた場合の概念図である。As an example of the problem to be solved by the present invention, it is a conceptual diagram when a graph problem is operated on a parallel computing device. 本発明を考える端緒となったバケットソートによる受信側整列（ＬｏｃａｌＳｏｒｔ）高速化について説明する図である。It is a figure explaining the receiving side alignment (Local Sort) speed-up by the bucket sort used as the beginning which considers this invention. 本発明を考える端緒となったバケットソートによる受信側整列（ＬｏｃａｌＳｏｒｔ）高速化について説明する図である。It is a figure explaining the receiving side alignment (Local Sort) speed-up by the bucket sort used as the beginning which considers this invention. 各計算ノード間でバケツを共有することでバケットソートに必要な記憶装置の数を削減することを概念的に説明する図である。It is a figure which illustrates notionally reducing the number of memory | storage devices required for a bucket sort by sharing a bucket between each calculation node. 本発明の第１の実施例による並列処理方法および並列計算機システムの動作を説明する図である。It is a figure explaining the operation | movement of the parallel processing method and parallel computer system by 1st Example of this invention. 本発明の第２の実施例による並列処理方法および並列計算機システムの動作を説明する図である。It is a figure explaining the operation | movement of the parallel processing method and parallel computer system by 2nd Example of this invention. 本発明の第３の実施例による並列処理方法および並列計算機システムの動作を説明する図である。It is a figure explaining the operation | movement of the parallel processing method and parallel computer system by 3rd Example of this invention.

以下の実施の形態においては便宜上その必要があるときは、複数のセクションまたは実施の形態に分割して説明するが、特に明示した場合を除き、それらは互いに無関係なものではなく、一方は他方の一部または全部の変形例、詳細、補足説明等の関係にある。また、以下の実施の形態において、要素の数等（個数、数値、量、範囲等を含む）に言及する場合、特に明示した場合および原理的に明らかに特定の数に限定される場合等を除き、その特定の数に限定されるものではなく、特定の数以上でも以下でも良い。 In the following embodiment, when it is necessary for the sake of convenience, the description will be divided into a plurality of sections or embodiments. However, unless otherwise specified, they are not irrelevant, and one is the other. Some or all of the modifications, details, supplementary explanations, and the like are related. Further, in the following embodiments, when referring to the number of elements (including the number, numerical value, quantity, range, etc.), especially when clearly indicated and when clearly limited to a specific number in principle, etc. Except, it is not limited to the specific number, and may be more or less than the specific number.

以下、本発明の実施の形態を図面に基づいて詳細に説明する。なお、実施の形態を説明するための全図において、同一の部材には原則として同一の符号を付し、その繰り返しの説明は省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that components having the same function are denoted by the same reference symbols throughout the drawings for describing the embodiment, and the repetitive description thereof will be omitted.

図１は、本発明の実施例である情報処理システム１０１の構成を示すブロック図である。情報処理システム１０１は、複数の計算ノードＣＡＬＣ＿ＮＯＤＥ＿ｘ（ｘ＝１，２，３・・・）と、その間の通信を仲介する通信スイッチＣＯＭ＿ＳＷと、を備える並列計算機システムである。 FIG. 1 is a block diagram showing the configuration of an information processing system 101 that is an embodiment of the present invention. The information processing system 101 is a parallel computer system including a plurality of calculation nodes CALC_NODE_x (x = 1, 2, 3,...) And a communication switch COM_SW that mediates communication therebetween.

各計算ノードＣＡＬＣ＿ＮＯＤＥ＿ｘは、中央演算処理装置（ＣＰＵ）と、一時記憶装置ＭＥＭと、記憶装置ＳＴＯＲと、通信デバイスＣＯＭ＿ＤＥＶと、ＣＰＵ、記憶装置ＳＴＯＲ、および通信デバイスＣＯＭ＿ＤＥＶを接続するバスＢＵＳと、を備える。各計算ノードＣＡＬＣ＿ＮＯＤＥ＿ｘにおいて、ＣＰＵは、記憶装置ＳＴＯＲから必要な入力データを読み出して計算を行い、必要であれば通信デバイスＣＯＭ＿ＤＥＶを用いて、入力データ、あるいは計算結果を、他の計算ノードに送信する。一連の処理の間、ＣＰＵは必要に応じて、一時的に必要なデータや、計算の途中経過等を一時記憶装置ＭＥＭに記憶する。また、後述のように、計算ノードＣＡＬＣ＿ＮＯＤＥ＿ｘの内の一つは、再送管理ノードとして働く。 Each computation node CALC_NODE_x includes a central processing unit (CPU), a temporary storage device MEM, a storage device STOR, a communication device COM_DEV, and a bus BUS connecting the CPU, the storage device STOR, and the communication device COM_DEV. . In each calculation node CALC_NODE_x, the CPU reads necessary input data from the storage device STOR, performs calculation, and transmits the input data or calculation result to another calculation node using the communication device COM_DEV if necessary. . During the series of processing, the CPU stores temporarily necessary data, the progress of calculation, and the like in the temporary storage device MEM as necessary. As will be described later, one of the computation nodes CALC_NODE_x serves as a retransmission management node.

ここで、一時記憶装置ＭＥＭには、ランダムアクセスメモリ（ＲＡＭ）を用いる。一方、記憶装置ＳＴＯＲには、フラッシュメモリ、相変化メモリやハードディスクドライブを用いる。したがって、一時記憶装置ＭＥＭは、記憶装置ＳＴＯＲに比べて、高速なアクセスが可能であるが小容量である。また、一時記憶装置ＭＥＭは高速にランダムアクセス可能であるのに対して、記憶装置ＳＴＯＲは、シーケンシャルアクセスのみでランダムアクセスは不可能、あるいは、シーケンシャルアクセス速度に比べてランダムアクセス速度が非常に遅い、といった特徴をもつことになる。 Here, a random access memory (RAM) is used as the temporary storage device MEM. On the other hand, a flash memory, a phase change memory, and a hard disk drive are used for the storage device STOR. Therefore, the temporary storage device MEM can be accessed at a higher speed than the storage device STOR, but has a small capacity. In addition, the temporary storage device MEM can be randomly accessed at high speed, whereas the storage device STOR cannot perform random access only by sequential access, or the random access speed is very low compared to the sequential access speed. It will have the characteristics.

なお、図１に示した並列計算機システムにおいて、各計算ノードが備える、ＣＰＵ、一時記憶装置ＭＥＭ、記憶装置ＳＴＯＲ、および通信デバイスＣＯＭ＿ＤＥＶは、必ずしも同じものであるとは限らず、計算ノードごとに大きさや性能が異なる場合もある。また、ＣＰＵ、一時記憶装置ＭＥＭ、記憶装置ＳＴＯＲ、および、通信デバイスＣＯＭ＿ＤＥＶの全てを備えていない計算ノードが存在する場合も、本願において開示される発明の考慮の対象としている。 In the parallel computer system shown in FIG. 1, the CPU, the temporary storage device MEM, the storage device STOR, and the communication device COM_DEV included in each calculation node are not necessarily the same, and each calculation node has a large size. The sheath performance may be different. Further, the case where there is a computation node that does not include all of the CPU, the temporary storage device MEM, the storage device STOR, and the communication device COM_DEV is also considered for the invention disclosed in the present application.

図５（ａ）および図５（ｂ）は、前述した課題の解決にあたって、本願の発明を考える端緒となったバケットソート（バケツソート）方式について説明する図である。例えば、Ｍａｐフェーズで出力されるＫｅｙが０以上の整数番号である場合を考える。なお、ここで、Ｋｅｙが０以上の整数番号である場合をとりあげたのは、説明の例にするためであり、Ｋｅｙが文字列等、整数以外のものであっても、あらかじめ各バケツの割り当て方を決めておけば、同様の方式が適用可能である。 FIG. 5A and FIG. 5B are diagrams for explaining the bucket sort method that has been the starting point for considering the invention of the present application in solving the above-described problems. For example, consider a case where the Key output in the Map phase is an integer number greater than or equal to zero. Here, the case where Key is an integer number greater than or equal to 0 is taken as an example of explanation, and even if Key is a non-integer such as a character string, each bucket is assigned in advance. If a method is determined, the same method can be applied.

ここで、並列計算の一部を担当する計算ノードＣＡＬＣ＿ＮＯＤＥ＿１が処理を担当するＲｅｄｕｃｅプロセスのＫｅｙ番号が、０から９９９までの１０００種類であるとする。図５（ａ）に、このときの、計算ノードＣＡＬＣ＿ＮＯＤＥ＿１内の処理フローを示す図を示す。Ｓｈｕｆｆｌｅフェーズで、ＣＡＬＣ＿ＮＯＤＥ＿１は、Ｋｅｙ番号が０〜９９９までの＜Ｋｅｙ，Ｖａｌｕｅ＞の組を受信する。受信する＜Ｋｅｙ，Ｖａｌｕｅ＞の組の順序については何も取り決めがないため、後の分類（整列）処理に備えて、受信した＜Ｋｅｙ，Ｖａｌｕｅ＞の組をいったん全て記憶装置ＳＴＯＲに記憶する必要がある。計算ノードＣＡＬＣ＿ＮＯＤＥ＿１は、全ての＜Ｋｅｙ，Ｖａｌｕｅ＞の組を受信し終わった後に、記憶装置ＳＴＯＲに記憶された全ての＜Ｋｅｙ，Ｖａｌｕｅ＞の組を読み出して、Ｋｅｙ毎に分類（整列）し、Ｒｅｄｕｃｅ処理を行う。 Here, it is assumed that there are 1000 types of key numbers of 0 to 999 for the Reduce process in which the computation node CALC_NODE_1 responsible for a part of the parallel computation is responsible. FIG. 5A shows a processing flow in the calculation node CALC_NODE_1 at this time. In the Shuffle phase, CALC_NODE_1 receives a set of <Key, Value> with a Key number from 0 to 999. Since there is no agreement on the order of the received <Key, Value> pairs, all the received <Key, Value> pairs must be temporarily stored in the storage device STOR in preparation for the subsequent classification (alignment) processing. There is. After the calculation node CALC_NODE_1 has received all the <Key, Value> pairs, it reads out all the <Key, Value> pairs stored in the storage device STOR and classifies (aligns) them for each Key. Reduce processing is performed.

ところが、図５（ａ）に示した方式では、Ｋｅｙ毎に分類（整列）する処理を、自計算ノードが受信した全ての＜Ｋｅｙ，Ｖａｌｕｅ＞の組を対象に行う必要があるため、分類（整列）に時間がかかるという問題がある。そこで、図５（ｂ）に示すように記憶装置を複数用意してバケットソート（バケツソート）を行うことも考えた。 However, in the method shown in FIG. 5A, the classification (alignment) processing for each key needs to be performed on all <Key, Value> pairs received by the self-calculation node. (Alignment) takes time. In view of this, as shown in FIG. 5B, a plurality of storage devices are prepared and bucket sorting (bucket sorting) is considered.

図５（ｂ）において、ＣＡＬＣ＿ＮＯＤＥ＿１は、記憶装置をＳＴＯＲ１〜ＳＴＯＲ１０までの１０台備えており、これらの記憶装置がバケツ群を構成している。計算ノードＣＡＬＣ＿ＮＯＤＥ＿１は、受信した＜Ｋｅｙ，Ｖａｌｕｅ＞の組を、Ｋｅｙ番号に応じて１０台の記憶装置に振り分けて記憶する。この振り分けは、例えば、Ｋｅｙ番号が０から９９までの＜Ｋｅｙ，Ｖａｌｕｅ＞の組はＳＴＯＲ１に、Ｋｅｙ番号が１００から１９９までの＜Ｋｅｙ，Ｖａｌｕｅ＞の組はＳＴＯＲ２に、といった手段で行う。Ｓｈｕｆｆｌｅフェーズが終了し、全ての＜Ｋｅｙ，Ｖａｌｕｅ＞の組を受信し終わった後、最初に、記憶装置ＳＴＯＲ１に記憶された＜Ｋｅｙ，Ｖａｌｕｅ＞の組を読み出してＫｅｙ毎に分類（整列）してＲｅｄｕｃｅ処理する。記憶装置ＳＴＯＲ１内の＜Ｋｅｙ，Ｖａｌｕｅ＞の組を全て処理し終わったら、次に、記憶装置ＳＴＯＲ２に記憶された＜Ｋｅｙ，Ｖａｌｕｅ＞の組を読み出してＫｅｙ毎に分類（整列）してＲｅｄｕｃｅ処理する。その後も同様に、ＳＴＯＲ３からＳＴＯＲ１０までを順番に分類（整列）、Ｒｅｄｕｃｅ処理すればよい。 In FIG. 5B, CALC_NODE_1 includes 10 storage devices STOR1 to STOR10, and these storage devices constitute a bucket group. The calculation node CALC_NODE_1 distributes and stores the received <Key, Value> set to 10 storage devices according to the Key number. This distribution is performed by means such as, for example, a <Key, Value> set with a Key number from 0 to 99 in STOR1, and a <Key, Value> set with a Key number from 100 to 199 in STOR2. After completion of the Shuffle phase and reception of all <Key, Value> pairs, first, the <Key, Value> pairs stored in the storage device STOR1 are read out and classified (aligned) for each Key. Reduce processing. After all the <Key, Value> pairs in the storage device STOR1 have been processed, the <Key, Value> pairs stored in the storage device STOR2 are read out, classified (aligned) by key, and reduced. To do. After that, similarly, STOR3 to STOR10 may be classified (aligned) in order and reduced.

図５（ｂ）で示した方式では、Ｋｅｙ毎に分類（整列）する処理は、記憶装置ＳＴＯＲ１からＳＴＯＲ１０ごとに行えばよいため、図５（ａ）に比べて分類（整列）処理の時間を短縮することが可能である。ここで、各記憶装置（バケツ）ＳＴＯＲ１からＳＴＯＲ１０へのアクセスは、シーケンシャルなアクセスのみであり、ランダムアクセスが低速な記憶装置を用いることができる。また、多ポート同時にシーケンシャルアクセス可能な記憶装置が使用可能な場合には、各アクセスポートを異なるＫｅｙ番号範囲（バケツ）に対応させることで、１台の記憶装置でバケツ群を構成することも可能である。なお、ここでは、バケツ群（記憶装置ＳＴＯＲ１からＳＴＯＲ１０）をＫｅｙの小さい順に割り当てたたが、Ｋｅｙ番号と記憶装置（バケツ）の対応関係が定まっていれば十分であり、必ずしもＫｅｙ番号の小さい順に割り当てる必要はない。また、Ｓｈｕｆｆｌｅフェーズを終了後に記憶装置（バケツ）を処理する順番は任意である。 In the method shown in FIG. 5B, the classification (alignment) processing for each key may be performed for each storage device STOR1 to STOR10. Therefore, the classification (alignment) processing time is longer than that in FIG. It can be shortened. Here, access from each storage device (bucket) STOR1 to STOR10 is only sequential access, and a storage device with low random access can be used. In addition, when a storage device that can access multiple ports simultaneously can be used, it is possible to configure a bucket group with a single storage device by making each access port correspond to a different key number range (bucket). It is. Here, the bucket groups (storage devices STOR1 to STOR10) are assigned in ascending order of the keys. However, it is sufficient that the correspondence relationship between the key numbers and the storage devices (buckets) is determined, and the order of the key numbers is not necessarily limited. There is no need to assign. Further, the order of processing the storage device (bucket) after completion of the Shuffle phase is arbitrary.

しかしながら、図５（ｂ）に示した方式では、各計算ノードに多数の記憶装置（バケツ）を用意する必要があるという問題がある。それに対して、本願発明者らは、次の構成を見出した。 However, the method shown in FIG. 5B has a problem that it is necessary to prepare a large number of storage devices (buckets) in each computation node. In contrast, the inventors of the present application have found the following configuration.

図６に、情報処理システム１０１の１０台の計算ノードＣＡＬＣ＿ＮＯＤＥ＿１〜１０がそれぞれ、１０００種類のＫｅｙ番号のＲｅｄｕｃｅ処理を担当する場合の例を示す。図５（ｂ）で示したバケットソート方式で、各計算ノードに１０台ずつの記憶装置（バケツ）を配置すると、並列計算装置全体では、１００台の記憶装置（バケツ）が必要となる。ところで、前述のように、図５（ｂ）で示したバケットソート方式では、Ｓｈｕｆｆｌｅフェーズ終了後に、自計算ノードにあるバケツを、Ｋｅｙ番号が小さい＜Ｋｅｙ，Ｖａｌｕｅ＞の組を記憶しているバケツから順に分類（整列）およびＲｅｄｕｃｅ処理を行う。Ｍａｐフェーズで出力されるＫｅｙの分布が極端にばらついていなければ、ＣＡＬＣ＿ＮＯＤＥ＿１がＫｅｙ番号０から９９のバケツを処理するのに必要な時間と、ＣＡＬＣ＿ＮＯＤＥ＿２がＫｅｙ番号１０００から１０９９のバケツを処理するのに必要な時間は、ほぼ同じであると考えられる。したがって、ＣＡＬＣ＿ＮＯＤＥ＿１がＫｅｙ番号１００から１９９のバケツの処理を始める時刻と、ＣＡＬＣ＿ＮＯＤＥ＿２がＫｅｙ番号１１００から１１９９のバケツの処理を始める時刻と、はほぼ同時刻である。本願発明者等は、このほぼ同時刻に必要となる、すなわち、ほぼ同時刻に分類（整列）・Ｒｅｄｕｃｅ処理が始まるバケツ同士を、計算ノードをまたがって共有することで、並列計算全体で必要とするバケツの数を削減することができることを見出した。 FIG. 6 shows an example in which each of the ten calculation nodes CALC_NODE_1 to 10 of the information processing system 101 is in charge of Reduce processing of 1000 types of key numbers. If ten storage devices (buckets) are arranged in each computation node by the bucket sort method shown in FIG. 5B, 100 parallel storage devices (buckets) are required for the entire parallel computing device. By the way, as described above, in the bucket sort method shown in FIG. 5B, after the completion of the Shuffle phase, the bucket stored in the self-calculation node is stored as a set of <Key, Value> having a small key number. Classification (alignment) and Reduce processing are performed in order. If the distribution of the Keys output in the Map phase is not extremely varied, the time required for CALC_NODE_1 to process the buckets with the key numbers 0 to 99 and CALC_NODE_2 to process the buckets with the key numbers 1000 to 1099 The required time is considered to be approximately the same. Therefore, the time when CALC_NODE_1 starts the processing of the buckets with the key numbers 100 to 199 and the time when CALC_NODE_2 starts the processing of the buckets with the key numbers 1100 to 1199 are almost the same time. The inventors of the present application need this at almost the same time. That is, the buckets that start classification (alignment) / reduce processing at almost the same time share the calculation nodes across the computation nodes. Found that the number of buckets to be reduced.

図７は、情報処理システム１０１での、並列処理の動作の例を説明する図である。図７では、実施の形態の説明にあたって、ＣＡＬＣ＿ＮＯＤＥ＿１からＣＡＬＣ＿ＮＯＤＥ＿１０の１０台の計算ノードで並列処理が実行され、それぞれの計算ノードが、１０００種類のＫｅｙ番号のＲｅｄｕｃｅ処理を担当する場合の例を示した。しかしながら、前述のように、Ｋｅｙが整数の番号であることや、Ｒｅｄｕｃｅ処理あるいはバケツの割り当て方等は、任意である。 FIG. 7 is a diagram for explaining an example of parallel processing operations in the information processing system 101. In the description of the embodiment, FIG. 7 shows an example in which parallel processing is executed by 10 calculation nodes CALC_NODE_1 to CALC_NODE_10, and each calculation node is responsible for Reduce processing of 1000 types of key numbers. . However, as described above, the key is an integer number, the reduction process or the bucket allocation method is arbitrary.

図７に示した並列処理の動作では、Ｍａｐフェーズで出力される＜Ｋｅｙ，Ｖａｌｕｅ＞の組を一時的に記憶する１０台の記憶装置（バケツ）ＢＳＴＯＲ１からＢＳＴＯＲ１０と、Ｒｅｄｕｃｅ処理を行う１０台の計算ノードＣＡＬＣ＿ＮＯＤＥ＿１からＣＡＬＣ＿ＮＯＤＥ＿１０と、バケツの再送管理を行うコントローラＢＵＣＫＥＴ＿ＣＯＮＴと、が機能する。 In the parallel processing operation illustrated in FIG. 7, ten storage devices (buckets) BSTOR1 to BSTOR10 that temporarily store a pair of <Key, Value> output in the Map phase, and ten units that perform Reduce processing. The calculation nodes CALC_NODE_1 to CALC_NODE_10 and the controller BUCKET_CONT that performs bucket retransmission management function.

計算ノードＣＡＬＣ＿ＮＯＤＥ＿１〜１０は、記憶装置ＳＴＯＲ１〜１０と、ＭＡＰ処理、Ｋｅｙ毎の分類（整列）、およびＲｅｄｕｃｅ処理を行う機能を実現する。図７に示したように、計算ノードＣＡＬＣ＿ＮＯＤＥ＿１の記憶装置ＳＴＯＲをＳＴＯＲ１、計算ノードＣＡＬＣ＿ＮＯＤＥ＿２の記憶装置ＳＴＯＲをＳＴＯＲ２、のように対応づけた。 The calculation nodes CALC_NODE_1 to 10 implement the functions of performing the MAP process, the classification (alignment) for each key, and the Reduce process with the storage devices STOR1 to STOR1-10. As shown in FIG. 7, the storage device STOR of the calculation node CALC_NODE_1 is associated with STOR1, and the storage device STOR of the calculation node CALC_NODE_2 is associated with STOR2.

＜Ｋｅｙ，Ｖａｌｕｅ＞の組を一時的に記憶するバケツ群ＢＳＴＯＲ１からＢＳＴＯＲ１０は、計算ノードＣＡＬＣ＿ＮＯＤＥ＿１の記憶装置ＳＴＯＲをＢＳＴＯＲ１、計算ノードＣＡＬＣ＿ＮＯＤＥ＿２の記憶装置ＳＴＯＲをＢＳＴＯＲ２、計算ノードＣＡＬＣ＿ＮＯＤＥ＿３の記憶装置ＳＴＯＲをＢＳＴＯＲ３、のように並列処理で計算を実行する計算ノードの記憶装置ＳＴＯＲと兼用させて実現する。このようにすることで、少資源で並列計算を実行できる。また、例えば、計算ノードＣＡＬＣ＿ＮＯＤＥ＿１１の記憶装置ＳＴＯＲをＢＳＴＯＲ１、計算ノードＣＡＬＣ＿ＮＯＤＥ＿１２の記憶装置ＳＴＯＲをＢＳＴＯＲ２、計算ノードＣＡＬＣ＿ＮＯＤＥ＿１３の記憶装置ＳＴＯＲをＢＳＴＯＲ３、のように並列処理では計算を実行しない計算ノードの記憶装置ＳＴＯＲで実現してもよい。また、バケツ群ＢＳＴＯＲ１からＢＳＴＯＲ１０と、Ｒｅｄｕｃｅ処理を行う計算ノード群ＣＡＬＣ＿ＮＯＤＥ＿１からＣＡＬＣ＿ＮＯＤＥ＿１０は、それぞれ数が異なっていてもよい。バケツ群と、計算ノード群の数が異なっており、かつ計算ノード上にバケツを配置する構成では、計算ノードあたりに０以上の整数台の記憶装置（バケツ）が配置されることになる。 A group of buckets BSTOR1 to BSTOR10 that temporarily store a set of <Key, Value> are BSTORE1 as the storage device STOR of the calculation node CALC_NODE_1, BTOR2 as the storage device STOR of the calculation node CALC_NODE_2, and BSTR3 as the storage device STOR of the calculation node CALC_NODE_3. In this way, the calculation node is also used as a storage device STOR of a calculation node that executes calculation by parallel processing. By doing so, parallel computation can be executed with a small amount of resources. Further, for example, the storage device STOR of the calculation node CALC_NODE_11 is a storage device STOR of a calculation node that does not perform calculation in parallel processing, such as BSTORE1, the storage device STOR of the calculation node CALC_NODE_12 is BSTR2, and the storage device STOR of the calculation node CALC_NODE_13 is BSTOR3. It may be realized with. Also, the number of bucket groups BSTOR1 to BSTOR10 and the number of calculation node groups CALC_NODE_1 to CALC_NODE_10 that perform Reduce processing may be different. In a configuration in which the number of bucket groups and the number of calculation node groups are different and buckets are arranged on the calculation nodes, an integer number of storage devices (buckets) of 0 or more are arranged per calculation node.

バケツの再送管理を行うコントローラＢＵＣＫＥＴ＿ＣＯＮＴは、計算ノードＣＡＬＣ＿ＮＯＤＥ＿ｘの内の一つである再送管理ノードで実現される。ここで、再送管理ノードは、並列処理に寄与する計算ノードと兼用する形で実現してもよいし、並列処理の計算には関係しない計算ノードを別途用いてもよい。 The controller BUCKET_CONT that performs bucket retransmission management is realized by a retransmission management node that is one of the computation nodes CALC_NODE_x. Here, the retransmission management node may be realized as a calculation node that contributes to parallel processing, or may separately use a calculation node that is not related to the parallel processing calculation.

図７に示した並列処理方式では、前述のＳｈｕｆｆｌｅフェーズ（計算ノード間通信）が、さらに、バケツ送信フェーズと、バケツ再送フェーズと、の２つのフェーズに細分化される。バケツ送信フェーズでは、Ｍａｐ処理で出力される＜Ｋｅｙ，Ｖａｌｕｅ＞の組を、Ｋｅｙに対応するＲｅｄｕｃｅ処理を行う計算ノードに直接送信するかわりに、並列処理全体で共有するバケツ群ＢＳＴＯＲ１からＢＳＴＯＲ１０に送信する。このとき、Ｋｅｙ番号を１０００で割った余りが０から９９であればＢＳＯＴＲ１に、Ｋｅｙ番号を１０００で割った余りが１００から１９９であればＢＳＯＴＲ２に、以下同様の規則にしたがって送信先のバケツを定める。このように、計算ノードＣＡＬＣ＿ＮＯＤＥ＿１〜１０でのＭａｐ処理の結果である＜Ｋｅｙ，Ｖａｌｕｅ＞の組に対して、各計算ノードは、ＢＳＴＯＲ１〜１０のいずれかへの宛先を与える。 In the parallel processing method shown in FIG. 7, the above-mentioned Shuffle phase (communication between nodes) is further subdivided into two phases: a bucket transmission phase and a bucket retransmission phase. In the bucket transmission phase, instead of directly transmitting the <Key, Value> pair output in the Map process to the computation node that performs the Reduce process corresponding to the Key, the bucket group BSTOR1 shared by the entire parallel process is transmitted to the BSTOR10. To do. At this time, if the remainder of dividing the Key number by 1000 is 0 to 99, the destination bucket is assigned to BSOTR1, and if the remainder of dividing the Key number by 1000 is 100 to 199, the destination bucket is assigned to BSOTR2. Determine. In this way, each computation node gives a destination to one of BSTORs 1 to 10 for a set of <Key, Value> that is a result of the Map processing in the computation nodes CALC_NODE_1 to 10.

Ｍａｐフェーズで出力される全ての＜Ｋｅｙ，Ｖａｌｕｅ＞の組を、バケツ群ＢＳＴＯＲ１からＢＳＴＯＲ１０に記憶し終わると、バケツ再送フェーズに移行する。バケツ再送フェーズでは、コントローラＢＵＣＫＥＴ＿ＣＯＮＴの指示にしたがって、各バケツに記憶されている＜Ｋｅｙ，Ｖａｌｕｅ＞の組を、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）に再送する。具体的には、まず、バケツＢＳＴＯＲ１に記憶されている＜Ｋｅｙ，Ｖａｌｕｅ＞の組を、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）に再送する。このとき各計算ノードＣＡＬＣ＿ＮＯＤＥ＿１〜１０は、受信した＜Ｋｅｙ，Ｖａｌｕｅ＞の組を、自計算ノードが備える記憶装置ＳＴＯＲ（例えば、計算ノードＣＡＬＣ＿ＮＯＤＥ＿１であればＳＴＯＲ１が対応）に記憶する。バケツＢＳＴＯＲ１に記憶されている＜Ｋｅｙ，Ｖａｌｕｅ＞の組の再送信が完了すると、コントローラＢＵＣＫＥＴ＿ＣＯＮＴは、それを各計算ノードＣＡＬＣ＿ＮＯＤＥ＿１からＣＡＬＣ＿ＮＯＤＥ＿１０に通知する。すると、各計算ノードＣＡＬＣ＿ＮＯＤＥ＿１からＣＡＬＣ＿ＮＯＤＥ＿１０は、記憶装置ＳＴＯＲ１からＳＴＯＲ１０に記憶されている＜Ｋｅｙ，Ｖａｌｕｅ＞の組を読み出してＫｅｙ毎に分類（整列）し、Ｒｅｄｕｃｅ処理を行う。全計算ノードがＲｅｄｕｃｅ処理を終了して記憶装置ＳＴＯＲ１からＳＴＯＲ１０が空になると、コントローラＢＵＣＫＥＴ＿ＣＯＮＴは、次のバケツＢＳＴＯＲ２の再送を開始する。以下同様に、最後のバケツＢＳＴＯＲ１０まで、再送とＲｅｄｕｃｅ処理を、バケツ毎に順次行っていく。 When all <Key, Value> pairs output in the Map phase are stored in the bucket groups BSTOR1 to BSTOR10, the process proceeds to the bucket retransmission phase. In the bucket retransmission phase, in accordance with an instruction from the controller BUCKET_CONT, the set of <Key, Value> stored in each bucket is retransmitted to the original destination calculation node (calculation node performing the Reduce process). More specifically, first, the set of <Key, Value> stored in the bucket BSTOR1 is retransmitted to the original destination calculation node (calculation node that performs the Reduce process). At this time, each calculation node CALC_NODE_1 to 10 stores the received <Key, Value> set in a storage device STOR included in its own calculation node (for example, STOR1 corresponds to calculation node CALC_NODE_1). When the retransmission of the set of <Key, Value> stored in the bucket BSTOR1 is completed, the controller BUCKET_CONT notifies the calculation node CALC_NODE_1 to CALC_NODE_10. Then, each of the calculation nodes CALC_NODE_1 to CALC_NODE_10 reads a set of <Key, Value> stored in the STOR10 from the storage device STOR1, classifies (aligns) each key, and performs a Reduce process. When all the computation nodes finish the Reduce process and the storage devices STOR1 to STOR10 become empty, the controller BUCKET_CONT starts retransmitting the next bucket BSTOR2. Similarly, retransmission and Reduce processing are sequentially performed for each bucket until the last bucket BSTOR10.

以上のように、図７に示した並列処理方式では、第１のグループ分けで、各計算ノードがＭＡＰ処理をするのに対し、処理結果が第２のグループ分けでＢＳＴＯＲ１〜１０の各記憶装置に送信され、さらに第１のグループ分けに従って各記憶装置から各計算ノードに再送信がなされる。これにより、並列処理全体で１０個のバケツＢＳＴＯＲ１からＢＳＴＯＲ１０を用意することで、各計算ノードにおける分類（整列）処理にかかる時間を、各計算ノードそれぞれに１０個のバケツを用意した場合（図５（ｂ）の構成）と同等まで短縮することが可能である。その一方で、計算ノード間の通信が、バケツ送信フェーズおよび、バケツ再送フェーズの２回行われるため、通信量が増大する。しかしながら、一般に、データ量Ｎに対して、計算ノード間通信に必要な時間はＮに比例するのに対して、分類（整列）にかかる時間は、Ｎ×Ｌｏｇ（Ｎ）に比例するため、データ量が大きくなるほど、分類（整列）の処理時間の短縮は全体の処理時間短縮に効果的になる。 As described above, in the parallel processing method shown in FIG. 7, each computing node performs MAP processing in the first grouping, whereas each storage device of BSTOR1 to 10 in the second grouping is processed. And retransmitted from each storage device to each computation node according to the first grouping. As a result, by preparing 10 buckets BSTOR1 to BSTOR10 in the entire parallel processing, the time required for the classification (alignment) processing in each computation node is prepared when 10 buckets are prepared for each computation node (FIG. 5). It can be shortened to the same as (b). On the other hand, since communication between calculation nodes is performed twice in the bucket transmission phase and the bucket retransmission phase, the amount of communication increases. However, in general, the time required for communication between computation nodes is proportional to N with respect to the data amount N, whereas the time required for classification (alignment) is proportional to N × Log (N). As the amount increases, the reduction in the processing time for classification (alignment) becomes more effective in reducing the overall processing time.

なお、上記の説明で、各計算ノードへのＲｅｄｕｃｅ処理を担当するＫｅｙの割り当ては単純にＫｅｙ番号の小さい順に１０００個ずつとし、各バケツへのＫｅｙの割り当て方はＫｅｙ番号を１０００で割った余りで定めたが、この決め方には任意性があり、他の割り当て方も適用可能である。一方で、本発明の効果をより高めるためには、図６に示したように、各計算ノードへのＲｅｄｕｃｅ処理を担当するＫｅｙの割り当て方と、各バケツへの割り当て方は、なるべく互いに直交の関係になるようにするのが望ましい。すなわち、第１のグループ分けの任意の一のグループに含まれる処理対象は、第２のグループ分けのそれぞれのグループに少なくとも一つ含まれるように分散されていることが望ましい。さらに、望ましい割り当て方は、Ｍａｐフェーズで出力されるＫｅｙの分布に依存しており、アプリケーションや入力データ毎に異なる。したがって、アプリケーションの設計者やユーザーが、各計算ノードへのＲｅｄｕｃｅ処理を担当するＫｅｙの割り当て方、および、各バケツへの割り当て方を、設定できるようにすることが考えられる。また、バケツの再送管理を行うコントローラＢＵＣＫＥＴ＿ＣＯＮＴをおかずに、各バケツ、および、計算ノード間で、必要な情報を都度やりとりすることで、フェーズ間の状態遷移を行うことも可能である。 In the above description, the number of Keys responsible for Reduce processing to each computation node is simply 1000 in ascending order of the Key number, and the Key allocation method for each bucket is the remainder obtained by dividing the Key number by 1000. However, this determination method is arbitrary, and other allocation methods can be applied. On the other hand, in order to further enhance the effect of the present invention, as shown in FIG. 6, the assignment method of the key responsible for the reduction process to each computation node and the assignment method to each bucket are as orthogonal as possible. It is desirable to have a relationship. That is, it is desirable that the processing targets included in any one group in the first grouping are distributed so as to be included in each group in the second grouping. Furthermore, a desirable allocation method depends on the distribution of the Key output in the Map phase, and differs depending on the application and input data. Therefore, it is conceivable that the application designer or user can set the assignment method of the key responsible for the reduction processing to each computation node and the assignment method to each bucket. It is also possible to perform state transition between phases by exchanging necessary information between each bucket and each computation node without having to use a controller BUCKET_CONT for performing bucket retransmission management.

また、前述のように、情報処理システム１０１の処理対象としては、グラフ構造データもあり得る。グラフ構造データの場合には、並列処理を行う計算ノードの台数分にグラフが分割され各計算ノードに割当てられる。各計算ノードでは、割当てられた頂点群の処理が行われる。 As described above, the processing target of the information processing system 101 may include graph structure data. In the case of graph structure data, the graph is divided into the number of computation nodes performing parallel processing and assigned to each computation node. In each computation node, processing of the assigned vertex group is performed.

図８は、情報処理システム１０１での、並列処理の動作の第２の例を説明する図である。本実施例の方式は、バケットソートを複数回行うことでより精密な整列を行う方法である基数ソート（ＲａｄｉｘＳｏｒｔ）という方法の考え方を応用し、同一のバケツ群ＢＳＴＯＲ１からＢＳＴＯＲ１０を再利用して２回バケットソートを行うことで、実質的に、バケツの数が２乗になったのと同等の効果を得ることができる。本実施例では、＜Ｋｅｙ，Ｖａｌｕｅ＞の組を一時的に記憶するバケツ群ＢＳＴＯＲ１からＢＳＴＯＲ１０を、計算ノードＣＡＬＣ＿ＮＯＤＥ＿１の記憶装置ＳＴＯＲをＢＳＴＯＲ１、計算ノードＣＡＬＣ＿ＮＯＤＥ＿２の記憶装置ＳＴＯＲをＢＳＴＯＲ２、計算ノードＣＡＬＣ＿ＮＯＤＥ＿３の記憶装置ＳＴＯＲをＢＳＴＯＲ３、のように並列処理で計算を実行する計算ノードの記憶装置ＳＴＯＲと兼用させて実現する形式を例として説明する。 FIG. 8 is a diagram illustrating a second example of parallel processing operations in the information processing system 101. The method of this embodiment applies the idea of a radix sort, which is a method of performing more precise sorting by performing bucket sorting a plurality of times, and reuses the same bucket group BSTOR1 to BSTOR10. By performing the bucket sort twice, it is possible to obtain substantially the same effect as when the number of buckets is squared. In the present embodiment, bucket groups BSTOR1 to BSTOR10 that temporarily store a set of <Key, Value> are stored, storage device STOR of calculation node CALC_NODE_1 is BSTOR1, storage device STOR of calculation node CALC_NODE_2 is storage of BSTOR2, and storage of calculation node CALC_NODE_3 A description will be given by taking, as an example, a format in which the device STOR is realized by being used also as a storage device STOR of a calculation node that performs calculation by parallel processing, such as BSTR3.

図８に示した、処理方式では、Ｍａｐフェーズで出力される＜Ｋｅｙ，Ｖａｌｕｅ＞の組が、Ｋｅｙ番号を１００で割った余りに基づいて定まるバケツＢＳＴＯＲ１からＢＳＴＯＲ１０に送信される。この通信が全て完了後、各計算ノードは、バケツＢＳＴＯＲ１の記憶内容を読み出して、再び別のバケツＢＳＴＯＲ１からＢＳＴＯＲ１０に送信する。このとき、Ｋｅｙ番号を１０００で割った余りに基づいて送り先のバケツが定められる。バケツＢＳＴＯＲ１の再送が完了後、同様に、バケツＢＳＴＯＲ２からバケツＢＳＴＯＲ１０まで順に再送がなされる。全てのバケツの再送が完了後、再び、各計算ノードは、バケツＢＳＴＯＲ１の記憶内容を読み出して、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）に再々送し、各計算ノードでＬｏｃａｌＳｏｒｔおよびＲｅｄｕｃｅ処理を行う。バケツＢＳＴＯＲ１の処理が完了後、同様に、バケツＢＳＴＯＲ２からバケツＢＳＴＯＲ１０まで順に再々送、および各計算ノードでのＬｏｃａｌＳｏｒｔおよびＲｅｄｕｃｅ処理が行われる。 In the processing method shown in FIG. 8, a set of <Key, Value> output in the Map phase is transmitted from the bucket BSTOR1 determined based on the remainder obtained by dividing the Key number by 100 to the BSTOR10. After all the communication is completed, each calculation node reads out the stored contents of the bucket BSTOR1 and transmits it again from another bucket BSTOR1 to the BSTOR10. At this time, the destination bucket is determined based on the remainder obtained by dividing the Key number by 1000. Similarly, after the retransmission of the bucket BSTOR1 is completed, the retransmission is sequentially performed from the bucket BSTOR2 to the bucket BSTOR10. After all the buckets have been retransmitted, each calculation node again reads out the stored contents of the bucket BSTOR1 and re-sends it to the original destination calculation node (calculation node that performs the Reduce process). At each calculation node, Local Sort and Reduce processing is performed. Similarly, after the processing of the bucket BSTOR1 is completed, the packet BSTOR2 to the bucket BSTOR10 are sequentially retransmitted, and the Local Sort and Reduce processing is performed at each calculation node.

以上のように、図８に示した、処理方式では、各計算ノードにおける分類（整列）処理にかかる時間を、各計算ノードそれぞれに１００個のバケツを用意した場合と同等まで短縮することが可能である。なお、バケツ送信および再送信の回数を２回より多くすることも可能である。再送信の回数を増やすことで、各計算ノードでのＫｅｙ毎の分類（整列）にかかる時間が減少する一方で、計算ノード間の通信量が増加するので、アプリケーションおよび入力データに応じて最適なバケツ再送信の回数が存在する。したがって、アプリケーションの設計者やユーザーが、バケツ再送信の回数を設定できるようにすることが考えられる。 As described above, in the processing method shown in FIG. 8, it is possible to reduce the time required for the classification (alignment) processing in each computation node to the same level as when 100 buckets are prepared for each computation node. It is. It is possible to increase the number of bucket transmissions and retransmissions more than two. By increasing the number of retransmissions, the time required for classification (alignment) for each key at each computation node is reduced, while the amount of communication between computation nodes increases, so it is optimal for the application and input data. There is a number of bucket retransmissions. Therefore, it may be possible for an application designer or user to set the number of bucket retransmissions.

図９は、情報処理システム１０１での、並列処理の動作の第２の例を説明する図である。本実施例の方式は、前述の図７に示した本発明の実施例１においては、バケツの再送信は、バケツ送信フェーズが完了した後、すなわち、Ｍａｐフェーズで出力される＜Ｋｅｙ，Ｖａｌｕｅ＞の組が、全てのバケツＢＳＴＯＲ１からＢＳＴＯＲ１０に記憶された後、に開始される。したがって、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）は、バケツ再送信フェーズが開始するまで、＜Ｋｅｙ，Ｖａｌｕｅ＞の組を一切受け取ることができない。しかしながら、もし、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）が、早い段階から＜Ｋｅｙ，Ｖａｌｕｅ＞の組を受信することができれば、＜Ｋｅｙ，Ｖａｌｕｅ＞の組を受信しながら、平行して（オンラインで）少しずつＫｅｙ毎の分類（整列）の処理を行うことで、全体の処理時間を短縮できる可能性があると本願発明者らは考えた。 FIG. 9 is a diagram illustrating a second example of parallel processing operations in the information processing system 101. In the system of the present embodiment, in the first embodiment of the present invention shown in FIG. 7, the bucket is retransmitted after the bucket transmission phase is completed, that is, output in the Map phase <Key, Value>. Is started after all the buckets BSTOR1 to BSTOR10 are stored. Therefore, the original destination calculation node (the calculation node performing the Reduce process) cannot receive any <Key, Value> pair until the bucket retransmission phase starts. However, if the original destination computation node (compute node that performs the Reduce process) can receive the <Key, Value> pair from an early stage, it will receive the <Key, Value> pair in parallel. The inventors of the present application have considered that the entire processing time may be shortened by performing the classification (alignment) processing for each key little by little (online).

図９に示した、処理方式においては、各バケツＢＳＴＯＲ１からＢＳＴＯＲ１０に、それぞれ、Ｋｅｙ番号を１０００で割った余りが９１から１８１、１８２から２７２、以下同様に、９０個または９１個のＫｅｙ番号を割り当てる。９０または９１は、１つの計算ノードがＲｅｄｕｃｅ処理を担当するＫｅｙの数（１０００個）を、バケツ群の個数（１０台）＋１、で割った数に相当する。また、全てのＭａｐ処理が完了するのに必要な時間をあらかじめ見積もっておき、全てのＭａｐ処理が完了する時間の１／１０の時間が経過するごとに、順にバケツＢＳＴＯＲ１からＢＳＴＯＲ１０の記憶内容を読み出して、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）に再送信する。こうすることで、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）が、早い段階から＜Ｋｅｙ，Ｖａｌｕｅ＞の組を受信することができ、＜Ｋｅｙ，Ｖａｌｕｅ＞の組を受信しながら、平行して（オンラインで）少しずつＫｅｙ毎の分類（整列）の処理を行うことができる。 In the processing method shown in FIG. 9, each bucket BSTOR1 to BSTOR10 has a remainder obtained by dividing the Key number by 1000 from 91 to 181, 182 to 272, and so on. Similarly, 90 or 91 Key numbers are assigned. assign. 90 or 91 corresponds to a number obtained by dividing the number of keys (1000) for which one computation node is in charge of Reduce processing by the number of bucket groups (10) +1. In addition, the time required for completing all Map processes is estimated in advance, and the contents stored in the buckets BSTOR1 to BSTOR10 are sequentially read every time 1/10 of the time for completing all Map processes has elapsed. Then, the data is retransmitted to the original destination calculation node (calculation node performing the Reduce process). By doing so, the original destination calculation node (the calculation node that performs the Reduce process) can receive the <Key, Value> set from an early stage, and receive the <Key, Value> set while receiving the parallel. Thus, the classification (alignment) processing for each key can be performed little by little (online).

具体的には、Ｍａｐフェーズの開始に先立って、全てのＭａｐ処理にかかる時間Ｔｍａｐをなんらかの手段で見積もっておく。通常、Ｍａｐプロセスの個数や、処理内容はあらかじめ分かっているためＭａｐ処理全体の処理時間の見積りは可能である。また、この見積もり時間は多少誤差があっても、以下の動作には支障がない。 Specifically, prior to the start of the Map phase, the time Tmap required for all Map processes is estimated by some means. Usually, since the number of Map processes and the processing contents are known in advance, it is possible to estimate the processing time of the entire Map process. Even if the estimated time has some errors, the following operations are not hindered.

Ｍａｐフェーズ開始後、Ｍａｐフェーズで出力される＜Ｋｅｙ，Ｖａｌｕｅ＞の組を、もしＫｅｙ番号に対応するバケツが存在するならそのバケツにあてて、もしＫｅｙ番号に対応するバケツが存在しないのであれば、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）にあてて送信する。したがって、Ｋｅｙ番号が０から９０までの＜Ｋｅｙ，Ｖａｌｕｅ＞の組は、バケツ群に送られることなく、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）に向けて直接送信されることになる。 After the start of the Map phase, the <Key, Value> set output in the Map phase is applied to the bucket corresponding to the Key number, and if there is no bucket corresponding to the Key number. The data is sent to the original destination calculation node (the calculation node that performs the Reduce process). Therefore, a set of <Key, Value> with a Key number from 0 to 90 is not directly sent to the bucket group, but is directly transmitted to the original destination calculation node (calculation node that performs Reduce processing). .

その後、Ｍａｐフェーズ開始からＴｍａｐ／１０の時間が経過したタイミングで、バケツＢＳＴＯＲ１の記憶内容を読み出して本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）にあてて再送信する。これ以降は、Ｍａｐ処理で出力された＜Ｋｅｙ，Ｖａｌｕｅ＞の組は、対応するバケツが存在しないＫｅｙ番号が０から９０までの＜Ｋｅｙ，Ｖａｌｕｅ＞の組に加えて、バケツＢＳＴＯＲ１が担当していたＫｅｙ番号が９１から１８１までの＜Ｋｅｙ，Ｖａｌｕｅ＞の組も、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）に向けて直接送信する。その後、さらにＴｍａｐ／１０だけ時間が経過して、Ｍａｐフェーズ開始から（Ｔｍａｐ／１０）×２の時間が経過したタイミングで、バケツＢＳＴＯＲ２の記憶内容を読み出して本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）にあてて再送信する。これ以降は、Ｍａｐ処理で出力された＜Ｋｅｙ，Ｖａｌｕｅ＞の組は、Ｍａｐ処理でバケツＢＳＴＯＲ２が担当していた＜Ｋｅｙ，Ｖａｌｕｅ＞の組が出力された場合にも、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）に向けて直接送信する。以下同様に、さらにＴｍａｐ／１０だけ時間が経過するたびにバケツを本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）に向けて再送信し、それ以降、Ｍａｐ処理で出力された＜Ｋｅｙ，Ｖａｌｕｅ＞の組は、Ｋｅｙに対応するバケツが存在して、かつ、そのバケツが再送信済みでない場合には、そのバケツにあてて、そうでない場合には、直接、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）にあてて送信する。 After that, at the timing when the time Tmap / 10 has elapsed from the start of the Map phase, the storage contents of the bucket BSTOR1 are read out and retransmitted to the original destination calculation node (calculation node that performs the Reduce process). From now on, the <Key, Value> pair output by the Map processing is in charge of the bucket BSTOR1 in addition to the <Key, Value> pair whose key number is 0 to 90 and no corresponding bucket exists. A pair of <Key, Value> with key numbers 91 to 181 is also transmitted directly to the original destination calculation node (calculation node that performs Reduce processing). Thereafter, when the time Tmap / 10 further elapses and the time (Tmap / 10) × 2 has elapsed since the start of the Map phase, the stored contents of the bucket BSTOR2 are read and the original destination calculation node (Reduce processing is performed) Re-send to (compute node). From then on, the <Key, Value> pair output by the Map process is the same as the original destination calculation node (<Key, Value>) even when the <Key, Value> pair that was handled by the bucket BSTOR2 is output by the Map process. It is directly transmitted to a computing node that performs Reduce processing. In the same manner, each time when Tmap / 10 further elapses, the bucket is retransmitted toward the original destination calculation node (the calculation node that performs the Reduce process), and thereafter, the <Key, Value output in the Map process is output. The pair of> is assigned to the bucket if the bucket corresponding to the Key exists and the bucket has not been retransmitted, and directly to the original destination calculation node (Reduce processing) otherwise. To the computing node that performs the transmission.

以上の説明では、簡単のために、各バケツが担当するＫｅｙの数を全て同じにし、また、バケツの再送信を、全てのＭａｐ処理にかかる時間Ｔｍａｐを等間隔に分割したタイミングで行うことにしたが、実際には、これではバケツ再送信のタイミングで各バケツが記憶している＜Ｋｅｙ，Ｖａｌｕｅ＞の組の数がバケツによって異なるため記憶装置の利用効率や、ネットワークの効率が低下する場合がある。なぜなら、バケツＢＳＴＯＲ１はＴｍａｐ／１０の時間しか＜Ｋｅｙ，Ｖａｌｕｅ＞の組の記憶を行わないのに対して、バケツＢＳＴＯＲ１０はＴｍａｐの時間＜Ｋｅｙ，Ｖａｌｕｅ＞の組の記憶を行う。したがって、もし、Ｍａｐフェーズで出力されるＫｅｙの分布が一様であるならば、バケツの再送信のタイミングで、バケツＢＳＴＯＲ１は、バケツＢＳＴＯＲ１０の１／１０の＜Ｋｅｙ，Ｖａｌｕｅ＞の組しか記憶していないことになる。バケツ再送信のタイミングでの各バケツが記憶している＜Ｋｅｙ，Ｖａｌｕｅ＞の組の数をなるべく同じにするには、各バケツが担当するＫｅｙの数を、バケツＢＳＴＯＲ１では多くし、バケツＢＳＴＯＲ１０では少なくすることが考えられる。あるいは、バケツ再送信のタイミングを単純に、Ｔｍａｐを等分割するのではなく、前のバケツを再送信してから次のバケツを再送信するまでの時間を、だんだん短くしていくといった方策をとってもよい。 In the above description, for the sake of simplicity, all the buckets are assigned the same number of keys, and the buckets are retransmitted at a timing obtained by dividing the time Tmap required for all Map processing at equal intervals. However, in actuality, when the number of <Key, Value> pairs stored in each bucket at the bucket re-transmission timing differs depending on the bucket, the use efficiency of the storage device and the network efficiency are reduced. There is. This is because the bucket BSTOR1 stores a set of <Key, Value> only for a time of Tmap / 10, whereas the bucket BSTOR10 stores a set of <Key, Value> for a Tmap. Therefore, if the distribution of the Key output in the Map phase is uniform, the bucket BSTOR1 stores only 1/10 <Key, Value> pairs of the bucket BSTOR10 at the timing of bucket retransmission. Will not be. In order to keep the number of <Key, Value> pairs stored in each bucket at the time of bucket retransmission as much as possible, the number of keys handled by each bucket is increased in the bucket BSTOR1, and in the bucket BSTOR10. It is possible to reduce it. Or, instead of simply dividing the Tmap into equal parts, instead of dividing the Tmap evenly, the time between retransmitting the previous bucket and retransmitting the next bucket is gradually shortened. Good.

以上のように、図９に示した、本発明の第３の実施例による処理方式では、Ｍａｐ処理と並行してバケツを順に再送信することで、本来の宛先計算ノード（Ｒｅｄｕｃｅ処理を行う計算ノード）が、早い段階から＜Ｋｅｙ，Ｖａｌｕｅ＞の組を受信することができ、＜Ｋｅｙ，Ｖａｌｕｅ＞の組を受信しながら、平行して（オンラインで）少しずつＫｅｙ毎の分類（整列）の処理を行うことができ、全体の処理時間を短縮することができる。 As described above, in the processing method according to the third embodiment of the present invention shown in FIG. 9, the original destination calculation node (reduction processing is performed by retransmitting the bucket in order in parallel with the Map processing. Node) can receive <Key, Value> pairs from an early stage, and receive the <Key, Value> pairs in parallel (online) little by little (online) classification (alignment) for each Key. Processing can be performed, and the overall processing time can be shortened.

１０１：情報処理システム、ＣＡＬＣ＿ＮＯＤＥ＿ｘ：計算ノード、ＣＯＭ＿ＳＷ：通信スイッチ、ＭＥＭ：一時記憶装置、ＳＴＯＲ：記憶装置、ＣＯＭ＿ＤＥＶ：通信デバイス、ＳＴＯＲ：記憶装置、ＢＵＳ：バス、ＢＳＴＯＲ１〜１０：記憶装置（バケツ）、ＢＵＣＫＥＴ＿ＣＯＮＴ：バケツの再送管理を行うコントローラ。 101: Information processing system, CALC_NODE_x: Calculation node, COM_SW: Communication switch, MEM: Temporary storage device, STOR: Storage device, COM_DEV: Communication device, STOR: Storage device, BUS: Bus, BSTOR1-10: Storage device (bucket) , BUCKET_CONT: A controller that performs retransmission management of buckets.

Claims

複数の計算ノードを有する並列計算機システムでの並列処理方法であって、
第１のグループ分けで処理対象を分割して各計算ノードに配置して処理し、
処理結果にそれぞれが属する第２のグループ毎の宛先を与え、
該宛先毎に異なるストレージ装置に該処理結果を保存し、
保存された処理結果を前記第１のグループ分けに従って各計算ノードに送信することを特徴とする並列処理方法。A parallel processing method in a parallel computer system having a plurality of computation nodes,
Divide the processing target in the first grouping and place it on each computation node for processing,
A destination for each second group to which each of the processing results belongs,
Save the processing result in a different storage device for each destination,
A parallel processing method, comprising: transmitting a stored processing result to each computing node according to the first grouping.

請求項１に記載の並列処理方法において、
該処理結果が保存される各ストレージ装置は、各計算ノード上に配置されていることを特徴とする並列処理方法。The parallel processing method according to claim 1,
A parallel processing method, wherein each storage device in which the processing result is stored is arranged on each computation node.

請求項１に記載の並列処理方法において、
前記保存された処理結果を前記第１のグループ分けに従って各計算ノードに送信する際に、
前記保存された処理結果を前記第２のグループ分け毎に時間差をおいて送信することを特徴とする並列処理方法。The parallel processing method according to claim 1,
When sending the stored processing result to each computing node according to the first grouping,
The parallel processing method, wherein the stored processing result is transmitted with a time difference for each second grouping.

請求項１に記載の並列処理方法において、
前記第１のグループ分けの任意の一のグループに含まれる処理対象は、前記第２のグループ分けのそれぞれのグループに少なくとも一つ含まれるように分散されていることを特徴とする並列処理方法。The parallel processing method according to claim 1,
The parallel processing method according to claim 1, wherein the processing targets included in any one group of the first grouping are distributed so as to be included in each group of the second grouping.

請求項１に記載の並列処理方法において、
前記処理対象はグラフ構造データであることを特徴とする並列処理方法。The parallel processing method according to claim 1,
The parallel processing method characterized in that the processing target is graph structure data.

複数の計算ノードを有する並列計算機システムであって、
第１のグループ分けで処理対象を分割して各計算ノードに配置して処理し、
処理結果にそれぞれが属する第２のグループ毎の宛先を与え、
該宛先毎に異なるストレージ装置に該処理結果を保存し、
保存された処理結果を前記第１のグループ分けに従って各計算ノードに送信することを特徴とする並列計算機システム。A parallel computer system having a plurality of computation nodes,
Divide the processing target in the first grouping and place it on each computation node for processing,
A destination for each second group to which each of the processing results belongs,
Save the processing result in a different storage device for each destination,
A parallel computer system, wherein a stored processing result is transmitted to each computation node according to the first grouping.

請求項６に記載の並列計算機システムにおいて、
該処理結果が保存される各ストレージ装置は、各計算ノード上に配置されていることを特徴とする並列計算機システム。The parallel computer system according to claim 6,
A parallel computer system, wherein each storage device in which the processing result is stored is arranged on each computation node.

請求項６に記載の並列計算機システムにおいて、
前記保存された処理結果を前記第１のグループ分けに従って各計算ノードに送信する際に、
前記保存された処理結果を前記第２のグループ分け毎に時間差をおいて送信することを特徴とする並列計算機システム。The parallel computer system according to claim 6,
When sending the stored processing result to each computing node according to the first grouping,
A parallel computer system, wherein the stored processing result is transmitted with a time difference for each of the second groupings.

請求項６に記載の並列計算機システムにおいて、
前記第１のグループ分けの任意の一のグループに含まれる処理対象は、前記第２のグループ分けのそれぞれのグループに少なくとも一つ含まれるように分散されていることを特徴とする並列計算機システム。The parallel computer system according to claim 6,
The parallel computer system is characterized in that at least one processing target included in any one group of the first grouping is distributed so as to be included in each group of the second grouping.

請求項６に記載の並列計算機システムにおいて、
前記処理対象はグラフ構造データであることを特徴とする並列計算機システム。The parallel computer system according to claim 6,
The parallel computer system, wherein the processing object is graph structure data.