JPH0516066B2

JPH0516066B2 -

Info

Publication number: JPH0516066B2
Application number: JP61076492A
Authority: JP
Inventors: Mamoru Sugie; Mitsugi Yoneyama
Original assignee: Agency of Industrial Science and Technology
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 1986-04-04
Filing date: 1986-04-04
Publication date: 1993-03-03
Also published as: JPS62233873A

Description

【発明の詳細な説明】[Detailed description of the invention] 【産業上の利用分野】[Industrial application field]

本発明の複数のプロセツサ・エレメントからな
る並列計算機システムに係り、特に、知識処理に
好適な並列計算機システムに関する。 The present invention relates to a parallel computer system comprising a plurality of processor elements, and particularly relates to a parallel computer system suitable for knowledge processing.

【従来の技術】[Conventional technology]

計算機性能の飛躍的向上に対して、100〜10000
台規模あるいはそれ以上のプロセツサ・エレメン
トを並列動作させるアーキテクチヤが有望視され
ている。特に、知識処理向きの計算機では、従来
性能の飛躍的向上が不可欠であること、実行する
プログラム自身が並列性を有することから、上記
のアーキテクチヤが一般に採用されている。並列計算機の構成に関しては、「イリノイ大
学・コンピユータ・サイエンス・デパートメン
ト・レポート・No.83−1123」（University of
Illinois at Urband−Champaign，DCS Report
No.83−1123（Cedar Doc.No.５））（以下、第１の
従来技術と呼ぶ）に示されているように、プロセ
ツサ・エレメントをクラスタに分割し、クラスタ
内部の複数のプロセツサ・エレメントを相互にネ
ツトワークで接続し、各クラスタを相互にネツト
ワークで結合する方式が知られている。また、他
の従来技術として知られている数台規模の並列計
算機システムでは、「MVS／拡張アーキテクチ
ヤ・オーバービユー、GC28−1348−０、File
No.S370−34」（MVS／Extended Architecture Overview、
GC28−1348−０、File No.S370−34）（以下、第
２の従来技術と呼ぶ）に示されているように、プ
ロセツサ・エレメントがメモリを共有し、この共
有メモリを介して結合するという方式も知られて
いる。 100 to 10,000 for the dramatic improvement in computer performance.
An architecture that allows parallel operation of processor elements on a scale of one or more processors is seen as promising. In particular, the above architecture is generally adopted in computers suitable for knowledge processing because it is essential to dramatically improve the performance of conventional computers and the programs to be executed themselves have parallelism. Regarding the configuration of parallel computers, please refer to "University of Illinois Computer Science Department Report No. 83-1123" (University of Illinois Computer Science Department Report No. 83-1123).
Illinois at Urband−Champaign, DCS Report
No. 83-1123 (Cedar Doc. No. 5)) (hereinafter referred to as the first prior art), a processor element is divided into clusters, and multiple processor elements within the cluster are A method is known in which clusters are connected to each other via a network, and each cluster is connected to each other via a network. In addition, for parallel computer systems with several machines, which are known as other conventional technologies, "MVS/Extended Architecture Overview, GC28-1348-0, File
No.S370−34” (MVS/Extended Architecture Overview,
GC28-1348-0, File No.S370-34) (hereinafter referred to as the second prior art), processor elements share memory and are coupled via this shared memory. The method is also known.

【発明が解決しようとする課題】[Problem to be solved by the invention]

上記従来技術には、高性能が得られないという
問題があつた。並列計算機の性能は、プロセツ
サ・エレメントの単体性能と並例動作するプロセ
ツサの台数との積で決定される。第２の従来技術
では、全てのプロセツサ・エレメントが同一のメ
モリをアクセスするためにメモリのアクセス衝突
が生じて高々数台程度しか結合できず、高い並列
性が得られない。一方、第１の従来技術では、各
プロセツサ・エレメントがネツトワークを介して
結合されているので独立性が高く、高い並列性が
得られる。しかしながら、このような構造の並列
計算機システムで、タスクの生成、分配をする必
要があるプログラムを実行させようとすると、タ
スクの分配に関連するオーバヘツドが大きくなつ
て、プロセツサ・エレメント単体の性能が低下し
てしまう。すなわち、タスク分配にあたつては、
各プロセツサ・エレメントの負荷がなるべく均一
になるように分配する必要があるが、この第１の
従来技術をそのまま用いたのでは、各プロセツ
サ・エレメントについて負荷の量を計測し、それ
に基づいてタスクの分配先のプロセツサ・エレメ
ントを決定する必要がある。負荷の計測をプロセ
ツサ・エレメント単位に行なうと計測のためのオ
ーバヘツドが大きくなる。さらに、タスクの分配先のプロセツサ・エレメ
ントが決定された後では、そのプロセツサ・エレ
メントは、親タスクの識別子、タスクの環境デー
タ等の情報をパケツトの形に組立てて転送する。
他のクラスタ内の受信側のプロセツサ・エレメン
トはこのパケツトを分解してタスクを分離し、こ
のタスクを自己のタスクとして登録する必要があ
る。上記第１の従来技術をそのまま用いたので
は、このパケツトの組立て・分解をタスクの分配
ごとに行なわなければならない。また、こうした
タスク分配に伴う処理は、プロセツサ・エレメン
ト自体が実行しなければならず、タスク分配はプ
ロセツサ・エレメントの本来の動作を阻害する。このように、第１の従来技術をそのまま用いた
のではタスクの分配のためのオーバヘツドが大き
い。本発明の目的は、高いプロセツサ・エレメント
単体性能を保持しつつ、高い並列性を得ることの
できる並列計算機システムを提供することにあ
る。 The above-mentioned conventional technology has a problem in that high performance cannot be obtained. The performance of a parallel computer is determined by the product of the individual performance of a processor element and the number of processors that operate in parallel. In the second prior art, since all processor elements access the same memory, memory access conflicts occur, and only a few processor elements can be combined at most, making it impossible to achieve high parallelism. On the other hand, in the first prior art, since each processor element is connected via a network, it is possible to achieve high independence and high parallelism. However, if you try to run a program that requires task generation and distribution on a parallel computer system with this structure, the overhead associated with task distribution will increase, and the performance of the individual processor elements will decrease. Resulting in. In other words, when distributing tasks,
It is necessary to distribute the load on each processor element as evenly as possible, but if this first conventional technique were used as is, the amount of load on each processor element would be measured and tasks would be distributed based on that. It is necessary to determine the processor element to which the data will be distributed. If the load is measured for each processor element, the overhead for the measurement becomes large. Furthermore, after the processor element to which the task is to be distributed is determined, that processor element assembles information such as the identifier of the parent task and the environmental data of the task into a packet and transfers it.
The receiving processor element in the other cluster must disassemble this packet, separate the task, and register this task as its own. If the first prior art technique is used as is, the packets must be assembled and disassembled each time a task is distributed. Further, processing associated with such task distribution must be executed by the processor element itself, and task distribution obstructs the original operation of the processor element. As described above, if the first conventional technique is used as is, the overhead for task distribution is large. An object of the present invention is to provide a parallel computer system that can obtain high parallelism while maintaining high processor element performance.

【課題を解決するための手段】[Means to solve the problem]

上記目的達成のために、本発明では、複数のプ
ロセツサ・エレメントを一部づつ結合して複数の
クラスタが構成され、これらのクラスタがネツト
ワークで相互に結合され、各クラスタは、クラスタコントローラを有し、
そのクラスタの複数のプロセツサ・エレメント
は、そのクラスタのクラスタコントローラがアク
セス可能な共有メモリで結合され、その共有メモ
リは、実行待ちのタスクを登録する、そのクラス
タで唯一のタスク・キユーを有し、各クラスタの
各プロセツサ・エレメントは、タスクの実行の結
果新たなタスクを生成するようなタスクを実行す
るものであり、そのクラスタのタスク・キユーか
ら実行すべきタスクを取り出して実行し、このタ
スクの実行の結果新たなタスクが発生したときに
は、このタスク・キユーに登録するものであり、各クラスタのクラスタコントローラは、そのク
ラスタのタスク・キユーから負荷の均等化のため
に分配すべきタスクを取り出し、そのタスクを含
むパケツトを組立て、他の一つのクラスタに分配
するために、その、他のクラスタのクラスタコン
トローラに上記ネツトワークを介してそのパケツ
トを送付し、さらに、他のクラスタのクラスタコ
ントローラから上記ネツトワークを介して送付さ
れた、分配されたタスクを含むパケツトを分解
し、その分配されたタスクを、そのクラスタコン
トローラが属するクラスタのタスク・キユーに登
録する。 To achieve the above object, in the present invention, a plurality of processor elements are partially connected to form a plurality of clusters, these clusters are interconnected through a network, and each cluster has a cluster controller. death,
The plurality of processor elements of the cluster are coupled by a shared memory accessible by the cluster controller of the cluster, the shared memory having a unique task queue in the cluster that registers tasks awaiting execution; Each processor element in each cluster executes a task that generates a new task as a result of task execution, and retrieves the task to be executed from the task queue of that cluster, executes it, and executes the task. When a new task is generated as a result of execution, it is registered in this task queue, and the cluster controller of each cluster takes out the task to be distributed to equalize the load from the task queue of that cluster. In order to assemble the packet containing the task and distribute it to another cluster, the packet is sent to the cluster controller of the other cluster via the above network, and then the cluster controller of the other cluster sends the packet to the cluster controller of the other cluster. A packet containing distributed tasks sent via the network is disassembled, and the distributed tasks are registered in the task queue of the cluster to which the cluster controller belongs.

【作用】[Effect]

各クラスタでは、そのクラスタ内の各プロセツ
サ・エレメントは自己が実行するタスクを、その
クラスタの唯一のタスク・キユーから取り出し、
実行し、また、そのタスクの実行中にタスクを生
成した場合、そのタスク・キユーにその生成され
たタスクを登録するだけでよく、そのクラスタ内
の他のプロセツサ・エレメントへの分配をしなく
てすむ。しかも、共有メモリ内のタスク・キユー
へのタスクの登録は、ネツトワークを介してタス
クを分配するよりはるかに高速に行いうる。ま
た、各クラスタ内のプロセツサ・エレメントの数
はシステムの全てのプロセツサ・エレメントの数
よりもはるかに少くて済むので、第２の従来技術
で問題となるメモリへのアクセスの衝突も生じる
ことは少ない、さらに、本発明では、クラスタ・コントローラ
が各クラスタから他のクラスタに分配するタスク
をそのクラスタの唯一のタスク・キユーから取り
出してパケツトとして組み立て、そのパケツトを
分配し、かつ、他のクラスタから分配されたタス
クを含むパケツトを分解して、そのタスクを、そ
のクラスタのタスク・キユーに登録するので、タ
スク分配に係る処理がプロセツサ・エレメントの
タスク実行を阻害することがない。さらに、各クラスタでは、共有メモリ内の唯一
のタスク・キユーから、各プロセツサ・エレメン
トが実行中のタスクの終了又は中断ごとにタスク
を取り出すようにすればそのクラスタ内のそれら
のプロセツサ・エレメント間の負荷の均一化は自
動的に達成される。したがつて、各クラスタ内のプロセツサ・エレ
メントでは、負荷の計測あるいは分配のオーバヘ
ツドがなくなる。さらに、本発明では、クラスタ・コントローラ
がタスク・キユー内のタスクを他のクラスタに分
配する場合で、クラスタ単位に分配先をきめれば
よい。したがつて、クラスタ単位に負荷を計測す
ればよく、全てのプロセツサ・エレメントについ
ての負荷を計測する場合よりもはるかに少ないオ
ーバヘツドで済む。とくに、本発明では各クラスタの共有メモリ上
のタスク・キユーにそのクラスタの全てのタスク
が登録されているので、この登録されたタスクの
量のみを見ればそのクラスタの負荷を簡単に知る
こともできる。 In each cluster, each processor element within the cluster retrieves the task it performs from the cluster's only task queue;
If a task is executed and a task is spawned during the execution of that task, it is only necessary to register the created task in the task queue, without having to distribute it to other processor elements in the cluster. I'm done. Moreover, registering tasks in a task queue in shared memory can be done much faster than distributing tasks over a network. Additionally, since the number of processor elements in each cluster is much smaller than the total number of processor elements in the system, conflicts in memory access, which is a problem in the second prior art, are less likely to occur. Further, in the present invention, the cluster controller retrieves the tasks to be distributed from each cluster to other clusters from that cluster's unique task queue, assembles them into packets, distributes the packets, and then distributes the tasks from the other clusters. Since the packet containing the assigned task is disassembled and the task is registered in the task queue of the cluster, processing related to task distribution does not interfere with the task execution of the processor element. Furthermore, in each cluster, if each processor element retrieves a task from a unique task queue in shared memory at the end or abort of the task it is currently executing, Load equalization is achieved automatically. Therefore, there is no load metering or distribution overhead for the processor elements within each cluster. Further, according to the present invention, when the cluster controller distributes tasks in the task queue to other clusters, the distribution destination may be determined on a cluster-by-cluster basis. Therefore, it is sufficient to measure the load on a cluster-by-cluster basis, which requires much less overhead than when measuring the load on all processor elements. In particular, in the present invention, all tasks of each cluster are registered in the task queue on the shared memory of each cluster, so it is possible to easily know the load on that cluster by looking only at the amount of registered tasks. can.

【実施例】【Example】

以下、本発明の一実施例を第１図により説明す
る。並列計算機は、＃０〜＃ｎのレベル１クラス
タ３０、メインメモリ１０、レベル１ネツトワー
ク２０から構成されている。レベル１クラスタ３
０はレベル１ネツトワーク２０によつて結合され
ており、レベル１クラスタコントローラ２００が
レベル１クラスタ３０間の負荷分散を制御する。
各レベル１クラスタ３０は＃０〜＃ｎのレベル２
クラスタ４０、レベル２ネツトワーク１００、レ
ベル１クラスタコントローラ２００からなり、各
レベル２クラスタ４０は、＃０〜＃ｌのプロセツ
サ・エレメント７０、共有メモリ５０、レベル２
クラスタコントローラ３００からなり、レベル２
クラスタコントローラ３００がレベル２クラスタ
間の負荷分散を制御する。各レベル２クラスタ４０のクラスタコントロー
ラ３００は、そのクラスタからタスクを他のレベ
ル２クラスタに分配するとき、分配すべきタスク
をその分配元のレベル２クラスタの共有メモリ５
０のタスクキユーから分配すべきタスクを取り出
し、そのタスクを含むパケツトを組立てる。このタスクをその分配元のレベル２クラスタ４
０が属するレベル１クラスタ３０に属する他のレ
ベル２クラスタ４０に送付するときには、そのパ
ケツトをその分配先のレベル２クラスタ４０のク
ラスタコントローラ３００に送付するようになつ
ている。あるいは、そのタスクを、分配先のレベル２ク
ラスタ４０が属するレベル１クラスタ３０と異な
るレベル１クラスタに属するレベル２クラスタに
分配するときには、分配元のレベル２クラスタ４
０のクラスタコントローラ３００は、その、異な
るレベル１クラスタ３０のクラスタコントローラ
２００に、そのパケツトを送付するようになつて
いる。各レベル１クラスタ３０のクラスタコントロー
ラ２００は、そのクラスタに属するレベル２クラ
スタ４０のクラスタコントローラ３００から、分
配すべきタスクを含むパケツトが送付されたと
き、これを他のレベル１クラスタ３０のクラスタ
コントローラ２００に送付するようになつてい
る。さらに、各レベル１クラスタ３０のクラスタコ
ントローラ２００は、そのクラスタと異なるレベ
ル１クラスタ３０のクラスタコントローラ２００
から、分配すべきタスクを含むパケツトが送付さ
れたとき、これをそのレベル１クラスタ３０に属
する複数のレベル２クラスタ４０のいずれか一つ
に含まれるクラスタコントローラ３００に送付す
るようになつている。さらに、各レベル２クラスタ４０のクラスタコ
ントローラ３００は、分配されたタスクを含むパ
ケツトが、そのクラスタが属するレベル１クラス
タ３０に属する他のレベル２クラスタ４０のクラ
スタコントローラ３００から送付されたとき、あ
るいは、そのレベル２クラスタ４０が属するレベ
ル１クラスタ３０のクラスタコントローラ２００
から送付されたとき、そのパケツトを分解し、そ
こに含まれている分配されたタスクを、そのクラ
スタコントローラ３００が属するレベル２クラス
タ４０のタスク・キユーに登録するようになつて
いる。なお、各レベル２クラスタ内のクラスタコント
ローラ３００は、レベル２クラスタ間の負荷分散
をするためには、同じレベル１クラスタに属する
複数のレベル２クラスタの負荷を比較して、その
クラスタコントローラ３００が属するクラスタか
らタスクを同じレベル１クラスタに属するどのレ
ベル２クラスタに分配するかを決定する必要があ
る。しかし、その決定方法は、本発明の要旨の関
係ないため、具体的な記載は省略する。同様に、各レベル１クラスタ内のクラスタコン
トローラ２００は、レベル１クラスタ間の負荷分
散をするためには、異なるレベル１クラスタの負
荷を比較して、そのクラスタコントローラ２００
が属するクラスタからタスクを他の、レベル１ク
ラスタに分配するか否かを決定する必要がある。
しかし、その決定方法は、本発明の要旨と関係な
いため、具体的な記載を省略する。さて、レベル１クラスタコントローラ２００
は、まず、メインメモリ１０に置かれたタスクを
取り込み、レベル２クラスタコントローラ３００
を介してあるレベル２クラスタ、例えば＃０の共
有メモリ５０上のそのクラスタでは唯一のタス
ク・キユーにつなぐ、タスクの取り込みとは、親
タスクの識別子、タスクの環境データ、実行する
プログラムへのポインタ等の転送を言う。プロセツサ・エレメント７０は、共有メモリ５
０上の唯一のタスク・キユーからタスクを取り出
して実行し、その結果、子タスクを生成して、こ
れを共有メモリ５０上のタスク・キユーにつな
ぐ、共有メモリ５０上のタスク・キユーからタス
クの取り込みは、タスクの実行の終了時又は中断
時に行なう。レベル２クラスタコントローラ３００は、それ
が属するクラスタのタスクを分配すべき適当なタ
イミングで共有メモリ５０上のタスク・キユーか
ら、例えば最も登録時刻の古いタスクを取り出
し、同一レベル１クラスタに属する他のいずれか
のレベル２クラスタ又は、他のレベル１クラスタ
に分配するタスク、親タスクの識別子、取り出し
たタスクの環境データ等をネツトワーク１００を
介して又はそれとネツトワーク２０を介して送出
する。そのタスクを同一のレベル１クラスタに属する
他のレベル２クラスタに送付した場合には、その
レベル２クラスタ内のレベル２クラスタコントロ
ーラ３００が、そのクラスタ内の共有メモリ内の
タスク・キユーにそのタスクを登録する。こうし
てそのレベル２クラスタへのタスクの分配が終了
する。他のレベル１クラスタにそのタスク送出する場
合には、一旦、同一レベル１クラスタに属するレ
ベル１クラスタコントローラ２００にパケツトを
送出し、このレベル１クラスタコントローラ２０
０が、送出先のレベル１クラスタ３０に属するレ
ベル１クラスタコントローラ２００にパケツトを
送出する。送出先のレベル１クラスタコントロー
ラ２００は、送られたパケツトを、ある適当はレ
ベル２クラスタ４０のレベル２クラスタコントロ
ーラ３００に送る。各レベル２クラスタコントロ
ーラ３００は、他のレベル２クラスタコントロー
ラ３００又はレベル１クラスタコントローラ２０
０から送られたパケツトを分解し、分配されたタ
スクをとりだし、そのレベル２クラスタ内の共有
メモリ５０のタスク・キユーにつなぐ。以上から明らかなように、本実施例では、各レ
ベル２クラスタでは、そのクラスタ内の各プロセ
ツサ・エレメントは自己が実行するタスクを、そ
のクラスタの唯一のタスク・キユーから取り出
し、実行し、また、そのタスクの実行中にタスク
を生成した場合、そのタスク・キユーにその生成
されたタスクを登録するだけでよく、そのクラス
タ内の他のプロセツサ・エレメントの分配をしな
くてすむ。したがつて、共有メモリ内のタスク・
キユーのタスクの登録は、ネツトワークを介して
タスクを分配するよりはるかに高速に行いうる。
たとえばレベル１クラスタ数を１、レベル２クラ
スタ数を10、クラスタ内のプロセツサ・エレメン
ト数を10、タスクの実行時間をＴ、共有メモリへ
の子タスクの登録に要する時間を0.1T、ネツト
ワークを介してのタスクの転送に要する時間を10
×Ｔ、他へのクラスタへのタスクの分配確率を
0.1とすると、全体性能P_Sは次式で表をされる。 P_S＝１／Ｔ＋0.1T＋0.1×10×Ｔ×100 ＝100／0.1×１／Ｔ48１／Ｔ……(1) 一方、レベル２クラスタ＃０〜＃ｍ内のプロセ
ツサ・エレメントをネツトワークで結合した場合
の性能P_oは次式のようになる。 P_o＝１／Ｔ×10×Ｔ×1009.1１／Ｔ……(2) したがつて、クラスタ内のプロセツサ・エレメ
ントをもネツトワークで結合した場合に比して、
5.3（P_s／P_o）の製造改善が得られる。また、レベル２クラスタ内のプロセツサ・エレ
メントの数はシステムの全てのプロセツサ・エレ
メントの数よりもはるかに少なく済むので、第２
の従来技術で問題となるメモリへのアクセスの衝
突も生じることは少ない。さらに、本実施例では、レベル２クラスタから
他のレベル２クラスタに分配するタスクをそのレ
ベル２クラスタの唯一のタスク・キユーから取り
出して分配し、かつ、他のレベル２クラスタから
分配されたタスクも、そのレベル２クラスタのタ
スク・キユーに登録するので、レベル２クラスタ
のタスクの管理が非常に簡単化される。しかも、本実施例では、レベル１クラストコン
トローラ２００、レベル２クラスタコントローラ
３００がタスクの分配を行うので、プロセツサ・
エレメントは、タスクの分配のための処理をする
必要がなく、タスクの実行自体が高速に行なわれ
る。さらに、レベル２クラスタでは、共有メモリ内
の唯一のタスク・キユーから、各プロセツサ・エ
レメントが実行中のタスクの終了又は中断ごとに
タスクを取り出すようにすればそれらのプロセツ
サ・エレメント間の負荷の均一化は自動的し達成
される。以上から分かるように、本発明では、各レベル
２クラスタでは、共有メモリに設けられた、その
クラスタで唯一のタスク・キユーからそのクラス
タの複数のプロセツサ・エレメントが実行すべき
タスクを取り出し、新たなタスクが発生したとき
には、そのタスク・キユーに登録されるので、同
じレベル２クラスタの複数のプロセツサ・エレメ
ントの間では、タスクの分配に関する特別の処理
が不要であり、かつ、それらの間での負荷のバラ
ンスが自動的に確保される。さらに、レベル２クラスタでは、共有メモリに
設けられた、そのクラスタで唯一のタスクキユー
から分配すべきタスクを取り出し、他のレベル２
クラスタに転送する。また、他のクラスタから転
送されたタスクをそのタスク・キユーに登録す
る。これらの処理は、そのレベル２クラスタのク
ラスタコントローラにより行われる。従つて、そ
のクラスタの複数のプロセツサ・エレメントの負
荷を、纒めて、このタスク・キユーに登録された
タスクから知ることができるので、負荷の検出が
容易になり、分配すべきタスクをこのキユーから
取り出せるので、分配すべきタスクの選択が容易
になる。さらに、分配するタスクを含むパケツトの組立
て、他のクラスタへのそのパケツト転送、他から
分配されたタスクを含むパケツトの分解は、レベ
ル２クラスタのクラスタコントロールにより行わ
れる。従つて、プロセツサ・エレメントはタスク
の分配に関する処理をしなくて良い。さらに、本実施例では、タスク・キユー内のタ
スクを他のクラスタに分配する場合でも、クラス
タ単位に分配先をきめればよい。したがつて、ク
ラスタ単位に負荷を計測すればよく、全てのプロ
セツサ・エレメントについての負荷を計測する場
合よりもはるかに少ないオーバヘツドで済む。とくに、本実施例ではレベル２クラスタの共有
メモリ上のタスク・キユーにそのクラスタの全て
のタスクが登場されているので、この登場された
タスクの量のみを見ればそのクラスタの負荷を簡
単に知ることもできる。 An embodiment of the present invention will be described below with reference to FIG. The parallel computer is composed of level 1 clusters 30 #0 to #n, a main memory 10, and a level 1 network 20. level 1 cluster 3
0 are connected by a level 1 network 20, and a level 1 cluster controller 200 controls load distribution between the level 1 clusters 30.
Each level 1 cluster 30 is level 2 from #0 to #n
Consisting of a cluster 40, a level 2 network 100, and a level 1 cluster controller 200, each level 2 cluster 40 includes processor elements #0 to #l, a shared memory 50, and a level 2 cluster controller 200.
Consists of cluster controller 300, level 2
A cluster controller 300 controls load distribution between level 2 clusters. When the cluster controller 300 of each level 2 cluster 40 distributes tasks from that cluster to other level 2 clusters, the cluster controller 300 transfers the tasks to be distributed to the shared memory 5 of the level 2 cluster that is the distribution source.
The task to be distributed is taken out from the task queue 0 and a packet containing that task is assembled. This task should be distributed to the level 2 cluster 4 from which it is distributed.
When sending a packet to another level 2 cluster 40 belonging to the level 1 cluster 30 to which 0 belongs, the packet is sent to the cluster controller 300 of the level 2 cluster 40 to which it is distributed. Alternatively, when distributing the task to a level 2 cluster belonging to a level 1 cluster different from the level 1 cluster 30 to which the distribution destination level 2 cluster 40 belongs, the distribution source level 2 cluster 40
The cluster controller 300 of Level 0 sends the packet to the cluster controller 200 of the different Level 1 cluster 30. When the cluster controller 200 of each level 1 cluster 30 receives a packet containing a task to be distributed from the cluster controller 300 of the level 2 cluster 40 belonging to the cluster, it transfers the packet to the cluster controller 200 of the other level 1 cluster 30. It is now being sent to Further, the cluster controller 200 of each level 1 cluster 30 is a cluster controller 200 of a level 1 cluster 30 different from the cluster.
When a packet containing a task to be distributed is sent from the cluster controller 30, the packet is sent to the cluster controller 300 included in any one of the plurality of level 2 clusters 40 belonging to the level 1 cluster 30. Further, the cluster controller 300 of each level 2 cluster 40 receives a packet containing a distributed task from the cluster controller 300 of another level 2 cluster 40 belonging to the level 1 cluster 30 to which the cluster belongs, or The cluster controller 200 of the level 1 cluster 30 to which the level 2 cluster 40 belongs
When the packet is sent from the cluster controller 300, the packet is disassembled and the distributed tasks contained therein are registered in the task queue of the level 2 cluster 40 to which the cluster controller 300 belongs. Note that in order to distribute the load between the level 2 clusters, the cluster controller 300 in each level 2 cluster compares the loads of multiple level 2 clusters belonging to the same level 1 cluster, and determines which cluster controller 300 belongs. It is necessary to determine to which level 2 clusters belonging to the same level 1 cluster a task is to be distributed from a cluster. However, since the determining method is not relevant to the gist of the present invention, a detailed description thereof will be omitted. Similarly, in order to load balance between level 1 clusters, the cluster controller 200 in each level 1 cluster compares the loads of different level 1 clusters, and
It is necessary to decide whether to distribute tasks from the cluster to which the cluster belongs to other level 1 clusters.
However, since the determination method is not related to the gist of the present invention, a detailed description thereof will be omitted. Now, level 1 cluster controller 200
First, it takes in the task placed in the main memory 10 and sends it to the level 2 cluster controller 300.
Ingestion of a task involves the identification of the parent task, the task's environment data, and the pointer to the program to be executed. etc. transfer. Processor element 70 includes shared memory 5
0 and executes the task from the only task queue on the shared memory 50, thereby generating a child task and connecting it to the task queue on the shared memory 50. Capturing is performed at the end or interruption of task execution. The level 2 cluster controller 300 retrieves, for example, the task with the oldest registration time from the task queue on the shared memory 50 at an appropriate timing when the tasks of the cluster to which it belongs should be distributed, and assigns it to other tasks belonging to the same level 1 cluster. The task to be distributed to the level 2 cluster or another level 1 cluster, the identifier of the parent task, the environment data of the retrieved task, etc. are sent via the network 100 or via the network 20. When the task is sent to another level 2 cluster belonging to the same level 1 cluster, the level 2 cluster controller 300 in that level 2 cluster sends the task to the task queue in the shared memory within that cluster. register. This completes the distribution of tasks to the level 2 cluster. When sending the task to another level 1 cluster, first send the packet to the level 1 cluster controller 200 belonging to the same level 1 cluster, and then send the packet to the level 1 cluster controller 200 that belongs to the same level 1 cluster.
0 sends a packet to the level 1 cluster controller 200 belonging to the destination level 1 cluster 30. The destination level 1 cluster controller 200 sends the sent packet to a level 2 cluster controller 300 of the level 2 cluster 40, as appropriate. Each level 2 cluster controller 300 is connected to other level 2 cluster controllers 300 or level 1 cluster controllers 20
The packet sent from 0 is disassembled, the distributed task is taken out, and it is connected to the task queue of the shared memory 50 in the level 2 cluster. As is clear from the above, in this embodiment, in each level 2 cluster, each processor element in that cluster retrieves the task it executes from the only task queue of that cluster and executes it, and If a task is generated while the task is being executed, it is only necessary to register the generated task in the task queue, and there is no need to distribute the task to other processor elements in the cluster. Therefore, tasks in shared memory
Registering tasks in a queue can be much faster than distributing tasks over a network.
For example, the number of level 1 clusters is 1, the number of level 2 clusters is 10, the number of processor elements in the cluster is 10, the task execution time is T, the time required to register a child task in the shared memory is 0.1T, and the network The time it takes to transfer a task through 10
×T, the probability of distributing tasks to other clusters is
When it is set to 0.1, the overall performance P _S is expressed by the following formula. P _S = 1/T + 0.1T + 0.1 x 10 x T x 100 = 100/0.1 x 1/T481/T...(1) On the other hand, the processor elements in level 2 clusters #0 to #m are connected to the network. The performance P _o when combined is as follows. P _o = 1/T x 10 x T x 1009.11/T...(2) Therefore, compared to the case where the processor elements in the cluster are also connected by a network,
A manufacturing improvement of 5.3 (P _s /P _o ) is obtained. Also, since the number of processor elements in a level 2 cluster is much smaller than the total number of processor elements in the system,
Conflicts in access to memory, which are a problem with conventional techniques, rarely occur. Furthermore, in this embodiment, tasks to be distributed from a level 2 cluster to other level 2 clusters are extracted and distributed from the only task queue of that level 2 cluster, and tasks distributed from other level 2 clusters are also distributed. , is registered in the task queue of the level 2 cluster, so management of the tasks of the level 2 cluster is greatly simplified. Moreover, in this embodiment, since the level 1 cluster controller 200 and the level 2 cluster controller 300 distribute tasks, the processor
Elements do not need to perform processing for task distribution, and tasks can be executed at high speed. In addition, in a level 2 cluster, each processor element can take a task from a single task queue in shared memory each time the task it is executing is finished or interrupted, which will even out the load among the processor elements. conversion is achieved automatically. As can be seen from the above, in the present invention, in each level 2 cluster, tasks to be executed by multiple processor elements of that cluster are retrieved from the only task queue in that cluster provided in the shared memory, and a new task is created. When a task occurs, it is registered in the task queue, so there is no need for special processing regarding task distribution among multiple processor elements in the same level 2 cluster, and the load among them is reduced. balance is automatically ensured. Furthermore, in a level 2 cluster, the task to be distributed is retrieved from the only task queue in that cluster provided in the shared memory, and the task to be distributed is
Transfer to cluster. Also, tasks transferred from other clusters are registered in the task queue. These processes are performed by the cluster controller of the level 2 cluster. Therefore, the load of multiple processor elements in the cluster can be known collectively from the tasks registered in this task queue, making it easy to detect the load and assigning tasks to this queue. This makes it easier to select tasks to be distributed. Furthermore, assembly of packets containing tasks to be distributed, forwarding of the packets to other clusters, and disassembly of packets containing tasks distributed from other clusters are performed by the cluster control of the level 2 cluster. Therefore, the processor element does not have to handle task distribution. Furthermore, in this embodiment, even when tasks in a task queue are to be distributed to other clusters, the distribution destinations can be determined on a cluster-by-cluster basis. Therefore, it is sufficient to measure the load on a cluster-by-cluster basis, which requires much less overhead than when measuring the load on all processor elements. In particular, in this embodiment, all the tasks of the level 2 cluster appear in the task queue on the shared memory of that cluster, so it is easy to know the load on that cluster by looking only at the amount of tasks that have appeared. You can also do that.

【発明の効果】【Effect of the invention】

本発明によれば、異なるクラスタにおける並列
動作のためのオーバヘツドを軽減できるので、
個々のプロセツサ・エレメントの性能を低下させ
ることなく高い並列性を達成するのに効果があ
る。これによつて、並列計算機システムの全体性
能向上が図れる。 According to the present invention, since the overhead for parallel operations in different clusters can be reduced,
It is effective in achieving high parallelism without degrading the performance of individual processor elements. This makes it possible to improve the overall performance of the parallel computer system.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は本発明の一実施例の構成を示す図であ
る。 FIG. 1 is a diagram showing the configuration of an embodiment of the present invention.

Claims

【特許請求の範囲】１複数のプロセツサ・エレメントを一部づつ結
合して複数のクラスタが構成され、これらのクラ
スタがネツトワークで相互に結合され、各クラスタは、クラスタコントローラを有し、
そのクラスタの複数のプロセツサ・エレメント
は、そのクラスタのクラスタコントローラがアク
セス可能な共有メモリで結合され、その共有メモ
リは、実行待ちのタスクを登録する、そのクラス
タで唯一のタスク・キユーを有し、各クラスタの各プロセツサ・エレメントは、タ
スクの実行の結果新たなタスクを生成するような
タスクを実行するものであり、そのクラスタのタ
スク・キユーから実行すべきタスクを取り出して
実行し、このタスクの実行の結果新たなタスクが
発生したときには、このタスク・キユーに登録す
るものであり、各クラスタのクラスタコントローラは、そのク
ラスタのタスク・キユーから負荷の均等化のため
に分配すべきタスクを取り出し、そのタスクを含
むパケツトを組立て、他の一つのクラスタに分配
するために、その、他のクラスタのクラスタコン
トローラに上記ネツトワークを介してそのパケツ
トを送付し、さらに、他のクラスタのクラスタコ
ントローラから上記ネツトワークを介して送付さ
れた、分配されたタスクを含むパケツトを分解
し、その分配されたタスクを、そのクラスタコン
トローラが属するクラスタのタスク・キユーに登
録することを特徴とする並列計算機システム。[Scope of Claims] 1. A plurality of clusters are configured by partially connecting a plurality of processor elements, these clusters are mutually connected through a network, each cluster has a cluster controller,
The plurality of processor elements of the cluster are coupled by a shared memory accessible by the cluster controller of the cluster, the shared memory having a unique task queue in the cluster that registers tasks awaiting execution; Each processor element in each cluster executes a task that generates a new task as a result of task execution, and retrieves the task to be executed from the task queue of that cluster, executes it, and executes the task. When a new task is generated as a result of execution, it is registered in this task queue, and the cluster controller of each cluster takes out the task to be distributed to equalize the load from the task queue of that cluster. In order to assemble the packet containing the task and distribute it to another cluster, the packet is sent to the cluster controller of the other cluster via the above network, and then the cluster controller of the other cluster sends the packet to the cluster controller of the other cluster. A parallel computer system characterized by disassembling packets containing distributed tasks sent via a network and registering the distributed tasks in the task queue of the cluster to which the cluster controller belongs.