JP2003162515A

JP2003162515A - Cluster system

Info

Publication number: JP2003162515A
Application number: JP2001358105A
Authority: JP
Inventors: Kazuhiro Suzuki; 和宏鈴木
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2001-11-22
Filing date: 2001-11-22
Publication date: 2003-06-06

Abstract

<P>PROBLEM TO BE SOLVED: To provide a cluster system capable of achieving power savings of the whole clusters, by operating a plurality of nodes as a single cluster and saving electricity on nodes in idle states. <P>SOLUTION: A software SCore 1 calls an apm (advanced power management) command of an OS 2. The apm command instructs a shift to a suspended state, by calling a BIOS (basic input/output system) call to a BIOS 3. The BIOS 3 suspends operations of a CPU and a hard disc drive, while keeping the state of execution of a memory of hardware 4 and shifts the nodes to the suspended state. When resuming and initiating the nodes, a network interface 5 which has received magic packets transmits WOL (wakeup on LAN) messages to the hardware 4, the hardware 4 issues a resume request command to the BIOS 3 and the BIOS 3 makes the CPU and hard disc to be recovered to the state of execution. <P>COPYRIGHT: (C)2003,JPO

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、複数のノードを有
するクラスタシステムに関し、特に、複数のノードを一
つのクラスタとして動作させる際に、クラスタ全体の省
電力化を図るクラスタシステムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a cluster system having a plurality of nodes, and more particularly to a cluster system for saving power of the entire cluster when a plurality of nodes are operated as one cluster.

【０００２】[0002]

【従来の技術】クラスタは、プロセッサとメモリの組か
らなるノードの複数をネットワークで結合したマルチコ
ンピュータであり、その複数のノードを動作させて同一
作業目的の処理を実行させる。これにより、パーソナル
コンピュータ（ＰＣ）単体では限界であった処理能力や
信頼性を向上させるクラスタシステムを構築できる。2. Description of the Related Art A cluster is a multi-computer in which a plurality of nodes each consisting of a set of a processor and a memory are connected by a network, and the plurality of nodes are operated to execute processing for the same work purpose. As a result, it is possible to construct a cluster system that improves the processing capability and reliability that were limited by a personal computer (PC) alone.

【０００３】ここで、クラスタシステムについて、シス
テムの機能面で分類すると、大きく分けて、高可用性
（ＨＡ）クラスタと高速処理コンピューティング（ＨＰ
Ｃ）クラスタとがある。さらに、複数の機能を組み合わ
せた種類のものも多く使用されている。このようなクラ
スタにおいて使用され、複数のノードを一つのクラスタ
として動作させるためのソフトウエアがクラスタシステ
ムソフトウエアである。一般的に、このクラスタシステ
ムソフトウエアの概要を以下に説明する。（ＨＡ型クラスタ）ＨＡ型クラスタには、フェイルオー
バ型とロードバランシング型とがある。・フェイルオーバ型２台またはそれ以上のノードを動作させ、何らかの原因
で動作不能になった場合に、バックアップとして待機さ
せておいた他のノードがその処理を引き継ぐことによっ
て高可用性を向上させている。フェイルオーバ型クラス
タのシステムソフトウエアには、種々の製品が出されて
いる。・ロードバランシング型ＷＷＷやＦＴＰサーバなどのサーバを多重化して、スケ
ーラビリティを実現するクラスタシステムである。一つ
のロードバランサに対するＩＰレベルのセッションを背
後に控える複数のサービスノードに割り振ることによっ
て負荷分散を行っている。割り振る方法にはいくつかあ
るが、順番に処理を割り振るラウンドロビン型やサービ
スノードやネットワークトラフィックの負荷を監視しな
がら負荷の少ないサービスノードに処理を割り振るダイ
ナミックなロードバランサなどの構成を取ることが多
い。ロードバランシング型クラスタのシステムソフトウ
エアについても、種々の製品が出されている。（ＨＰＣ型クラスタ）ＨＰＣ型クラスタでは、複数のノ
ードが協調動作することによって並列処理アプリケーシ
ョンを高速に実行できるようになっている。ノード間の
インターコネクトのデータ転送帯域が狭いとそこがボト
ルネックとなって全体の処理能力が低下するために、ギ
ガビットEthernetやMyrinetなどの高速なインタフェー
スで接続されることがある。並列処理アプリケーション
作成にはＭＰＩやＰＶＭ等のライブラリがあり、これら
は数値演算ライブラリと合わせて学術研究分野で利用さ
れている。こうした特徴を持つＨＰＣ型クラスタのシス
テムソフトウエアには、（技）新情報開発機構による
“SCore”等が挙げられる。Here, the cluster system is roughly classified into a high availability (HA) cluster and a high speed processing computing (HP).
C) There is a cluster. Further, many types that combine a plurality of functions are also used. The software used in such a cluster to operate a plurality of nodes as one cluster is cluster system software. Generally, the outline of this cluster system software will be described below. (HA type cluster) The HA type cluster includes a failover type and a load balancing type. -High availability is improved by operating two or more failover type nodes, and if they become inoperable for some reason, the other node that has been placed on standby as a backup takes over the processing. Various products have been released as system software for failover type clusters. A cluster system that realizes scalability by multiplexing servers such as load balancing type WWW and FTP servers. The load is distributed by allocating the IP level session for one load balancer to a plurality of service nodes which are reserved in the background. There are several allocation methods, but in many cases, it is configured as a round-robin type that allocates processes in order, or a dynamic load balancer that allocates processes to service nodes with low load while monitoring the load of service nodes and network traffic. . Various products have been released for load balancing cluster system software. (HPC type cluster) In the HPC type cluster, parallel processing applications can be executed at high speed by the cooperative operation of a plurality of nodes. If the data transfer bandwidth of the interconnect between the nodes is narrow, it becomes a bottleneck and the overall processing capacity decreases, so it may be connected by a high-speed interface such as Gigabit Ethernet or Myrinet. There are libraries such as MPI and PVM for creating a parallel processing application, and these are used in academic research fields together with a numerical operation library. The HPC type cluster system software having such characteristics includes "SCore" by (Technology) New Information Development Organization.

【０００４】[0004]

【発明が解決しようとする課題】ここで、図１に、ＨＰ
Ｃ型クラスタの概略構成を示した。図示のクラスタは、
複数のノードＮ１乃至Ｎ６で構成されている。図１の
（ａ）では、当該クラスタで処理すべきアプリケーショ
ンが無い状態を示しており、ノードＮ１のみが動作し、
他のノードＮ２乃至Ｎ６はアイドル状態になっている。DISCLOSURE OF THE INVENTION Problems to be Solved by the Invention Here, in FIG.
The schematic configuration of the C-type cluster is shown. The cluster shown is
It is composed of a plurality of nodes N1 to N6. FIG. 1A shows a state where there is no application to be processed in the cluster, and only the node N1 operates.
The other nodes N2 to N6 are in the idle state.

【０００５】そこへ、当該クラスタで処理すべきアプリ
ケーションが到来すると、クラスタシステムソフトウエ
アは、図１の（ｂ）に示されるように、ノードＮ２乃至
Ｎ６に対してジョブ投入を行う。このとき、一つのアプ
リケーションを全てのノードに分散並行処理させること
もできるが、２以上の異なるアプリケーションを処理す
る場合、複数のノードを２以上の郡に分けて、各郡にそ
れぞれのアプリケーションを割り振り、並行処理させる
こともできる。When an application to be processed by the cluster arrives, the cluster system software submits a job to the nodes N2 to N6, as shown in FIG. 1 (b). At this time, one application can be distributed and processed in parallel to all the nodes, but when processing two or more different applications, the plurality of nodes are divided into two or more groups and each application is allocated to each group. , It can also be processed in parallel.

【０００６】次いで、各ノードにおいて、アプリケーシ
ョン処理が終了したときには、各ノードは、図１の
（ｂ）の並行処理状態から、図１の（ａ）のアイドル状
態に移行する。Then, when the application processing is completed in each node, each node shifts from the parallel processing state of FIG. 1B to the idle state of FIG. 1A.

【０００７】例えば、ソフトウエアSCoreのようなＨＰ
Ｃ型クラスタ上でのアプリケーションを最大性能で処理
するために、当該アプリケーションについて、該クラス
タで管理される最大ノード数で実行することが多いと考
えられる。しかしながら、アプリケーションによっては
最高性能が出るノード数が全ノード数よりも小さい場合
もある。また、クラスタ内のノードをいくつかのサブク
ラスタに分けて、それぞれのサブクラスタで複数のアプ
リケーションを起動するような場合は、アプリケーショ
ンの終了時間の違いからアイドル状態のノードができて
しまう可能性がある。For example, HP such as software SCore
In order to process the application on the C-type cluster with the maximum performance, it is considered that the application is often executed by the maximum number of nodes managed by the cluster. However, depending on the application, the maximum number of nodes may be smaller than the total number of nodes. Also, if the nodes in the cluster are divided into several sub-clusters and multiple applications are started in each sub-cluster, there is a possibility that idle nodes will be created due to the difference in the application end time. is there.

【０００８】ノードの動作がアイドル状態であっても、
ノード自体には、消費電力を必要としているため、アイ
ドル状態のノードが多数存在することは無駄な消費電力
が増大することになり、クラスタ全体のノード数が大き
くなるにつれて、さらに大きな問題点となっている。Even when the operation of the node is idle,
Since the node itself requires power consumption, the existence of many idle nodes increases unnecessary power consumption, and as the number of nodes in the entire cluster increases, it becomes a bigger problem. ing.

【０００９】そこで、本発明は、複数のノードを一つの
クラスタとして動作させる際に、アイドル状態のノード
について節電することによりクラスタ全体の省電力化を
図ることができるクラスタシステムを提供することを目
的とする。Therefore, it is an object of the present invention to provide a cluster system in which when operating a plurality of nodes as a single cluster, the nodes in an idle state are conserved to save power in the entire cluster. And

【００１０】[0010]

【課題を解決するための手段】そこで、上記課題を解決
するため、本発明では、アプリケーションを複数のノー
ドに分散処理させるノードの動作管理を行うクラスタシ
ステムにおいて、前記ノードがアイドル状態のときには
当該ノードの動作を停止状態にさせ、前記ノードが前記
処理を実行するときに当該ノードの起動を行うこととし
た。In order to solve the above problems, the present invention provides a cluster system for managing the operation of a node that distributes an application to a plurality of nodes, and when the node is in an idle state, the node The operation is stopped, and the node is started when the node executes the process.

【００１１】そして、前記ノードには、自ノードのアイ
ドル状態を検出して該ノードの動作を停止状態にする機
能を備えた。The node is provided with a function of detecting the idle state of its own node and stopping the operation of the node.

【００１２】また、複数の前記ノードの動作状態を記憶
する記憶手段を備え、前記ノードを起動する際に、当該
ノードについて起動状態であることが記憶されている場
合には、当該ノードに対しては起動信号を出さないよう
にした。[0012] Further, a storage means for storing the operating states of the plurality of nodes is provided, and when starting the node, if it is stored that the node is in the activated state, Tried not to give a start signal.

【００１３】さらに、複数の段階による省電力状態を設
定でき、前記ノードのアイドル時間が長くなるにつれて
次第に省電力状態の段階を上げていく機能を備えた。Furthermore, the power saving state can be set in a plurality of stages, and the power saving state is gradually increased as the idle time of the node becomes longer.

【００１４】また、異なる周辺機器に接続されているノ
ード上で実行されているジョブを、同一の周辺機器に接
続されたノードに集めるプロセスマイグレーション機能
を備えた。Further, there is provided a process migration function for collecting the jobs executed on the nodes connected to different peripheral devices to the nodes connected to the same peripheral device.

【００１５】[0015]

【発明の実施の形態】次に、本発明のクラスタシステム
による実施形態について、図を参照しながら実施形態別
に以下に説明する。〔第１の実施形態〕先ず、クラスタの動作原理から説明
する。クラスタ内の複数のノードに関する動作について
は、図１に示したが、従来のクラスタシステムでは、ノ
ードで処理すべきジョブが無いときには、各ノードはア
イドル状態となっていた。しかし、本実施形態によるク
ラスタシステムにおいては、ノードで処理すべきジョブ
が無いときには、該当ノードをサスペンド状態に移行さ
せておく。そして、アプリケーションが到来し、処理す
べきジョブが必要なノードに割り当てられるときに、該
当ノードをリジュームして、ジョブの処理を行うように
する。ジョブの処理が終了したときには、当該ノードを
再びサスペンド状態に移行させる。BEST MODE FOR CARRYING OUT THE INVENTION Next, embodiments of a cluster system according to the present invention will be described below with reference to the drawings. [First Embodiment] First, the operation principle of a cluster will be described. The operation regarding a plurality of nodes in the cluster is shown in FIG. 1, but in the conventional cluster system, each node was in an idle state when there was no job to be processed by the node. However, in the cluster system according to the present embodiment, when there is no job to be processed in a node, the corresponding node is transitioned to the suspend state. Then, when an application arrives and a job to be processed is assigned to a required node, the node is resumed and the job is processed. When the job processing is completed, the node is moved to the suspend state again.

【００１６】次いで、ノード内の動作原理について説明
する。クラスタシステムで管理される複数のノードのう
ち、代表的にその一つのノードに注目して、該ノード内
の動作の概要を、図２に示した。同図に示されたクラス
タシステムには、クラスタシステムソフトウエア１に、
具体例としてソフトウエアSCoreを用いた場合を示して
いる。Next, the principle of operation within the node will be described. Of the plurality of nodes managed by the cluster system, one node is typically focused on, and an outline of the operation in the node is shown in FIG. The cluster system shown in FIG.
As a specific example, the case where the software SCore is used is shown.

【００１７】図２において、２は、ＰＣのシステム管理
をし、ユーザ操作環境を提供する基本ソフトウエアであ
るＯＳを、３は、ＰＣに接続されている周辺機器を制御
する基本入出力システムのソフトウエアであるＢＩＯＳ
を、４は、ＰＣのハードウエアを、そして、５は、ＰＣ
のネットワークインタフェースをそれぞれ表している。
ここで、ＯＳ２、ＢＩＯＳ３、ハードウエア４、そし
て、ネットワークインタフェース５は、一つのノードに
備えられているものである。なお、ＯＳ２には、例え
ば、ソフトウエアLinuxを用いてもよい。また、図２で
は、ネットワークインタフェース５にＬＡＮカードを用
いている場合を示した。In FIG. 2, reference numeral 2 is an OS which is the basic software for managing the system of the PC and provides a user operation environment, and 3 is a basic input / output system for controlling the peripheral devices connected to the PC. BIOS that is software
, 4 is the PC hardware, and 5 is the PC
Represents the network interface of each.
Here, the OS 2, the BIOS 3, the hardware 4, and the network interface 5 are provided in one node. Note that software Linux may be used as the OS 2, for example. Further, FIG. 2 shows the case where a LAN card is used as the network interface 5.

【００１８】そこで、ソフトウエアSCore１が管理する
当該ノードをサスペンド状態に移行する場合には、先
ず、ソフトウエアSCore１がＯＳ２のソフトウエアLinux
のａｐｍコマンドを呼び出す。ａｐｍコマンドは、ＢＩ
ＯＳ３に対するＢＩＯＳコールでサスペンド状態への移
行を指示する。ＢＩＯＳ３は、ハードウエア４における
メモリに対して実行状態を保持したまま、ＣＰＵやハー
ドディスクの動作を停止して、当該ノードをサスペンド
状態に移行させる。Therefore, when shifting the node managed by the software SCore1 to the suspend state, first, the software SCore1 is the software Linux of the OS2.
Call the apm command of. The apm command is BI
A BIOS call to OS3 is used to instruct the transition to the suspend state. The BIOS 3 stops the operations of the CPU and the hard disk while keeping the execution state of the memory in the hardware 4 and shifts the node to the suspend state.

【００１９】反対に、アプリケーション処理が割り当て
られ、当該ノードを起動する必要があるとき、当該ノー
ドをリジュームする場合は、マジックパケットを受け取
ったネットワークインタフェース５のＬＡＮカードが、
ハードウエア４に対してＷＯＬメッセージを送る。ＷＯ
Ｌメッセージを受けたハードウエア４は、ＢＩＯＳ３に
対してリジュームリクエストコマンドを発行し、ＢＩＯ
Ｓ３が実行状態を復帰させた後で、Linuxプログラムに
よる制御に戻される。On the contrary, when the application process is allocated and the node needs to be started, when the node is resumed, the LAN card of the network interface 5 which receives the magic packet is
Send a WOL message to hardware 4. WO
Receiving the L message, the hardware 4 issues a resume request command to the BIOS 3,
After S3 restores the execution state, control is returned to the Linux program.

【００２０】ここで、本実施形態によるクラスタシステ
ムにおけるサスペンド機能について説明する。一般的
に、ＰＣには、アドバンスドパワーマネージメント（Ａ
ＰＭ）と呼ばれる電力管理機能が備えられている。この
ＡＰＭは、マイクロソフト社とインテル社が共同で規格
化したＰＣの電源管理に関する規約である。ＡＰＭによ
って、ＯＳが電源を切ったりサスペンドしたりすること
ができる。もともとはノート型ＰＣ等において内臓バッ
テリの消費電力量を押さえるための機能であったが、最
近のデスクトップ型ＰＣやサーバ機等でもサポートされ
ている。Here, the suspend function in the cluster system according to the present embodiment will be described. Generally, a PC has an advanced power management (A
A power management function called PM) is provided. This APM is a rule regarding power management of a PC standardized jointly by Microsoft and Intel. The APM allows the OS to power down or suspend. Originally, the function was to reduce the power consumption of the built-in battery in notebook PCs and the like, but it is also supported in recent desktop PCs and server machines.

【００２１】ＡＰＭによる電源管理によって、スタンバ
イ状態とサスペンド状態に移行することができる。スタ
ンバイ状態では、ハードディスクや画面の動作を停止す
ることによって消費電力を下げるものであり、サスペン
ド状態では、メモリ上に実行状態を保持するようにして
ＣＰＵの動作をも停止し、メモリのみに電源を供給する
だけとする。そのために、サスペンド状態は、スタンバ
イ状態よりも消費電力を低くすることができる状態とい
うことになる。Power management by the APM enables transition to a standby state and a suspend state. In the standby state, the power consumption is reduced by stopping the operation of the hard disk and the screen. In the suspend state, the execution state is held in the memory and the operation of the CPU is stopped, and the power is supplied only to the memory. Only supply. Therefore, the suspend state is a state in which the power consumption can be made lower than that in the standby state.

【００２２】さらに、ＰＣの電源管理機能として、ハイ
バネーション状態を備えているものもある。これは、実
行状態を含めたメモリの内容をハードディスク上に書き
出しておき、完全に電源を切ることができるというもの
である。ハイバネーション状態は、主にノート型ＰＣに
採用されている機能であり、使用しているマシンにおい
て、ハイバネーション状態がサポートされていない場合
もあるが、ハイバネーション状態に関する機能として、
サスペンド状態に移行させることもできる。Further, some PCs have a hibernation state as a power management function. With this, the contents of the memory including the execution state can be written on the hard disk and the power can be completely turned off. The hibernation state is a function mainly adopted in notebook PCs, and the machine you are using may not support the hibernation state, but as a function related to the hibernation state,
It is possible to shift to the suspend state.

【００２３】本実施形態のクラスタシステムでは、クラ
スタを構成する複数のノードの個々において、各ノード
に備えられたサスペンド機能を利用するものであり、ア
プリケーション処理が割り当てられないときには、クラ
スタシステムソフトウエアによって当該ノードをサスペ
ンド状態とし、システム全体としての電力消費を抑えて
いる。In the cluster system of this embodiment, the suspend function provided in each node is used in each of the plurality of nodes forming the cluster, and when application processing is not assigned, the cluster system software executes the suspend function. The node is placed in the suspend state to reduce the power consumption of the entire system.

【００２４】次に、クラスタシステムソフトウエアがク
ラスタを構成する複数のノードに処理すべきアプリケー
ションを割り振るときには、アプリケーション処理に必
要な数のノードを動作状態にしなければならない。その
対象ノードは、サスペンド状態に移行しているので、当
該ノードをサスペンド状態から動作状態に復帰させる必
要がある。そこで、本実施形態のクラスタシステムにお
いて、動作状態に復帰させるリジュームについて以下に
説明する。Next, when the cluster system software allocates an application to be processed to a plurality of nodes forming a cluster, the number of nodes required for application processing must be activated. Since the target node has transitioned to the suspend state, it is necessary to return the node from the suspend state to the operating state. Therefore, in the cluster system of this embodiment, the resume for returning to the operating state will be described below.

【００２５】リジュームとは、ＰＣにおいて一般的に用
いられており、スタンバイ状態、サスペンド状態、ハイ
バネーション状態から復帰することを指しており、これ
によって、ノードの動作が起動される。リジュームによ
って復帰させる際のイベントは、マシンに内蔵された周
辺機器やＢＩＯＳによって異なったものになる。主なイ
ベントを以下に示す。ａ）電源（サスペンド）スイッチを押下する。ｂ）予め定義された時間によるタイマでサスペンド・リ
ジュームする。ｃ）モデムカードを内蔵したマシンでモデムに着信があ
る。ｄ）ＰＣＩバスに挿されたＬＡＮカードにマジックパケ
ットと呼ばれる特別なパケットが到達する（ウエイクオ
ンＬＡＮ：ＷＯＬ）。Resume is generally used in a PC, and refers to returning from a standby state, a suspend state, or a hibernation state, whereby the operation of a node is activated. The event at the time of returning by the resume depends on the peripheral device and BIOS built in the machine. The main events are shown below. a) Press the power (suspend) switch. b) Suspend / resume with a timer according to a predefined time. c) A machine with a built-in modem card receives a call to the modem. d) A special packet called a magic packet arrives at the LAN card inserted in the PCI bus (Wake on LAN: WOL).

【００２６】リジュームによって復帰させるＰＣに関わ
るイベントとして、一般的には、上述のａ）乃至ｄ）が
挙げられるが、本実施形態によるクラスタシステムで
は、それらのうちで、ｃ）のモデム着信と、ｄ）のマジ
ックパケット到着ＷＯＬのいずれかのイベントが採用さ
れる。The above-mentioned a) to d) are generally mentioned as the events related to the PC to be restored by the resume. Among them, in the cluster system according to the present embodiment, among them, the modem incoming call of c) and Any event of the magic packet arrival WOL of d) is adopted.

【００２７】ここで、図２に示したように、ノードにＬ
ＡＮカードが備えられていて、イベントがＷＯＬである
場合について説明する。マジックパケットは、ＡＭＤ社
が開発したＷＯＬのための特殊なパケットで、該パケッ
ト内に、６つの“0xFF”と１６個のＬＡＮカードのＭＡ
Ｃアドレスが並べられているものである。このパケット
をネットワークに対してブロードキャストすることによ
って電源投入やリジュームを行うことができる。Here, as shown in FIG.
A case where an AN card is provided and the event is WOL will be described. The magic packet is a special packet for WOL developed by AMD, and contains 6 "0xFF" and 16 LAN card MAs in the packet.
The C addresses are arranged. By broadcasting this packet to the network, the power can be turned on or resumed.

【００２８】ユーザレベルのソフトウエアからリモート
マシンをリジュームさせるためには、ＷＯＬが最も容易
に実現できる方式である。ただ、ＷＯＬを利用するに
は、ＷＯＬ対応のＬＡＮカードが搭載されている必要が
ある。このようなＬＡＮカードがマジックパケットを受
け取ると、マザーボードに対して電源ＯＮの命令信号を
伝えることにより、マシンに対する電源投入又はリジュ
ームすることができる。In order to resume a remote machine from user level software, WOL is the easiest method to implement. However, in order to use WOL, a WOL-compatible LAN card must be installed. When such a LAN card receives the magic packet, the machine can be powered on or resumed by transmitting a power-on command signal to the motherboard.

【００２９】次に、図２のノード内処理において、当該
ノードをサスペンド状態に移行させるタイミングについ
て説明する。クラスタシステムにおいては、ソフトウエ
アSCoreは、クラスタ内の全てのノードに対し動作して
おり、それぞれのノードが協調動作してユーザのジョブ
について並行処理を実行している。ユーザのジョブが無
いアイドル状態のとき、ソフトウエアSCoreは、図３に
示すようなコード群を実行するようになっている。Next, in the intra-node process of FIG. 2, the timing of shifting the node to the suspend state will be described. In the cluster system, the software SCore operates on all the nodes in the cluster, and each node cooperates to execute the parallel processing for the user's job. When the user's job is idle and there is no job, the software SCore executes code groups as shown in FIG.

【００３０】ソフトウエアSCoreがアイドルループにお
いて実行するコード群のうち、図中で下線を付したsele
ct( )システムコールは、次のジョブを待っている状態
を示している。select( )関数の“timeout”で指定した
時間内に監視しているファイルディスクリプタに変化が
ない場合には０を返す。図３内のselect( )システムコ
ールは、処理すべきメッセージの到着を監視しており、
select( )システムコールが０を返した場合には、この
ノードで実行処理するべきジョブが無いということを意
味する。そこで、select( )システムコールが０を返す
ようになってからの回数をカウントして、このカウント
値が設定されたアイドルカウント最大値“IDLE_COUNT_M
AX”を上回ったときに、ＡＰＭによってサスペンド状態
に移行させる。この時、ジョブの割り当てを管理してい
るサーバノードについては、サスペンド状態に移行させ
ないようにしなければならない。サスペンド状態に移行
する時にカウンタの値をリセットして次回のサスペンド
に備える。Of the code group executed by the software SCore in the idle loop, the underlined sele in the figure
The ct () system call indicates the status of waiting for the next job. If there is no change in the file descriptor being monitored within the time specified by "timeout" of the select () function, 0 is returned. The select () system call in Figure 3 monitors the arrival of messages to be processed,
When the select () system call returns 0, it means that there is no job to be executed in this node. Therefore, the number of times after the select () system call has returned 0 is counted, and this count value is set to the maximum idle count value "IDLE_COUNT_M
When it exceeds AX ”, the APM shifts to the suspend state. At this time, the server node that manages the job assignment must not shift to the suspend state. The counter when shifting to the suspend state The value of is reset to prepare for the next suspend.

【００３１】また、本実施形態によるクラスタシステム
におけるリジュームタイミングについて説明する。マル
チスレッドテンプレートライブラリＭＰＣ＋＋では、図
４のようなテンプレートによって、ノードＮＯＤＥ上で
関数ＦＵＮＣを呼び出すことができるようになってい
る。ソフトウエアSCoreも、ＭＰＣ＋＋で書かれてお
り、これらのテンプレートによってリモート関数呼び出
しを行うことができる。The resume timing in the cluster system according to the present embodiment will be described. In the multi-thread template library MPC ++, the function FUNC can be called on the node NODE by the template as shown in FIG. Software SCore is also written in MPC ++, and it is possible to make remote function calls with these templates.

【００３２】該テンプレートにおいて、invoke( )は、
関数ＦＵＮＣの終了を待つ同期型呼び出しであり、ainv
oke( )は、終了を待たずに処理を進める非同期型の呼び
出しである。ノードＮＯＤＥが自分自身とは異なってい
る場合は、リモート呼び出しを意味する。リモート呼び
出しの場合にはリモート側のノードをリジュームしてか
ら関数呼び出しを行うようにする。In the template, invoke () is
This is a synchronous call that waits for the end of the function FUNC.
oke () is an asynchronous call that advances the process without waiting for the end. If the node NODE is different from itself, it means a remote call. In the case of remote call, the node on the remote side is resumed before the function call.

【００３３】リジュームには、対象ノードのＭＡＣアド
レスを使ったマジックパケットを送出することによって
行う。クラスタのノードやネットワークの情報を管理し
ているデータベースサーバSCoreboardは、Ethernetカー
ドの情報としてノード番号とＭＡＣアドレスをテーブル
として保持している。マジックパケットを生成する場合
には、データベースサーバSCoreboardに問い合わせるこ
とによって対象ノードのＭＡＣアドレスを得ることがで
きる。Resuming is performed by sending a magic packet using the MAC address of the target node. The database server SCoreboard, which manages information about the nodes and networks of the cluster, holds a node number and MAC address as a table as information about the Ethernet card. When generating a magic packet, the MAC address of the target node can be obtained by inquiring the database server SCoreboard.

【００３４】リモート呼び出しごとにリジュームをする
と、動作しているノードに対してもマジックパケットを
送ることになって無駄が生じる。そこで各ノードの状態
を記憶しておいて、これと比較してからリジュームする
ことで無駄を省くことができる。When the resume is performed for each remote call, a magic packet is also sent to the operating node, resulting in waste. Therefore, it is possible to save waste by storing the state of each node and comparing it with the state before resuming.

【００３５】そのため、サスペンドするノードとリジュ
ームさせるノードが異なっているので、全ノードで共通
に参照できるメモリ空間に、ノードの状態を記憶してお
かなければならない。これは、ＭＰＣ＋＋のグローバル
ポインタ“GlobalPtr”クラステンプレートで実現する
ことができる。GlobalPtrクラステンプレートは任意の
型をパラメータとして受け取って、その型のオブジェク
トを指すグローバルポインタを生成する。グローバルポ
インタは、全てのノードから共通にアクセスすることが
できる。Therefore, since the suspended node and the resumed node are different, the node state must be stored in a memory space that can be commonly referred to by all nodes. This can be realized with the MPC ++ global pointer "GlobalPtr" class template. The GlobalPtr class template takes an arbitrary type as a parameter and creates a global pointer to an object of that type. The global pointer can be commonly accessed from all nodes.

【００３６】次に、ジョブが終了した場合の動作につい
て説明する。ジョブが終了すると、ソフトウエアSCore
は、全てのノードのハードディスクをフラッシュするた
めに、sync_all( )関数が呼び出される。これは全ての
ノードに対して、ainvoke( )テンプレートによってsync
( )システムコールを呼び出すためのものである。これ
を呼び出すと、ジョブが割り当てられていなかったサス
ペンド中のノードにも、sync( )システムコールを呼び
出す時にリジュームしてしまう。これは不必要なリジュ
ームであるため、sync_all( )関数の中で実行中のノー
ドにだけ、sync()システムコールを実行するようにし
た。Next, the operation when the job is completed will be described. When the job is finished, the software SCore
Calls the sync_all () function to flush the hard disks of all nodes. This syncs all nodes with the ainvoke () template
() It is for calling a system call. Calling this will cause a suspended node that has not been assigned a job to be resumed when the sync () system call is called. Since this is an unnecessary resume, I made the sync () system call only to the node that is running in the sync_all () function.

【００３７】次いで、クラスタシステム内の異なるノー
ド間において、それらの動作状態に移行するタイミング
によっては競合状態となる場合がある。その競合状態に
よってデッドロックが発生することになるが、それを回
避する手段について、以下に説明する。Next, there may be a race condition between different nodes in the cluster system depending on the timing of transition to their operating state. Deadlock occurs due to the race condition, and a means for avoiding it will be described below.

【００３８】図３に示したアイドルループは、プライオ
リティが低いスレッドとして動作しているため、他のノ
ードをリジユームさせている時間や同期待ちの時間など
で実行される可能性がある。また、図５のような競合状
態によってデッドロックが発生する可能性がある。Since the idle loop shown in FIG. 3 operates as a thread having a low priority, there is a possibility that it will be executed during the time during which other nodes are being resumed or waiting for synchronization. Further, a deadlock may occur due to the race condition as shown in FIG.

【００３９】図５に示されるように、ノード１がノード
２をリジュームする場合、ノード１でノード２の状態が
サスペンド状態ではないことを確認してリモート関数呼
び出しを行ったとする。この時、ノード２の状態確認と
リモート呼び出しの間でノード２がサスペンドしてしま
う場合が考えられる。As shown in FIG. 5, when the node 1 resumes the node 2, it is assumed that the node 1 confirms that the state of the node 2 is not the suspend state and executes the remote function call. At this time, the node 2 may be suspended between the confirmation of the state of the node 2 and the remote call.

【００４０】そこで、リジュームされた場合に、アイド
ルループを回った回数を−1に設定する。アイドルルー
プ側ではカウンタが−1に設定されていた場合には、サ
スペンド要求を出さないようにする。カウンタを−１に
設定することは、当該ノードから該ノードと異なるノー
ドの動作を停止状態に移行させるまで、カウンタに動作
の停止を禁止する状態を設定したこととなる。ジョブが
終了した時に、カウンタの値をリセットしてサスペンド
可能な状態に戻す。これによって、ジョブが割り当てら
れてから終了するまでの間は、そのノードがサスペンド
してデッドロックを起すことを回避できる。Therefore, when it is resumed, the number of times the idle loop is rotated is set to -1. On the idle loop side, if the counter is set to -1, do not issue a suspend request. Setting the counter to -1 means setting the counter to a state in which the stop of the operation is prohibited until the operation of the node different from the node is changed to the stop state. When the job is completed, the counter value is reset to the suspendable state. This prevents the node from suspending and causing a deadlock between the time the job is assigned and the time it ends.

【００４１】なお、以上においては、クラスタを構成す
るノードの省電力に対しては、サスペンド状態への移行
によって行うようにしたが、接続されている周辺機器に
応じてこの省電力状態を変化させることもできる。例え
ば、ＣＲＴだけを停止するとか、ＣＲＴとハードディス
クを停止するなどといった段階的な省電力状態を設定す
ることができる。この様にすることによって、全ての周
辺機器を停止させると起動するための時間がかかるため
に、アイドル時間と起動時間とを考慮して、例えば、ア
イドル時間が長くなるにつれて次第に省電力状態を上げ
ていくなど、より細かい省電力制御を行うことができ
る。〔第２の実施形態〕プロセスマイグレーションとは、動
作中のプロセスの状態を一旦ハードディスクに書き出し
て、新たに他のノード上でハードディスクに書き出され
た状態を読込んでプロセスの実行を縦続するための仕組
みである。In the above description, the power saving of the nodes forming the cluster is performed by shifting to the suspend state, but the power saving state is changed according to the connected peripheral device. You can also For example, it is possible to set a gradual power saving state such as stopping only the CRT or stopping the CRT and the hard disk. By doing this, it takes time to start when all peripheral devices are stopped, so consider the idle time and startup time, for example, gradually increase the power saving state as the idle time becomes longer. More detailed power saving control can be performed. [Second Embodiment] Process migration is for temporarily writing out the state of a process in operation to a hard disk and then reading the state newly written to the hard disk on another node to cascade the execution of processes. It is a mechanism.

【００４２】クラスタ内のノード数が多くなると、電源
タップ、インターコネクトのハブ等に係る周辺機器の数
も多くなる。そして、物理的に離れているノードは、異
なる電源タップやインターコネクトハブに接続されるこ
とになる。プロセスマイグレーションによって異なる周
辺機器に繋がれているノード上で動作しているジョブ
を、同一の周辺機器に接続されている近いノードに移動
することができる。これによって使われていない周辺機
器を増やし、これらの電源を落とすことによって周辺機
器による消費電力を下げることができる。As the number of nodes in the cluster increases, the number of peripheral devices related to power strips, interconnect hubs, etc. also increases. Then, the physically separated nodes will be connected to different power strips or interconnect hubs. A job running on a node connected to a different peripheral device can be moved to a close node connected to the same peripheral device by process migration. As a result, the number of unused peripheral devices can be increased, and the power consumption of the peripheral devices can be reduced by turning off these power supplies.

【００４３】図６に、本実施形態によるクラスタシステ
ムにプロセスマイグレーションを適用して、省電力化を
図った例を示した。図６においては、３つのノード群が
ハブを介して結合されてクラスタシステムを形成してい
る。図６（ａ）には、ハブＨ１に、ノードＮ11乃至Ｎ16
が、ハブＨ２には、ノードＮ21乃至Ｎ26が、そして、ハ
ブＨ３には、ノードＮ31乃至Ｎ36がそれぞれ結合されて
いる。図６（ａ）では、ハブＨ１において、ノードＮ11
とノードＮ12が、ハブＨ２において、ノードＮ23とノー
ドＮ24が、そして、ハブＨ３において、ノードＮ31が動
作状態にある。それらのノードが動作状態にあること
を、便宜的に、図中では太線で示している。他のノード
は、アイドル状態にある。FIG. 6 shows an example in which process migration is applied to the cluster system according to the present embodiment to save power. In FIG. 6, three node groups are connected via a hub to form a cluster system. In FIG. 6A, the hub H1 is connected to the nodes N11 to N16.
However, nodes H21 to N26 are coupled to the hub H2, and nodes N31 to N36 are coupled to the hub H3. In FIG. 6A, in the hub H1, the node N11
And node N12, hub N2, node N23 and node N24, and hub H3, node N31. It is indicated by bold lines in the figure that those nodes are in the operating state for convenience. The other node is idle.

【００４４】各ノードが、図６（ａ）に示されるような
場合には、ハブＨ１乃至Ｈ３に結合されている全てのノ
ードに電源供給されている。そのため、各ハブにおいて
アプリケーション処理で動作しているノードが少なくて
も、電力が消費され、省電力にはならない。In the case where each node is as shown in FIG. 6A, power is supplied to all the nodes coupled to the hubs H1 to H3. Therefore, even if the number of nodes operating in application processing is small in each hub, power is consumed and power is not saved.

【００４５】そこで、図６（ｂ）に示されるように、例
えば、ハブＨ２のノードＮ23とノードＮ24と、ハブＨ３
のノードＮ31とで実行されるアプリケーション処理を、
ハブＨ１のノードＮ13乃至Ｎ15に移動させる。このと
き、ハブＨ１におけるノードＮ13乃至Ｎ15は、処理が移
動される前に、動作停止状態からリジュームされて起動
している。このようにすると、ハブＨ２とハブＨ３に属
する各ノードへのアプリケーション処理の割り当てが必
要無くなり、これらのノードについてはアイドル状態と
なって、ノードの動作を停止状態に移行させることがで
きる。このことは、ハブ間に跨って分散していたアプリ
ケーション処理を特定のハブに集中させることができ、
ハブＨ１においては、電力消費が増えるものの、システ
ム全体で見ると、省電力化を図ることができる。〔第３の実施形態〕ＰＣの電力管理機能には、ＡＰＭの
他に、電力制御インタフェースであるＡＣＰＩ（Advanc
ed Configuration and Power Interface）が提案されて
おり、このＡＣＰＩをサポートしたマシンも多く製品化
されている。ＡＣＰＩは、ノードに備えられたＯＳから
ノードの電源管理を行うことを規定したものである。そ
こで、ソフトウエアMS−Ｗｉｎｄｏｗｓ（登録商標）や
Linux等の多くのＯＳで採り入れられている。上述した
本実施形態の例では、ＡＰＭによる電源管理による場合
について説明したが、このＡＰＭの代わりに、ＡＣＰＩ
を実装してサポートしたマシンとして、ノードのＯＳに
よって柔軟に電源管理を行うことが可能である。Therefore, as shown in FIG. 6B, for example, the nodes N23 and N24 of the hub H2, and the hub H3.
Application processing executed with the node N31 of
It is moved to the nodes N13 to N15 of the hub H1. At this time, the nodes N13 to N15 in the hub H1 are resumed from the operation stop state and activated before the processing is moved. By doing so, it becomes unnecessary to allocate the application processing to each node belonging to the hub H2 and the hub H3, and these nodes can be in the idle state and the operation of the node can be shifted to the suspended state. This makes it possible to concentrate the application processing that was distributed across hubs on a specific hub,
Although power consumption increases at the hub H1, power saving can be achieved from the viewpoint of the entire system. [Third Embodiment] In addition to the APM, the power management function of the PC includes ACPI (Advanc
ed Configuration and Power Interface) has been proposed, and many machines supporting this ACPI have been commercialized. ACPI defines that the power supply of a node is managed by the OS provided in the node. Therefore, software MS-Windows (registered trademark)
It is adopted by many OS such as Linux. In the above-described example of the present embodiment, the case of power management by the APM has been described, but instead of the APM, ACPI is used.
As a machine that implements and supports, it is possible to flexibly perform power management by the OS of the node.

【００４６】なお、以上において、本実施形態における
クラスタシステムでは、ＨＰＣ型クラスタを用いた場合
を説明してきたが、本実施形態による省電力化の手法
は、ＨＰＣ型に限られるものではなく、その手法は、複
数のノードでアプリケーション処理を行うシステムであ
れば、適用可能であり、ＨＡ型クラスタにも使用するこ
とができる。Although the case where the HPC type cluster is used in the cluster system according to this embodiment has been described above, the power saving method according to this embodiment is not limited to the HPC type. The method can be applied to any system that performs application processing on a plurality of nodes, and can also be used for an HA-type cluster.

【００４７】以下に、本発明によるクラスタシステムに
関する実施の態様について示した。（付記１）アプリケーションを複数のノードに分散処
理させるノードの動作管理を行うクラスタシステムにお
いて、前記ノードがアイドル状態のときには当該ノード
の動作を停止状態にさせ、前記ノードが前記処理を実行
するときに当該ノードの起動を行うことを特徴とするク
ラスタシステム。（付記２）前記ノードは、自ノードのアイドル状態を
検出して該ノードの動作を停止状態にする機能を備える
ことを特徴とする付記１に記載のクラスタシステム。（付記３）前記ノードのアイドルループ回数をカウン
トするカウンタを有し、前記ノードから該ノードと異な
るノードの動作を停止状態に移行させるまで、前記カウ
ンタに前記動作の停止を禁止する状態を設定することを
特徴とする付記２に記載のクラスタシステム。（付記４）複数の前記ノードの動作状態を記憶する記
憶手段を有し、前記ノードを起動する際に、当該ノード
について起動状態であることが記憶されている場合に
は、当該ノードに対しては起動信号を出さないことを特
徴とする付記１又は２に記載のクラスタシステム。（付記５）複数の段階による省電力状態が設定され、
前記ノードのアイドル時間が長くなるにつれて次第に省
電力状態の段階を上げていく機能を備えることを特徴と
する付記１又２に記載のクラスタシステム。（付記６）前記ノードの動作停止状態には、スタンバ
イ状態、サスペンド状態、又はハイバネーション状態が
含まれることを特徴とする付記１又２に記載のクラスタ
システム。（付記７）異なる周辺機器に接続されているノード上
で実行されているジョブを、同一の周辺機器に接続され
たノードに集めるプロセスマイグレーション機能を有す
ることを特徴とする付記１乃至６のいずれかに記載のク
ラスタシステム。（付記８）前記ノードは、ウエイクオンＬＡＮメッセ
ージを受けたときに起動することを特徴とした付記１又
は２に記載のクラスタシステム。（付記９）前記ノードは、内蔵モデムに着信信号を受
けたときに起動することを特徴とした付記１又は２に記
載のクラスタシステム。（付記１０）前記ノードは、アドバンスドパフォーマ
ンスマネージメント機能によって該ノードの電源を管理
することを特徴とした付記１又は２に記載のクラスタシ
ステム。（付記１１）前記ノードは、電力制御インタフェース
ＡＣＰＩを有し、前記電力インタフェースＡＣＰＩによ
り、前記ノードに備えられるＯＳから該ノードの電源を
管理することを特徴とする付記１又は２に記載のクラス
タシステム。The embodiments of the cluster system according to the present invention are shown below. (Supplementary Note 1) In a cluster system that performs operation management of a node that distributes an application to a plurality of nodes, when the node is in an idle state, the operation of the node is stopped, and when the node executes the process. A cluster system characterized by activating the node. (Supplementary Note 2) The cluster system according to Supplementary Note 1, wherein the node has a function of detecting an idle state of the self node and suspending the operation of the node. (Supplementary Note 3) A counter is provided that counts the number of idle loops of the node, and the counter is set to a state in which the suspension of the operation is prohibited until the operation of the node different from the node is transited to the suspension state. The cluster system according to appendix 2, characterized in that (Supplementary Note 4) A storage unit that stores the operation states of a plurality of the nodes is provided, and when the node is activated, when it is stored that the node is in the activated state, The cluster system according to appendix 1 or 2, wherein the cluster system does not output a start signal. (Supplementary note 5) A power saving state is set in multiple stages,
3. The cluster system according to appendix 1 or 2, further comprising a function of gradually increasing the power saving state as the idle time of the node increases. (Supplementary note 6) The cluster system according to supplementary note 1 or 2, wherein the operation stop state of the node includes a standby state, a suspend state, or a hibernation state. (Supplementary note 7) Any one of supplementary notes 1 to 6 characterized by having a process migration function of collecting jobs executed on nodes connected to different peripheral devices to a node connected to the same peripheral device. The cluster system described in. (Supplementary note 8) The cluster system according to supplementary note 1 or 2, wherein the node is activated when a wake-on LAN message is received. (Supplementary note 9) The cluster system according to supplementary note 1 or 2, wherein the node is activated when an incoming signal is received by the internal modem. (Supplementary note 10) The cluster system according to supplementary note 1 or 2, wherein the node manages the power supply of the node by an advanced performance management function. (Supplementary note 11) The cluster system according to supplementary note 1 or 2, wherein the node has a power control interface ACPI, and the power supply of the node is managed by the OS provided in the node by the power interface ACPI. .

【００４８】[0048]

【発明の効果】以上説明したように、本発明によれば、
複数のノードからなるクラスタシステムにおいて、アイ
ドル状態のノードに対して動作停止状態とすることによ
り、アイドル状態のノードが存在することによるクラス
タシステムの無駄な消費電力の増大を防ぎ、クラスタシ
ステムの省電力化を実現することができる。As described above, according to the present invention,
In a cluster system consisting of multiple nodes, by suspending the operation of idle nodes, it is possible to prevent unnecessary power consumption increase of the cluster system due to the existence of idle nodes, and to save power of the cluster system. Can be realized.

【図面の簡単な説明】[Brief description of drawings]

【図１】クラスタシステムにおけるノードの動作状態を
説明する図である。FIG. 1 is a diagram illustrating an operating state of a node in a cluster system.

【図２】本実施形態のクラスタシステムの一ノード内に
おけるサスペンド状態への移行動作と、リジュームでの
復帰動作を説明する図である。FIG. 2 is a diagram illustrating a transition operation to a suspended state and a resume operation in resume in one node of the cluster system according to the present exemplary embodiment.

【図３】ソフトウエアのアイドルループにおける実行コ
マンド群を説明する図である。FIG. 3 is a diagram illustrating an execution command group in an idle loop of software.

【図４】マルチスレッドテンプレートライブラリにおけ
るinvokeテンプレートを示す図である。FIG. 4 is a diagram showing an invoke template in a multithread template library.

【図５】異なるノード間におけるサスペンドとリジュー
ムとが競合状態にある場合を説明する図である。FIG. 5 is a diagram illustrating a case where suspend and resume are in conflict between different nodes.

【図６】プロセスマイグレーションを本実施形態のクラ
スタシステムに適用した場合を説明する図である。FIG. 6 is a diagram illustrating a case where process migration is applied to the cluster system of this embodiment.

【符号の説明】[Explanation of symbols]

１…クラスタシステムソフトウエア２…ＯＳ３…ＢＩＯＳ４…ハードウエア５…ネットワークインタフェース 1 ... Cluster system software 2 ... OS 3 ... BIOS 4 ... Hardware 5 ... Network interface

Claims

【特許請求の範囲】[Claims]

【請求項１】アプリケーションを複数のノードに分散
処理させるノードの動作管理を行うクラスタシステムに
おいて、前記ノードがアイドル状態のときには当該ノードの動作
を停止状態にさせ、前記ノードが前記処理を実行すると
きに当該ノードの起動を行うことを特徴とするクラスタ
システム。1. A cluster system for managing the operation of a node that distributes an application to a plurality of nodes, wherein when the node is idle, the operation of the node is stopped, and when the node executes the process. A cluster system in which the node is started up.

【請求項２】前記ノードは、自ノードのアイドル状態
を検出して該ノードの動作を停止状態にする機能を備え
ることを特徴とする請求項１に記載のクラスタシステ
ム。2. The cluster system according to claim 1, wherein the node has a function of detecting an idle state of the own node and bringing the operation of the node into a stopped state.

【請求項３】複数の前記ノードの動作状態を記憶する
記憶手段を有し、前記ノードを起動する際に、当該ノードについて起動状
態であることが記憶されている場合には、当該ノードに
対しては起動信号を出さないことを特徴とする請求項１
又は２に記載のクラスタシステム。3. Having a storage means for storing the operating states of the plurality of nodes, and when starting the node, if it is stored that the node is in the activated state, 2. The device does not issue a start signal for the first time.
Alternatively, the cluster system according to item 2.

【請求項４】複数の段階による省電力状態が設定さ
れ、前記ノードのアイドル時間が長くなるにつれて次第に省
電力状態の段階を上げていく機能を備えることを特徴と
する請求項１又２に記載のクラスタシステム。4. The power saving state is set according to a plurality of stages, and the power saving state is gradually increased as the idle time of the node increases. Cluster system.

【請求項５】異なる周辺機器に接続されているノード
上で実行されているジョブを、同一の周辺機器に接続さ
れたノードに集めるプロセスマイグレーション機能を有
することを特徴とする付記１乃至４のいずれかに記載の
クラスタシステム。5. The process migration function according to claim 1, further comprising a process migration function for collecting jobs executed on nodes connected to different peripheral devices to a node connected to the same peripheral device. The cluster system according to Crab.