JP4057729B2

JP4057729B2 - Fourier transform method and program recording medium

Info

Publication number: JP4057729B2
Application number: JP37768498A
Authority: JP
Inventors: 有作山本; 健直野
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1998-12-29
Filing date: 1998-12-29
Publication date: 2008-03-05
Anticipated expiration: 2018-12-29
Also published as: JP2000200261A

Description

【発明の属する技術分野】
本発明は、複数のプロセッサを有する計算機で実行するのに適したフーリエ変換方法に係り、とくに、ベクトル演算器を内蔵する複数のプロセッサからなるベクトル並列計算機で実行するのに適したフーリエ変換方法に関する。
【従来の技術】
科学技術計算において頻繁に利用される処理の一つに、フーリエ変換がある。フーリエ変換は、物理現象のシミュレーションその他に使用される。フーリエ変換は、ある実数区間で定義された複素数値をとる関数ｆ（ｘ）を複素指数関数ｅｘｐ（ｉｋｘ）の重ね合わせとして表す処理であり、計算機上で実現する場合には、扱いうる点の数が有限であることから、複素数の点列ｆ₀，ｆ₁，．．．，ｆ_N-1をＮ個の複素指数関数ｅｘｐ（２πｉｋｊ／Ｎ）（ただし、ｋ＝０，１，．．．，Ｎ−１で、ｉは虚数単位、πは円周率）の重ね合わせとして表す処理となる。すなわち、ｆ₀，ｆ₁，．．．，ｆ_N-1が与えられたときに、式１ａにより重ね合わせの係数ｃ₀，ｃ₁，．．．，ｃ_N-1を求めるのがフーリエ変換である。各点ｆ_jの値は、これらの係数を用いると、式１ｂによりあらわされる。
【数１】
c_k＝(1/N)Σ_j=0 ^N-1f_jexp(-2πikj/N)
（ただし、k=0,1,...,N-1）（１ａ）
f_j＝Σ_k=0 ^N-1c_kexp(2πikj/N)
（ただしj=0,1,...,N-1）（１ｂ）
しかし、この定義に基づいて計算を行うと、式の数がＮ本あり、各式がＮ個の項から成るため、複素指数関数ｅｘｐ（−２πｉｋｊ／Ｎ）の計算に加えて、複素数の加算と乗算が約Ｎ²回必要である。そこで実際には、アルゴリズム上の工夫により計算量を約ＮｌｏｇＮのオーダーに減少させた高速フーリエ変換という手法が広く使われている。
高速フーリエ変換を並列計算機上で効率的に行うための手法として、従来、転置アルゴリズムとバイナリ・エクスチェンジと呼ばれる２つの手法が提案されている（たとえばV. Kumar, A. Grama, A. Gupta and G. Karypis: "Introduction to Parallel Computing, The Benjamin/Cummings Publishing Company, 1994参照）。前者はプロセッサ間の通信を計算途中の一箇所にまとめて行う方式、後者はプロセッサ間で通信を行いながら計算を進める方式であり、プロセッサの台数をｐとすると、通信の回数は前者がｐ−１、後者がｌｏｇ₂ｐで、通信１回あたりに送るデータ量は、前者がＮ／ｐ²、後者がＮ／２ｐである。後者は前者に比べて通信の回数が少なくて済むため、通信のセットアップ時間が支配的となる小規模問題では通信時間が少なくて済むという利点があるが、通信すべきデータの総量は多くなるため、大規模データの場合には前者が有利となる。
半導体デバイスの特性計算、電子状態計算、気象予測のための計算などの科学技術計算では、数万から数百万に上る変数を扱う大規模シミュレーションが必要である。このような大規模問題を扱う手段としては、並列計算機が有力である。並列計算機は数十個から数万個に上る多数の高速プロセッサをネットワークで結んだシステムであり、従来の逐次型計算機に比べ、プロセッサ台数を増やすことでピーク性能をいくらでも高めることができるという利点を持つ。さらに、最近の並列計算機では、各プロセッサで、一連のデータに対して同じ演算を高速に実行するできるように演算器が構成されていることも多い。とくに、各プロセッサに、そのような演算器として同じ演算を複数のデータに対してパイプライン的に実行するベクトル演算器を有するベクトル並列計算機も開発されている。ベクトル並列計算機の中には、このベクトル演算器による演算を指定するベクトル命令を実行できるものもある。さらに、メモリとベクトル演算器の間に複数のベクトルレジスタが設けられている並列計算機もある。これらのベクトルレジスタはメモリと演算器のデータの転送時間が処理時間に及ぼす時間を軽減している。より高速にシミュレーションを実行可能になっている。また、厳密にはベクトル並列計算機ではないが、ベクトル並列計算機に類似の並列計算機として、ある演算を実行する演算器がベクトル演算器でなくても、一連のデータに対してその演算を高速に実行できるように構成されている演算器を使用する並列計算機も多い。
フーリエ変換は科学技術計算でもっともよく使われる処理の一つであり、最近では並列計算機用のライブラリとして提供されることも多い。たとえば日立製作所編「プログラムプロダクトＨＩ−ＵＸ／ＭＰＰ行列計算副プログラムライブラリＭＡＴＲＩＸ／ＭＰＰ」参照。
並列計算機で実行する大規模のシミュレーションがフーリエ変換を実行する場合には前述の転置アルゴリズムが使用されることが多い。上に記載したように、ベクトル型並列計算機あるいはそれらに類似の並列計算機で転置アルゴリズムを実行する場合には、変換すべき一次元空間の点列データを３次元空間に直方体状に並べ、これに対してたとえばＹ方向の変換、Ｘ方向の変換、Ｚ方向の変換を順次行うことによって、全データ点列に対して高速フーリエ変換を行ったのと同一の結果を得る。より具体的には、フーリエ変換の対象となる一次元のデータｆ₀，ｆ₁，．．．，ｆ_N-1を入力し、各辺の長さがＮＸ，ＮＹ，ＮＺの直方体状に並べる。ここで、ＮＸ，ＮＹ，ＮＺはＮＸ＊ＮＹ＊ＮＺ＝Ｎを満たす整数である。データを直方体状に並べるに当たっては、原点からたとえばＺ方向にデータを並べていき、ＮＺ個のデータを並べ終わったら次はＸ座標を１だけ増やしてデータを並べ、これを繰り返してＮＸ＊ＮＺ個のデータを並べ終わったら次はＹ座標を１だけ増やしてデータを並べる、という操作を行う。
このようにデータを並べた後、直方体をＺ軸に垂直にスライスし、こうしてできる各面を並列計算機の一つのプロセッサに割り当てる。次に、Ｙ方向の変換を行う。プロセッサへの入力データｆ_jの割り当て方式より、各ＸＹ平面は１台のプロセッサに担当されているから、この変換処理は通信なしに各プロセッサで独立に行える。次に、同様にして各プロセッサで独立にＹ方向の変換の結果データに対してＸ方向の変換を行う。Ｘ方向の変換の終了後、プロセッサ間でのＸ方向の変換の結果データの入れ替えを行い、今度はその結果データが構成する直方体をＸ軸に垂直にスライスし、こうしてできる各面を一つのプロセッサに割り当てる。この処理を転置と呼び、各プロセッサが自分以外の全プロセッサとデータの交換を行う必要がある。転置の終了後、今度は各プロセッサで独立にＺ方向の変換を行う。以上で、直方体状に並べられた一次元入力データｆ_jのフーリエ変換が終了し、直方体状に並べられた、重ね合わせの係数を表す出力データｃ_kが求まる。出力データｃ_kの並び方は、原点からまずＹ方向に、Ｙ方向にＮＹ個行ったら次はＸ座標が１だけ増え、ＸＹ平面上にＮＸ＊ＮＹ個のデータが並んだら次はＺ座標が１だけ増えるという順で並ぶ。
上記の転置アルゴリズムでは、入力データｆ_jの分割は、ｆ_jを第ＭＯＤ（j，ｐ）番のプロセッサが担当するという形でデータがプロセッサ間で分割されている。このデータ分割形式はサイクリック分割と呼ばれる。データ分割形式はデータのプロセッサへの割り当ての順序を表すものでもあり、本明細書ではデータ分割形式のことを割り当て順序あるいは割り当て態様とも呼ぶことがある。一方、出力データｃ_kの分割は、ＮＹ個の連続するデータを１台のプロセッサが担当するブロックサイクリック分割となり、入力データｆ_jとはプロセッサ間のデータ分割形式が異なる。
上記の転置アルゴリズムでは、入力データｆ_jの並べ方およびそのデータのプロセッサへの分割の仕方より分かるように、入力データの分割形式は、ｆ_jを第ＭＯＤ（j，ｐ）番のプロセッサが担当するというサイクリック分割となる。一方、転置後のデータのプロセッサへの分割の仕方、および変換で得られた出力データｃ_kの並び方より分かるように、出力データの分割形式は、ＮＹ個の連続するデータを１台のプロセッサが担当するというブロックサイクリック分割となる。
しかし多くの応用では、フーリエ変換と逆フーリエ変換とを対にして用い、しかも逆フーリエ変換はフーリエ変換プログラムを流用して行うため、フーリエ変換の入力データと出力データが同じデータ分割形式（データ割り当て順序）になっている方が都合がよい。そのため、従来の高速フーリエ変換方法では、以上の処理に従ってブロックサイクリック分割の出力データｃ_kを得た後、再びプロセッサ間でデータの転送を行い、データｃ_kをサイクリック分割に直して出力する必要がある。
【発明が解決しようとする課題】
本発明者の検討の結果、以上の従来のフーリエ変換方法では、フーリエ変換係数の計算後に行うデータ分割形式（データ割り当て順序）の変更のためのプロセッサ間でのデータ転送が、フーリエ変換時間の短縮を妨げていることが分かった。
したがって、本発明の目的は、フーリエ変換係数の計算後にデータ分割形式の変更のためにデータ転送を行わなくても、フーリエ変換結果データがフーリエ変換対象データと同一のデータ分割形式（データ割り当て順序）を持ち得るフーリエ変換方法を提供することである。
【課題を解決するための手段】
上記目的を達成するため、本発明によるフーリエ変換方法は、
各プロセッサにより、第１の変換処理、第２の変換処理、第３の変換処理を順次実行し、
上記複数のプロセッサの各々による、上記第１、第２の変換処理のいずれか一方の変換処理の実行後に、上記複数のプロセッサでのその一方の変換処理を実行した結果得られた一群の結果データを構成する複数の結果データ部分群が異なるプロセッサに割り当てられるように、上記一群の結果データを上記複数のプロセッサの間で交換するステップを有する。
上記第１から第３の変換処理は、一群の順序づけられた変換対象データに対する一群の順序づけられたフーリエ変換係数データを構成する複数のフーリエ変換係数データ部分群をそれぞれ異なるプロセッサにより生成するように定められ、各プロセッサには、上記一群の変換対象データを構成する複数の変換対象データ部分群の一つの変換対象データ部分群がそのプロセッサに対して予め割り当てられ、
上記一群のフーリエ変換係数データのそれぞれを生成したプロセッサの順序が、上記一群の変換対象データのそれぞれが割り当てられたプロセッサの順序と同一となるように、上記交換するステップで各プロセッサに割り当てられる結果データ部分群が定められているもの。
より具体的には、上記第１、第２、第３の変換処理は、それぞれ３次元データ空間の第１、第２、第３の座標軸に関する変換処理であり、
上記変換対象データ群の各々は、上記３次元データ空間の直方体形状に位置する格子点群の一つの座標をそれぞれ有し、
上記複数の変換対象データ部分群は、上記３次元データ空間の第３の座標軸の座標値が同じであり、上記３次元データ空間の第１、第２の座標軸の座標値が異なる全ての変換対象データが同一の変換対象データ部分群に含まれるように定められ、
上記フーリエ変換係数データ群の各々は、３次元係数空間の直方体形状に位置する格子点群の一つの座標をそれぞれ有し、
上記複数のフーリエ変換係数データ部分群は、上記３次元係数空間の第１の座標軸の座標値が同じであり、上記３次元波数空間の第２、第３の座標軸の座標値が異なる全てのフーリエ変換係数データが同一のフーリエ変換係数データ部分群に含まれるように定められている。
本発明の具体的な態様によるフーリエ変換方法では、
上記変換対象データ群の各々は、上記３次元データ空間の直方体形状に位置する格子点群の一つの座標をそれぞれ有し、
上記複数の変換対象データ部分群は、上記３次元データ空間の第３の座標軸の座標値が同じであり、上記３次元データ空間の第１、第２の座標軸の座標値が異なる全ての変換対象データが同一の変換対象データ部分群に含まれるように定められ、
上記フーリエ変換係数データ群の各々は、３次元係数空間の直方体形状に位置する格子点群の一つの座標をそれぞれ有し、
上記複数のフーリエ変換係数データ部分群は、上記３次元係数空間の第１の座標軸の座標値が同じであり、上記３次元波数空間の第２、第３の座標軸の座標値が異なる全てのフーリエ変換係数データが同一のフーリエ変換係数データ部分群に含まれるように定められる。
更に具体的には、
上記変換対象データ群が上記３次元データ空間に直方体形状に位置する格子点群に上記３次元空間に第３の座標軸、第２の座標軸、第１の座標軸の順に順次割り当てられ、
上記第１から第３の変換処理は、上記複数のフーリエ変換係数データが、３次元係数空間に直方体形状に位置する格子点群に、当該３次元係数空間の第１、第２、第３の座標軸の順序で割り当てられるように定められている。
更に具体的な態様では、
各プロセッサが上記第１の変換処理により生成する上記一つの第１の結果データ部分群は、上記３次元データ空間の第３の座標軸の座標値が所定の同じ値であり、上記３次元データ空間の第２の座標軸の座標値と上記３次元係数空間の第１の座標軸の座標値が異なる値を有する全ての複数の第１の結果データを含み、
上記交換ステップが上記第１の変換処理が上記複数のプロセッサにより実行された後に実行され、
上記複数のプロセッサは、この交換ステップで、上記３次元係数空間の第１の座標軸の座標値が所定の同じ値であり、上記３次元データ空間の第２、第３の座標軸の座標値が異なる値を有する全ての複数の第１の結果データを含む第１の結果データ部分群が同一のプロセッサに割り当てられるように、上記複数のプロセッサが生成した一群の第１の結果データを上記複数のプロセッサの間で交換し、各プロセッサが上記第２の変換処理により生成する上記一つの第２の結果データ部分群は、上記３次元係数空間の第１の座標軸の座標値が所定の同じ値であり、上記３次元波数空間の第２の座標軸の座標値と上記３次元データ空間の第３の座標軸の座標値が異なる値を有する全ての複数の第２の結果データを含み、
各プロセッサが上記第３の変換処理により生成する上記一つのフーリエ変換係数部分群は、上記３次元係数空間の第１の座標軸の座標値が所定の値であり、上記３次元波数空間の第２、第３の座標軸の座標値が異なる値を有する全ての複数のフーリエ変換係数を含む。
更に具体的な他の態様では、
各プロセッサが上記第１の変換処理により生成する上記一つの第１の結果データ部分群は、上記３次元データ空間の第３の座標軸の座標値が所定の同じ値であり、上記３次元データ空間の第２の座標軸の座標値と上記３次元係数空間の第１の座標軸の座標値が異なる値を有する全ての複数の第１の結果データを含み、
上記交換ステップが上記第２の変換処理が上記複数のプロセッサにより実行された後に実行され、
各プロセッサが上記第２の変換処理により生成する上記一つの第２の結果データ部分群は、上記３次元データ空間の第３の座標軸の座標値が所定の同じ値であり、上記３次元係数空間の第１、第２の座標軸の座標値が異なる値を有する全ての複数の第２の結果データを含み、
上記複数のプロセッサは、上記交換ステップにより、上記３次元係数空間の第１の座標軸の座標値が所定の同じ値であり、上記３次元係数空間の第１の座標軸の座標値と上記３次元データ空間の第３の座標軸の座標値が異なる値を有する全ての複数の第１の結果データを含む第１の結果データ部分群が同一のプロセッサに割り当てられるように、上記複数のプロセッサが生成した一群の第１の結果データを上記複数のプロセッサの間で交換し、
各プロセッサが上記第３の変換処理により生成する上記一つのフーリエ変換係数部分群は、上記３次元係数空間の第１の座標軸の座標値が所定の値であり、上記３次元係数空間の第２、第３の座標軸の座標値が異なる値を有する全ての複数のフーリエ変換係数を含む。
本発明のより具体的な態様では、
各プロセッサにより、３次元空間の第１、第２、第３の座標軸の座標にそれぞれ関する第１、第２、第３の変換処理を順次かつ他のプロセッサと並行して実行し、
各プロセッサが上記第１、第２の変換処理のいずれか一方を実行した後に、その一方の変換処理の結果それぞれのプロセッサで得られた複数の結果データを上記複数のプロセッサの間で交換するステップを有する。
ここで、一群の順序づけられた変換対象データが上記３次元空間に直方体の形に並べられ、
上記第１から第３の変換処理は、上記一群の変換対象データに対する一群の順序づけられた３次元空間の座標を有する複数のフーリエ変換係数データを生成するように定められ、
上記複数の変換対象データが構成する上記直方体を分割する上記３次元空間の上記第１の座標軸に垂直な複数の面の各々に含まれる複数の変換対象データが同一のプロセッサに割り当てられ、
上記交換ステップは、上記一方の変換処理の結果得られた上記複数の結果データが構成する３次元空間の直方体を、その３次元空間の第１の座標軸に垂直な複数の面で分割し直して、各面に属する複数の結果データを同一のプロセッサに割り当てるように、上記一方の変換処理の結果得られた上記複数の結果データを上記複数のプロセッサ間で交換するステップを有する。
とくに、望ましくは、上記一群の順序づけられた変換対象データを上記３次元空間に直方体の形に並べられる順序は、第３の座標軸、第２の座標軸、第１の座標軸の順であり、
上記第１から第３の変換処理は、上記複数のフーリエ変換係数データが３次元空間に第１、第２、第３の座標軸の順序で並べられるように定められている。
さらに望ましくは、本発明によるフーリエ変換方法は、各プロセッサがパイプライン演算器を含み、その演算器での演算の対象とするループ長がＬのときのその各プロセッサの演算性能を求めるための性能データを上記複数のプロセッサに共通に記憶し、
その性能データを用いて、上記直方体の上記第１、第２、第３の座標軸方向の長さを決定し、
その決定された上記第１、第２、第３の座標軸方向の長さを有する直方体に、上記順序づけられた複数の変換対象データを並べるステップをさらに有する。
本発明によるプログラム記憶媒体は、上記いろいろのフーリエ変換方法のいずれかを実行するようにプログラムされたプログラムが記憶する。
さらに、本発明によるシミュレーション方法は、上記いろいろのフーリエ変換方法のいずれかを使用してシミュレーションを実行する。
本発明による他のプログラム記憶媒体は、上記シミュレーション方法を実行するようにプログラムされたプログラムを記憶する。
【発明の実施の形態】
以下、本発明に係るフーリエ変換方法、それを用いるシミュレーション方法およびプログラムを図面に示したいくつかの実施の形態を参照してさらに詳細に説明する。なお、以下においては、同じ参照番号は同じものもしくは類似のものを表わすものとする。また、第２の実施の形態以降では、第１の実施の形態との相違点を主に説明するに止める。
＜発明の実施の形態１＞
（１）装置の概略構成
本発明によるフーリエ変換方法を実行するための並列計算機システムの一例を図１に示す。並列計算機２８は、それぞれがメモリ２６を備えた複数のプロセッサ２７と、プログラムおよびデータを格納するための複数の外部記憶装置３１から構成され、これらの装置は、内部データ転送ネットワーク２９を介して相互にデータを交換可能なように構成されている。外部記憶装置３１には、たとえば、多くのユーザの利用に供するために並列計算機２８に予め準備された複数のプログラムライブラリ４４とそれらが使用するデータ３０等が記憶される。各プロセッサのメモリ２６は、いわゆるローカルメモリであり、このメモリに記憶されたデータに割り当てられるアドレスは、そのプロセッサで定められたローカルなアドレス空間に属するアドレスであり、この種のメモリは一般に分散メモリと呼ばれ、この種のメモリを有する計算機は分散メモリ型の並列計算機と呼ばれる。並列計算機２８は、各プロセッサ２９が、一連のデータ要素からなるベクトルデータに対して同じ演算をパイプライン的に連続して実行できるベクトル演算器（図示せず）を備えるベクトル並列計算機であると仮定する。
これらのプロセッサ２７内の特定の一つのプロセッサには、ユーザが操作可能な計算機、たとえばワークステーション１がＬＡＮ等のネットワーク２を介して接続されている。この計算機は他の計算機たとえばパーソナルコンピュータでもよい。このワークステーション１には、並列計算機２８に対する指示あるいはデータを入力するための入力装置３（典型的には、キーボードとマウス）と、並列計算機２８からの計算結果を出力するための出力装置２９（典型的には、表示装置と印刷装置）が接続されている。なお、ワークステーション１内には、並列計算機２８に送るべきプログラムおよびそのプログラムで使用するデータを記憶する記憶装置（図示せず）も設けられている。
上記特定のプロセッサは、並列計算機２８内で計算を司るプロセッサの役目とユーザ用のワークステーション１との通信の役目とを兼ねる。すなわち、このプロセッサは、ワークステーション１から送付されるプログラムとデータを受信し、それらを外部記憶装置３１の一つに記憶し、その後、並列計算機２８の内部に記憶された適当なプログラムにより、ユーザ指定のプログラムを複数のプロセッサ（具体的には全プロセッサ）にロードし、ユーザ指定のデータの異なる部分を、それぞれそれらのプロセッサの異なるものに割り当て、そのユーザ指定のプログラムを起動する。
しかしながら、本発明によるフーリエ変換方法を実施するためには、並列計算機２８は、複数のプロセッサを有することが必要であるが、それ以外の点では特に限定した構造を有しなくてもよいことは言うまでもない。並列計算機２８は、ベクトル並列計算機であると仮定したが、このベクトル演算器はごく一部の演算のみを実行でき、他の演算はベクトル演算器ではないスカラ演算器で実行されてもよい。さらに、並列計算機２８は、対してこのような演算器を有しなくてもよい。もちろん、一連のデータに対する同じ演算を高速に実行できるように構成されている演算器を有することが望ましい。また、並列計算機２８は、メモリ２９と演算器（図示せず）の間に複数のベクトルレジスタを有しないと仮定するが、これらのレジスタが使用することはより望ましいことである。
さらに、それらのプロセッサの具体的な構造あるいはそれらの間のデータ転送ネットワークの構造、あるいはそれらのプロセッサと入力装置あるいは出力装置との接続形態がいろいろであっても、本発明はそれらの並列計算機に適用可能である。たとえば、ワークステーションと通信可能な複数のプロセッサが設けられていてもよく、また、ワークステーションと通信可能な少なくとも一つのプロセッサが計算用のプロセッサとは別に設けられていてもよい。また、実行すべきプログラムとデータを並列計算機２８に送付する方法は他の方法に依ってもよいことは明らかである。
ユーザは上記複数のプロセッサ２７を使用して種々の計算を実行できる。最も典型的な計算は、物理現象などのシミュレーションであり、たとえば、地球の気象の予測もシミュレーションにより行われる。半導体デバイスの設計も、半導体デバイスの物理的な動作をシミュレーションして行われている。このようなシミュレーションを並列計算機を使用して実行する場合、シミュレーション対象の物理領域を複数の部分領域に区分し、各部分領域を一つのプロセッサに割り当て、そのプロセッサにおいて、その部分領域についてのシミュレーションを、たとえば一つまたは複数の物理量に関する偏微分方程式を解いて実行することが多い。この場合、シミュレーションに使用される複数のプロセッサは、同じシミュレーションプログラムを互いに並列に実行する。したがって、このようなプログラムは並列プログラムとも呼ばれる。各プロセッサが実行するシミュレーションプログラムが使用するデータは異なる。たとえばシミュレーション領域の位置と形状を表すデータ、シミュレーションすべき物理量の初期値、シミュレーション領域の物質に関する物質定数、あるいは各部分領域に関する境界条件など異なる。各プロセッサは、計算の途中で得られた結果データを他の適当なプロセッサに転送し、あるいは他のプロセッサから計算結果データを受け取り、さらにシミュレーションを続けていく。
このシミュレーションプログラムの中にはフーリエ変換を使用するものもある。本実施の形態では、いろいろのシミュレーションの利用に供するために、本発明によるフーリエ変換方法にしたがってフーリエ変換を実行するようにプログラムされたフーリエ変換ライブラリがいずれかの外部記憶装置３１に記憶される。さらにプロセッサ間の通信を実行するための通信ライブラリも外部記憶装置３１に記憶される。シミュレーションプログラムは、上記フーリエ変換ライブラリあるいは通信ライブラリを必要な時点でコールするようにプログラムされる。並列計算機２８は、ワークステーション１から送信されたユーザ指定のシミュレーションプログラムと、そのシミュレーションプログラムが使用するライブラリ（今の場合には上記フーリエ変換ライブラリと上記通信ライブラリ）を各プロセッサにロードする。さらに、並列計算機２８は、それぞれのプロセッサでシミュレーションプログラムが使用する、ワークステーション１から送信されたユーザ指定のデータをそれぞれのプロセッサにロードする。なお、シミュレーションプログラムは全プロセッサにロードされてもよく、一部のプロセッサにロードされてもよいが、以下では簡単化のために、シミュレーションプログラムは、全てのプロセッサにロードされると仮定する。上記ライブラリあるいは上記シミュレーションプログラムは、並列計算機２８の命令セットあるいはハード構造の特徴、ソフトウエア上の制約等を反映するコンパイラによりコンパイルされたものである。本発明によりフーリエ変換方法を実行する上記ライブラリあるいは上記シミュレーションプログラムに上記ライブラリに含まれたフーリエ変換のためのプログラム部分を組み込んだプログラムを磁気記憶装置のようなプログラム記録媒体に記憶して販売できる。
（２）並列高速フーリエ変換の原理
すでに述べたごとく、フーリエ変換は、Ｎ個の入力データｆ₀，ｆ₁，．．．，ｆ_N-1からＮ個の出力データｃ₀，ｃ₁，．．．，ｃ_N-1を、式１ａを用いて計算する処理である。
入力データｆ_j、出力データｃ_kは、実数データであっても複素データであってもよい。入力データｆ_j、出力データｃ_kはそれぞれ実空間のデータ、波数空間のデータと呼ばれることがある。すなわち、入力データｆ_jの添え字jは、一次元の実空間の格子点の座標を表し、出力データｃ_kの添え字ｋは、一次元の波数空間の格子点の座標を表すと考えることができる。言い換えると、上記の式によるフーリエ変換は、一次元の実空間のデータを一次元の波数空間のデータに変換する処理である。したがって、本明細書では、入力データｆ_jの添え字jを一次元実空間の格子点座標あるいは単に座標と呼び、出力データｃ_kの添え字ｋを一次元波数空間の格子点の座標あるいは単に座標と呼ぶことがある。あるいは、それらのデータはその座標を有すると呼ぶことがある。しかし、入力データｆ_j、出力データｃ_kが実際にはそのような実空間、波数空間に属するデータでなくてもよい。
一般に、並列計算機を使用して演算を行う場合、できるだけ多くのプロセッサが互いに並列に動作する時間を増大するとともに、プロセッサ間のデータ通信の総回数を少なくすることが望ましいことが知られている。データ通信は、プロセッサ内部の計算時間に比べて時間が掛かる上に、通信はあるプロセッサからのデータの送信と他のプロセッサでの受信となからなり、受信側のプロセッサでは、ある処理を実行する前に他のプロセッサからそこでの演算結果データを受信するようにプログラムされた場合、そのプロセッサは、受信すべき演算結果データが受信されるまで、その処理を開始することができない。したがって、各プロセッサでは、通信の発生に伴い、受信待ち時間が増大し、他のプロセッサと並列に動作する時間が減少する。したがって、並列計算機で演算を高速に行うには、プロセッサ間の通信の総回数を減らすことが望ましいことが知られている。このことは演算としてフーリエ変換を並列計算機で実行する場合も同じである。このためには、演算で使用するデータをどのプロセッサに割り当てるか、いつプロセッサ間で演算結果データを交換するかが重要な問題である。
並列計算機でフーリエ変換を行うには、従来の転置アルゴリズムでは、変換対象データｆ_jを以下のようにして３次元の実空間のデータに写像し、それを用いて変換対象データを割り当てるプロセッサを決定することになる。
いま、ＮＸ，ＮＹ，ＮＺをＮＸ＊ＮＹ＊ＮＺ＝Ｎを満たす正の整数とし、１次元の添字ｊ，ｋを３次元の添字（ｊ_x，ｊ_y，ｊ_z）、（ｋ_x，ｋ_y，ｋ_z）に次の式２、３によって置換する。
【数２】

【数３】

ここで、記号＊は乗算を表す。この置換は、１次元実空間の格子点座標ｊ、１次元の波数空間の格子点座標ｋを、それぞれ３次元の実空間の格子点座標（ｊ_x，ｊ_y，ｊ_z）、３次元の波数空間の格子点座標（ｋ_x，ｋ_y，ｋ_z）に写像することであるとも言える。
３次元の実空間の座標（ｊ_x，ｊ_y，ｊ_z）は、１次元の実空間の座標ｊから次式により計算される。
ｊ x=MOD(j/NZ，NX)
ｊy=(j/(NX*NZ))↓
ｊz=MOD(j，NZ)
ここで、（）↓は、括弧内の数値の整数部分のみを表し、小数点以下を切り捨てることを表す。
したがって、ｊが０，１，２，３，，，（Ｎ−１）と変化したときに、（ｊ_x，ｊ_y，ｊ_z）は、（０，０，０），（０，０，１），（０，０，２），（０，０，３），，，（０，０，ＮＺ−１）と変化し、さらに、（１，０，０），（１，０，１），（１，０，２），（１，０，３），，，（１，０，ＮＺ−１）と変化し、この変化を（ＮＸ−１，０，ＮＺ−１）まで変化した後に、ｊ_yを１に変えて上記変化を座標（ＮＸ−１，ＮＹ−１，ＮＺ−１）に達するまで繰り返す。すなわち、一次元の順次異なる座標点ｊに対応する３次元の座標点（ｊ_x，ｊ_y，ｊ_z）は、Ｚ方向、Ｘ方向、Ｙ方向の順に変化する。本明細書では、このようにフーリエ変換の対象となる一次元実空間のデータｆ₀，ｆ₁，．，ｆ_N-1を３次元実空間のデータに写像することを、簡単化のために各辺の長さがＮＸ，ＮＹ，ＮＺの直方体状に並べるともいう。以上の置換は、言い換えると、図２に示すように、原点からまずＺ方向にデータを並べていき、ＮＺ個のデータを並べ終わったら次はＸ座標を１だけ増やしてデータを並べ、これを繰り返してＮＸ＊ＮＺ個のデータを並べ終わったら次はＹ座標を１だけ増やしてデータを並べるという操作を行うことと等価である。但し、図２では、Ｎは５１２であり、ＮＸ，ＮＹ，ＮＺはともに８に等しいと仮定した。
３次元の波数空間の座標（ｋ_x，ｋ_y，ｋ_z）は、１次元の波数空間の座標ｋから次式により計算される。
ｋ_x=MOD(k/NY,NX)
ｋ_y=MOD(k,NY)
ｋ_z=(k/(NX*NY))↓
したがって、ｋが０，１，２，３，，，（ＮＸ＊ＮＹ＊ＮＺ−１）と変化したときに、（ｋ_x，ｋ_y，ｋ_z）は、（０，０，０），（０，１，０），（０，２，０），（０，３，０），，，（０，ＮＹ−１，０）と変化し、さらに、（１，０，０），（１，１，０），（１，２，０），（１，３，０），，，（１，ＮＹ−１，０）と変化し、この変化を（ＮＸ−１，ＮＹ−１，０）まで変化した後に、ｋ_zを１に変えて上記変化を座標（ＮＸ−１，ＮＹ−１，ＮＺ−１）に達するまで繰り返す。すなわち、一次元の順次異なる座標点ｋに対応する３次元の座標点（ｋ_x，ｋ_y，ｋ_z）は、Ｙ方向、Ｘ方向、Ｚ方向の順に変化する。したがって、求めるべきフーリエ変換係数ｃ_kとそれに対応する３次元の波数空間の座標（ｋ_x，ｋ_y，ｋ_z）との関係は図３に示すとおりになる。但し、図３では、Ｎは５１２であり、ＮＸ，ＮＹ，ＮＺはともに８に等しいと仮定した。
なお，転置アルゴリズム自体は、１次元空間のデータ列ｆ_j、それに対するフーリエ変換結果データｃ_kを２次元空間のデータｆ_jx,jy,ckx,kyに変換して行うこともできる。この場合には、１次元のフーリエ変換を２次元のフーリエ変換に置き直すことになる。しかし、ここで記載するように、１次元空間のデータ列ｆ_j、それに対するフーリエ変換結果データｃ_kを３次元空間のデータｆ_jx,jy,jzとｃ_kx,ky,kzに変換してフーリエ変換を行うのは，並列計算機の個々のプロセッサがベクトル演算器を持つ場合に、そのベクトル演算器をうまく利用するためである。この場合には、１次元のフーリエ変換を３次元のフーリエ変換に置き直すことになる。すなわち，以下の実施の形態でも述べるように，変換の各ステップでは，Ｘ方向、Ｙ方向、Ｚ方向のうちのどれかの方向について変換を行い，残りの２方向のうちの１方向を用いて並列化を行い，更に残りの１方向を用いてベクトル化を行う。このため，データを３つの方向を持つ直方体状に並べる必要がある。原理的には，式２，３と同様の変換を行って，データを２次元空間あるいは４次元以上の空間に並べ直すこともできるが，２次元では並列化とベクトル化の両方を行うには次元が足りず，また，４次元以上では不要な次元ができ，その分だけベクトル化対象のループ長が短くなってしまうので性能的に不利である。そのため，ベクトル演算器を持つ並列計算機上で高速フーリエ変換を行うには，データを式２，３のように３次元に並べ直すことがことが望ましい。このようなデータの変換は、各プロセッサがベクトル演算器を持たない場合にも演算の高速化に有効な場合が多い。
置換式２，３を使用すると、式１ａは次のように書き換えられる。
【数４】

さらに、この変換式は次の３式で表される３ステップの変換をそれらの式の順に順次実行することにより実現される変換であることが分かる。
【数５】
ｃ'_jx,ky,jz＝Σ_jy=0 ^Ny-1ｆ_jx,jy,jz*exp(-2πiｋ_yｊ_y/NY) （５）
【数６】

【数７】

式５は、ｊ_x、ｊ_zが特定の値であり、ｊ_yの値が異なるＮＹ個の入力データｆ_jx,jy,jzに対するフーリエ変換を、ｊ_x、ｊ_zが採りうる値の組合わせの数（ＮＸ＊ＮＺ組）だけ行い、それにより上記二つの実空間座標（ｊ_x、ｊ_z）の組のひとつにそれぞれ対応する複数（ＮＸ＊ＮＺ）組の、３次元の波数空間の一つの座標（ｋ_y）に関する一次変換結果データ（ｃ'_jx,ky,jz）（ｋ_y＝０〜ＮＹ−１、ｊ_x＝０〜ＮＸ−１、ｊ_z＝０〜ＮＺ−１）を得る処理を表す。
式６も、複素指数関数の中にｋ_y／ＮＹという余分な項が入る以外はフーリエ変換と同じ変換を表し、具体的には、この式は、ｊ_z、ｋ_yが特定の値であり、ｊ_xの値が異なるＮＸ個の一次変換結果データｃ'_jx,ky,jzに対してフーリエ変換と類似の変換を、ｊ_z、ｋ_yが採りうる値の組合わせの数（ＮＹ＊ＮＺ組）だけ行い、それにより、座標ｊ_zの異なる値に対する、３次元の波数空間の二つの座標（ｋ_x、ｋ_y）に関する２次変換結果データ（ｃ''_kx,ky,jz）（ｋ_x＝０〜ＮＸ−１、ｋ_y＝０〜ＮＹ−１、ｊ_z＝０〜ＮＺ−１）を得る処理を表す。
式７も、複素指数関数の中にｋ_x／ＮＹ＋ｋ_y／（ＮＸ＊ＮＹ）という余分な項が入る以外はフーリエ変換と同じ変換を表し、具体的には、この式は、ｋ_x、ｋ_yが特定の値であり、ｊ_zの値が異なるＮＺ個のデータｃ''_kx,ky,jzに対してフーリエ変換と類似の変換を、ｋ_x、ｋ_yが採りうる値の組合わせの数（ＮＸ＊ＮＹ組）だけ行い、それにより３次元の波数空間の３つの座標（ｋ_x,ｋ_y,ｋ_z）に関する最終的なフーリエ変換結果データ（ｃ_kx,ky,kz）（ｋ_x＝０〜ＮＸ−１、ｋ_y＝０〜ＮＹ−１、ｋ_z＝０〜ＮＺ−１）を得る処理を表す。したがって、これらの３つの変換は、すべて高速フーリエ変換のアルゴリズムを用いて実行することができる。
以下では、これらの変換をそれぞれＹ方向の変換、Ｘ方向の変換、Ｚ方向の変換と呼ぶ。本明細書では、これらの変換を簡単化のためにそれぞれＹ方向のフーリエ変換、Ｘ方向のフーリエ変換、Ｚ方向のフーリエ変換と呼ぶこともある。ここで、Ｘ方向等は、式２、３で定めた座標変換できまる方向である。すなわち、一次元の順次異なる座標点ｊに対応して最初に順次変化する座標がＺ座標であり、その後に変化する座標がＸ座標であり、最後に変化する座標がＹ座標である。座標変換式を式２、３から変更すれば、Ｘ方向等の変換の内容が変わるのは言うまでもない。したがって、本明細書では、より一般的には、これらの変換は以下の変換を指す。
Ｙ方向の変換とは、式５により例示されたように、実空間の第１、第３の座標軸の座標（ｊ_x，ｊ_z）が特定の値であり、第２の座標軸の座標（ｊ_y）の値が異なる複数（ＮＹ）個の入力データｆ_jx,jy,jzに対してフーリエ変換を行い、３次元実空間の第１、第３の座標軸の座標（ｊ_x，ｊ_z）の組のひとつにそれぞれ対応する複数（ＮＸ＊ＮＺ）組の、３次元波数空間の第２の座標軸の座標（ｋ_y）に関連する一次変換結果データ（ｃ'_jx,ky,jz）（ｊ_x＝０〜ＮＸ−１、ｋ_y＝０〜ＮＹ−１、ｊ_z＝０〜ＮＺ−１）を得る処理を指す。あるいは、言い換えると、Ｙ方向の変換は、入力データｆ_jx,jy,jzに対して、実空間の第２の座標軸に関してフーリエ変換を行い、３次元波数空間の第２の座標軸の座標と、３次元実空間の第１の座標軸の座標と、第３の座標軸の座標との関数である一次変換結果データを得る処理を指すとも言える。
さらに、Ｘ方向の変換とは、式６により例示されたように、上記第３の実空間座標系の第１の座標系の座標（ｊ_z）と３次元波数空間の第２の座標系の座標（ｋ_y）とが特定の値であり、上記第１の実空間座標系の座標（ｊ_x）の値が異なる複数（ＮＸ）個の一次変換結果データ（ｃ'_jx,ky,jz）に対してフーリエ変換に類似の変換を行い、上記第３の実空間座標軸の座標（ｊ_z）の異なる値の一つにそれぞれ対応する複数（Ｎ_z）個の、上記３次元の波数空間の第１、第２の座標軸の座標（ｋ_x、ｋ_y）に関連する２次変換結果データ（ｃ''_kx,ky,jz）（ｋ_x＝０〜ＮＸ−１、ｋ_y＝０〜ＮＹ−１、ｊ_z＝０〜ＮＺ−１）を得る処理を指す。あるいは、言い換えると、Ｘ方向の変換は、一次変換結果データに対して、３次元実空間の第１の座標軸に関してフーリエ変換に類似の変換を行い、３次元実空間の第３の座標軸の座標と、３次元波数空間の第１、第２の座標軸の座標との関数である２次変換結果データを得る処理を指すとも言える。
さらに、Ｚ方向の変換とは、式７により例示されたように、３次元波数空間の第１、第２の座標系の座標（ｋ_x，ｋ_y）とが特定の値であり、３次元実空間の第３の座標系の座標（ｊ_z）の値が異なる複数（ＮＺ）個の２次変換結果データ（ｃ''_kx,ky,jz）に対してフーリエ変換に類似な変換を行い、３次元波数空間の第１、第２、第３の座標軸の座標（ｋ_x，ｋ_y，ｋ_Z）に関連する、入力データに対する最終的なフーリエ変換結果データ（ｃ_kx,ky,kz）（ｋ_x＝０〜ＮＸ−１、ｋ_y＝０〜ＮＹ−１、ｋ_z＝０〜ＮＺ−１）を得る処理を指すとも言える。あるいは、言い換えると、Ｚ方向の変換は、２次変換結果データに対して、３次元実空間の第３の座標軸に関してフーリエ変換に類似の変換を行い、３次元波数空間の第１、第２、第３の座標軸の座標の関数である最終的なフーリエ変換結果データを得る処理を指すとも言える。
Ｙ方向の変換は、式５にて示されるように、ＮＸ＊ＮＺ組のデータに対する互いに独立な変換からなる。同様に、Ｘ方向の変換は、式６にて示されるように、ＮＹ＊ＮＺ組のデータに対する互いに独立な変換からなる。同様に、Ｚ方向の変換は、式７にて示されるように、ＮＸ＊ＮＹ組のデータに対する互いに独立な変換からなる。
従来の転置アルゴリズムによるフーリエ変換方法では、この特徴を利用してプロセッサ間の通信を少なくするように、変換対象データを並列計算機の異なるプロセッサに割り当てている。すなわち、式２に従って、また、図２に例示されるように、変換対象データｆj（j＝０〜Ｎ）を直方体状に並べ、３次元実空間のＺ軸に並行な平面でこのデータを分割し、図４に例示するように、jz＝０，１，，，７をそれぞれ有する複数のデータはプロセッサ０，１，，７に割り当てられている。すなわち、特定の値のＺ座標jzを有する全ての変換対象データは、それらのＸ座標jx、Ｙ座標jyの値に依らないで同一のプロセッサに割り当てる。図２では、Ｎ＝５１２，ＮＸ＝ＮＹ＝ＮＺ＝８と仮定し、図４ではプロセッサの総数ＮＰＵ＝８と仮定したが、これらの数値がここに仮定の数値と異なる場合でも、特定の値のＺ座標jzを有する全ての変換対象データは、それらのＸ座標、Ｙ座標の値に依らないで同一のプロセッサに割り当てればよい。たとえば、プロセッサの総数ＮＰＵ＝ＮＸ＝ＮＺとし、ＮＹ＝（Ｎ／（ＮＸ＊ＮＺ））↑とすればよい。ここで、（）↑は、括弧内の数値の小数点以下を切り上げた後の整数を示す。たとえば、Ｎ＝５１２、ＮＰＵ＝４のときには、ＮＸ＝ＮＺ＝４、ＮＹ＝３２であればよい。
このようなデータの割り当てを行った後、フーリエ変換を以下のように実行する。この方法では式５によるＹ方向の変換と式６によるＸ方向の変換とは、プロセッサ間の通信を使用しないで行うことができる。
（ステップ１）Ｙ方向の変換
まず、Ｙ方向の変換を実行する。式５から分かるように、Ｙ方向の変換では、Ｘ座標j_xとＺ座標j_zが特定の値であり、Ｙ座標j_yが異なる複数の変換対象データに対してフーリエ変換が実行される。しかし、これらの複数のデータは同じプロセッサに割り当てられている。こうして、各プロセッサでは、式５にしたがって、プロセッサ間の通信を使用しないで、Ｘ座標j_xとＺ座標j_zが特定の値であり、波数空間のＹ座標ｋ_yがいろいろの値を有する複数の一次変換結果データが得られる。
（ステップ２）Ｘ方向の変換
次に、Ｘ方向の変換を実行する。式６から分かるように、Ｘ方向の変換では、Ｚ座標j_zと波数空間のＹ座標ｋ_yが特定の値であり、Ｘ座標j_xが異なる複数の一次変換結果データに対してフーリエ変換に類似の変換が実行される。しかし、これらの複数のデータは同じプロセッサでのＹ方向の変換によりすでに得られている。こうして、各プロセッサは、式６にしたがって、他のプロセッサとの通信をしないで、そのプロセッサに割り当てられた、３次元実空間のＺ座標j_zの特定の値に対する、３次元波数空間の座標（ｋ_x，ｋ_y）に関連するＮＸ＊ＮＹ個の２次変換結果データ（ｃ''_kx,ky,jz）（ｋ_x＝０〜ＮＸ−１、ｋ_y＝０〜ＮＹ−１）を得る。
（ステップ３）データの転置
式７によるＺ方向の変換を行うには、３次元波数空間の座標（ｋ_x，ｋ_y）の特定の値と、３次元実空間のＺ座標j_zの全ての値に対して得られた２次変換結果データ（ｃ''_kx,ky,jz）が必要である。そこで、各プロセッサに、３次元波数空間の座標ｋ_xの特定の値を割り当てて、Ｚ方向の変換を実行するに必要なデータをプロセッサ間で転送する処理が実行される。すなわち、各プロセッサへの座標ｋ_xの値の割り当てでは、プロセッサ０，１，２，，，７には、ｋ_x＝０，１，２，３，，，を順次割り当てる。このことは、図５に示すように、３次元波数空間を、そのｋ_x軸に垂直な平面で分割して、分割後の部分空間の各々を一つのプロセッサに割り当てることを意味する。
各プロセッサが、そのプロセッサに割り当てられた３次元波数空間の座標ｋ_xの特定の値と、３次元波数空間の座標ｋ_yの全ての値と、３次元実空間のＺ座標j_zの全ての値とに対して得られた全ての２次変換結果データ（ｃ''_kx,ky,jz）を使用してＺ方向の変換を実行できるように、全プロセッサ間で２次変換結果データの転送が行われる。このデータ転送は、データの転置あるいはデータの並び替えとも言われる。すなわち、各プロセッサは、３次元実空間のＺ座標j_zの全ての値と、３次元波数空間の座標（ｋ_x，ｋ_y）の全ての値との組に対して得られた２次変換結果データ（ｃ''_kx,ky,jz）（但し、ｋ_x＝特定値、ｋ_y＝０〜ＮＹ―１、ｋ_z＝０〜（ＮＺ―１））の内、自プロセッサが生成しなかったデータを他のプロセッサから受信するように、全プロセッサの間で２次変換結果データを転送する。
（ステップ４）Ｚ方向の変換
各プロセッサは、そのプロセッサに割り当てられた３次元波数空間の座標ｋ_xの特定の値と、３次元波数空間の他の二つの座標ｋ_y，ｋ_zの全ての値を有する最終的なフーリエ変換結果データｃ_kx,ky,kz（但し、ｋ_x＝特定値、ｋ_y＝０〜ＮＹ―１、ｋｚ＝０〜ＮＺ―１）を計算する。各プロセッサはこの計算を互いに並列に実行できる。
（ステップ５）データの転置
しかし以上の処理だけでは、変換対象データｆ_jと変換結果データｃ_kのデータ分割形式が異なり、実用上不便であるという問題がある。すなわち、変換対象データｆ_jは、図２に示すように、３次元実空間のデータに写像され、後者のデータは、図４にしたがって分割されて複数のプロセッサに割り当てられた。したがって、図６に示すように、変換対象データｆ_jの分割は、ｆ_jを第ＭＯＤ（j，ｐ）番のプロセッサが担当するサイクリック分割となる。一方、変換結果データｃ_kは、図３に示すように、３次元波数空間のデータに写像され、３次元波数空間は図５にしたがって分割されて複数のプロセッサに割り当てられた。したがって、図７に示すように、変換結果データｃ_kのデータ分割は、ＮＹ個の連続するデータを１台のプロセッサが担当するブロックサイクリック分割となる。
多くの応用では、変換対象データをフーリエ変換して得られる変換結果データに対してある処理を施し、その処理結果に対して再び逆フーリエ変換を行う。逆フーリエ変換はフーリエ変換のプログラムを流用して行われることが多い。すなわち、次の逆フーリエ変換の式
【数８】
ｆ_j=Σ_k=0 ^N-1ｃ_kexp(2πikj/N)
（ただし、ｊ=0,1,...,N-1）．．．（８）
を変形すると次式が得られる。
【数９】
ｆ_j=(Σ_k=0 ^N-1ｃ_k ^*exp(-2πikj/N))^*
（ただし、ｊ=0,1,...,N-1）．．．（９）
ここで、＊印は複素共役を示す。したがって、式１ａと９の比較より明らかなように、逆フーリエ変換は、フーリエ変換結果データの複素共役をフーリエ変換し、得られた結果データの複素共役を取ることに等価である。したがって、原理的には、逆フーリエ変換はフーリエ変換のプログラムを流用して実行できることが分かる。しかし、並列計算機でフーリエ変換を実行するときには、変換対象データと変換結果データとのデータ分割形式が異なると、いずれかのプロセッサに割り当てられた変換対象データに対する変換結果データがそのプロセッサに割り当てられていないことになり、そのプロセッサは、その変換結果データを他のプロセッサから受信しないとフーリエ変換のプログラムを流用して逆フーリエ変換を行うことができなくなる。
このような不便を避けるため、従来の転置アルゴリズムを使ったフーリエ変換プログラムでは、上記Ｚ方向の変換を実行した後に、３次元フーリエ変換結果データｃ_kx,ky,kzのプロセッサへの割り当てを変更し、再びプロセッサ間でフーリエ変換結果データｃ_kx,ky,kzの転置（入れ替え）を行い、フーリエ変換結果データｃ_kx,ky,kzのデータ分割をサイクリック分割に直すのが一般的であった。すなわち、図８に示すように、３次元波数空間を、Ｙ座標軸に垂直な平面で切断し、同じＹ座標値ｋ_yを有するフーリエ変換結果データｃ_kx,ky,kzを同一のプロセッサに割り当てる。具体的には、ｋ_y＝０，１，２，，，を有するフーリエ変換結果データｃ_kx,ky,kzを順次プロセッサ０，１，２，，，に割り当てる。各プロセッサが、この割り当てにしたがってそのプロセッサに割り当てられたフーリエ変換結果データｃ_kx,ky,kzの内、自プロセッサが生成しなかったデータを他のプロセッサから受信するように、全プロセッサの間でフーリエ変換結果データｃ_kx,ky,kzを転送する。こうして、サイクリック分割された３次元フーリエ変換結果データが得られる。
こうして、得られた最終的３次元フーリエ変換結果データｃ_kx,ky,kzから目的とする１次元フーリエ変換結果データｃ_kは式３より得ることができる。データｃ_kとその三次元座標ｋ_x，ｋ_y，ｋ_zとの関係は図３に示されたとおりである。
しかし、本発明者による検討によれば、従来のデータのデータ入れ替えのためのプロセッサ間での余分な通信は、並列化効率を低下し、フーリエ変換に必要な処理時間を増大する原因であることが判明した。
そこで本発明では、従来の転置アルゴリズムにおけるデータの分割方式を見直し、プロセッサ間のデータ転送量の削減の側面から最適なデータ分割方式を以下のようにして決定した。
従来の転置アルゴリズムは、Ｙ方向、Ｘ方向、Ｚ方向の各変換と、プロセッサ間でのデータの入れ替えを行う転置操作から構成される。そのアルゴリズムでは、Ｙ方向の変換を行う際に、変換対象データの空間をＺ軸に垂直な複数の平面で切り、各面を一つのプロセッサに割り当てて、次にＸ方向の変換を行う際には、その割り当てをそのまま使用し、さらにＺ方向の変換を行う際には、Ｘ軸に垂直な複数の平面で変換対象データの空間を切り、各面を一つのプロセッサに割り当てていた。これにより、その変換そのものは、プロセッサ間のデータ転送なしに実行できた。
しかし、たとえば最初のＹ方向の変換を行う際には、変換の対象となる同一のＸ座標とＺ座標を持つＮＹ個のデータが１台のプロセッサ上にありさえすれば、その変換そのものは、プロセッサ間のデータ転送なしに実行できる。したがって、この変換対象データの分割は、Ｚ軸に垂直な平面によってではなく、Ｘ軸に垂直な平面によって行ってもよい。このことは、Ｘ方向、Ｚ方向の変換についても言える。したがって、望ましいデータ分割方式が満たすべき第一の条件は、「ある方向の変換を行うときには、その変換の変換対象データをその方向以外の方向に垂直な複数の平面で分割してプロセッサへのデータ割り当てをする」というデータ分割形式が、Ｙ方向、Ｘ方向、Ｚ方向のすべてに採用されていることである。
今ひとつ考慮すべきことは転置のためのデータ転送回数である。たとえば、Ｙ方向、Ｘ方向、Ｚ方向の変換を行うとき、変換対象データをそれぞれＺ軸、Ｙ軸、Ｘ軸に垂直な複数の平面で切って分割するというデータ分割方式は、上記の第１の条件を満たすが、このデータ分割方式では、データ分割形式がＸ方向、Ｚ方向の変換を行うときという２回にわたって変更され、その変更の度に転置のためのデータ転送が必要になる。したがって、望ましいデータ分割方式が満たすべき条件として、上記の第１の条件に加えて、「データ分割形式の変更のための転置処理は一回に限る」という第二の条件を付加する。
これら２つの条件を満たすデータ分割方式を数え上げた結果を図９に示す。上記第１、第２の条件を満たすデータ分割方式は４通りあり、これらの内で、フーリエ変換対象データのデータ分割形式とフーリエ変換結果データのデータ分割形式が同一のデータ分割方式が求めるものである。
方式１が従来の転置アルゴリズムで採用されているものである。方式４は入力データがＸ方向に分割、出力データがＹ方向に分割だから、図３および図６と照らし合わせてみると、フーリエ変換対象データがブロックサイクリック分割、フーリエ変換結果データがサイクリック分割であり、方式１とはデータ分割形式がちょうど逆になってはいるものの、これらの二つの種類のデータの間でデータ分割形式が異なるという方式１と同様の欠点を抱えていることがわかる。
一方、方式２では、従来の転置アルゴリズムと同じく、フーリエ変換対象データがＺ方向に沿って分割され、Ｙ方向の変換、Ｘ方向の変換も従来の転置アルゴリズムと同じように実行されるが、Ｚ方向の変換は、従来の転置アルゴリズムと異なり、Ｘ方向の変換の結果データがＹ方向に沿って分割された後に実行される。Ｘ方向の変換とＺ方向の変換の間では、データ転置が必要である。図２および図３と照らし合わせてみると、フーリエ変換対象データもフーリエ変換結果データもサイクリック分割になることが分かる。したがって、方式２のデータ分割方式を採用すると、フーリエ変換係数の計算後にプロセッサ間でデータ転送を行わなくても、フーリエ変換対象データとフーリエ変換結果データとが同じデータ分割形式を保つ。
この方式２では、フーリエ変換は具体的には以下のようにして実行される。以下に記載するＹ方向、Ｘ方向、Ｚ方向の変換はＦＦＴのアルゴリズムにより計算される。
（ステップａ）Ｙ方向の変換
Ｙ方向の変換は、従来の転置アルゴリズムについて既に述べたステップ１の要領で実行される。既に述べたごとく、フーリエ変換対象データのデータ分割は、サイクリック分割である。
（ステップｂ）Ｘ方向の変換
さらに、Ｘ方向の変換は、Ｙ方向の変換の結果データに対して、従来の転置アルゴリズムについて既に述べたステップ２の要領でなされる。
（ステップｃ）データ転置
式７によるＺ方向の変換を行うには、３次元波数空間の座標（ｋx，ｋy）の特定の値と、３次元実空間のＺ座標jzの全ての値に対して得られた２次変換結果データ（ｃ’’kx, ky, jz）が必要である。方式２では、従来の転置アルゴリズムと異なり、Ｘ方向の変換で得られた２次変換結果データ（ｃ’’kx, ky, jz）は、Ｙ軸に垂直な複数の平面で分割される。このことは、図８に示すように、３次元波数空間を、そのｋy軸に垂直な複数の平面で分割して、分割後の部分空間（上記複数の平面）の各々を一つのプロセッサに割り当てることを意味する。すなわち、各プロセッサへ座標ｋyの特定の値を割り当てる。具体的には、ｋ y＝０，１，２，３，，，の２次変換結果データ（ｃ’’kx, ky, jz）を順次プロセッサ０，１，２，，，７に割り当てる。
この割り当てに従い、Ｚ方向の変換を実行するに必要なデータをプロセッサ間で転送する処理が実行される。各プロセッサが、そのプロセッサに割り当てられた３次元波数空間の座標ｋ_yの特定の値と、３次元波数空間の座標ｋ_xの全ての値と、３次元実空間のＺ座標j_zの全ての値とに対して得られた全ての２次変換結果データ（ｃ''_kx,ky,jz）を使用してＺ方向の変換を実行できるように、全プロセッサ間で２次変換結果データの転送が行われる。すなわち、各プロセッサは、３次元実空間のＺ座標j_zの全ての値と、３次元波数空間の座標ｋ_xの全ての値と、座標ｋ_yの特定の値との組に対して得られた２次変換結果データ（ｃ''_kx,ky,jz）（但し、ｋ_x＝０〜（ＮＸ―１）、ｋ_y＝特定値、ｋ_z＝０〜（ＮＺ―１））の内、自プロセッサが生成しなかったデータを他のプロセッサから受信するように、全プロセッサの間で２次変換結果データを転送する。
（ステップｄ）Ｚ方向の変換
各プロセッサは、そのプロセッサに割り当てられた３次元波数空間の座標ｋ_yの特定の値と、３次元波数空間の他の二つの座標ｋ_x、ｋ_zの全ての値を有する最終的なフーリエ変換結果データｃ_kx,ky,kz（但し、ｋ_x＝０〜ＮＸ―１、ｋ_y＝特定値、ｋ_z＝０〜ＮＺ―１）を計算する。各プロセッサはこの計算を互いに並列に実行できる。
この結果、座標ｋ_y＝０，１，２，３，，，に対応するフーリエ変換係数ｃ₀，ｃ₁，ｃ₂，，，が、順次プロセッサ０，１，２，，，で生成され、全フーリエ変換結果データｃ_kx,ky,kzは、プロセッサ間でサイクリックに分割されていることが分かる。
本実施の形態では、上記データ分割方式２を使用する。なお、データ分割方式３も後に詳細に説明するように、フーリエ変換対象データもフーリエ変換結果データもサイクリック分割になっている。したがって、この方式３も使用することができる。後に述べるように、実際にフーリエ変換ライブラリを構成する場合に、方式３は、方式２に比べて、ライブラリが生成する複素指数関数の値のテーブルのサイズが小さくてよいという利点を有する。
なお、計算機により入力データｆ_jに対して以上の変換を実施するときには、一般に配列データが使用される。すなわち、同じプロセッサでＹ方向の変換を施すべき一群のデータは、３次元配列に格納され、その配列に対してＹ方向の変換が実行される。その結果得られた１次変換結果データは、同じ配列あるいは他の３次元配列に格納されてもよい。他の方向の変換も先に実行された変換の結果データを格納する配列に対してなされる。また、プロセッサ間でのデータの転置も各プロセッサが生成した配列の内容を交換するようになされる。したがって、このようにそれらの変換において同じ３次元配列を使用するときには、その３次元配列の各次元のインデックスは、あるときには３次元実空間の各座標軸に対応し、他の時にはある方向の変換後の結果データが属する３次元空間の各座標軸に対応し、最終的にはフーリエ変換係数が属する３次元波数空間の各座標軸に対応することになる。しかし、このように同じ配列を異なる種類の一群のデータの格納に使用された場合でも、ある時点でその配列に格納されている一群のデータは、その一群のデータが属する３次元空間に属し、その配列の各インデックスは、その一群のデータが属する３次元空間のいずれかの座標軸を表すことには変わりはない。したがって、本発明を実施するにあたって一群のデータを格納するのに使用する配列の具体的な構造は特定のもの限定されない。さらに、以上の原理で説明したいくつかの３次元空間のいずれか一つに属する一群のデータを格納する配列の構造が、その３次元空間に直接対応しないものであっても、その配列に含まれた各データは、その３次元空間の座標を有すると見なすことができ、以上の原理説明がその配列に対してもあてはまるのは言うまでもない。
(3) 多次元フーリエ変換への応用
以上、１次元フーリエ変換のためのアルゴリズムを説明したが、本方式は多次元フーリエ変換の場合へも簡単に拡張できる。次式の２次元フーリエ変換を例に採って説明する。
【数１０】

この式は、次式１１に変形できる。この式１１は、更に次の２ステップからなる変換式１１ａ、１１ｂとして書くことができる。
【数１１】

したがって、２次元フーリエ変換は、まず式１１ａのようにＮ₁個のデータに対する１次元フーリエ変換をＮ₂組行い、次に式１１ｂのようにＮ₂個のデータに対する１次元フーリエ変換をＮ₁組行うことに帰着する。したがって、これらの１次元フーリエ変換において、本発明の方式を適用できる。
本発明の方式を用いて並列計算機上で２次元フーリエ変換を行うには、２次元データｆ_j1,j2に対し、添え字ｊ₁の方向（以下、これを第１方向と呼ぶ）にサイクリック分割を行う。すなわち、第i番目のプロセッサにｆ_m*NPU+i,j2（但し、m=0,1,...,((N1/NPU)-1)、j2=0,1,...,N2)番目の要素を割り当てる。ここで、ＮＰＵはプロセッサの台数である。
すると、式１１ａのステップは、ｊ₂が同じＮ₁個の要素の間での１次元フーリエ変換をＮ₂組行うことであり、このＮ₁個の要素はプロセッサ間にサイクリック分割されているから、このステップは本発明の方式による１次元フーリエ変換をＮ₂組行うことに帰着する。変換後のデータｃ'_k1,j2は、第１方向にサイクリック分割されている。
次に式１１ｂのステップでは、ｊ₁が同じＮ₂個の要素の間での１次元フーリエ変換をＮ₁組行うが、これらＮ₂個の要素は同一プロセッサ上にあるため、この変換は通信なしに各プロセッサごとに独立に行える。以上により２次元フーリエ変換が完了し、変換後のデータｃ_k1,k2は第１方向にサイクリック分割される。
なお、以上では２次元の場合を示したが、より次元の大きい場合も、第１方向にサイクリック分割を行い、第１方向の変換のみを本発明の方式を用いて行い、以下の変換はプロセッサごとに独立に行うことにより、本発明のフーリエ変換方式を適用可能である。
（４）並列高速フーリエ変換ライブラリ
図１に戻り、並列計算機２８上で使用される高速フーリエ変換ライブラリは、具体的には、たとえば以下のように構成される。但し、本発明を適用したフーリエ変換ライブラリは、これに限定されないことは言うまでもない。本ライブラリは、全てのプロセッサにロードされ、そのプロセッサ内のシミュレーションプログラムから必要に応じてサブルーチンとしてコールされる。
サブルーチン名称をＦＦＴ１Ｄとし、これを実行するには
CALL FFT1D (NX, NY, NZ, NPU, F, TB, IOPT, IER)
のように所定の引数を指定して、いずれかのプロセッサにロードされたすべてのシミュレーションプログラムから同時にコールする必要がある。ここで、Ｎ＝ＮＸ＊ＮＹ＊ＮＺはフーリエ変換対象データｆ_jの個数、ＮＰＵはプロセッサ台数、Ｆはライブラリのコール時はフーリエ変換対象データｆ_j、ライブラリからのリターン時はフーリエ変換結果データｃ_kを格納する配列、ＴＢは複素指数関数の値を格納するテーブル、ＩＯＰＴはサブルーチンの機能を指定する入力、ＩＥＲは実行時エラーが生じたか否かを示す出力である。
ここで、配列Ｆは各プロセッサがそれぞれ持つ部分配列である。フーリエ変換の原理説明で説明したように、全入力データ（フーリエ変換対象データ）ｆ_jは、図２のように３次元実空間に直方体状に配置され、各プロセッサには、この直方体の内、一つまたは複数の特定のＺ座標を有するＺ軸に垂直な平面に属する入力データが割り当てられる。この割り当てられた入力データが、上記引数Ｆで指定される部分配列に格納されている。すなわち、フーリエ変換対象データとフーリエ変換結果データは、ともにサイクリック分割されるので、第ｉ番目のプロセッサは、ｍ＊ＮＰＵ＋ｉ（ｍ＝０，１，．．．，Ｎ／ＮＰＵ−１）番目の要素のみを持つ。すなわち、第ｉ番目のプロセッサの配列Ｆには、Ｎ個の入力データ列ｆ_j（ｊ＝０〜Ｎ−１）の内、次式で示されるように、一群の入力データｆ_m*NPU+iを格納する。
Ｆ（ｍ）＝ｆ_m*NPU+i(m=0,1,...,N/NPU- 1)
したがって各プロセッサの持つ配列Ｆの大きさはＮ／ＮＰＵである。また、ＴＢは、第１回目のコールで計算した複素指数関数の値を格納しておくテーブルであり、２回目のコールからはここに格納した値を再利用することにより、新たな計算が不要となる。また、第１回目のコールではＩＯＰＴ＝１を指定し、このときは複素指数関数のテーブルを作成する。ＩＯＰＴ＝２は２回目以降のコールを意味し、このときは既にＴＢに格納されている値を用いる。
本ライブラリのフローチャートを図１０に示す。本ライブラリは、コールされると (ステップ４５) 、まず引数をチェックする（ステップ４６）。すなわち、Ｎ＝ＮＸ＊ＮＹ＊ＮＺとＮＰＵとが１以上の整数であるかどうか、ＩＯＰＴが１または２の値であるかどうかなど、引数の有効性を調べる。入力データに無効な値が入っていた場合は、ＩＥＲ＝１０００と設定して（処理４７）リターンする。
次に、他のプロセッサに本ライブラリがコールされたことを通知する（ステップ４８）。この通知は、実際には、そのライブラリがロードされている通信ライブラリに、他の全てのプロセッサに当該プロセッサでのライブラリコールの発生を通知することを要求し、その通信ライブラリが、その発生を他の全てのプロセッサに通知するメッセージを送信し、それぞれの他のプロセッサでは、そこにロードされた通信ライブラリが、このメッセージを受信して、そのプロセッサでロードされた本ライブラリに、送信元のプロセッサでの本ライブラリのコールを通知する。
次に、ライブラリが引数で指定した通りにＮＰＵ台のプロセッサでコールされているかどうかをチェックする（ステップ４９）。このチェックは、上に述べた他の全てのプロセッサから本ライブラリに対するライブラリコールが発生したとの通知を受信したか否かに基づいて行われる。この条件が満たされていない場合は、ＩＥＲ＝２０００と設定して（ステップ５０）リターンする。次にＩＯＰＴの値をチェックし（ステップ５１）、ＩＯＰＴ＝１の場合は、現在のコールが、最初のコールである。したがって、そのコール元のプロセッサでのフーリエ変換を実行するための準備を行う。具体的には、Ｘ、Ｙ、Ｚの方向の変換のために、そのプロセッサで式５，６，７で使用する複素指数関数の値を前もって計算し、複素指数関数のテーブルを生成し、配列ＴＢとして格納する（ステップ５５）。計算すべき複素指数関数の値は、そのプロセッサに対するデータの割り当てにより定まる。すなわち、Ｘ、Ｙ、Ｚの方向の変換の各々においてそのプロセッサが処理すべきデータの３次元実空間の座標ｊ_x，ｊ_y，ｊ_zと３次元実空間の座標ｋ_x，ｋ_y，ｋ_zとを決定し、この結果により、式５から７の複素指数の偏角が採りうるいろいろの値を決定し、それぞれの偏角に対する余弦関数の値と正弦関数の値を計算し、配列ＴＢに格納する。上記決定では、各方向での変換に使用されるデータ分割形式とそのコール元のプロセッサに予め割り当てられたプロセッサ番号と、式２、３が使用される。このプロセッサ番号は、シミュレーションプログラムのロード時に予め各プロセッサに並列計算機２８により指定されるものである。各方向での変換に使用されるデータ分割形式は、使用されるデータ分割方式、本実施の形態では前述の方式２、により定まる。
なお、ＩＯＰＴ＝１でない場合は、現在のコールが、２回目以降のコールである。このようなコールは、ライブラリのコール元のシミュレーションプログラムが、異なる物理量に対するフーリエ変換を行うようにプログラムされている場合において生じる。たとえば、シミュレーションプログラムが、第１の物理量に対するフーリエ変換のために本ライブラリをコールした後に、第２の物理量に対するフーリエ変換のために本ライブラリを再度コールした場合である。この場合、第２の物理量を表すフーリエ変換対象データも第１の物理量を表すフーリエ変換対象データと同じ添え字を有することが多い。この場合には、第２の物理量に対するフーリエ変換の実行時に、先に配列ＢＴに格納した複素指数の値が使用できる。したがって、ＩＯＰＴ＝１でない場合は、ステップ５５を実行しない。
次に、Ｙ方向の変換を行う（ステップ５６）。本実施の形態では、全プロセッサが持つフーリエ変換対象データを仮想的に図２のような各辺の長さが引数ＮＸ，ＮＹ，ＮＺの直方体状に並べ、図４に示すように、特定の座標を有するフーリエ変換対象データを同一のプロセッサに割り当てられる。このＹ方向の変換では、既にステップ１あるいはステップａとして述べたように、各プロセッサは、同じＸ座標とＺ座標とを持つＮＹ個のデータについて、高速フーリエ変換が式５に従い行う。このようなデータの組は全部でＮＸ＊ＮＺ組あるため、結局、ＮＸ＊ＮＺ個の独立なＮＹ次の高速フーリエ変換を行うことになる。プロセッサへのデータの割り当て方式より、各ＸＹ平面は１台のプロセッサに担当されているから、この変換処理は通信なしに各プロセッサで独立に行える。
本ライブラリの場合、本ライブラリにより各プロセッサが処理すべき変換対象データは、そのプロセッサで実行されるシミュレーションプログラムにより、引数Ｆで指定される配列として、そのプロセッサのメモリ（２６（図１））にコール前に格納されている。その配列Ｆの添え字と３次元実空間の座標ｊ_x，ｊ_y，ｊ_zとの関係は、データ分割形式により定まる。したがって、このＹ方向の変換では、この関係を使用して配列Ｆ内の変換対象データに対して式５で指定される変換を実行する。変換で得られた一次変換結果データはそのプロセッサのメモリに記憶される。
具体的には、各プロセッサでは、本ライブラリがコールされると、適当なタイミングで（たとえば、ステップ４６で入力データに無効な値が入っていないと判定された時）、各プロセッサは、データ格納用の３次元配列及び第１、第２、第３の３次元の作業配列をメモリ上に確保する。ここでは、データ格納用の３次元配列は、３次元のインデックスの長さが引数ＮＸ、ＮＹ、ＮＺに等しい。以下ではこれらの３次元の作業配列もデータ格納用の３次元配列と同じ大きさを有すると仮定する。しかし、これらの３次元の作業配列は、以下に説明するデータを格納できる大きさを有すればよく、したがって、これらの作業配列の大きさは適宜変更可能である。さらに、これらの３次元の作業配列の構造も、その使用目的に合致する限り、変更することができる。
データ格納用の３次元配列には、上記引数が指定する配列Ｆに含まれるデータ点列を以下のようにして格納できる。そのプロセッサに、図２のＺ軸に垂直な一つの平面が割り当てられているときには、その一つの平面に属するＮＸ＊ＮＹ個の入力データが、それぞれのデータのＸ、Ｙ、Ｚ座標に対応する、上記データ用３次元配列のインデックスを有する位置に格納される。そのプロセッサにＺ軸に垂直な複数の平面が割り当てられたときには、各面のデータは同様にして、上記データ用３次元配列の対応するインデックスの位置に格納される。
各プロセッサでは、Ｙ方向の変換はこのデータ格納用の配列を使用して、Ｚ座標が特定の値を有し、Ｙ座標とＸ座標がいろいろの値を有する一群の入力データに対して、式５により行なわれる。このとき、Ｚ座標が特定の値を有し、Ｘ座標が異なる一群の入力データに対してＹ方向の変換が高速フーリエ変換アルゴリズム（ＦＦＴ）を用いて実行される。この変換の実行にあっては、Ｘ座標が異なる一群のデータに対して、プロセッサ内のベクトル演算器（図示せず）が使用され、パイプライン的に計算が実行される。
その結果得られる１次変換結果データｃ’_jx,ky,jz（但し、ｊ_x＝０〜ＮＸ−１，ｋ_y＝０〜ＮＹ−１，ｊ_z＝特定値）は、第１の３次元の作業配列の、これらの座標値ｊ_x，ｋ_y，ｊ_zに対応するインデックスのところに格納される。したがって、図２の場合、各プロセッサでは、図２の一つの平面上の一群の入力データに対する１次変換結果データｃ’_jx,ky,jzが、第１の３次元作業配列の、特定の座標ｊ_zを有する平面上に格納される。もし、図２において、Ｚ軸に垂直な複数の平面に属する入力データがそのプロセッサに割り当てられているときには、それぞれのＺ面に対応する、上記第１の３次元作業配列内の、Ｚ軸に垂直な複数の平面のそれぞれに対応する１次変換結果データｃ'_jx,ky,jzが格納される。
Ｙ方向の変換の終了後、同様にしてＸ方向の変換を行う（ステップ５７）。すなわち、各プロセッサは、すでにステップ２あるいはステップｂとして述べたように、各プロセッサは、Ｙ方向の変換で得られた一次変換結果データに対して式６で指定される変換を実行する。変換で得られた２次変換結果データはそのプロセッサのメモリに記憶される。この変換処理も通信なしに各プロセッサで独立に行える。
具体的には、各プロセッサでは、Ｘ方向の変換は、Ｚ座標が特定の値を有し、Ｘ座標とｋ_y座標とがいろいろの値を有する一群の１次変換結果データｃ’_jx,ky,jzに対して、式６により行なわれる。このとき、Ｚ座標が特定の値を有し、ｋ_y座標が異なる一群の入力データに対してＸ方向の変換が高速フーリエ変換アルゴリズム（ＦＦＴ）を用いて実行される。この変換は、上記第１の３次元作業配列を使用して実行される。この変換の実行にあっては、ｋ_y座標が異なる一群のデータに対して、プロセッサ内のベクトル演算器が使用され、計算がパイプライン的に実行される。
その結果得られる２次変換結果データｃ''_kx,ky,jz（但し、k_x＝０〜ＮＸ−１，ｋ_y＝０〜ＮＹ−１，ｊ_z＝特定値）は、第２の３次元作業配列の、これらの座標値ｋ_x，ｋ_y，ｊ_zに対応するインデックスのところに格納される。したがって、図２の場合、各プロセッサでは、２次変換結果データｃ''_kx,ky,jzは、第２の３次元作業配列の、特定の座標ｊ_zを有する一つの平面に格納される。もし、図２において、Ｚ軸に垂直な複数の平面に属する入力データがそのプロセッサに割り当てられているときには、それぞれのＺ面に対応する、上記第２の３次元作業配列内の、Ｚ軸に垂直な複数の平面のそれぞれに対応する２次変換結果データｃ''_kx,ky,jzが格納される。
Ｘ方向の変換の終了後、プロセッサ間でのデータの転置（入れ替え）を行う。すなわち、今度は既にステップｃで述べたように、２次変換結果データの直方体を図８のようにＹ軸に垂直にスライスし、こうしてできる各面を一つのプロセッサに割り当てる（ステップ５８）。既にステップｃで述べたように、この割り当てに従い、各プロセッサが自分以外の全プロセッサとの間でそれぞれのプロセッサが生成した２次変換結果データの交換を行う。
具体的には、この転置時には、各プロセッサは、上記第２の作業配列に、そのプロセッサに割り当てられた座標ｋ_yの値を有するＹ軸に垂直な平面に属すべき、ｋ_yが特定値で、k_x，ｊ_zが種々の値を持つ２次変換結果データｃ''_kx,ky,jz（但し、k_x＝０〜ＮＸ−１，ｋ_y＝特定値，ｊ_z＝０〜ＮＺ−１）を受信するように、プロセッサ間で２次変換結果データｃ''_kx,ky,jzを交換する。
転置の終了後、Ｚ方向の変換を行う（ステップ５９）。すなわち、各プロセッサは、既にステップｄで記載したように、そのプロセッサに新たに割り当てられた２次変換結果データに対して、式７により指定される変換を実行し、最終的な３次元のフーリエ変換結果データを生成する。転置により各ＸＺ平面は１台のプロセッサに担当されているから、この変換処理も通信なしに各プロセッサで独立に行える。
具体的には、各プロセッサでは、Ｚ方向の変換は、ｋ_y座標が特定の値を有し、ｋ_x座標とＺ座標とがいろいろの値を有する一群の２次変換結果データｃ''_kx,ky,jzに対して、式７により行なわれる。このとき、ｋ_y座標が特定の値を有し、ｋ_x座標が異なる一群の入力データに対してＺ方向の変換が高速フーリエ変換アルゴリズム（ＦＦＴ）を用いて実行される。この変換は、上記第２の３次元作業配列を使用して実行される。この変換の実行にあっては、ｋ_z座標が異なる一群のデータに対して、プロセッサ内のベクトル演算器が使用され、計算がパイプライン的に実行される。
その結果得られる最終フーリエ変換結果データｃ_kx,ky,kz（但し、k_x＝０〜ＮＸ−１，ｋ_y＝特定値，ｋ_z＝０〜ＮＺ−１）は、第３の３次元作業配列の、これらの座標値ｋ_x，ｋ_y，ｋ_zに対応するインデックスのところに格納される。したがって、図８のように、一つのプロセッサに一つの座標ｋ_yを有す一つの平面が割り当てられた場合、各プロセッサでは、最終フーリエ変換結果データｃ_kx,ky,kzは、上記第３の３次元作業配列の、特定の座標ｋ_yを有する一つの平面に格納される。もし、図８において、ｋ_y軸に垂直な複数の平面がそのプロセッサに割り当てられているときには、それぞれの平面に対応する、上記第３の３次元作業配列内の、ｋ_y軸に垂直な複数の平面のそれぞれに対応する最終フーリエ変換結果データｃ_kx,ky,kzが格納される。
Ｚ方向の変換が終了すると、一次元の変換対象データｆ_jのフーリエ変換が終了し、重ね合わせの係数ｃ_kが求まる。ｃ_kの並び方は、原点からまずＹ方向に、Ｙ方向にＮＹ個行ったら次はＸ座標が１だけ増え、ＸＹ平面上にＮＸ＊ＮＹ個のデータが並んだら次はＺ座標が１だけ増える、という順で並ぶ（図３）。このデータの並び方と図８のデータの分割形式とを照らし合わせることにより、本実施の形態では、出力データｃ_kもサイクリック分割になっていることがわかる。上記第３の３次元作業配列内でも、最終フーリエ変換結果データｃ_kx,ky,kzはこの並びに対応する並びを有する。ライブラリはこのデータｃ_kx,ky,kzを一次元座標ｋの順に並び替えて一次元配列Ｆに格納し（ステップ６１）、リターンする（ステップ６２）。本ライブラリでは、従来法で必要であった変換後のデータ分割形式の変更が不要となり、通信の削減により従来法を上回る並列化効率を得ることが可能となり、フーリエ変換時間を低減できる。
なお、ＮＸ，ＮＹ，ＮＺの決め方としては、プロセッサ台数をｐとすると、Ｙ方向、Ｘ方向の変換でＺ方向に垂直な面でデータを分割することから、ＮＺ≧ｐが成り立つ必要がある。また、Ｚ方向の変換ではＹ方向に垂直な面でデータを分割することから、ＮＹ≧ｐも成り立つ必要がある。
また、並列計算機２８の各プロセッサ２９がベクトル演算器（図示せず）を備えると仮定した。このような並列計算機では、このベクトル演算器を効率的に使うためには、ベクトル化の対象となるループの長さ（すなわち、同じ演算を受けるデータ群（ベクトルデータ）の要素数であり、ベクトル長とも言われる）をできるだけ長く取る必要があることが知られている。本アルゴリズムで式５から７を計算するときには、このベクトル演算器が使用される。ベクトル化の対象となるループは、フーリエ変換にも並列化にも使わない座標軸の方向で複数のデータに対して同じ演算を実行する計算であり、Ｙ方向、Ｘ方向、Ｚ方向の変換において、それぞれＸ方向、Ｙ方向、Ｘ方向での演算となる。したがって、ベクトル演算器の性能を引き出すには、ＮＹ≧ｐ，ＮＺ≧ｐの２つの条件を満たしつつＮＸとＮＹをできるだけ大きく取るようにＮＸ，ＮＹ，ＮＺを決めることが望ましい。
なお、並列計算機２８は、ベクトル演算器を有すると仮定したいが、この演算器がフーリエ変換において必要な全ての演算の一部の演算をパイプライン的に実行できるものでもよい。さらに、並列計算機２８がベクトル演算器を有しない並列計算機であっても、ループ長を大きくすることが高速化に有効である場合が多い。また、以上の動作の説明では、並列計算機２８がメモリ２９と演算器（図示せず）の間に複数のベクトルレジスタを有しないと仮定し、各方向の変換で利用される配列はメモリ２９から直接演算器に読み出され、あるいはその変換で生成される配列はメモリ２９に直接演算器から書き込まれるかのように説明した。しかし、複数のベクトルレジスタを有する並列計算機では、メモリ２９上の配列に対する演算あるいはその演算の結果得られた配列のメモリ２９への格納は、これらのレジスタを介して実行させればよいことは当業者に明らかである。
本実施の形態では本ライブラリは逆フーリエ変換を実行するためのプログラムを有しない。後で説明するように、シミュレーションプログラムが逆フーリエ変換を必要とするときには、シミュレーションプログラムの方で、逆フーリエ変換の対象のデータの複素共役データを生成し、そのデータに対してフーリエ変換を本ライブラリに要求する。この複素共役データに対して得られたフーリエ変換データの複素共役をシミュレーションプログラムが生成する。
しかし、本ライブラリにフーリエ変換の機能を持たせることもできる。すなわち、本ライブラリの引数としてフーリエ変換か逆フーリエ変換かを指定する引数を追加し、シミュレーションプログラムが逆フーリエ変換を要求したときには、本ライブラリで、変換対象データの複素共役を求め、これにフーリエ変換を上記のようにして実行し、得られた結果データの複素共役を求め、それを逆フーリエ変換結果データとしてシミュレーションプログラムに戻せばよい。
（５）シミュレーションプログラム
本実施の形態において使用するシミュレーションプログラムの例として気象計算のための並列プログラムを図１１に示す。気象計算は本来３次元の計算であるが、現在は計算機能力の制約から２次元で行うことも多い。そこで本実施の形態では、２次元の気象予測対象とする領域（これが計算対象領域となる）の場合を例にとってシミュレーションプログラムを説明する。ユーザは、予め全計算対象領域を複数の部分計算領域に区分し、それぞれをいずれか一つのプロセッサに割り当てる。さらに、各部分計算領域のサイズＮ１，Ｎ２、フーリエ変換で用いるＮＸ，ＮＹ，ＮＺなどのパラメータを指定する。ユーザが本プログラムを並列計算機２８で使用するときには、まず、ワークステーション１が、並列計算機２８内の特定のプロセッサ（たとえばプロセッサ０）と交信して、このプログラムと上記ユーザ指定の情報と、空気の熱伝導率などの計算に使用する物質定数と、全計算対象領域に内の観測によって得られた温度・風速・圧力などの初期値データとを、その特定のプロセッサ０を介して外部記憶装置３１に記憶する。その後、その特定のプロセッサ０が、各プロセッサに本プログラムをロードし、全プロセッサで本プログラムを起動する。本プログラムは、並列計算機２８内の全プロセッサで全く同じようにして並行して実行される。
本プログラムでは、起動されると、まず計算領域のサイズＮ１，Ｎ２、フーリエ変換で用いるＮＸ，ＮＹ，ＮＺ、空気の熱伝導率などの物質定数などのパラメータと、観測によって得られた温度・風速・圧力などの初期値データとを外部記憶装置３１から入力する（ステップ３２）。本プログラムは、それがロードされたプロセッサがどの部分計算領域に関する計算を実行するかを判断するようにプログラムされていると仮定する。このステップでは、各プロセッサは、プロセッサに依らないで使用される上記パラメータを入力するとともに、外部記憶装置３１に記憶された全計算対象領域に対する初期値データの内、そのプロセッサに割り当てられた部分計算領域に関する初期値データを選択して外部記憶装置３１から入力する。
なお、上記（３）の「多次元フーリエ変換への応用」の項で述べたように、本発明の方式による２次元高速フーリエ変換では、入力データが第１の座標方向にサイクリック分割されている必要がある。すなわち、サイズＮ１×Ｎ２のメッシュ上で定義されたある物理量Ａ_j1,j2（ただしｊ₁＝０, １， ... ，Ｎ１−１，ｊ₂＝０, １， ... ，Ｎ２−１）のうち，要素Ａ_m*NPU+i,j2（ｍ=０,１,...,Ｎ１／ＮＰＵ−１，ｉ＝０，１，...，ＮＰＵ−１，ｊ₂＝０, １， ... ，Ｎ２−１)は第ｉ番目のプロセッサに割り当てられている必要がある。そこで，本実施の形態のシミュレーションプログラムでも，２次元高速フーリエ変換での入力形式に合わせて，このように第１の座標方向をサイクリック分割することによって得られる部分計算領域を用いる。
その後計算に必要な前処理を行う（ステップ３３）。ここで前処理とは、観測によって得られた温度・風速・圧力などのデータに対して補間を行い、計算に必要なメッシュポイントでの温度・風速・圧力などのデータを得ることである。
これらの処理が終わった後、以下に説明する繰り返しループにより各時間ステップでの温度・風速・圧力などの量を順々に求めていく。基礎となる方程式は、以下に示す風速に対する運動方程式、質量保存の式、温度変化を表す式の３本である。
【数１２】
ｄｕ／ｄｔ＝−２Ω×ｕ−（１／ρ）∇ｐ＋Ｆ_u ，．．．（１２）
【数１３】
ｄρ／ｄｔ＝−ρ∇・ｕ．．．（１３）
【数１４】
ｄＴ／ｄｔ＝−κ∇²Ｔ＋ｕ・∇Ｔ．．．（１４）
ここで、ｕは風速、ｐは圧力、Ｔは温度を表し、Ωはコリオリ力と呼ばれる地球の自転による力、Ｆ_uはそれ以外の外力、ρは空気の密度、κは空気の熱伝導率を表す。これらの式から次の時刻でのデータの値を求めるには、まずフーリエ変換により格子点上の温度Ｔ、圧力ｐおよび風速ｕをそれぞれ波数空間でのデータに変換する。そのために、それぞれの物理量についてのデータについて２次元高速フーリエ変換ライブラリＦＦＴ２Ｄを順次コールする（ステップ３４）。ライブラリＦＦＴ２Ｄのコール時には、既に述べた引数を指定する。
波数空間でそれぞれの物理量のデータを微分する（ステップ３５）。すなわち、ライブラリＦＦＴ２Ｄから与えられる、各物理量に関するフーリエ変換係数データを波数空間で微分し、その物理量についての、波数空間の格子点上での温度勾配∇Ｔ、２次微分∇²Ｔ、圧力勾配∇ｐ、風速の発散∇・ｕ等の微分に関連するデータを求める。
各物理量についての上記微分に関連するデータを逆フーリエ変換して、実空間の格子点上での温度勾配∇Ｔ、２次微分∇²Ｔ、圧力勾配∇ｐ、風速の発散∇・ｕ等の微分に関連するデータを求める（ステップ３６）。逆フーリエ変換するには、すでに述べた式８、９を使用する。すなわち、各物理量についての上記微分後のデータの複素共役なデータを生成し、ライブラリＦＦＴ２Ｄをコールしてこの複素共役なデータに対するフーリエ変換を要求する。さらに、得られたフーリエ変換結果データの複素共役なデータを生成し、この生成された複素共役なフーリエ変換結果データを逆フーリエ変換結果データとして使用する。
この後、上記逆フーリエ変換で得られた微分に関連するデータを、式１２−１４の右辺に代入し、風速ｕ、空気密度ρ、温度Ｔのそれぞれに関する時間微分を決定し、得られたそれらの時間微分を用いて、次の時間ステップでの温度・風速・圧力を求める（ステップ３７）。
なお、ステップ３４，３５，３６で、フーリエ変換により実空間上の格子点のデータを波数空間上のデータに変換してそこで微分関連データを求め、得られた微分関連データを逆フーリエ変換して実空間に関する微分関連データを得るのは、その方が微分が精度良く計算できるからであり、本シミュレーションプログラムでは、この計算部分で２次元フーリエ変換ライブラリＦＦＴ２Ｄを用いる。
上記のループでは、各時間ステップ毎に、求める時刻までの計算が終了したかどうかを判定し（ステップ３８）、終了したら、後処理を行い（ステップ３９）、予測結果データとして出力する（ステップ４０）。後処理では、主に計算を行うメッシュポイントと予測結果データが必要な点とがずれている場合に、計算結果データを補間して必要な点での予測値を計算するなどの処理を行う。出力処理４０では、各プロセッサは、生成したデータを外部記憶装置３１に書き戻し、上記特定のプロセッサはシミュレーションプログラムの実行終了時に、このデータをワークステーション１に一つの結果データとして転送する。
なお、以上ではシミュレーションプログラムは、並列計算機２８内の全プロセッサで全く同じようにして並行して実行されると仮定した。さらに、それがロードされたプロセッサがどの部分計算領域に関する計算を実行するかを判断するようにプログラムされていると仮定する。しかし、本発明に依るフーリエ変換を利用するプログラムはこのようなプログラムに限定されないことはいうまでもない。上記フーリエ変換ライブラリＦＦＴ２Ｄを使用するには、それぞれのライブラリが要求する上記複数の引数を指定することが必要であり、それらの引数の生成あるいは獲得は他の方法でも良い。たとえば、本プログラムは、並列計算機２８内の特定の一つのプロセッサで実行される単独処理部分と全プロセッサで並行して実行される並列処理部分とから構成されてもよい。たとえば、本プログラムがいずれかのプロセッサで起動されたときに、そのプロセッサが上記特定の一つのプロセッサであるときにその単独処理部分が実行され、そうでないときには上記並列処理部分のみが実行される。上記単独処理部分では、各プロセッサが担当する部分計算領域を判断し、その結果を他のプロセッサに通知するように構成できる。
上記の例では気象予測計算を行う場合を例にとって説明したが、本発明の手法は、これ以外の応用例についても、並列計算機上で高速フーリエ変換を用いてシミュレーションを行う場合に適用できることは明らかである。一次元フーリエ変換ライブラリＦＦＴ１Ｄについても全く同様である。
＜発明の実施の形態２＞
上記の実施の形態１では、フーリエ変換ライブラリは、図９の方式２を採用した。しかし、本実施の形態では、フーリエ変換ライブラリは、図９の方式３を採用する。この方式３では、Ｙ方向の変換を行った後で転置を行い、その後Ｘ方向とＺ方向の変換を行う。
この方式２では、フーリエ変換は具体的には以下のようにして実行される。
（ステップａ’）Ｙ方向の変換
Ｙ方向の変換は、従来の転置アルゴリズムについて既に述べたステップ１あるいはａ’の要領で実行される。既に述べたごとく、フーリエ変換対象データのデータ分割は、サイクリック分割である。
（ステップｂ’）データ転置
式６によるＸ方向の変換を行うには、３次元波数空間の座標ｋ_yの特定の値と、３次元実空間の座標（j_x，j_z）の全ての値に対して得られた１次変換結果データｃ'_jx,ky,jzが必要である。方式３では、方式２と異なり、Ｙ方向の変換で得られた１次変換結果データｃ'_jx,ky,jzは、Ｙ軸に垂直な複数の平面で分割される。このことは、図８に示すように、３次元波数空間を、そのｋ_y軸に垂直な複数の平面で分割して、分割後の部分空間（上記複数の平面）の各々を一つのプロセッサに割り当てることを意味する。すなわち、各プロセッサへの座標ｋ_yの特定の値を割り当てる。具体的には、ｋ_y＝０，１，２，３，，，の１変換結果データｃ'_jx,ky,jzを順次プロセッサ０，１，２，，，７に割り当てる。
この割り当てに従い、後にＸ方向の変換を実行するに必要なデータをプロセッサ間で転送する処理がここのステップｂ’で実行される。各プロセッサが、そのプロセッサに割り当てられた３次元波数空間の座標ｋ_yの特定の値と、３次元実空間の座標（j_x，j_z）の全ての値とに対して得られた全ての１次変換結果データｃ'_jx,ky,jzを使用してＸ方向の変換を実行できるように、全プロセッサ間で１次変換結果データの転送が行われる。すなわち、各プロセッサは、３次元実空間の座標（j_x，j_z）の全て値と、３次元波数空間の座標ｋ_yの特定の値との組に対して得られた１次変換結果データｃ'_jx,ky,jz（但し、ｊ_x＝０〜（ＮＸ―１）、ｊ_z＝０〜（ＮＺ―１）、ｋ_y＝特定値）の内、自プロセッサが生成しなかったデータを他のプロセッサから受信するように、全プロセッサの間で２次変換結果データを転送する。
（ステップｃ’）Ｘ方向の変換
次に、Ｘ方向の変換を実行する。各プロセッサは、式６により、他のプロセッサとの通信をしないで、そのプロセッサに割り当てられた３次元波数空間の座標ｋ_yの特定の値と、３次元波数空間の座標ｋ_xの全ての値と、３次元実空間の座標ｊ_zの全ての値に関連するＮＸ＊ＮＺ個の２次変換結果データ（ｃ''_kx,ky,jz）（ｋ_x＝０〜ＮＸ−１、ｊ_z＝０〜ＮＺ−１）を得る。
（ステップｄ’）Ｚ方向の変換
各プロセッサは、そのプロセッサに割り当てられた３次元波数空間の座標ｋ_yの特定の値と、３次元波数空間の他の二つの座標ｋ_x、ｋ_zの全ての値を有する最終的なフーリエ変換結果データｃ_kx,ky,kz（但し、ｋ_x＝０〜ＮＸ―１、ｋ_y＝特定値、ｋ_z＝０〜ＮＺ―１）を計算する。各プロセッサはこの計算を互いに並列に実行できる。
この結果、座標ｋ_y＝０，１，２，３，，，に対応するフーリエ変換係数ｃ₀，ｃ₁，ｃ₂，，，が、順次プロセッサ０，１，２，，，で生成され、全フーリエ変換結果データｃ_kx,ky,kzは、プロセッサ間でサイクリックに分割されていることが分かる。
本方式は、実施の形態１で使用した方式２に比べ、複素指数関数テーブルを格納する配列ＢＴの容量の点で有利となる。実際、式６により、Ｘ方向の変換における複素指数関数の値はＸ方向、Ｙ方向のインデックスのみに依存し、Ｚ方向のインデックスには依存しない。したがって、実施の形態１のようにＸ方向の変換においてＺ軸に垂直な分割を採用した場合には、各プロセッサが同じテーブルを重複して持つことになる。それに対して本方式では、分割をＹ軸に垂直な面で行うので、各プロセッサが自分の計算に必要なテーブルの一部分のみを持つことになり、重複はない。これにより、本方式ではＸ方向の変換に必要なテーブルの大きさが１／（プロセッサ台数）に削減できる。
本実施の形態では、ベクトル化の対象となるループはＹ方向、Ｘ方向、Ｚ方向の変換において、それぞれＸ方向、Ｚ方向、Ｘ方向となるので、ベクトル並列計算機の性能を引き出すには、ＮＹ≧ｐ，ＮＺ≧ｐの２つの条件を満たしつつＮＸとＮＺをできるだけ大きく取るのがよい。
＜発明の実施の形態３＞
本実施の形態において対象となる並列計算機は、実施の形態１で説明した図１の並列計算機システムとほぼ同様であるが、各プロセッサは同一のベクトル演算器を内蔵し、かつ、外部記憶装置中３２に、そのベクトル演算器の性能に関するデータベースを持つ。ベクトル演算器の演算性能とはたとえば単位時間あたりに実行可能な演算回数である。ベクトル演算器性能データベース中には、ベクトル演算器の性能データがループ長Ｌの関数（ｇ（Ｌ））の形で格納されている。
実施の形態１では、フーリエ変換のためのパラメータＮＸ，ＮＹ，ＮＺはプログラムへの入力として決定していたが、これを並列計算機２８の各々のプロセッサを構成するベクトル演算器の特性に応じて最適化することにより、さらに効率的な計算が可能となる。一般に、ベクトル演算器の演算性能は、ループ長Ｌの関数ｇ（Ｌ）である。ｇ（Ｌ）は通常、Ｌに対して単調に増加する関数である。いま、実施の形態１のフーリエ変換方式でＹ方向の変換を計算するステップの演算量を考えると、ＮＹ次のフーリエ変換を一回行うための演算量はＮＹｌｏｇＮＹであり、これをＮＸ＊ＮＺ組計算するから、全体での演算量はＮＸ＊ＮＹ＊ＮＺｌｏｇＮＹ＝ＮｌｏｇＮＹである。同様にして、ＮＸ方向、ＮＺ方向での演算量は、それぞれＮｌｏｇＮＸ、ＮｌｏｇＮＺである。一方、それぞれの演算におけるベクトル化のループ長は、実施の形態１で述べたようにＮＸ，ＮＹ，ＮＸであるから、ベクトル演算器の演算性能はそれぞれｇ（ＮＸ），ｇ（ＮＹ），ｇ（ＮＸ）となる。演算時間ｔは演算量を演算性能で割ることによって得られ、合計で
t=NlogNY/g(NX)+NlogNX/g(NY)+NlogNZ/g(NX)
となる。したがって、プロセッサ台数をｐとするとき、ＮＸ≧ｐ，ＮＺ≧ｐという条件の下でｔを最小化するようにＮＸ，ＮＹ，ＮＺを決めることにより、ベクトル演算器の性能を最大限に発揮できる高速フーリエ変換が実現できる。
本実施の形態でのライブラリのフローチャートを図１２に示す。処理は、ＮＸ，ＮＹ，ＮＺの決定部分（ステップ４３）を除いては、実施の形態１（図１０）と同様である。ステップ４３では、上記ベクトル演算器性能データベースを用いて、ＮＸ≧ｐ，ＮＺ≧ｐという条件の下で上記演算時間ｔを最小化するようにＮＸ，ＮＹ，ＮＺを決める。その後の処理は、実施の形態１と同様である。このライブラリへのコール文ではシミュレーションプログラムはこれらのパラメータＮＸ，ＮＹ，ＮＺを指定する必要はない。本実施の形態の方式によれば、ユーザは自分でＮＸ，ＮＹ，ＮＺを計算することなく、ベクトル演算器を内蔵する並列計算機の性能を最大限に引き出すことが可能となる。
なお、ＮとＮＰＵとが一般の整数の場合には、ＮＸ，ＮＹ，ＮＺを変えることにより、入力データのプロセッサへの分割形式も変更する必要があるが、フーリエ変換でもっともよく利用される、ＮおよびＮＰＵが共に２のべき乗の場合には、ＮＸ，ＮＹ，ＮＺを変えても、分割形式を変更する必要がない場合がある。
実際、２つの組（ＮＸ，ＮＹ，ＮＺ）＝（ＮＸ１，ＮＹ１，ＮＺ１）、（ＮＸ２，ＮＹ２，ＮＺ２）が共にＮＹ≧ＮＰＵ，ＮＺ≧ＮＰＵの２つの条件を満たしているとき、入力データｆ_jを図２に示す順番で直方体状に並べ、これをＺ軸に垂直な面でスライスして、各面をサイクリックにプロセッサ０，１，．．．，ＮＰＵ−１に割り当てたとする。すると、（ＮＸ，ＮＹ，ＮＺ）＝（ＮＸ１，ＮＹ１，ＮＺ１）の場合は、ｆ_jの属する面は上からＭＯＤ（ｆ_j，ＮＺ１）＋１番目であり、この面を担当するプロセッサの番号は
MOD(MOD(ｆ_j,NZ1),NPU)
である。一方、（ＮＸ，ＮＹ，ＮＺ）＝（ＮＸ２，ＮＹ２，ＮＺ２）の場合も同様にして、ｆ_jを担当するプロセッサの番号は
MOD(MOD(ｆ_j,NZ2),NPU)
となる。ところが、いまＮＺ１≧ＮＰＵ，ＮＺ２≧ＮＰＵであり、ＮＺ１，ＮＺ２，ＮＰＵはすべて２のべき乗であるから、ＮＺ１，ＮＺ２は共にＮＰＵの倍数である、したがって、

すなわち、ｆ_jを担当するプロセッサの番号は、どちらの場合も同じである。
以上の考察より、ＮおよびＮＰＵが共に２のべき乗で、ＮＹ≧ＮＰＵ，ＮＺ≧ＮＰＵの２つの条件が成り立っている限り、ＮＸ，ＮＹ，ＮＺを変えても、入力データｆ_jのプロセッサへの分割形式は変更する必要がないことがわかる。
このことを利用し、分割形式を変えずに済む範囲でＮＸ，ＮＹ，ＮＺの最適化を行えば、分割形式変更に伴う新たな通信オーバーヘッドを生じることなく、ベクトル演算器を含む並列計算機での処理速度、具体的には、フーリエ変換速度を向上させることができる。
＜発明の実施の形態４＞
本発明による高速フーリエ変換を用いてシミュレーションを行う他の例として、半導体デバイス等における電子構造計算を説明する。電子構造計算は、その結果を利用して半導体デバイスの設計、とくにデバイス構造の決定に使用されている。
電子構造計算では、２次元または３次元のメッシュで定義された電子の波動関数ｕ（ｒ）を、次のシュレディンガー方程式
【数１５】

に従って計算することにより、半導体の性質を決定するバンドギャップの大きさや、結晶の構造安定性などを求める。ただし、上式で、ｈはプランク定数、ｍは電子の質量、Ｅは対象とする波動関数のエネルギーレベル、Ｖは結晶中の原子や他の電子によるポテンシャルエネルギーを表す。
式１５の計算では、波動関数ｕ（ｒ）の２次微分∇²ｕ（ｒ）が必要であるが、気象計算の例において述べたのと同様な理由により、この部分はｕ（ｒ）をフーリエ変換により波数空間に移してから計算し、結果を逆フーリエ変換で再び実空間に戻す。したがって、並列計算機上で電子構造計算を行う場合には、この部分で本発明の高速フーリエ変換方法が適用できる。
＜変形例＞
本発明は、以上の実施の形態に限定されるのではなく、以下に例示する変更あるいは変形以外のいろいろの変更あるいは変形により実現可能である。
（１）本発明によるフーリエ変換方法は、シミュレーションに限らず他の用途にも使用できるのは言うまでもない。たとえば、伝送される信号あるいは地震波等の波動の解析に利用でき、解析の結果を用いて、信号伝送に関係する装置、例えば伝送装置あるいは伝送線路の設計を行うことができ、あるいは地震を利用した応用、例えば資源開発等にも利用できる。
（２）以上の実施の形態では、フーリエ変換変換はそのために用意されたフーリエ変換ライブラリにより実行された。しかし、本発明は、フーリエ変換を使用するアプリケーションプログラム自身にこのフーリエ変換手順を実行するプログラムを組み込んでもよいことは明らかである。このようなシミュレーションプログラムは。プログラムを磁気記憶装置のようなプログラム記録媒体に記憶して販売できる。
（３）本発明は、フーリエ変換対象データｆ_jが実データであるときにも適用できる。その場合に、Ｙ方向等の変換のときに、係数の計算においては、虚数部の計算を省略することができる。
以上から明らかなように、本発明によれば、並列計算機を使用してフーリエ変換を従来より高速に実行できる。たとえば本出願人により開発された並列計算機ＳＲ２２０１を用いて本発明の効果を評価した結果では以下の通りである。３次元フーリエ変換を実行する場合、従来法では、２５６×２５６×２５６のサイズのデータを２５６台のプロセッサを用いて変換するのに、約０．２６秒の時間が必要である。この内訳は、計算に０．１４秒、途中でのデータの転置に０．０６秒、最後のデータの並べ替えに０．０６秒の時間がかかる。実施の形態１あるいは２に記載の方法によれば、計算と転置の時間は従来法と同じであり、最後のデータの並べ替えが省略できるので、０．２０秒でフーリエ変換を行うことができ、約２４％の高速化が達成できる。とくに、実施の形態１で記載した気象計算を、３次元フーリエ変換を用いて行う場合、気象計算では全計算時間の約５０％がフーリエ変換で占められるため、約１２％の高速化が得られる。また、実施の形態４で記載した電子構造計算を３次元フーリエ変換を用いて行う場合、通常全実行時間の３０％程度がフーリエ変換で占められるため、約７％の高速化が達成できる。
【発明の効果】
以上説明したように、本発明によれば、並列計算機上でのフーリエ変換を高速に実行できる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態で使用する並列計算機の概略構成図。
【図２】本発明の第１の実施の形態で使用する一次元変換対象データの３次元データへの写像を説明する図。
【図３】本発明の第１の実施の形態で使用する一次元変換結果データの３次元データへの写像を説明する図。
【図４】本発明の第１の実施の形態で使用する、プロセッサへデータを割り当てる第１の方法を示す図。
【図５】従来技術で使用する、プロセッサへデータを割り当てる他の方法を示す図。
【図６】本発明の第１の実施の形態で使用する、一次元変換対象データのプロセッサ間データ分割形式を説明する図。
【図７】従来技術による、一次元変換結果データのプロセッサ間データ分割形式を説明する図。
【図８】本発明の第１の実施の形態で使用する、プロセッサへデータを割り当てる第２の方法を示す図。
【図９】本発明に至る前に比較検討した、複数のフーリエ変換変換手順を示す図。
【図１０】本発明の実施の形態１で使用するフーリエ変換ライブラリのフローチャート。
【図１１】本発明の実施の形態１で使用するシミュレーションプログラムのフローチャート。
【図１２】本発明の実施の形態３で使用するフーリエ変換ライブラリのフローチャート。BACKGROUND OF THE INVENTION
The present invention relates to a Fourier transform method suitable for execution on a computer having a plurality of processors, and more particularly to a Fourier transform method suitable for execution on a vector parallel computer composed of a plurality of processors incorporating a vector calculator. .
[Prior art]
One of the processes frequently used in scientific and engineering calculations is Fourier transform. The Fourier transform is used for simulation of physical phenomena and the like. The Fourier transform is a process of representing a function f (x) taking a complex value defined in a certain real interval as a superposition of a complex exponential function exp (ikx). When it is realized on a computer, Since the number is finite, a complex point sequence f₀, F₁,. . . , F_N-1Is expressed as a superposition of N complex exponential functions exp (2πikj / N) (where k = 0, 1,..., N−1, i is an imaginary unit, and π is a pi). . That is, f₀, F₁,. . . , F_N-1Is given by Equation 1a, the superposition coefficient c₀, C₁,. . . , C_N-1It is the Fourier transform that finds. Each point f_jThe value of is expressed by Equation 1b using these coefficients.
[Expression 1]
c_k= (1 / N) Σ_{j = 0} ^N-1f_jexp (-2πikj / N)
(However, k = 0,1, ..., N-1) (1a)
f_j= Σ_{k = 0} ^N-1c_kexp (2πikj / N)
(However, j = 0,1, ..., N-1) (1b)
However, if the calculation is performed based on this definition, the number of expressions is N, and each expression is composed of N terms. Therefore, in addition to the calculation of the complex exponential function exp (−2πikj / N), addition of complex numbers And multiplication is about N²Is required. Therefore, in practice, a technique called fast Fourier transform in which the amount of calculation is reduced to the order of about NlogN by means of an algorithm is widely used.
Conventionally, two methods called transpose algorithm and binary exchange have been proposed for efficient fast Fourier transform on parallel computers (for example, V. Kumar, A. Grama, A. Gupta and G Karypis: "Introduction to Parallel Computing, The Benjamin / Cummings Publishing Company, 1994." The former is a method in which communication between processors is centralized in one place in the middle of calculation, and the latter is performed while communication is performed between processors. If the number of processors is p, the number of communications is p-1 for the former and log for the latter.₂p, the amount of data sent per communication is N / p²The latter is N / 2p. The latter has the advantage that less communication time is required compared to the former, so there is an advantage that less communication time is required for small-scale problems where the communication setup time is dominant, but the total amount of data to be communicated increases. In the case of large-scale data, the former is advantageous.
Scientific and technological calculations such as semiconductor device characteristic calculations, electronic state calculations, and weather forecast calculations require large-scale simulations that handle tens of thousands to millions of variables. As a means for handling such a large-scale problem, a parallel computer is promising. A parallel computer is a system in which dozens to tens of thousands of high-speed processors are connected via a network. Compared to conventional sequential computers, the number of processors can increase peak performance as much as possible. Have. Furthermore, in recent parallel computers, an arithmetic unit is often configured so that each processor can execute the same operation on a series of data at high speed. In particular, a vector parallel computer has been developed in which each processor has a vector operation unit that executes the same operation on a plurality of data in a pipeline manner as such an operation unit. Some vector parallel computers can execute a vector instruction that specifies an operation by the vector calculator. Further, there is a parallel computer in which a plurality of vector registers are provided between the memory and the vector calculator. These vector registers reduce the time that the transfer time of data between the memory and the arithmetic unit has on the processing time. Simulation can be executed at higher speed. Strictly speaking, it is not a vector parallel computer, but as a parallel computer similar to a vector parallel computer, even if the operation unit that performs a certain operation is not a vector operation unit, the operation is performed on a series of data at high speed. Many parallel computers use an arithmetic unit configured to be able to do so.
The Fourier transform is one of the most frequently used processes in scientific and technical computation, and recently, it is often provided as a library for parallel computers. For example, see “Product Program HI-UX / MPP Matrix Calculation Subprogram Library MATRIX / MPP” edited by Hitachi, Ltd.
When a large-scale simulation executed on a parallel computer performs Fourier transform, the above-described transposition algorithm is often used. As described above, when the transposition algorithm is executed by a vector type parallel computer or a parallel computer similar thereto, the point sequence data to be converted is arranged in a three-dimensional space in a rectangular parallelepiped shape. On the other hand, for example, the Y direction conversion, the X direction conversion, and the Z direction conversion are sequentially performed, thereby obtaining the same result as that obtained by performing the fast Fourier transform on all the data point sequences. More specifically, one-dimensional data f subject to Fourier transform₀, F₁,. . . , F_N-1Are arranged in a rectangular parallelepiped shape with the length of each side being NX, NY, NZ. Here, NX, NY, and NZ are integers that satisfy NX * NY * NZ = N. When arranging the data in a rectangular parallelepiped shape, arrange the data in the Z direction from the origin, for example, and after arranging the NZ data, arrange the data by increasing the X coordinate by 1, and repeat this to repeat NX * NZ After the data is arranged, the next operation is to increase the Y coordinate by 1 and arrange the data.
After arranging the data in this way, the rectangular parallelepiped is sliced perpendicularly to the Z axis, and each surface thus formed is assigned to one processor of the parallel computer. Next, conversion in the Y direction is performed. Input data f to the processor_jSince each XY plane is assigned to one processor by this allocation method, this conversion process can be performed independently by each processor without communication. Next, in the same way, each processor independently converts the result data of the Y direction in the X direction. After the conversion in the X direction is completed, the result data of the conversion in the X direction between the processors is exchanged. This time, the rectangular parallelepiped formed by the result data is sliced perpendicularly to the X axis, and each surface thus formed is converted into one processor. Assign to. This process is called transposition, and each processor needs to exchange data with all other processors. After the transposition is completed, each processor performs conversion in the Z direction independently. Thus, the one-dimensional input data f arranged in a rectangular parallelepiped shape_jOutput data c representing the coefficient of superposition arranged in a rectangular parallelepiped shape_kIs obtained. Output data c_kAre arranged in the Y direction first from the origin, and then the number of X in the Y direction is increased, then the X coordinate is incremented by 1. Next, when NX * NY pieces of data are arranged on the XY plane, the Z coordinate is incremented by 1. Line up with.
In the above transposition algorithm, the input data f_jIs divided into f_jIs divided among the processors in such a manner that the processor of the MOD (j, p) number is in charge. This data division format is called cyclic division. The data division format also represents the order of allocation of data to processors, and in this specification, the data division format may be referred to as an allocation order or an allocation mode. On the other hand, output data c_kIs divided into block cyclic division in which NY continuous data is handled by one processor, and the input data f_jAnd the data division format between processors is different.
In the above transposition algorithm, the input data f_jAs can be seen from the way of arranging and dividing the data into processors, the division format of the input data is f_jIs divided by the MOD (j, p) -th processor. On the other hand, how to divide the transposed data into processors, and output data c obtained by conversion_kAs can be seen from the arrangement of the output data, the output data division format is block cyclic division in which one processor is in charge of NY continuous data.
However, in many applications, the Fourier transform and inverse Fourier transform are used in pairs, and the inverse Fourier transform is carried out using a Fourier transform program. Therefore, the Fourier transform input data and output data are the same data division format (data allocation). It is more convenient if it is in order. Therefore, in the conventional fast Fourier transform method, the output data c of the block cyclic division according to the above processing._kData is transferred again between the processors, and the data c_kNeed to be converted to cyclic division and output.
[Problems to be solved by the invention]
As a result of the study by the present inventors, in the above conventional Fourier transform method, data transfer between processors for changing the data division format (data allocation order) performed after the calculation of Fourier transform coefficients shortens the Fourier transform time. I found out that
Accordingly, an object of the present invention is to provide a data division format (data allocation order) in which the Fourier transform result data is the same as the data to be subjected to Fourier transformation without performing data transfer to change the data division format after the calculation of Fourier transform coefficients. It is to provide a Fourier transform method that can have
[Means for Solving the Problems]
In order to achieve the above object, the Fourier transform method according to the present invention comprises:
Each processor sequentially executes the first conversion process, the second conversion process, and the third conversion process,
A group of result data obtained as a result of executing one of the first and second conversion processes by each of the plurality of processors and executing one of the conversion processes by the plurality of processors. Exchanging the group of result data among the plurality of processors such that the plurality of result data subgroups constituting the group are assigned to different processors.
The first to third conversion processes are defined so that a plurality of Fourier transform coefficient data subgroups constituting a group of ordered Fourier transform coefficient data for the group of ordered conversion target data are generated by different processors. Each processor is pre-assigned to the processor one conversion target data portion group of the plurality of conversion target data portion groups constituting the group of conversion target data,
The result assigned to each processor in the step of exchanging so that the order of the processors that generated each of the group of Fourier transform coefficient data is the same as the order of the processors to which each of the group of transformation target data is assigned. Data subgroups are defined.
More specifically, the first, second, and third conversion processes are conversion processes related to the first, second, and third coordinate axes of the three-dimensional data space, respectively.
Each of the conversion target data groups has one coordinate of a lattice point group located in the rectangular parallelepiped shape of the three-dimensional data space,
The plurality of conversion target data subgroups are all conversion targets having the same coordinate value of the third coordinate axis in the three-dimensional data space and different coordinate values of the first and second coordinate axes in the three-dimensional data space. The data is determined to be included in the same conversion target data subgroup,
Each of the Fourier transform coefficient data groups has one coordinate of a grid point group located in a rectangular parallelepiped shape in the three-dimensional coefficient space,
The plurality of Fourier transform coefficient data subgroups have the same coordinate value of the first coordinate axis of the three-dimensional coefficient space, and all Fouriers having different coordinate values of the second and third coordinate axes of the three-dimensional wave number space. It is determined that the transform coefficient data is included in the same Fourier transform coefficient data subgroup.
In a Fourier transform method according to a specific aspect of the present invention,
Each of the conversion target data groups has one coordinate of a lattice point group located in the rectangular parallelepiped shape of the three-dimensional data space,
The plurality of conversion target data subgroups are all conversion targets having the same coordinate value of the third coordinate axis in the three-dimensional data space and different coordinate values of the first and second coordinate axes in the three-dimensional data space. The data is determined to be included in the same conversion target data subgroup,
Each of the Fourier transform coefficient data groups has one coordinate of a grid point group located in a rectangular parallelepiped shape in the three-dimensional coefficient space,
The plurality of Fourier transform coefficient data subgroups have the same coordinate value of the first coordinate axis of the three-dimensional coefficient space, and all Fouriers having different coordinate values of the second and third coordinate axes of the three-dimensional wave number space. It is determined that the transform coefficient data is included in the same Fourier transform coefficient data subgroup.
More specifically,
The conversion target data group is sequentially assigned to the grid point group located in a rectangular parallelepiped shape in the three-dimensional data space in the order of the third coordinate axis, the second coordinate axis, and the first coordinate axis in the three-dimensional space,
In the first to third transformation processes, the plurality of Fourier transform coefficient data is converted into lattice point groups located in a rectangular parallelepiped shape in the three-dimensional coefficient space, and the first, second, and third of the three-dimensional coefficient space. It is determined to be assigned in the order of the coordinate axes.
In a more specific aspect,
The one first result data portion group generated by each processor through the first conversion processing has the same coordinate value of the third coordinate axis of the three-dimensional data space, and the three-dimensional data space. A plurality of first result data having a value different from the coordinate value of the second coordinate axis of the first coordinate axis of the three-dimensional coefficient space,
The exchange step is performed after the first conversion process is performed by the plurality of processors;
In the exchange step, the coordinate values of the first coordinate axes in the three-dimensional coefficient space are the same predetermined value, and the coordinate values of the second and third coordinate axes in the three-dimensional data space are different in the exchange step. The group of first result data generated by the plurality of processors is assigned to the same processor so that the first result data subgroup including all of the plurality of first result data having values is assigned to the same processor. The one second result data sub-group generated by the second conversion process by each processor has the same coordinate value of the first coordinate axis of the three-dimensional coefficient space. A plurality of second result data having a value different from the coordinate value of the second coordinate axis of the three-dimensional wave number space and the coordinate value of the third coordinate axis of the three-dimensional data space,
In the one Fourier transform coefficient subgroup generated by each processor by the third transform process, the coordinate value of the first coordinate axis of the three-dimensional coefficient space is a predetermined value, and the second value of the three-dimensional wave number space is the second value. , Including all the plurality of Fourier transform coefficients having different values of the coordinate values of the third coordinate axis.
In yet another specific embodiment,
The one first result data portion group generated by each processor through the first conversion processing has the same coordinate value of the third coordinate axis of the three-dimensional data space, and the three-dimensional data space. A plurality of first result data having a value different from the coordinate value of the second coordinate axis of the first coordinate axis of the three-dimensional coefficient space,
The exchange step is performed after the second conversion process is performed by the plurality of processors;
The one second result data portion group generated by each processor by the second conversion processing has the same coordinate value of the third coordinate axis of the three-dimensional data space, and the three-dimensional coefficient space. Including a plurality of second result data having different values of the coordinate values of the first and second coordinate axes,
In the plurality of processors, the coordinate value of the first coordinate axis in the three-dimensional coefficient space is the same as the predetermined value in the exchange step, and the coordinate value of the first coordinate axis in the three-dimensional coefficient space and the three-dimensional data A group generated by the plurality of processors such that the first result data subgroup including all the plurality of first result data having different values of the coordinate values of the third coordinate axes of the space are assigned to the same processor. Exchange the first result data of the plurality of processors,
In the one Fourier transform coefficient subgroup generated by each processor by the third transform process, the coordinate value of the first coordinate axis of the three-dimensional coefficient space is a predetermined value, and the second value of the three-dimensional coefficient space is second. , Including all the plurality of Fourier transform coefficients having different values of the coordinate values of the third coordinate axis.
In a more specific aspect of the present invention,
By each processor, first, second, and third conversion processes relating to the coordinates of the first, second, and third coordinate axes in the three-dimensional space are executed sequentially and in parallel with other processors,
After each processor executes one of the first and second conversion processes, a step of exchanging a plurality of result data obtained by the respective processors as a result of the one conversion process among the plurality of processors Have
Here, a group of ordered conversion target data is arranged in a rectangular parallelepiped shape in the three-dimensional space,
The first to third transformation processes are defined to generate a plurality of Fourier transform coefficient data having a group of ordered three-dimensional space coordinates for the group of transformation target data,
A plurality of conversion target data included in each of a plurality of planes perpendicular to the first coordinate axis of the three-dimensional space that divides the rectangular parallelepiped formed by the plurality of conversion target data is assigned to the same processor;
In the exchange step, the rectangular parallelepiped of the three-dimensional space formed by the plurality of result data obtained as a result of the one conversion process is re-divided by a plurality of planes perpendicular to the first coordinate axis of the three-dimensional space. And a step of exchanging the plurality of result data obtained as a result of the one conversion process between the plurality of processors so as to assign the plurality of result data belonging to each surface to the same processor.
In particular, preferably, the order in which the group of ordered conversion target data is arranged in a rectangular parallelepiped shape in the three-dimensional space is the order of the third coordinate axis, the second coordinate axis, and the first coordinate axis.
The first to third transformation processes are defined so that the plurality of Fourier transform coefficient data are arranged in a three-dimensional space in the order of the first, second, and third coordinate axes.
More preferably, in the Fourier transform method according to the present invention, each processor includes a pipeline arithmetic unit, and the performance for obtaining the arithmetic performance of each processor when the loop length to be operated by the arithmetic unit is L. Storing data in common to the plurality of processors,
Using the performance data, the lengths of the first, second, and third coordinate axes in the rectangular parallelepiped are determined,
The method further includes a step of arranging the plurality of ordered conversion target data in a rectangular parallelepiped having the determined lengths in the first, second, and third coordinate axis directions.
A program storage medium according to the present invention stores a program programmed to execute any of the various Fourier transform methods described above.
Furthermore, the simulation method according to the present invention executes the simulation using any of the various Fourier transform methods described above.
Another program storage medium according to the present invention stores a program programmed to execute the simulation method.
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a Fourier transform method, a simulation method using the same, and a program according to the present invention will be described in more detail with reference to some embodiments shown in the drawings. In the following, the same reference numerals represent the same or similar items. In the second and subsequent embodiments, the differences from the first embodiment will be mainly described.
<Embodiment 1 of the Invention>
(1) Schematic configuration of the device
An example of a parallel computer system for executing the Fourier transform method according to the present invention is shown in FIG. The parallel computer 28 includes a plurality of processors 27 each having a memory 26 and a plurality of external storage devices 31 for storing programs and data. These devices are mutually connected via an internal data transfer network 29. It is configured to be able to exchange data. The external storage device 31 stores, for example, a plurality of program libraries 44 prepared in advance in the parallel computer 28 for use by many users, data 30 used by them, and the like. The memory 26 of each processor is a so-called local memory, and an address assigned to data stored in the memory is an address belonging to a local address space defined by the processor, and this type of memory is generally a distributed memory. A computer having this type of memory is called a distributed memory type parallel computer. The parallel computer 28 is assumed to be a vector parallel computer in which each processor 29 includes a vector operation unit (not shown) that can continuously execute the same operation on vector data composed of a series of data elements in a pipeline manner. To do.
A computer that can be operated by a user, for example, a workstation 1 is connected to a specific one of the processors 27 via a network 2 such as a LAN. This computer may be another computer such as a personal computer. The workstation 1 has an input device 3 (typically a keyboard and a mouse) for inputting instructions or data to the parallel computer 28, and an output device 29 (for outputting a calculation result from the parallel computer 28). Typically, a display device and a printing device) are connected. The workstation 1 is also provided with a storage device (not shown) for storing a program to be sent to the parallel computer 28 and data used in the program.
The specific processor also serves as a processor that performs computation in the parallel computer 28 and a communication role with the user workstation 1. In other words, this processor receives the program and data sent from the workstation 1, stores them in one of the external storage devices 31, and then uses the appropriate program stored in the parallel computer 28 for the user. A designated program is loaded onto a plurality of processors (specifically, all processors), different portions of user-designated data are assigned to different ones of the processors, and the user-designated program is started.
However, in order to carry out the Fourier transform method according to the present invention, the parallel computer 28 needs to have a plurality of processors, but otherwise it may not have a particularly limited structure. Needless to say. Although the parallel computer 28 is assumed to be a vector parallel computer, this vector computing unit can execute only a part of the operations, and other operations may be performed by a scalar computing unit that is not a vector computing unit. Furthermore, the parallel computer 28 does not need to have such an arithmetic unit. Of course, it is desirable to have an arithmetic unit configured to execute the same operation on a series of data at high speed. Further, it is assumed that the parallel computer 28 does not have a plurality of vector registers between the memory 29 and the arithmetic unit (not shown), but it is more preferable that these registers be used.
Furthermore, the present invention can be applied to these parallel computers regardless of the specific structure of these processors or the structure of the data transfer network between them, or the connection form of these processors and input devices or output devices. Applicable. For example, a plurality of processors capable of communicating with the workstation may be provided, and at least one processor capable of communicating with the workstation may be provided separately from the processor for calculation. Further, it is apparent that the method for sending the program and data to be executed to the parallel computer 28 may depend on other methods.
The user can perform various calculations using the plurality of processors 27. The most typical calculation is a simulation of a physical phenomenon, for example, the prediction of the earth's weather is also performed by the simulation. A semiconductor device is also designed by simulating the physical operation of the semiconductor device. When such a simulation is executed using a parallel computer, the physical area to be simulated is divided into a plurality of partial areas, each partial area is assigned to one processor, and the processor performs the simulation for the partial area. For example, a partial differential equation relating to one or more physical quantities is often solved and executed. In this case, the plurality of processors used for the simulation execute the same simulation program in parallel with each other. Therefore, such a program is also called a parallel program. Data used by the simulation program executed by each processor is different. For example, data representing the position and shape of the simulation region, initial values of physical quantities to be simulated, material constants relating to substances in the simulation region, or boundary conditions relating to each partial region are different. Each processor transfers the result data obtained in the middle of the calculation to another appropriate processor or receives the calculation result data from another processor and continues the simulation.
Some of these simulation programs use Fourier transform. In the present embodiment, in order to use various simulations, a Fourier transform library programmed to execute Fourier transform according to the Fourier transform method of the present invention is stored in any of the external storage devices 31. Further, a communication library for executing communication between processors is also stored in the external storage device 31. The simulation program is programmed to call the Fourier transform library or the communication library when necessary. The parallel computer 28 loads the simulation program specified by the user transmitted from the workstation 1 and the libraries used by the simulation program (in this case, the Fourier transform library and the communication library) to each processor. Furthermore, the parallel computer 28 loads the data designated by the user transmitted from the workstation 1 and used by the simulation program in each processor into each processor. Note that the simulation program may be loaded on all the processors or a part of the processors, but for the sake of simplicity, the simulation program is assumed to be loaded on all the processors. The library or the simulation program is compiled by a compiler that reflects the instruction set of the parallel computer 28 or the characteristics of the hardware structure, software restrictions, and the like. According to the present invention, a program in which a program part for Fourier transform included in the library is incorporated into the library or the simulation program for executing the Fourier transform method can be stored and sold in a program recording medium such as a magnetic storage device.
(2) Principle of parallel fast Fourier transform
As already mentioned, the Fourier transform is the N input data f₀, F₁,. . . , F_N-1To N output data c₀, C₁,. . . , C_N-1Is a process of calculating using Equation 1a.
Input data f_j, Output data c_kMay be real data or complex data. Input data f_j, Output data c_kAre sometimes called real space data and wave number space data. That is, the input data f_jThe subscript j represents the coordinates of the lattice points in the one-dimensional real space, and the output data_kThe subscript k can be considered to represent the coordinates of lattice points in the one-dimensional wave number space. In other words, the Fourier transform according to the above equation is a process of converting one-dimensional real space data into one-dimensional wave number space data. Therefore, in this specification, the input data f_jThe subscript j is called grid point coordinates or simply coordinates in the one-dimensional real space, and the output data c_kIs sometimes referred to as a coordinate of a lattice point in the one-dimensional wave number space or simply as a coordinate. Alternatively, the data may be referred to as having that coordinate. However, the input data f_j, Output data c_kHowever, the data may not actually belong to such real space or wave number space.
In general, when performing operations using a parallel computer, it is known that it is desirable to increase the time during which as many processors as possible operate in parallel with each other and to reduce the total number of data communications between the processors. Data communication takes time compared to the calculation time in the processor, and communication consists of data transmission from one processor and reception by another processor, and the receiving processor executes a certain process. If previously programmed to receive operation result data there from another processor, that processor cannot start its processing until the operation result data to be received is received. Therefore, in each processor, the reception waiting time increases with the occurrence of communication, and the time for operating in parallel with other processors decreases. Therefore, it is known that it is desirable to reduce the total number of communications between processors in order to perform operations at high speed on a parallel computer. This also applies to the case where the Fourier transform is executed by a parallel computer as an operation. For this purpose, it is an important issue to which processor the data used in the calculation is assigned and when the calculation result data is exchanged between the processors.
In order to perform Fourier transform on a parallel computer, the conventional transposition algorithm uses the transformation target data f_jIs mapped to three-dimensional real space data as described below, and the processor to which the conversion target data is allocated is determined using this.
Now, let NX, NY, NZ be a positive integer that satisfies NX * NY * NZ = N, and replace the one-dimensional subscripts j, k with the three-dimensional subscripts (j_x, J_y, J_z), (K_x, K_y, K_z) By the following

formulas

2 and 3.
[Expression 2]

[Equation 3]

Here, the symbol * represents multiplication. In this replacement, the lattice point coordinate j in the one-dimensional real space, the lattice point coordinate k in the one-dimensional wave number space, and the lattice point coordinate (j_x, J_y, J_z) Lattice point coordinates (k_x, K_y, K_z)).
3D real space coordinates (j_x, J_y, J_z) Is calculated from the coordinate j in the one-dimensional real space by the following equation.
  j x= MOD (j / NZ, NX)
  jy = (j / (NX * NZ)) ↓
  jz = MOD (j, NZ)
  Here, () ↓ represents only the integer part of the numerical value in parentheses, and represents that the decimal part is rounded down.
Therefore, when j changes to 0, 1, 2, 3,._x, J_y, J_z) Changes to (0,0,0), (0,0,1), (0,0,2), (0,0,3),, (0,0, NZ-1) Furthermore, (1,0,0), (1,0,1), (1,0,2), (1,0,3),, (1,0, NZ-1) After changing the change to (NX-1, 0, NZ-1), j_yIs changed to 1, and the above change is repeated until the coordinates (NX-1, NY-1, NZ-1) are reached. That is, a three-dimensional coordinate point (j_x, J_y, J_z) Changes in the order of Z direction, X direction, and Y direction. In this specification, the data f in the one-dimensional real space to be subjected to Fourier transformation in this way₀, F₁,. , F_N-1For the sake of simplicity, the length of each side is arranged in a rectangular parallelepiped shape of NX, NY, NZ. In other words, in the above replacement, as shown in FIG. 2, the data is first arranged in the Z direction from the origin, and after arranging NZ data, the X coordinate is increased by 1 and the data is arranged repeatedly. When the arrangement of NX * NZ pieces of data is completed, the next operation is equivalent to the operation of increasing the Y coordinate by 1 and arranging the data. However, in FIG. 2, it is assumed that N is 512 and NX, NY, and NZ are all equal to 8.
3D wave number space coordinates (k_x, K_y, K_z) Is calculated from the coordinate k in the one-dimensional wave number space by the following equation.
k_x= MOD (k / NY, NX)
k_y= MOD (k, NY)
k_z= (k / (NX * NY)) ↓
Therefore, when k changes to 0, 1, 2, 3,... (NX * NY * NZ-1),_x, K_y, K_z) Changes to (0,0,0), (0,1,0), (0,2,0), (0,3,0),, (0, NY-1, 0), Further, (1, 0, 0), (1, 1, 0), (1, 2, 0), (1, 3, 0),, (1, NY-1, 0) After changing the change to (NX-1, NY-1, 0), k_zIs changed to 1, and the above change is repeated until the coordinates (NX-1, NY-1, NZ-1) are reached. That is, a three-dimensional coordinate point (k_x, K_y, K_z) Changes in the order of Y direction, X direction, and Z direction. Therefore, the Fourier transform coefficient c to be obtained_kAnd the corresponding coordinates in the three-dimensional wavenumber space (k_x, K_y, K_z3) is as shown in FIG. However, in FIG. 3, it is assumed that N is 512 and NX, NY, and NZ are all equal to 8.
Note that the transposition algorithm itself is a data string f in a one-dimensional space._j, Fourier transform result data c_kIs two-dimensional space data f_{jx, jy, ckx, ky}It can also be converted to In this case, the one-dimensional Fourier transform is replaced with the two-dimensional Fourier transform. However, as described here, the data string f in the one-dimensional space_j, Fourier transform result data c_kIs the data f in the three-dimensional space_{jx, jy, jz}And c_{kx, ky, kz}The reason why the Fourier transform is performed by converting to is that when each processor of the parallel computer has a vector operation unit, the vector operation unit is used effectively. In this case, the one-dimensional Fourier transform is replaced with a three-dimensional Fourier transform. That is, as described in the following embodiment, in each step of conversion, conversion is performed in any one of the X direction, the Y direction, and the Z direction, and one of the remaining two directions is used. Parallelization is performed, and vectorization is performed using the remaining one direction. For this reason, it is necessary to arrange data in a rectangular parallelepiped shape having three directions. In principle, it is possible to rearrange the data into a two-dimensional space or a space of four or more dimensions by performing the same transformation as

Equations

2 and 3, but in two dimensions both parallelization and vectorization can be performed. There are not enough dimensions, and an unnecessary dimension is created in four or more dimensions, and the loop length of the vectorization target is shortened accordingly, which is disadvantageous in terms of performance. Therefore, in order to perform fast Fourier transform on a parallel computer having a vector computing unit, it is desirable to rearrange the data in three dimensions as shown in

equations

2 and 3. Such data conversion is often effective for speeding up computation even when each processor does not have a vector computing unit.
Using

substitution equations

2 and 3, equation 1a can be rewritten as:
[Expression 4]

Further, it can be seen that this conversion equation is a conversion realized by sequentially executing the three-step conversion represented by the following three equations in the order of the equations.
[Equation 5]
c '_{jx, ky, jz}= Σ_{jy = 0} ^Ny-1f_{jx, jy, jz}* exp (-2πik_yj_y/ NY) (5)
[Formula 6]

[Expression 7]

Equation 5 is j_x, J_zIs a specific value and j_yNY pieces of input data f having different values_{jx, jy, jz}Fourier transform for j_x, J_zIs performed by the number of combinations of values that can be taken (NX * NZ pairs), whereby the two real space coordinates (j_x, J_z) One coordinate (k) of a three-dimensional wave number space of a plurality of (NX * NZ) sets respectively corresponding to one of the sets_y) Primary conversion result data (c ′)_{jx, ky, jz}) (K_y= 0 to NY-1, j_x= 0 to NX-1, j_z= 0 to NZ-1).
Equation 6 also includes k in the complex exponential function._yThis represents the same transformation as the Fourier transform, except that the extra term / NY is included._z, K_yIs a specific value and j_xNX primary conversion result data c ′ having different values_{jx, ky, jz}A transformation similar to the Fourier transform for j_z, K_yThe number of combinations of values that can be taken by (NY * NZ pairs) is performed, so that coordinates j_zTwo coordinates in the three-dimensional wavenumber space (k_x, K_y) Secondary conversion result data (c ″)_{kx, ky, jz}) (K_x= 0 to NX-1, k_y= 0 to NY-1, j_z= 0 to NZ-1).
Equation 7 also includes k in the complex exponential function._x/ NY + k_yThis represents the same transformation as the Fourier transform, except that an extra term / (NX * NY) is included._x, K_yIs a specific value and j_zNZ pieces of data c '' having different values_{kx, ky, jz}A transformation similar to the Fourier transform for k_x, K_yThe number of combinations of values that can be taken (NX * NY pairs) is performed, so that the three coordinates (k in the three-dimensional wave number space (k_x,k_y,k_z) Final Fourier transform result data (c)_{kx, ky, kz}) (K_x= 0 to NX-1, k_y= 0 to NY-1, k_z= 0 to NZ-1). Thus, all three of these transformations can be performed using fast Fourier transform algorithms.
Hereinafter, these conversions are referred to as Y-direction conversion, X-direction conversion, and Z-direction conversion, respectively. In this specification, for the sake of simplicity, these transformations may be referred to as Y-direction Fourier transformation, X-direction Fourier transformation, and Z-direction Fourier transformation, respectively. Here, the X direction or the like is a direction in which the coordinate conversion determined by

Equations

2 and 3 can be performed. That is, the first sequentially changing coordinate corresponding to the one-dimensionally different coordinate point j is the Z coordinate, the subsequently changing coordinate is the X coordinate, and the last changing coordinate is the Y coordinate. It goes without saying that if the coordinate conversion formula is changed from

Formulas

2 and 3, the content of the conversion in the X direction or the like changes. Thus, in this document, more generally, these transformations refer to the following transformations:
The conversion in the Y direction is, as exemplified by Equation 5, the coordinates (j of the first and third coordinate axes in the real space._x, J_z) Is a specific value, and the coordinates (j_y) Multiple (NY) input data with different values_{jx, jy, jz}Fourier transform is performed on the first and third coordinate axes of the three-dimensional real space (j_x, J_z) Of the second coordinate axis in the three-dimensional wave number space (k * k) of a plurality (NX * NZ) of sets corresponding to one of the sets of_y) Related to the primary transformation result data (c ′)_{jx, ky, jz}) (J_x= 0 to NX-1, k_y= 0 to NY-1, j_z= 0 to NZ-1). Or, in other words, the conversion in the Y direction is the input data f_{jx, jy, jz}For the second coordinate axis in the real space, the Fourier transform is performed, the coordinates of the second coordinate axis in the three-dimensional wave number space, the coordinates of the first coordinate axis in the three-dimensional real space, the coordinates of the third coordinate axis, It can also be said to refer to processing for obtaining primary conversion result data that is a function of
Further, the transformation in the X direction is the coordinate (j of the first coordinate system of the third real space coordinate system, as exemplified by Equation 6._z) And coordinates of the second coordinate system in the three-dimensional wave number space (k_y) Are specific values, and the coordinates (j of the first real space coordinate system are_x) Different primary conversion result data (c ′) having different values_{jx, ky, jz}) To the coordinates of the third real space coordinate axis (j_z) Corresponding to one of the different values (N_z) Coordinates (k) of the first and second coordinate axes of the three-dimensional wave number space._x, K_y) -Related secondary conversion result data (c ″)_{kx, ky, jz}) (K_x= 0 to NX-1, k_y= 0 to NY-1, j_z= 0 to NZ-1). Or, in other words, the transformation in the X direction performs transformation similar to the Fourier transformation on the first coordinate axis in the three-dimensional real space with respect to the primary transformation result data, and the coordinates of the third coordinate axis in the three-dimensional real space It can also be said to indicate a process of obtaining secondary conversion result data that is a function of the coordinates of the first and second coordinate axes in the three-dimensional wave number space.
Further, the conversion in the Z direction refers to the coordinates (k of the first and second coordinate systems in the three-dimensional wave number space, as exemplified by Equation 7._x, K_y) Is a specific value, and the coordinate (j of the third coordinate system in the three-dimensional real space_z) Secondary transformation result data (c ″) having different values._{kx, ky, jz}) To the coordinates (k) of the first, second, and third coordinate axes in the three-dimensional wave number space._x, K_y, K_Z) Final Fourier transform result data for the input data (c)_{kx, ky, kz}) (K_x= 0 to NX-1, k_y= 0 to NY-1, k_zIt can also be said that the process of obtaining = 0 to NZ-1). Or, in other words, the transformation in the Z direction performs a transformation similar to the Fourier transformation on the third coordinate axis of the three-dimensional real space with respect to the secondary transformation result data, and performs the first, second, It can also be said that it refers to processing for obtaining final Fourier transform result data that is a function of the coordinates of the third coordinate axis.
The conversion in the Y direction is made up of independent conversions for NX * NZ sets of data, as shown in Equation 5. Similarly, the conversion in the X direction includes independent conversions on NY * NZ sets of data, as shown in Equation 6. Similarly, the conversion in the Z direction includes independent conversions on NX * NY sets of data, as shown in Equation 7.
In a conventional Fourier transform method using a transposition algorithm, data to be transformed is allocated to different processors of a parallel computer so as to reduce communication between processors using this feature. That is, according to Equation 2 and as illustrated in FIG. 2, the conversion target data fj (j = 0 to N) are arranged in a rectangular parallelepiped shape, and this data is divided on a plane parallel to the Z axis of the three-dimensional real space. AndFIG., A plurality of data having jz = 0, 1,, and 7 are assigned to

processors

0, 1, and 7, respectively. That is, all the conversion target data having the Z coordinate jz having a specific value are assigned to the same processor without depending on the values of the X coordinate jx and the Y coordinate jy. 2 assumes that N = 512, NX = NY = NZ = 8, and FIG. 4 assumes that the total number of processors NPU = 8. All the conversion target data having the Z coordinate jz may be assigned to the same processor without depending on the values of the X coordinate and the Y coordinate. For example, the total number of processors NPU = NX = NZ may be set and NY = (N / (NX * NZ)) ↑. Here, () ↑ indicates an integer after rounding up the number in the parentheses after the decimal point. For example, when N = 512 and NPU = 4, NX = NZ = 4 and NY = 32 may be satisfied.
After assigning such data, Fourier transformation is performed as follows. In this method, the conversion in the Y direction according to Expression 5 and the conversion in the X direction according to Expression 6 can be performed without using communication between processors.
(Step 1) Y direction conversion
First, conversion in the Y direction is executed. As can be seen from Equation 5, the X coordinate j_xAnd Z coordinate j_zIs a specific value, Y coordinate j_yFourier transform is performed on a plurality of data to be converted having different values. However, these plural data are assigned to the same processor. Thus, in each processor, according to Equation 5, without using communication between processors, the X coordinate j_xAnd Z coordinate j_zIs a specific value and the Y coordinate k of the wave number space_yA plurality of primary conversion result data having various values can be obtained.
(Step 2) X direction conversion
Next, conversion in the X direction is performed. As can be seen from Equation 6, the Z coordinate j_zAnd Y coordinate k of wave number space_yIs a specific value and the X coordinate j_xA transformation similar to the Fourier transformation is performed on a plurality of primary transformation result data having different values. However, the plurality of data has already been obtained by conversion in the Y direction by the same processor. Thus, according to Equation 6, each processor does not communicate with other processors, and is assigned to the Z coordinate j of the three-dimensional real space assigned to that processor._zThe coordinates of the three-dimensional wave number space (k_x, K_yNX * NY secondary transformation result data (c ″)_{kx, ky, jz}) (K_x= 0 to NX-1, k_y= 0 to NY-1).
(Step 3) Data transposition
To perform the transformation in the Z direction according to Equation 7, the coordinates (k_x, K_y) And the Z coordinate j of the three-dimensional real space_zSecondary conversion result data (c ″) obtained for all values of_{kx, ky, jz})is required. Therefore, each processor has coordinates k in the three-dimensional wave number space._xA process of transferring data necessary for executing the conversion in the Z direction between the processors is executed. That is, the coordinate k to each processor_x, The

processor

0, 1, 2,._x= 0, 1, 2, 3,. This means that, as shown in FIG._xThis means that each of the divided subspaces is assigned to one processor by being divided by a plane perpendicular to the axis.
Each processor has a coordinate k of the three-dimensional wave number space assigned to that processor._xA specific value of and the coordinate k of the three-dimensional wave number space_yAll the values of and the Z coordinate j of the three-dimensional real space_zAnd all secondary conversion result data (c ″) obtained for all values of_{kx, ky, jz}) Is used to transfer the secondary conversion result data among all the processors. This data transfer is also called data transposition or data rearrangement. That is, each processor has a Z coordinate j of a three-dimensional real space._zAnd the coordinates of the three-dimensional wave number space (k_x, K_y) Obtained as a result of the transformation with all values (c ″)_{kx, ky, jz}(However, k_x= Specific value, k_y= 0 to NY-1, k_z= 0 to (NZ-1)), the secondary conversion result data is transferred among all the processors so that the data not generated by the own processor is received from the other processors.
(Step 4) Z direction conversion
Each processor has a coordinate k of the three-dimensional wave number space assigned to that processor._xA specific value of and the other two coordinates k in the three-dimensional wavenumber space_y, K_zFinal Fourier transform result data c having all values of_{kx, ky, kz}(However, k_x= Specific value, k_y= 0 to NY-1, kz = 0 to NZ-1). Each processor can perform this computation in parallel with each other.
(Step 5) Data transposition
However, with the above processing alone, the conversion target data f_jAnd conversion result data c_kThere is a problem that the data division formats are different and inconvenient in practice. That is, the conversion target data f_jAs shown in FIG. 2, the data is mapped to data in a three-dimensional real space, and the latter data is divided according to FIG. 4 and assigned to a plurality of processors. Therefore, as shown in FIG._jIs divided into f_jIs cyclic division in which the processor of the MOD (j, p) number is in charge. On the other hand, conversion result data c_kAs shown in FIG. 3, the three-dimensional wave number space is mapped into data in a three-dimensional wave number space, and the three-dimensional wave number space is divided according to FIG. 5 and assigned to a plurality of processors. Therefore, as shown in FIG._kThe data division is block cyclic division in which one continuous processor is in charge of NY continuous data.
In many applications, a certain process is performed on the transformation result data obtained by Fourier transforming the transformation target data, and the inverse Fourier transformation is performed again on the processing result. Inverse Fourier transform is often performed using a Fourier transform program. That is, the following inverse Fourier transform formula
[Equation 8]
f_j= Σ_{k = 0} ^N-1c_kexp (2πikj / N)
(However, j = 0,1, ..., N-1). . . (8)
The following equation is obtained by transforming.
[Equation 9]
f_j= (Σ_{k = 0} ^N-1c_k ^*exp (-2πikj / N))^*
(However, j = 0,1, ..., N-1). . . (9)
Here, * indicates a complex conjugate. Therefore, as is clear from the comparison between Equations 1a and 9, the inverse Fourier transform is equivalent to Fourier transform of the complex conjugate of the Fourier transform result data and taking the complex conjugate of the obtained result data. Therefore, in principle, it can be understood that the inverse Fourier transform can be executed by utilizing a Fourier transform program. However, when the Fourier transform is executed by a parallel computer, if the data division format of the conversion target data and the conversion result data is different, the conversion result data corresponding to the conversion target data allocated to one of the processors is allocated to that processor. Therefore, the processor cannot perform the inverse Fourier transform by using the Fourier transform program unless the transform result data is received from another processor.
In order to avoid such inconvenience, in the conventional Fourier transform program using the transpose algorithm, the three-dimensional Fourier transform result data c is executed after the transformation in the Z direction is executed._{kx, ky, kz}To the processor is changed, and the Fourier transform result data c between the processors again._{kx, ky, kz}Is transposed (replaced), and Fourier transform result data c_{kx, ky, kz}It was common to convert the data division into cyclic division. That is, as shown in FIG. 8, the three-dimensional wave number space is cut along a plane perpendicular to the Y coordinate axis, and the same Y coordinate value k_yFourier transform result data c having_{kx, ky, kz}Are assigned to the same processor. Specifically, k_y= Fourier transform result data c having 0, 1, 2,._{kx, ky, kz}Are sequentially assigned to the

processors

0, 1, 2,. Each processor assigns Fourier transform result data c assigned to the processor according to this assignment._{kx, ky, kz}Among all the processors so that the data not generated by the processor is received from other processors._{kx, ky, kz}Forward. In this manner, cyclic divided three-dimensional Fourier transform result data is obtained.
Thus obtained final three-dimensional Fourier transform result data c_{kx, ky, kz}To target 1D Fourier transform result data c_kCan be obtained from Equation 3. Data c_kAnd its three-dimensional coordinate k_x, K_y, K_zThe relationship is as shown in FIG.
However, according to the study by the present inventor, the extra communication between the processors for data exchange of the conventional data is a cause of reducing the parallelization efficiency and increasing the processing time required for the Fourier transform. There was found.
Therefore, in the present invention, the data division method in the conventional transposition algorithm is reviewed, and the optimum data division method is determined as follows from the aspect of reducing the data transfer amount between processors.
The conventional transposition algorithm is composed of transposition in the Y direction, X direction, and Z direction, and transposition operation for exchanging data between processors. In the algorithm, when converting in the Y direction, the space of the data to be converted is cut by a plurality of planes perpendicular to the Z axis, each surface is assigned to one processor, and then the conversion in the X direction is performed. In this case, the assignment is used as it is, and when the conversion in the Z direction is further performed, the space of the conversion target data is cut by a plurality of planes perpendicular to the X axis, and each plane is assigned to one processor. Thus, the conversion itself could be performed without data transfer between processors.
However, for example, when performing the first conversion in the Y direction, as long as there are NY data having the same X coordinate and Z coordinate to be converted on one processor, the conversion itself is: Can be executed without data transfer between processors. Therefore, the division of the conversion target data may be performed not by a plane perpendicular to the Z axis but by a plane perpendicular to the X axis. This is also true for conversion in the X and Z directions. Therefore, the first condition to be satisfied by a desirable data division method is that “when data is converted in a certain direction, the data to be converted is divided into a plurality of planes perpendicular to the direction other than that direction, The data division format “assign” is adopted in all of the Y direction, the X direction, and the Z direction.
Another thing to consider is the number of data transfers for transposition. For example, when the conversion in the Y direction, the X direction, and the Z direction is performed, the data division method of dividing the data to be converted along a plurality of planes perpendicular to the Z axis, the Y axis, and the X axis, respectively, is the first method described above. However, in this data division method, the data division format is changed twice when the conversion in the X direction and the Z direction is performed, and data transfer for transposition is necessary for each change. Therefore, in addition to the first condition described above, a second condition that “the transposition process for changing the data division format is limited to one time” is added as a condition that should be satisfied by the desirable data division method.
FIG. 9 shows the results of enumerating data division methods that satisfy these two conditions. There are four data division methods that satisfy the first and second conditions, and among these, a data division method in which the data division format of the Fourier transform target data and the data division format of the Fourier transformation result data are the same is obtained. is there.
Method 1 is adopted in the conventional transposition algorithm. In Method 4, since the input data is divided in the X direction and the output data is divided in the Y direction, the Fourier transform target data is divided into block cyclic divisions and the Fourier transformation result data is divided into cyclic divisions in comparison with FIGS. Although the data division format is exactly opposite to that of the method 1, it can be seen that the data division format is different between these two types of data and has the same drawback as the method 1.
On the other hand, in the method 2, as in the conventional transposition algorithm, the Fourier transform target data is divided along the Z direction, and the transformation in the Y direction and the transformation in the X direction are executed in the same manner as the conventional transposition algorithm. Unlike the conventional transposition algorithm, the direction conversion is executed after the result data of the X direction conversion is divided along the Y direction. Data transposition is required between X-direction conversion and Z-direction conversion. 2 and 3, it can be seen that both the Fourier transform target data and the Fourier transform result data are cyclically divided. Therefore, when the data division method of method 2 is adopted, the Fourier transform target data and the Fourier transformation result data maintain the same data division format without performing data transfer between processors after calculation of Fourier transform coefficients.
In the method 2, the Fourier transform is specifically executed as follows. The conversion in the Y direction, X direction, and Z direction described below is calculated by an FFT algorithm.
(Step a) Y-direction conversion
The conversion in the Y direction is performed in the same manner as in Step 1 already described for the conventional transposition algorithm. As already described, the data division of the Fourier transform target data is cyclic division.
(Step b) X direction conversion
Further, the conversion in the X direction is performed on the result data of the conversion in the Y direction in the same manner as in Step 2 described above for the conventional transposition algorithm.
(Step c) Data transposition
To perform the transformation in the Z direction according to Equation 7, the secondary transformation obtained for a specific value of coordinates (kx, ky) in the three-dimensional wave number space and all values of the Z coordinate jz in the three-dimensional real space. Result data (c ″ kx, ky, jz) is required. In the method 2, unlike the conventional transposition algorithm, the secondary transformation result data (c ″ kx, ky, jz) obtained by the transformation in the X direction is divided by a plurality of planes perpendicular to the Y axis. This means that, as shown in FIG. 8, the three-dimensional wave number space is divided by a plurality of planes perpendicular to the ky axis, and each of the divided subspaces (the plurality of planes) is assigned to one processor. Means that. That is, a specific value of the coordinate ky is assigned to each processor. In particular,k y= 0, 1, 2, 3,... Are sequentially assigned to the

processors

0, 1, 2,.
In accordance with this assignment, a process of transferring data necessary for executing the conversion in the Z direction between the processors is executed. Each processor has a coordinate k of the three-dimensional wave number space assigned to that processor._yA specific value of and the coordinate k of the three-dimensional wave number space_xAll the values of and the Z coordinate j of the three-dimensional real space_zAnd all secondary conversion result data (c ″) obtained for all values of_{kx, ky, jz}) Is used to transfer the secondary conversion result data among all the processors. That is, each processor has a Z coordinate j of a three-dimensional real space._zAll the values of and the coordinate k of the three-dimensional wave number space_xAll the values and coordinates k_ySecondary conversion result data (c ″) obtained for a pair with a specific value of_{kx, ky, jz}(However, k_x= 0 to (NX-1), k_y= Specific value, k_z= 0 to (NZ-1)), the secondary conversion result data is transferred among all the processors so that the data not generated by the own processor is received from the other processors.
(Step d) Z-direction conversion
Each processor has a coordinate k of the three-dimensional wave number space assigned to that processor._yA specific value of and the other two coordinates k in the three-dimensional wavenumber space_x, K_zFinal Fourier transform result data c having all values of_{kx, ky, kz}(However, k_x= 0 to NX-1, k_y= Specific value, k_z= 0 to NZ-1) is calculated. Each processor can perform this computation in parallel with each other.
As a result, the coordinate k_y= Fourier transform coefficient c corresponding to 0, 1, 2, 3,.₀, C₁, C₂,, Are sequentially generated by the

processors

0, 1, 2,._{kx, ky, kz}Is divided cyclically among the processors.
In the present embodiment, the data division method 2 is used. As will be described later in detail in the data division method 3, both the Fourier transform target data and the Fourier transform result data are cyclic division. Therefore, this method 3 can also be used. As will be described later, when the Fourier transform library is actually configured, the method 3 has an advantage that the size of the table of complex exponential function values generated by the library may be smaller than the method 2.
Input data f is calculated by the computer._jIn general, sequence data is used when performing the above conversion on. That is, a group of data to be converted in the Y direction by the same processor is stored in a three-dimensional array, and the conversion in the Y direction is executed on the array. The primary conversion result data obtained as a result may be stored in the same array or another three-dimensional array. Conversion in the other direction is also performed on the array that stores the result data of the previously executed conversion. Further, the transposition of data between the processors is also performed by exchanging the contents of the array generated by each processor. Therefore, when using the same three-dimensional array in these transformations in this way, the index of each dimension of the three-dimensional array corresponds to each coordinate axis of the three-dimensional real space in some cases, and after transformation in a certain direction in other cases. Corresponds to each coordinate axis of the three-dimensional space to which the result data belongs, and finally corresponds to each coordinate axis of the three-dimensional wave number space to which the Fourier transform coefficient belongs. However, even when the same array is used to store a group of data of different types in this way, the group of data stored in the array at a certain time belongs to the three-dimensional space to which the group of data belongs, Each index of the array represents one of the coordinate axes in the three-dimensional space to which the group of data belongs. Accordingly, the specific structure of the array used to store a group of data in carrying out the present invention is not limited to a specific one. Furthermore, even if the structure of an array that stores a group of data belonging to any one of several three-dimensional spaces described in the above principle does not directly correspond to the three-dimensional space, it is included in the array. Each data obtained can be regarded as having the coordinates of the three-dimensional space, and it goes without saying that the above description of the principle also applies to the arrangement.
(3) Application to multidimensional Fourier transform
Although the algorithm for the one-dimensional Fourier transform has been described above, this method can be easily extended to the case of the multi-dimensional Fourier transform. An explanation will be given by taking the following two-dimensional Fourier transform as an example.
[Expression 10]

This equation can be transformed into the following equation 11. Expression 11 can be written as conversion expressions 11a and 11b, which are further composed of the following two steps.
## EQU11 ##

Therefore, the two-dimensional Fourier transform is first expressed as N₁One-dimensional Fourier transform of N pieces of data₂And then N, as in equation 11b₂One-dimensional Fourier transform of N pieces of data₁It comes down to doing a group. Therefore, the method of the present invention can be applied to these one-dimensional Fourier transforms.
To perform a two-dimensional Fourier transform on a parallel computer using the method of the present invention, two-dimensional data f_{j1, j2}For subscript j₁(Hereinafter, this is referred to as a first direction). That is, the i-th processor has f_{m * NPU + i, j2}(However, m = 0,1, ..., ((N1 / NPU) -1), j2 = 0,1, ..., N2) -th element is assigned. Here, NPU is the number of processors.
Then the step of equation 11a is j₂N is the same₁1-D Fourier transform between elements₂This is to do the group, this N₁Since the elements are cyclically divided between the processors, this step is a N-dimensional Fourier transform according to the method of the present invention.₂It comes down to doing a group. Data c ′ after conversion_{k1, j2}Are cyclically divided in the first direction.
Next, in step 11b, j₁N is the same₂1-D Fourier transform between elements₁N₂Since the elements are on the same processor, this conversion can be done independently for each processor without communication. With the above, the two-dimensional Fourier transform is completed, and the transformed data c_{k1, k2}Are cyclically divided in the first direction.
In the above, the two-dimensional case is shown. However, even when the dimension is larger, cyclic division is performed in the first direction, and only the conversion in the first direction is performed using the method of the present invention. By performing the processing independently for each processor, the Fourier transform method of the present invention can be applied.
(4) Parallel fast Fourier transform library
Returning to FIG. 1, the fast Fourier transform library used on the parallel computer 28 is specifically configured as follows, for example. However, it goes without saying that the Fourier transform library to which the present invention is applied is not limited to this. This library is loaded into all the processors, and is called as a subroutine from a simulation program in the processors as necessary.
To execute the subroutine name FFT1D
CALL FFT1D (NX, NY, NZ, NPU, F, TB, IOPT, IER)
Thus, it is necessary to simultaneously call from all simulation programs loaded on any one of the processors by specifying predetermined arguments. Here, N = NX * NY * NZ is the Fourier transform target data f_j, NPU is the number of processors, F is the Fourier transform target data f when the library is called_jAt the time of return from the library, Fourier transform result data c_k, TB is a table for storing values of complex exponential functions, IOPT is an input for specifying a subroutine function, and IER is an output indicating whether or not a runtime error has occurred.
Here, the array F is a partial array that each processor has. As described in the explanation of the principle of Fourier transform, all input data (Fourier transform target data) f_j2 is arranged in a rectangular parallelepiped shape in a three-dimensional real space as shown in FIG. 2, and each processor receives input data belonging to a plane perpendicular to the Z axis having one or a plurality of specific Z coordinates. Assigned. The assigned input data is stored in the partial array specified by the argument F. That is, since the Fourier transform target data and the Fourier transform result data are both cyclically divided, the i-th processor is the m * NPU + i (m = 0, 1,..., N / NPU-1) th Has only elements. That is, the array F of the i-th processor has N input data strings f._jAmong (j = 0 to N−1), as shown by the following equation, a group of input data f_{m * NPU + i}Is stored.
F (m) = f_{m * NPU + i}(m = 0,1, ..., N / NPU-1)
Therefore, the size of the array F of each processor is N / NPU. TB is a table for storing the value of the complex exponential function calculated in the first call, and no new calculation is required by reusing the value stored here from the second call. It becomes. In the first call, IOPT = 1 is designated. At this time, a table of complex exponential functions is created. IOPT = 2 means the second and subsequent calls. At this time, the value already stored in the TB is used.
A flowchart of this library is shown in FIG. When this library is called (step 45), it first checks the arguments (step 46). That is, the validity of the argument is examined, such as whether N = NX * NY * NZ and NPU are integers of 1 or more, and whether IOPT is a value of 1 or 2. If there is an invalid value in the input data, IER = 1000 is set (process 47) and the process returns.
Next, the other processor is notified that this library has been called (step 48). This notification actually requires the communication library in which the library is loaded to notify all other processors of the occurrence of a library call on that processor, and the communication library A message is sent to all the processors in the network, and in each of the other processors, the communication library loaded therein receives this message and sends it to the library loaded in that processor. Notify this library call.
Next, it is checked whether the library is called by NPU processors as specified by the argument (step 49). This check is performed based on whether or not a notification that a library call to the library has occurred is received from all the other processors described above. If this condition is not satisfied, IER = 2000 is set (step 50) and the process returns. Next, the value of IOPT is checked (step 51). If IOPT = 1, the current call is the first call. Therefore, it prepares to execute the Fourier transform in the calling processor. Specifically, for conversion in the X, Y, and Z directions, the processor calculates in advance the values of the complex exponential functions used in

Equations

5, 6, and 7 to generate a table of complex exponential functions. Store as TB (step 55). The value of the complex exponential function to be calculated is determined by the assignment of data to the processor. That is, the coordinate j of the three-dimensional real space of the data to be processed by the processor in each of the transformations in the X, Y, and Z directions_x, J_y, J_zAnd the coordinate k of the three-dimensional real space_x, K_y, K_zFrom these results, various values that can be taken by the declination of the complex exponents of Equations 5 to 7 are determined, the cosine function value and the sine function value for each declination are calculated, and the array TB is calculated. Store. In the above determination, the data division format used for conversion in each direction, the processor number assigned in advance to the calling processor, and

Equations

2 and 3 are used. This processor number is assigned to each processor by the parallel computer 28 in advance when the simulation program is loaded. The data division format used for conversion in each direction is determined by the data division method used, that is, the above-described method 2 in the present embodiment.
If IOPT = 1, the current call is the second and subsequent calls. Such a call occurs when the simulation program from which the library is called is programmed to perform a Fourier transform on different physical quantities. For example, this is a case where the simulation program calls the library for the Fourier transform for the first physical quantity and then calls the library again for the Fourier transform for the second physical quantity. In this case, the Fourier transform target data representing the second physical quantity often has the same subscript as the Fourier transform target data representing the first physical quantity. In this case, the value of the complex exponent previously stored in the array BT can be used when executing the Fourier transform on the second physical quantity. Therefore, if IOPT = 1, step 55 is not executed.
Next, conversion in the Y direction is performed (step 56). In the present embodiment, the Fourier transform target data possessed by all the processors are virtually arranged in a rectangular parallelepiped with arguments NX, NY, and NZ having the lengths of the sides as shown in FIG. 2, and as shown in FIG. The Fourier transform target data having coordinates is assigned to the same processor. In this conversion in the Y direction, as already described in Step 1 or Step a, each processor performs Fast Fourier Transform on NY data having the same X coordinate and Z coordinate according to Equation 5. Since there are NX * NZ sets of data in total, NX * NZ independent NY-order fast Fourier transforms are eventually performed. Since each XY plane is assigned to one processor by the method of assigning data to the processors, this conversion process can be performed independently by each processor without communication.
In the case of this library, the conversion target data to be processed by each processor by this library is stored in the memory (26 (FIG. 1)) of the processor as an array specified by the argument F by the simulation program executed by the processor. Stored before the call. The subscript of the array F and the coordinate j of the three-dimensional real space_x, J_y, J_zIs determined by the data division format. Therefore, in this conversion in the Y direction, the conversion specified by Expression 5 is performed on the conversion target data in the array F using this relationship. The primary conversion result data obtained by the conversion is stored in the memory of the processor.
Specifically, when this library is called in each processor, each processor stores data at an appropriate timing (for example, when it is determined in step 46 that the input data does not contain an invalid value). And the first, second, and third three-dimensional work arrays are secured on the memory. Here, in the three-dimensional array for storing data, the length of the three-dimensional index is equal to the arguments NX, NY, and NZ. In the following, it is assumed that these three-dimensional work arrays have the same size as the three-dimensional array for storing data. However, these three-dimensional work arrays need only have a size capable of storing data described below, and therefore the sizes of these work arrays can be changed as appropriate. Furthermore, the structure of these three-dimensional work arrangements can be changed as long as it matches the purpose of use.
In the three-dimensional array for storing data, the data point sequence included in the array F specified by the argument can be stored as follows. When one plane perpendicular to the Z axis in FIG. 2 is assigned to the processor, NX * NY input data belonging to the one plane correspond to the X, Y, and Z coordinates of the respective data. , And stored in a position having an index of the three-dimensional array for data. When a plurality of planes perpendicular to the Z-axis are assigned to the processor, the data of each plane is stored in the corresponding index positions in the data three-dimensional array in the same manner.
In each processor, the transformation in the Y direction uses this data storage array, and for a group of input data in which the Z coordinate has a specific value and the Y coordinate and the X coordinate have various values, 5 is performed. At this time, the transformation in the Y direction is performed using a fast Fourier transform algorithm (FFT) on a group of input data having a specific value in the Z coordinate and different X coordinates. In executing this conversion, a vector calculator (not shown) in the processor is used for a group of data having different X coordinates, and calculation is executed in a pipeline manner.
Resulting primary conversion result data c '_{jx, ky, jz}(However, j_x= 0 to NX-1, k_y= 0 to NY-1, j_z= Specific value) is the coordinate value j of the first three-dimensional work array_x, K_y, J_zIs stored at the index corresponding to. Therefore, in the case of FIG. 2, in each processor, primary conversion result data c ′ for a group of input data on one plane of FIG. 2._{jx, ky, jz}Is the specific coordinate j of the first three-dimensional work array_zAre stored on a plane having In FIG. 2, when input data belonging to a plurality of planes perpendicular to the Z-axis are assigned to the processor, the Z-axis in the first three-dimensional work array corresponding to each Z-plane is assigned to the Z-axis. Primary conversion result data c ′ corresponding to each of a plurality of vertical planes_{jx, ky, jz}Is stored.
After completing the conversion in the Y direction, the conversion in the X direction is similarly performed (step 57). That is, as already described in step 2 or step b, each processor performs the conversion specified by Expression 6 on the primary conversion result data obtained by the Y-direction conversion. The secondary conversion result data obtained by the conversion is stored in the memory of the processor. This conversion process can also be performed independently by each processor without communication.
Specifically, in each processor, the transformation in the X direction is such that the Z coordinate has a specific value and the X coordinate and k_yA group of primary transformation result data c 'whose coordinates have various values_{jx, ky, jz}On the other hand, this is performed according to Equation 6. At this time, the Z coordinate has a specific value, and k_yX-direction transformation is performed on a group of input data having different coordinates using a fast Fourier transform algorithm (FFT). This conversion is performed using the first three-dimensional work array. In performing this conversion, k_yA vector calculator in the processor is used for a group of data having different coordinates, and the calculation is executed in a pipeline manner.
Secondary conversion result data c '' obtained as a result_{kx, ky, jz}(However, k_x= 0 to NX-1, k_y= 0 to NY-1, j_z= Specific value) is the coordinate value k of the second three-dimensional work array_x, K_y, J_zIs stored at the index corresponding to. Therefore, in the case of FIG. 2, each processor has secondary conversion result data c ″._{kx, ky, jz}Is the specific coordinate j of the second 3D work array_zAre stored in one plane. If input data belonging to a plurality of planes perpendicular to the Z axis in FIG. 2 is assigned to the processor, the Z axis in the second three-dimensional work array corresponding to each Z plane is assigned to the Z axis. Secondary conversion result data c '' corresponding to each of a plurality of vertical planes_{kx, ky, jz}Is stored.
After the conversion in the X direction is completed, the data is transposed (replaced) between the processors. That is, this time, as already described in step c, the rectangular parallelepiped of the secondary transformation result data is sliced perpendicularly to the Y axis as shown in FIG. 8, and each surface thus formed is assigned to one processor (step 58). As already described in step c, in accordance with this assignment, each processor exchanges secondary conversion result data generated by each processor with all other processors.
Specifically, at the time of this transposition, each processor assigns the coordinate k assigned to the processor to the second work array._yShould belong to a plane perpendicular to the Y-axis having a value of k_yIs a specific value, k_x, J_zSecondary conversion result data c ″ having various values_{kx, ky, jz}(However, k_x= 0 to NX-1, k_y= Specific value, j_z= 0 to NZ-1) so as to receive the secondary conversion result data c '' between the processors._{kx, ky, jz}Replace.
After the transposition is completed, conversion in the Z direction is performed (step 59). That is, as already described in step d, each processor performs the transformation specified by Equation 7 on the secondary transformation result data newly assigned to the processor, and obtains the final three-dimensional Fourier Generate conversion result data. Since each XZ plane is assigned to one processor by transposition, this conversion process can also be performed independently by each processor without communication.
Specifically, in each processor, the transformation in the Z direction is k_yThe coordinate has a specific value and k_xA group of secondary transformation result data c '' whose coordinates and Z coordinates have various values_{kx, ky, jz}On the other hand, this is performed according to Equation 7. At this time, k_yThe coordinate has a specific value and k_xTransformation in the Z direction is performed on a group of input data having different coordinates using a fast Fourier transform algorithm (FFT). This conversion is performed using the second three-dimensional work array. In performing this conversion, k_zA vector calculator in the processor is used for a group of data having different coordinates, and the calculation is executed in a pipeline manner.
Final Fourier transform result data c obtained as a result_{kx, ky, kz}(However, k_x= 0 to NX-1, k_y= Specific value, k_z= 0 to NZ-1) are the coordinate values k of the third three-dimensional work array._x, K_y, K_zIs stored at the index corresponding to. Accordingly, as shown in FIG. 8, one coordinate k is assigned to one processor._yIs assigned to the final Fourier transform result data c in each processor._{kx, ky, kz}Is a specific coordinate k of the third three-dimensional work array._yAre stored in one plane. In FIG. 8, k_yWhen a plurality of planes perpendicular to the axis are assigned to the processor, k in the third three-dimensional work array corresponding to each plane,_yFinal Fourier transform result data c corresponding to each of a plurality of planes perpendicular to the axis c_{kx, ky, kz}Is stored.
When the conversion in the Z direction ends, the one-dimensional conversion target data f_jIs completed, and the superposition coefficient c_kIs obtained. c_kAre arranged in the Y direction from the origin, and when the number of NY is changed in the Y direction, the X coordinate is incremented by 1. Next, when NX * NY pieces of data are arranged on the XY plane, the Z coordinate is incremented by 1. They are arranged in order (Fig. 3). In the present embodiment, the output data c is checked by collating this data arrangement with the data division format shown in FIG._kIt can also be seen that it is a cyclic division. Even in the third three-dimensional work array, the final Fourier transform result data c_{kx, ky, kz}Has a corresponding sequence. The library has this data c_{kx, ky, kz}Are rearranged in the order of the one-dimensional coordinate k, stored in the one-dimensional array F (step 61), and the process returns (step 62). In this library, it is not necessary to change the data division format after conversion, which is necessary in the conventional method, and it is possible to obtain parallelization efficiency that exceeds the conventional method by reducing communication, thereby reducing the Fourier transform time.
Note that NX, NY, and NZ are determined by dividing the data in a plane perpendicular to the Z direction by conversion in the Y direction and X direction, assuming that the number of processors is p, so that NZ ≧ p must be satisfied. Further, since conversion in the Z direction divides data on a plane perpendicular to the Y direction, it is necessary to satisfy NY ≧ p.
Further, it is assumed that each processor 29 of the parallel computer 28 includes a vector computing unit (not shown). In such a parallel computer, in order to efficiently use this vector computing unit, the length of the loop to be vectorized (that is, the number of elements of the data group (vector data) subjected to the same computation, It is known that it is necessary to take as long as possible. This vector computing unit is used when equations 5 to 7 are calculated by this algorithm. The loop to be vectorized is a calculation for performing the same operation on a plurality of data in the direction of the coordinate axis that is not used for Fourier transformation or parallelization. In the transformation in the Y direction, the X direction, and the Z direction, Calculations are performed in the X direction, the Y direction, and the X direction, respectively. Therefore, in order to extract the performance of the vector computing unit, it is desirable to determine NX, NY, and NZ so that NX and NY are as large as possible while satisfying the two conditions of NY ≧ p and NZ ≧ p.
It is assumed that the parallel computer 28 has a vector computing unit, but this computing unit may be capable of executing a part of all computations necessary for Fourier transform in a pipeline manner. Furthermore, even if the parallel computer 28 is a parallel computer that does not have a vector calculator, increasing the loop length is often effective for speeding up. In the above description of the operation, it is assumed that the parallel computer 28 does not have a plurality of vector registers between the memory 29 and an arithmetic unit (not shown), and the array used for the conversion in each direction is from the memory 29. The array read out directly to the computing unit or generated by the conversion has been described as if it was directly written into the memory 29 from the computing unit. However, in a parallel computer having a plurality of vector registers, operations on the array on the memory 29 or storage of the array obtained as a result of the operation in the memory 29 may be executed via these registers. It is clear to the contractor.
In this embodiment, this library does not have a program for executing the inverse Fourier transform. As will be described later, when the simulation program requires inverse Fourier transform, the simulation program generates complex conjugate data of the data to be subjected to inverse Fourier transform, and performs Fourier transform on this data in this library. To request. The simulation program generates a complex conjugate of the Fourier transform data obtained for this complex conjugate data.
However, this library can also have a Fourier transform function. That is, when an argument specifying Fourier transform or inverse Fourier transform is added as an argument of this library and the simulation program requests the inverse Fourier transform, the library obtains the complex conjugate of the data to be transformed, and this is the Fourier transform. Is executed as described above, a complex conjugate of the obtained result data is obtained, and it is returned to the simulation program as inverse Fourier transform result data.
(5) Simulation program
A parallel program for weather calculation is shown in FIG. 11 as an example of a simulation program used in the present embodiment. Although the weather calculation is originally a three-dimensional calculation, at present, it is often performed in a two-dimensional manner due to the limitation of calculation capability. Therefore, in the present embodiment, the simulation program will be described by taking as an example the case of a two-dimensional weather prediction target region (this is a calculation target region). The user divides the entire calculation target area into a plurality of partial calculation areas in advance, and assigns each of the calculation target areas to any one processor. Further, parameters such as sizes N1 and N2 of each partial calculation region and NX, NY, and NZ used in Fourier transform are designated. When a user uses this program on the parallel computer 28, first, the workstation 1 communicates with a specific processor (for example, processor 0) in the parallel computer 28, and this program, the user-specified information, and the air A material constant used for calculation such as thermal conductivity, and initial value data such as temperature, wind speed, and pressure obtained by observation in the entire calculation target area are connected via the specific processor 0 to the external storage device 31. To remember. Thereafter, the specific processor 0 loads the program on each processor and starts the program on all the processors. This program is executed in parallel in the same manner on all the processors in the parallel computer 28.
When this program is started, it first calculates the parameters N1, N2, Nx, NY, NZ used in Fourier transform, material constants such as thermal conductivity of air, temperature and wind speed obtained by observation. Input initial value data such as pressure from the external storage device 31 (step 32). The program assumes that the processor into which it is loaded is programmed to determine which partial computation region to perform computations. In this step, each processor inputs the above parameters to be used without depending on the processor, and among the initial value data for all the calculation target areas stored in the external storage device 31, the partial calculation assigned to the processor. Initial value data relating to the area is selected and input from the external storage device 31.
Note that, as described in the section “Application to Multidimensional Fourier Transform” in (3) above, in the two-dimensional fast Fourier transform according to the method of the present invention, input data is cyclically divided in the first coordinate direction. Need to be. That is, a physical quantity A defined on a mesh of size N1 × N2_{j1, j2}(However, j₁= 0, 1, ..., N1-1, j₂= 0, 1,..., N2-1), element A_{m * NPU + i, j2}(M = 0,1, ..., N1 / NPU-1, i = 0,1, ..., NPU-1, j₂= 0, 1,..., N2-1) must be assigned to the i-th processor. Therefore, the simulation program according to the present embodiment also uses a partial calculation area obtained by cyclic division of the first coordinate direction in this manner in accordance with the input format in the two-dimensional fast Fourier transform.
Thereafter, preprocessing necessary for the calculation is performed (step 33). Here, the preprocessing is to interpolate data such as temperature, wind speed, and pressure obtained by observation to obtain data such as temperature, wind speed, and pressure at mesh points necessary for calculation.
After these processes are completed, the amount of temperature, wind speed, pressure, etc. at each time step is obtained in order by a repetitive loop described below. There are three basic equations: the equation of motion for the wind speed, the equation for mass conservation, and the equation for temperature change.
[Expression 12]
du / dt = −2Ω × u− (1 / ρ) ∇p + F_u ,. . . (12)
[Formula 13]
dρ / dt = −ρ∇ · u. . . (13)
[Expression 14]
dT / dt = −κ∇²T + u · ∇T. . . (14)
Here, u represents wind speed, p represents pressure, T represents temperature, Ω represents the force caused by the rotation of the earth called Coriolis force, F_uIs the other external force, ρ is the density of the air, and κ is the thermal conductivity of the air. In order to obtain the data value at the next time from these equations, first, the temperature T, the pressure p and the wind speed u on the lattice point are converted into data in the wave number space by Fourier transformation. For this purpose, the two-dimensional fast Fourier transform library FFT2D is sequentially called for the data for each physical quantity (step 34). When the library FFT2D is called, the arguments already described are specified.
Each physical quantity data is differentiated in the wave number space (step 35). That is, Fourier transform coefficient data relating to each physical quantity given from the library FFT2D is differentiated in the wave number space, and the temperature gradient ∇T and the second derivative ∇ on the lattice point of the wave number space for the physical quantity.²Data related to differentiation such as T, pressure gradient ∇p, and wind speed divergence ｕ · u are obtained.
Inverse Fourier transform is performed on the data related to the above-mentioned differentiation for each physical quantity, and the temperature gradient ∇T and the second-order derivative ∇ on the lattice points in the real space.²Data relating to differentiation such as T, pressure gradient ∇p, and wind speed divergence ∇ · u are obtained (step 36). For the inverse Fourier transform, the equations 8 and 9 already described are used. That is, complex conjugate data of the differentiated data for each physical quantity is generated, and the library FFT2D is called to request Fourier transform for the complex conjugate data. Further, complex conjugate data of the obtained Fourier transform result data is generated, and the generated complex conjugate Fourier transform result data is used as inverse Fourier transform result data.
After that, the data related to the differentiation obtained by the inverse Fourier transform is substituted into the right side of the expression 12-14, and the time differentiation regarding each of the wind speed u, the air density ρ, and the temperature T is determined, and those obtained are obtained. Is used to obtain temperature, wind speed, and pressure at the next time step (step 37).
In steps 34, 35, and 36, the data of the lattice points in the real space is converted into data in the wave number space by Fourier transformation, and the differentiation related data is obtained there, and the obtained differentiation related data is subjected to inverse Fourier transform. The reason why the differentiation-related data regarding the real space is obtained is that the derivative can be calculated with higher accuracy. In this simulation program, the two-dimensional Fourier transform library FFT2D is used in this calculation part.
In the above loop, it is determined at each time step whether or not the calculation up to the required time has been completed (step 38). When the calculation is completed, post-processing is performed (step 39) and output as prediction result data (step 40). ). In the post-processing, when a mesh point to be calculated and a point where the prediction result data is necessary are misaligned, processing such as calculating a prediction value at a necessary point by interpolating the calculation result data is performed. In the output process 40, each processor writes back the generated data to the external storage device 31, and the specific processor transfers this data to the workstation 1 as one result data when the execution of the simulation program is completed.
In the above, it is assumed that the simulation program is executed in parallel in the same manner on all the processors in the parallel computer 28. Further assume that it is programmed to determine which partial computational domain the processor into which it is loaded performs. However, it goes without saying that the program using the Fourier transform according to the present invention is not limited to such a program. In order to use the Fourier transform library FFT2D, it is necessary to specify the plurality of arguments required by each library, and the generation or acquisition of these arguments may be performed by other methods. For example, this program may be composed of a single processing part executed by a specific processor in the parallel computer 28 and a parallel processing part executed in parallel by all the processors. For example, when the program is started by any one of the processors, the single processing portion is executed when the processor is the specific one processor, and only the parallel processing portion is executed otherwise. The single processing part can be configured to determine a partial calculation area in charge of each processor and notify the result to another processor.
In the above example, the case where the weather prediction calculation is performed has been described as an example. However, it is obvious that the method of the present invention can be applied to other application examples when performing simulation using a fast Fourier transform on a parallel computer. It is. The same applies to the one-dimensional Fourier transform library FFT1D.
<Embodiment 2 of the Invention>
In the first embodiment, the Fourier transform library employs method 2 in FIG. However, in the present embodiment, the Fourier transform library employs method 3 in FIG. In this method 3, the transposition is performed after the conversion in the Y direction, and then the conversion in the X direction and the Z direction is performed.
In the method 2, the Fourier transform is specifically executed as follows.
(Step a ') Conversion in Y direction
The conversion in the Y direction is performed in the same manner as in Step 1 or a ′ described above for the conventional transposition algorithm. As already described, the data division of the Fourier transform target data is cyclic division.
(Step b ') Data transposition
To convert in the X direction according to Equation 6, the coordinate k in the three-dimensional wave number space_ySpecific value of, and the coordinates (j_x, J_z) Obtained as a result of the primary conversion for all values c)_{jx, ky, jz}is required. In the method 3, unlike the method 2, the primary conversion result data c ′ obtained by the conversion in the Y direction._{jx, ky, jz}Are divided by a plurality of planes perpendicular to the Y-axis. This means that, as shown in FIG._yThis means that the image is divided by a plurality of planes perpendicular to the axis, and each of the divided subspaces (the plurality of planes) is assigned to one processor. That is, the coordinate k to each processor_yAssign a specific value of. Specifically, k_y= 1 conversion result data c ′ of 0, 1, 2, 3,._{jx, ky, jz}Are sequentially assigned to the

processors

0, 1, 2,.
In accordance with this assignment, a process of transferring data necessary for executing the conversion in the X direction later between the processors is executed in step b 'here. Each processor has a coordinate k of the three-dimensional wave number space assigned to that processor._ySpecific value of, and the coordinates (j_x, J_z) And all primary conversion result data c ′ obtained for all values_{jx, ky, jz}The primary conversion result data is transferred among all the processors so that the conversion in the X direction can be performed using the. That is, each processor has coordinates (j_x, J_z) And the coordinate k in the three-dimensional wave number space_yPrimary conversion result data c ′ obtained for a pair with a specific value of_{jx, ky, jz}(However, j_x= 0 to (NX-1), j_z= 0 to (NZ-1), k_y(= Specific value), the secondary conversion result data is transferred among all the processors so that the data not generated by the own processor is received from other processors.
(Step c ') Conversion in the X direction
Next, conversion in the X direction is performed. Each processor does not communicate with other processors according to Equation 6, but the coordinate k of the three-dimensional wave number space assigned to that processor._yA specific value of and the coordinate k of the three-dimensional wave number space_xAll the values of, and the coordinate j in the three-dimensional real space_zNX * NZ secondary transformation result data (c ″) related to all values of_{kx, ky, jz}) (K_x= 0 to NX-1, j_z= 0 to NZ-1).
(Step d ') Conversion in the Z direction
Each processor has a coordinate k of the three-dimensional wave number space assigned to that processor._yA specific value of and the other two coordinates k in the three-dimensional wavenumber space_x, K_zFinal Fourier transform result data c having all values of_{kx, ky, kz}(However, k_x= 0 to NX-1, k_y= Specific value, k_z= 0 to NZ-1) is calculated. Each processor can perform this computation in parallel with each other.
As a result, the coordinate k_y= Fourier transform coefficient c corresponding to 0, 1, 2, 3,.₀, C₁, C₂,, Are sequentially generated by the

processors

0, 1, 2,._{kx, ky, kz}Is divided cyclically among the processors.
This method is more advantageous than the method 2 used in the first embodiment in terms of the capacity of the array BT that stores the complex exponential function table. Actually, according to Equation 6, the value of the complex exponential function in the transformation in the X direction depends only on the index in the X direction and the Y direction, and does not depend on the index in the Z direction. Therefore, when the division perpendicular to the Z-axis is employed in the X-direction conversion as in the first embodiment, each processor has the same table. On the other hand, in this method, since the division is performed on a plane perpendicular to the Y axis, each processor has only a part of the table necessary for its calculation, and there is no overlap. Thereby, in this system, the size of the table required for the conversion in the X direction can be reduced to 1 / (number of processors).
In the present embodiment, loops to be vectorized become X direction, Z direction, and X direction in Y direction, X direction, and Z direction conversion, respectively. NX and NZ should be as large as possible while satisfying the two conditions of ≧ p and NZ ≧ p.
<Third Embodiment of the Invention>
The target parallel computer in the present embodiment is substantially the same as the parallel computer system of FIG. 1 described in the first embodiment, but each processor has the same vector computing unit and is in an external storage device. 32 has a database relating to the performance of the vector computing unit. The calculation performance of the vector calculator is, for example, the number of calculations that can be executed per unit time. In the vector computing unit performance database, vector computing unit performance data is stored in the form of a loop length L function (g (L)).
In the first embodiment, the parameters NX, NY, and NZ for the Fourier transform are determined as inputs to the program. However, this is optimal according to the characteristics of the vector computing units constituting each processor of the parallel computer 28. Therefore, more efficient calculation is possible. In general, the calculation performance of the vector calculator is a function g (L) of the loop length L. g (L) is usually a function that increases monotonously with L. Now, considering the amount of computation in the step of calculating the transformation in the Y direction by the Fourier transform method of the first embodiment, the amount of computation for performing the NY-order Fourier transform once is NYlogNY, and this is represented by NX * NZ sets. Since the calculation is performed, the total calculation amount is NX * NY * NZlogNY = NlogNY. Similarly, the calculation amounts in the NX direction and the NZ direction are NlogNX and NlogNZ, respectively. On the other hand, since the vectorization loop length in each calculation is NX, NY, NX as described in the first embodiment, the calculation performance of the vector calculator is g (NX), g (NY), g, respectively. (NX). The calculation time t is obtained by dividing the calculation amount by the calculation performance.
t = NlogNY / g (NX) + NlogNX / g (NY) + NlogNZ / g (NX)
It becomes. Therefore, when the number of processors is p, by determining NX, NY, and NZ so as to minimize t under the conditions of NX ≧ p and NZ ≧ p, the performance of the vector calculator can be maximized. Fast Fourier transform can be realized.
FIG. 12 shows a flowchart of the library in the present embodiment. The processing is the same as that of the first embodiment (FIG. 10) except for the determination part (step 43) of NX, NY, and NZ. In step 43, NX, NY, and NZ are determined using the vector computing unit performance database so as to minimize the computation time t under the conditions of NX ≧ p and NZ ≧ p. The subsequent processing is the same as in the first embodiment. In the call statement to this library, the simulation program does not need to specify these parameters NX, NY, and NZ. According to the system of the present embodiment, the user can maximize the performance of the parallel computer incorporating the vector computing unit without calculating NX, NY, and NZ by himself / herself.
When N and NPU are general integers, it is necessary to change the division format of input data into the processor by changing NX, NY, and NZ, but it is most often used in Fourier transform. When N and NPU are both powers of 2, even if NX, NY, and NZ are changed, it may not be necessary to change the division format.
Actually, when the two sets (NX, NY, NZ) = (NX1, NY1, NZ1), (NX2, NY2, NZ2) both satisfy the two conditions of NY ≧ NPU and NZ ≧ NPU, the input data f_jAre arranged in a rectangular parallelepiped shape in the order shown in FIG. 2 and sliced along a plane perpendicular to the Z-axis, and the

respective processors

0, 1,. . . , NPU-1 is assigned. Then, when (NX, NY, NZ) = (NX1, NY1, NZ1), f_jThe surface to which the_j, NZ1) +1, and the number of the processor responsible for this aspect is
MOD (MOD (f_j, NZ1), NPU)
It is. On the other hand, when (NX, NY, NZ) = (NX2, NY2, NZ2), f_jThe processor number responsible for
MOD (MOD (f_j, NZ2), NPU)
It becomes. However, since NZ1 ≧ NPU, NZ2 ≧ NPU and NZ1, NZ2, and NPU are all powers of 2, NZ1 and NZ2 are both multiples of NPU.

That is, f_jThe number of the processor in charge of is the same in both cases.
From the above consideration, as long as N and NPU are both powers of 2 and two conditions of NY ≧ NPU and NZ ≧ NPU are satisfied, even if NX, NY, and NZ are changed, the input data f_jIt can be seen that there is no need to change the division format of the processor.
By using this fact and optimizing NX, NY, and NZ within a range that does not require changing the division format, there is no new communication overhead associated with the change of the division format. The processing speed, specifically, the Fourier transform speed can be improved.
<Embodiment 4 of the Invention>
As another example of performing simulation using the fast Fourier transform according to the present invention, electronic structure calculation in a semiconductor device or the like will be described. Electronic structure calculations are used to design semiconductor devices, particularly to determine device structures, using the results.
In the electronic structure calculation, an electron wave function u (r) defined by a two-dimensional or three-dimensional mesh is expressed by the following Schrodinger equation.
[Expression 15]

Thus, the size of the band gap that determines the properties of the semiconductor and the structural stability of the crystal are obtained. In the above equation, h is the Planck constant, m is the mass of the electron, E is the energy level of the wave function of interest, and V is the potential energy due to atoms and other electrons in the crystal.
In the calculation of Equation 15, the second derivative of the wave function u (r)²Although u (r) is necessary, for the same reason as described in the example of weather calculation, this part is calculated after u (r) is moved to the wave number space by Fourier transform, and the result is obtained by inverse Fourier transform. Return to real space again. Therefore, when electronic structure calculation is performed on a parallel computer, the fast Fourier transform method of the present invention can be applied to this portion.
<Modification>
The present invention is not limited to the above-described embodiment, and can be realized by various changes or modifications other than the changes or modifications exemplified below.
(1) It goes without saying that the Fourier transform method according to the present invention can be used not only for simulation but also for other purposes. For example, it can be used for analysis of transmitted signals or waves such as seismic waves, and the results of the analysis can be used to design devices related to signal transmission, such as transmission devices or transmission lines, or using earthquakes It can also be used for applications such as resource development.
(2) In the above embodiments, the Fourier transform transform is executed by the Fourier transform library prepared for that purpose. However, it is obvious that the present invention may incorporate a program for executing the Fourier transform procedure in an application program that uses the Fourier transform. Such a simulation program. The program can be stored and sold in a program recording medium such as a magnetic storage device.
(3) The present invention provides Fourier transform target data f_jIt can also be applied when is real data. In that case, the calculation of the imaginary part can be omitted in the calculation of the coefficient during the conversion in the Y direction or the like.
As is clear from the above, according to the present invention, the Fourier transform can be executed at a higher speed than before by using a parallel computer. For example, the results of evaluating the effects of the present invention using the parallel computer SR2201 developed by the present applicant are as follows. When performing a three-dimensional Fourier transform, in the conventional method, it takes about 0.26 seconds to convert 256 × 256 × 256 size data by using 256 processors. The breakdown takes 0.14 seconds for calculation, 0.06 seconds for transposition of data in the middle, and 0.06 seconds for rearrangement of the last data. According to the method described in the first or second embodiment, the calculation and transposition time is the same as that of the conventional method, and rearrangement of the last data can be omitted, so that the Fourier transform can be performed in 0.20 seconds. About 24% speedup can be achieved. In particular, when the meteorological calculation described in the first embodiment is performed using a three-dimensional Fourier transform, about 50% of the total calculation time is occupied by the Fourier transform in the meteorological calculation, so that a speed increase of about 12% can be obtained. . In addition, when the electronic structure calculation described in the fourth embodiment is performed using the three-dimensional Fourier transform, about 30% of the total execution time is usually occupied by the Fourier transform, so that a speed increase of about 7% can be achieved.
【The invention's effect】
As described above, according to the present invention, Fourier transformation on a parallel computer can be executed at high speed.
[Brief description of the drawings]
FIG. 1 is a schematic configuration diagram of a parallel computer used in a first embodiment of the present invention.
FIG. 2 is a diagram for explaining mapping of one-dimensional conversion target data to three-dimensional data used in the first embodiment of the present invention.
FIG. 3 is a diagram illustrating mapping of one-dimensional conversion result data to three-dimensional data used in the first embodiment of the present invention.
FIG. 4 is a diagram showing a first method for assigning data to a processor used in the first embodiment of the present invention;
FIG. 5 is a diagram showing another method for allocating data to a processor used in the prior art.
FIG. 6 is a diagram for explaining an inter-processor data division format for one-dimensional conversion target data used in the first embodiment of the present invention.
FIG. 7 is a diagram for explaining an inter-processor data division format of one-dimensional conversion result data according to a conventional technique.
FIG. 8 is a diagram showing a second method for allocating data to a processor used in the first embodiment of the present invention;
FIG. 9 is a diagram showing a plurality of Fourier transform transform procedures that were compared and studied before reaching the present invention.
FIG. 10 is a flowchart of a Fourier transform library used in the first embodiment of the present invention.
FIG. 11 is a flowchart of a simulation program used in the first embodiment of the present invention.
FIG. 12 is a flowchart of a Fourier transform library used in the third embodiment of the present invention.

Claims

それぞれに専有のメモリを備えた複数のプロセッサを有する分散メモリ型並列計算機で実行するためのフーリエ変換方法であって、
一次元の変換対象データf₀,f₁,・・・,f_N-1を、Ｘ軸、Ｙ軸、Ｚ軸を持つ３次元データ空間の直方体状の格子点上に、座標がＺ方向、Ｘ方向、Ｙ方向の順に変化するように順次に並べ、並べることで得られる３次元配列データについて同じＺ座標をもつ全てのデータは同一のプロセッサに分配されるようにＺ軸と垂直な複数平面で分割し、分割された３次元配列データを各々上記複数のプロセッサのいずれかに割り当て、
上記３次元配列データのＹ方向に関する第１のフーリエ変換処理、及びＸ方向に関する第２のフーリエ変換処理を、上記データ割り当てをされた各プロセッサで順次行い、
上記各プロセッサの前記第２のフーリエ変換の結果である３次元波数空間上の配列データ（ｃ’’kx, ky, jz）を、同じｋｙ座標をもつ全てのデータは同一のプロセッサに分配されるようにｋｙ軸と垂直な複数平面で分割した分割状態に、上記複数のプロセッサ間で割り当て直すために、上記各プロセッサ間で上記配列データの転置を行い、
上記各プロセッサ間のデータの転置が行われた配列データ（ｃ’’kx, ky, jz）のＺ方向に関する第３のフーリエ変換処理を、上記転置に伴い割り当て直された各プロセッサで行い、
もって上記変換対象データf₀,f₁,・・・,f_N-1のフーリエ変換結果C₀,C₁,・・・,C _N-1を上記複数のプロセッサ間でサイクリック分割された状態で得ることを特徴とするフーリエ変換方法。A Fourier transform method for execution on a distributed memory parallel computer having a plurality of processors each having its own memory,
The one-dimensional conversion target data f ₀ , f ₁ ,..., F _N−1 are coordinated in the Z direction on a rectangular grid point in a three-dimensional data space having an X axis, a Y axis, and a Z axis. A plurality of planes perpendicular to the Z axis so that all data having the same Z coordinate is distributed to the same processor with respect to the three-dimensional array data obtained by arranging and arranging sequentially so as to change in the order of X direction and Y direction. And dividing the divided three-dimensional array data into one of the plurality of processors,
A first Fourier transform process related to the Y direction of the three-dimensional array data and a second Fourier transform process related to the X direction are sequentially performed by the processors assigned with the data,
The array data (c ″ kx, ky, jz) in the three-dimensional wave number space, which is the result of the second Fourier transform of each processor, is all distributed to the same processor with the same ky coordinates. In order to reassign between the plurality of processors to the divided state divided by a plurality of planes perpendicular to the ky axis, the array data is transposed between the processors,
A third Fourier transform process in the Z direction of the array data (c ″ kx, ky, jz) in which the transposition of data between the processors is performed is performed in each processor reassigned with the transposition,
Therefore, the Fourier transform results C ₀ , C ₁ ,..., C _N-1 of the conversion target data f ₀ , f ₁ ,..., F _N-1 are cyclically divided among the plurality of processors. A Fourier transform method, characterized in that

それぞれに専有のメモリを備えた複数のプロセッサを有する分散メモリ型並列計算機によりフーリエ変換を実行するためのプログラムを記憶した、上記計算機により読み取り可能なプログラム記録媒体であって、
上記プログラムは、
一次元の変換対象データf₀,f₁,・・・,f_N-1を、Ｘ軸、Ｙ軸、Ｚ軸を持つ３次元データ空間の直方体状の格子点上に、座標がＺ方向、Ｘ方向、Ｙ方向の順に変化するように順次に並べ、並べたことで得られる３次元配列データについて同じＺ座標をもつ全てのデータは同一のプロセッサに分配されるようにＺ軸と垂直な複数平面で分割し、分割された３次元配列データを各々上記複数のプロセッサのいずれかに割り当てるステップ、
上記３次元配列データのＹ方向に関する第１のフーリエ変換処理、及びＸ方向に関する第２のフーリエ変換処理を、上記データ割り当てをされた各プロセッサで順次行うステップ、
上記各プロセッサの前記第２のフーリエ変換の結果である３次元波数空間上の配列データ（ｃ’’kx, ky, jz）を、同じｋｙ座標もつ全てのデータは同一のプロセッサに分配されるようにｋｙ軸と垂直な複数平面で分割した分割状態に、上記複数のプロセッサ間で割り当て直すために、上記各プロセッサ間で上記配列データの転置を行うステップ、及び
上記各プロセッサ間のデータの転置が行われた配列データ（ｃ’’kx, ky, jz）のＺ方向に関する第３のフーリエ変換処理を、上記転置に伴い割り当て直された各プロセッサで行い、もって上記変換対象データf₀,f₁,・・・,f_N-1のフーリエ変換結果C₀,C₁,・・・,C _N-1を上記複数のプロセッサ間でサイクリック分割された状態で得るステップを有することを特徴とするプログラム記録媒体。A program recording medium readable by the computer, storing a program for performing Fourier transform by a distributed memory parallel computer having a plurality of processors each having its own memory,
The above program
The one-dimensional conversion target data f ₀ , f ₁ ,..., F _N−1 are coordinated in the Z direction on a rectangular grid point in a three-dimensional data space having an X axis, a Y axis, and a Z axis. A plurality of pieces perpendicular to the Z axis so that all data having the same Z coordinate is distributed to the same processor for the three-dimensional array data obtained by arranging and arranging in order so as to change in the order of X direction and Y direction. Dividing on a plane and assigning the divided three-dimensional array data to any of the plurality of processors,
A step of sequentially performing a first Fourier transform process on the Y direction of the three-dimensional array data and a second Fourier transform process on the X direction in each processor assigned with the data;
The array data (c ″ kx, ky, jz) in the three-dimensional wave number space, which is the result of the second Fourier transform of each processor, is all distributed to the same processor with the same ky coordinates. And the transposition of the array data between the processors in order to reassign between the plurality of processors in the divided state divided by a plurality of planes perpendicular to the ky axis, and the transposition of the data between the processors. The third Fourier transform processing relating to the Z direction of the array data (c ″ kx, ky, jz) performed is performed by each of the processors reassigned along with the transposition, and thus the data to be converted f ₀ , f ₁ ,..., F _N-1 Fourier transform results C ₀ , C ₁ ,..., C _N-1 are obtained in a state of being cyclically divided among the plurality of processors. Program recording medium.