JP3784990B2

JP3784990B2 - Configuration method of systolic array processor

Info

Publication number: JP3784990B2
Application number: JP10358099A
Authority: JP
Inventors: 孝浩浅井; 正松本
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 1999-04-12
Filing date: 1999-04-12
Publication date: 2006-06-14
Anticipated expiration: 2019-04-12
Also published as: JP2000293510A

Description

【０００１】
【発明の属する技術分野】
この発明は逐次最小二乗アルゴリズム（以下、ＲＬＳアルゴリズムと略記する）の処理を行うためのシストリックアレイプロセッサの構成法に関する。
【０００２】
【従来の技術】
図１に適応フィルタの基本モデルを示す。適応フィルタとはｕ（ｎ）を入力信号、ｙ（ｎ）を出力信号とする可変係数フィルタ１１であり、出力信号ｙ（ｎ）が参照信号ｄ（ｎ）に近づくように係数更新アルゴリズム１２によりフィルタ係数が更新される。ここで、それぞれの信号はシンボルタイミングごとにサンプリングされるので、サンプリングされた後の信号をタイミングインデックスｎを用いて表す。適応フィルタとして図２に示されるようなトランスバーサル形フィルタを考える。遅延量が１タイミング周期の遅延素子Ｄ１，Ｄ２，…，ＤＮ−１が直列に接続され、その一端の遅延素子Ｄ１に入力信号ｕ（ｎ）が入力され、その入力信号ｕ（ｎ）と、各遅延素子Ｄ１，…，ＤＮ−１の各出力ｕ（ｎ−１），…，ｕ（ｎ−Ｎ＋１）とに対し、乗算器Ｍ０，Ｍ１，…，ＭＮ−１でそれぞれタップ係数ｗ₀(ｎ），ｗ₁(ｎ），…，ｗ_N-1(n)が乗算され、これら乗算結果が加算器１３で加算されて出力信号ｙ（ｎ）となる。フィルタのタップ係数ｗ_kとし、入力と出力の関係は以下の式で表される。
【０００３】
ｙ(n) ＝Σ _k=0 ^N-1 ｗ_k(n) ｕ(n−k) （１）
ここで、タップ数をＮとして、入力信号とタップ係数のベクトルをそれぞれ以下の式で定義する。
ｕ(n) ＝（ｕ(n) ，ｕ(n−1)，…，ｕ(n−Ｎ＋１))^t （２）
ｗ(n) ＝（ｗ₀(n)，ｗ₁(n)，…，ｗ_N-1(n)）^t （３）
添字ｔは転置を表す。このとき、フィルタの入出力関係は以下の式で表される。
【０００４】
ｙ(n) ＝ｗ ^t(n) ｕ(n) （４）
また、誤差信号ｅ（ｎ）は以下の式で表される。
ｅ(n) ＝ｄ(n）−ｗ ^t(n) ｕ(n) （５）
この誤差信号の荷重二乗平均を評価関数Ｊ（ｗ）とする。
Ｊ（ｗ）＝Σ _m=0 ⁿλ^n-m｜ｅ(m) ｜² （６）
ただし、λは忘却係数である。適応フィルタは評価関数Ｊ（ｗ）を最小とす
るようにフィルタの各タップ係数を更新する。評価関数Ｊ（ｗ）を最小とす
る解ｗ(n) は、
Ｒ _xx(n) ｗ(n) ＝Ｐ(n) （７）
を満足する。ここでＲ _xx(n) は入力信号の自己相関行列であり、Ｐ(n) は入
力信号と参照信号の相互相関行列である。式（７）は正規方程式と呼ばれている。ここでＲ _xx(n) が正則であれば、式（７）により最適な係数ベクトルｗ(n
) は以下の式で与えられる。
【０００５】
ｗ(n) ＝Ｒ _xx ^-1(n) Ｐ(n) （８）
式(8) に基づいてフィルタのタップ係数を更新する方法は評価関数Ｊ（ｗ
）を最小にするという意味で最適であるが、Ｒ _xx(n) の逆行列演算を行わなけ
ればならないので計算量が多い。そこで、より少ない計算量でフィルタのタップ係数を求めるアルゴリズムがいくつか知られている。その中で特に収束の速いアルゴリズムがＲＬＳアルゴリズムである。ＲＬＳアルゴリズムはフィルタのタップ係数を逐次的に求める。これにより、式（８）の逆行列演算を行う必要がなくなり計算量が少なくなる。具体的に、ＲＬＳアルゴリズムでは以下の演算を行う。
【０００６】
Ｘ(n) ＝（１／λ）Ｒ ^-1（ｎ−１）ｕ(n)
Ｋ(n) ＝Ｘ(n) ／（１＋ｕ ^H(n) Ｘ(n) ）
α(n) ＝ｙ(n) −ｗ ^H（ｎ−１）ｕ(n)
ｗ(n) ＝ｗ（ｎ−１）＋Ｋ(n) α^*(n)
Ｒ ^-1(n) ＝（１／λ）Ｒ ^-1（ｎ−１）−Ｋ(n) Ｘ ^H(n) （９）
ここで、添字＊は複素共役を表し、Ｈは共役転置を表す。
【０００７】
ＲＬＳアルゴリズムを用いても適応フィルタのタップ数が多くなると、計算量が増えるために実時間処理が困難となる。そこで、ＲＬＳアルゴリズムの処理をパイプライン並列的に行うことのできるシストリックアレイプロセッサが知られている。シストリックアレイプロセッサとは同一の機能を持つセルを規則的に配置して、ＲＬＳアルゴリズムの計算をパイプライン的に並列に行うものである。原理的には行列のＱＲ分解（与えられた行列の固有値を効率的に求めるために、行列を直交行列と上三角行列に分解すること）とギブンズローテーション（与えられた行列に対して、その行列の値により導出される直交行列を用いて、与えられた行列の上三角行列化を行うこと）を用いることにより、処理のパイプライン並列化を可能としている。シストリックアレイプロセッサでは、各セルにおいて単純な計算を行い、計算結果を隣接セルに渡すということを繰り返すことによりＲＬＳアルゴリズムの計算が行われる。従来のシストリックアレイプロセッサの構成を図３Ａに示す。ここでフィルタのタップ数は３としている。シストリックアレイプロセッサにはバウンダリーセル（以下、boundary cell)ＢＣ（図３Ｂ）とインターナルセル（以下、internal cell)ＩＣ（図３Ｃ）とファイナルセル（以下、final cell) ＦＣ（図３Ｄ）の３種類のセルがあり、それぞれのセルにおいて図３に示される演算が行われる。delay unitはデータの転送を１タイミング遅らせる働きを持つ。ｕ(n) はフィルタの各タップへの入力信号を表し、ｄ（ｎ）は参照信号を表す。
【０００８】
タップ数のバウンダリーセルＢＣ１，ＢＣ２，ＢＣ３が順次、遅延量が１タイミング周期の遅延素子Ｄを介して直列に接続され、バウンダリーセルＢＣ３の出力側は遅延素子Ｄ３を介してファイナルセルＦＣに接続される。バウンダリーセルＢＣ１の他方の出力側にインターナルセルＩＣ１１，ＩＣ１２，ＩＣ１３が順次接続され、バウンダリーセルＢＣ２の他方の出力側にインターナルセルＩＣ２２，ＩＣ２３が順次接続され、バウンダリーセルＢＣ３の他方の出力側にインターナルセルＩＣ３３が接続される。インターナルセルＩＣ１１，ＩＣ１２，ＩＣ１３の各他方の出力側はバウンダリーセルＢＣ２、インターナルセルＩＣ２２，ＩＣ２３とそれぞれ接続され、インターナルセルＩＣ２２，ＩＣ２３の各他方の出力側はバウンダリーセルＢＣ３、インターナルセルＩＣ３３に接続され、インターナルセルＩＣ３３の他方の出力側はファイナルセルＦＣに接続される。つまりこれらセルＢＣ，ＩＣ，ＦＣは三角形の行列に配されている。
【０００９】
各バウンダリーセルＢＣには遅延素子Ｄを介しての前段のバウンダリーセルよりの入力δ_inと、そのバウダリーセルＢＣに接続されているインターナルセルＩＣからの入力ｕ _inとが与えられ、ｕ_in＝０もしくはδ_in＝０の時、次の計算を
行う。
ｘ＝β² ｘ，ｓ＝０，ｚ＝ｕ_in，δ_out＝δ_in （10）
ｕ_in≠０かつδ_in≠０の時、以下の計算を行う。
【００１０】
ｚ＝ｕ_in，ｘ′＝β² ｘ＋δ_in｜ｚ｜² ，ｃ＝β² ｘ／ｘ′，ｓ＝δ_inｚ／ｘ′，ｘ＝ｘ′，δ_out＝ｃδ_in （11）
ここで、ｘはバウンダリーセル自身が持つ値であり正の実数となる。またβはＲＬＳアルゴリズムにおける忘却係数λの平方根である。そのバウンダリーセルＢＣにおいて計算されたｓ，ｚは行方向の隣接するインターナルセルＩＣに渡され、δ_inを用いて計算されたδ_outは遅延素子を介して１タイミング遅延されて次のバウンダリーセルに渡される。式（10）、（11）において、ｘ，β，δ_in，δ_out，ｃは正の実数でありｓ，ｚ，ｕ_inは複素数となる。
【００１１】
各インターナルセルにおいては行方向からの入力ｓ，ｚと列方向からの入力ｕ_inとにより以下の計算を行う。
ｕ_out＝ｕ_in−ｚｘ，ｘ＝ｘ＋ｓ^*ｕ_out （12）
ｕ_outは列方向に出力され、ｓ，ｚはそのまま行方向に出力され、セルの持つ値ｘが更新される。この時、ｘの値は複素数となる。
【００１２】
最後にファイナルセルＦＣにおいてこれに接続されたインターナルセルＩＣ３３からのｕ_inと直前のバウンダリーセルＢＣ３よりの遅延素子Ｄ３を介して入力されるδ_inとをかけることにより誤差信号ｅ（ｎ）が導出される。
ｅ（ｎ）＝ｕ_inδ_in （13）
図３Ａに示すように初段のバウンダリーセルＢＣ１には入力信号ｕ（１），ｕ（２），ｕ（３），…がタイミングごとに順次入力され、インターナルセルＩＣ１１にはバウンダリーセルＢＣ１に対し、１タイミング遅れて０，ｕ（１），ｕ（２），…が順次入力され、インターナルセルＩＣ１２には、バウンダリーセルＢＣ１に対し、２タイミング遅れて０，０，ｕ（１），ｕ（２），…が順次入力され、インターナルセルＩＣ１３にはバウンダリーセルＢＣ１に対し、３タイミング遅れて参照信号ｄ（１），ｄ（２），ｄ（３），…が順次入力される。
【００１３】
バウンダリーセル・インターナルセルで以上の計算を行い、隣接セルにその結果を渡して、再び各セルで計算を行うという処理を繰り返すことにより、全体としてＲＬＳアルゴリズムが処理され、誤差信号及びタップ係数を得ることができる。
上記、ＲＬＳアルゴリズムの処理を行うシストリックアレイプロセッサをＡＳＩＣ（Application Specific IC ）などのプログラマブルデバイスを用いて設計する場合、内部で行われる演算として、浮動小数点演算ではなく固定小数点演算を用いると動作速度が速い。しかし、固定小数点演算を行う場合、桁あふれや丸め誤差等の問題が生じてしまう。スケーリング（入力される値を、あらかじめ１に正規化すること）を行うことにより乗算による桁あふれの問題は回避できるが、除算や和差算による桁あふれの問題が残る。そこで、各演算毎に最適なビットシフト量を求めてビットシフトを行う擬似浮動小数点演算を行うことにより、桁あふれの問題はなくなるが動作速度が遅くなる。動作速度の点では、固定小数点演算が最も速く、次に擬似浮動小数点演算となり、浮動小数点演算が最も遅い。
【００１４】
シストリックアレイプロセッサにおいて固定小数点演算を行う場合、スケーリングを用いても除算と和差算の影響により桁あふれの問題が生じる可能性がある。式（11）、（12）においては和算の影響により桁あふれが生じる可能性がある。また、式（11）においてβ²ｘ≪δ_in｜ｚ｜²かつ｜ｚ｜≪１の時に、ｘ′≒δ_in｜ｚ｜²、よって｜ｓ｜＝１／｜ｚ｜となってしまうので、ｓの値が大きくなり桁あふれを生じる可能性がある。桁あふれが生じてしまうと、ＲＬＳアルゴリズムの処理が正しく行われない。
【００１５】
【発明が解決しようとする課題】
この発明は、シストリックアレイプロセッサをＡＳＩＣ（Application Specific IC ）などのプログラマブルデバイスにより固定小数点演算を用いて設計する場合、桁あふれ等の影響により生ずる誤差を小さくすることを目的とする。
【００１６】
【課題を解決するための手段】
この発明においては、固定小数点演算を用い、バウンダリーセルの演算の際に、しきい値を設け、入力のいずれかがしきい値以下である場合と他の場合とに応じて実行する演算を分ける。
この構成により、固定小数点演算により得られる結果の誤差を小さくして、浮動小数点演算により得られる結果の誤差に近ずける。
【００１７】
【発明の実施の形態】
この発明の実施例においても、図３に示した従来と同様なＲＬＳアルゴリズムの処理を行うシストリックアレイプロセッサを構成するが、この発明ではバウンダリーセルの演算のための条件文に小さな値のしきい値ｍｉｎを設けて以下の計算を行う。
【００１８】
ｕ_in＜min 又はδ_in＜min 又はβ²ｘ＋δ_in｜ｚ｜²＜min であれば
｛ｘ＝β²ｘ，ｓ＝０，ｚ＝ｕ_in，δ_out＝δ_in｝
を演算し、その他の場合は
｛ｚ＝ｕ_in，ｘ′＝β²ｘ＋δ_in｜ｚ｜²，
ｃ＝β²ｘ／ｘ′，ｓ＝δ_inｚ／ｘ′，ｘ＝ｘ′，
δ_out＝ｃδ_in｝（14）
を演算する。これにより、β²ｘ≪δ_in｜ｚ｜²かつ｜ｚ｜≪１であっても、あるしきい値（min)の値より入力ｕ_inが小さければ、ｕ_in＜ｍｉｎとなるのでｓ＝０となり、除算の影響によりｓの値が桁あふれを生じるという問題を回避することができる。またｃやｓの演算において（１／ｘ′）を演算するため、ｘ′が小さな値となったり、βは忘却係数λの平方根であり、従ってβ²は著しく小さな値となり、かつδ_inが著しく小さな値であれば、ｘ′が小さな値となり、（１／ｘ′）の演算で桁あふれの問題が生じるが、この発明ではδ_in＜ｍｉｎ又はｘ′＜ｍｉｎの場合は、ｃやｓの演算は行わず、つまり１／ｘ′の演算は行なわないため、１／ｘ′が大きな値となって桁あふれが生じる問題は生じない。
【００１９】
しきい値ｍｉｎの値は入力信号の変動状態や、演算ビット数などによって好ましい値が異なる。よって予めシミュレーションにより各種条件と対応した好ましいしきい値を求めておき、使用条件に適したしきい値ｍｉｎを選定すればよい。これらにより、桁あふれの影響により生じる誤差を小さくすることができる。
【００２０】
【発明の効果】
この発明により、シストリックアレイプロセッサをＡＳＩＣ（Application Specific IC ）などのプログラマブルデバイスにより固定小数点演算を用いて設計する場合、桁あふれ等の影響により生ずる誤差を小さくすることができる。その結果、浮動小数点演算により得られる結果の誤差に、固定小数点演算により得られる結果の誤差を近づけることができ、誤差を小さくすることができる。
【００２１】
ここで、一例として推定するパラメータ数１６におけるこの発明の効果について示す。図４は、ある適当な入力信号ｕ(n) と参照信号ｄ(n) を用いて、適応フィルタのタップ数１６とした場合の、式（９）を用いて浮動小数点演算により導出したタップ係数と、シストリックアレイプロセッサを用いて導出したタップ係数との誤差を表している。誤差の単位は％である。boundary cell におけるしきい値を設定しない従来のシストリックアレイプロセッサにおいて、浮動小数点演算を用いて導出したタップ係数と、式（９）を用いて浮動小数点演算により導出したタップ係数との誤差は殆どない。これは、式（９）における計算と、シストリックアレイプロセッサにおいて行われる計算が数学的に等価なためである。次に、boundary cell においてしきい値（ここでは、一例としてしきい値を０．０００１とした。）の設定を行う式（14）の方法を用いて、浮動小数点演算によりタップ係数を導出した場合、若干の誤差が生じている。これは、しきい値を設定することにより、式（９）の計算とシストリックアレイプロセッサ内部で行われる演算が数学的に等価でなくなったためである。次に、固定小数点演算（整数部１０ビット、小数部２２ビット）を行い、boundary cell 内部ではしきい値の設定を行わない式（10）、（11）の演算を行った場合は、大きな誤差を生じている。これは、桁あふれ等の影響のためと考えられる。最後に、固定小数点演算（整数部１０ビット、小数部２２ビット）を行い、boundary cell 内部における演算において条件文にしきい値を設定したこの発明の場合は、しきい値の設定を行わない場合と比較して誤差が大幅に減少している。これは、β²ｘ≪δ_in｜ｚ｜²かつ｜ｚ｜≪１の時に除算の影響によりｓの値が桁あふれを生じるという問題を回避することができているためである。したがって、この発明により、固定小数点演算を用いた場合の桁あふれ等の影響により生じる誤差の影響を小さくできる。
【図面の簡単な説明】
【図１】適応フィルタの基本モデルを表す図。
【図２】トランスバーサル形フィルタを表す図。
【図３】ＡはＲＬＳアルゴリズムの処理を行う従来のシストリックアレイプロセッサの構成を示す図、Ｂ、Ｃ、Ｄはそれぞれboundary cell とinternal cell とfinal cellで行われる計算を示す図である。
【図４】推定するパラメータ数１６として、従来のＲＬＳアルゴリズムの数式により浮動小数点演算を用いて導出したタップ係数と、シストリックアレイプロセッサを用いて導出したタップ係数との誤差を表し、浮動小数点演算及び固定小数点演算のそれぞれについて、boundary cell におけるしきい値の設定を行う場合と行わない場合の４通りについて誤差を示す図。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a configuration method of a systolic array processor for performing processing of a sequential least squares algorithm (hereinafter abbreviated as RLS algorithm).
[0002]
[Prior art]
FIG. 1 shows a basic model of the adaptive filter. The adaptive filter is a variable coefficient filter 11 having u (n) as an input signal and y (n) as an output signal. The coefficient update algorithm 12 causes the output signal y (n) to approach the reference signal d (n). The filter coefficient is updated. Here, since each signal is sampled at every symbol timing, the signal after sampling is represented by using a timing index n. Consider a transversal filter as shown in FIG. 2 as an adaptive filter. Delay elements D1, D2,..., DN - 1 having a delay amount of one timing period are connected in series, and an input signal u (n) is input to the delay element D1 at one end thereof. each delay element D1, ..., DN - 1 of each output u (n-1), ... , to the u (n-n + 1) , the multipliers M0, M1, ..., MN - 1 tap coefficients respectively w ₀ ( n), w ₁ (n),..., w _N-1 (n) are multiplied, and the multiplication results are added by the adder 13 to become an output signal y (n). The tap coefficient w _{k of the} filter is used, and the relationship between input and output is expressed by the following equation.
[0003]
y (n) = Σk _{= 0} ^N−1 w _k (n) u (n−k) (1)
Here, assuming that the number of taps is N, vectors of input signals and tap coefficients are respectively defined by the following equations.
u (n) = (u (n), u (n−1),..., u (n−N + 1)) ^t (2)
w (n) = (w ₀ (n), w ₁ (n), ..., w _N-1 (n)) ^t (3)
The subscript t represents transposition. At this time, the input / output relationship of the filter is expressed by the following equation.
[0004]
y (n) = w ^t (n) u (n) (4)
The error signal e (n) is expressed by the following equation.
e (n) = d (n) −w ^t (n) u (n) (5)
The weighted mean square of the error signal is defined as an evaluation function J ( w ).
J (w) = Σ _{m = 0} ⁿ λ ^nm | e (m) | ² (6)
Where λ is a forgetting factor. The adaptive filter updates each tap coefficient of the filter so as to minimize the evaluation function J ( w ). The solution w (n) that minimizes the evaluation function J ( w ) is
R _xx (n) w (n) = P (n) (7)
Satisfied. Here, R _xx (n) is an autocorrelation matrix of the input signal, and P (n) is a cross-correlation matrix of the input signal and the reference signal. Equation (7) is called a normal equation. If R _xx (n) is regular, the optimal coefficient vector w (n
) Is given by
[0005]
w (n) = R _xx ^-1 (n) P (n) (8)
The method of updating the filter tap coefficient based on the equation (8) is based on the evaluation function J ( w
) Is minimized, but it requires a large amount of calculation because an inverse matrix operation of R _xx (n) must be performed. Therefore, several algorithms for obtaining the filter tap coefficient with a smaller calculation amount are known. Among them, the RLS algorithm is a particularly fast convergence algorithm. The RLS algorithm sequentially obtains the filter tap coefficients. Thereby, it is not necessary to perform the inverse matrix operation of Expression (8), and the amount of calculation is reduced. Specifically, the RLS algorithm performs the following operations.
[0006]
X (n) = (1 / λ) R ⁻¹ (n−1) u (n)
K (n) = X (n ) / (1+ u H (n) X (n))
α (n) = y (n) −w ^H (n−1) u (n)
w (n) = w (n-1) + K (n) α ^* (n)
R ^-1 (n) = (1 / λ) R ^-1 (n-1) -K (n) X ^H (n) (9)
Here, the subscript * represents a complex conjugate, and H represents a conjugate transpose.
[0007]
Even if the RLS algorithm is used, if the number of taps of the adaptive filter increases, the amount of calculation increases, and real-time processing becomes difficult. Therefore, a systolic array processor capable of performing RLS algorithm processing in pipeline parallel is known. In the systolic array processor, cells having the same function are regularly arranged, and RLS algorithm calculations are performed in parallel in a pipeline manner. In principle, QR decomposition of a matrix (in order to efficiently determine the eigenvalues of a given matrix, decompose the matrix into an orthogonal matrix and an upper triangular matrix) and Givens rotation (for a given matrix, the matrix By using an orthogonal matrix derived from the value of (3), an upper triangular matrix of a given matrix is used), thereby enabling pipeline parallel processing. In the systolic array processor, calculation of the RLS algorithm is performed by repeating simple calculation in each cell and passing the calculation result to the adjacent cell. The configuration of a conventional systolic array processor is shown in FIG. 3A. Here, the number of filter taps is three. The systolic array processor includes a boundary cell BC (FIG. 3B), an internal cell IC (FIG. 3C), and a final cell FC (FIG. 3D). There are types of cells, and the operation shown in FIG. 3 is performed in each cell. The delay unit has a function of delaying data transfer by one timing. u (n) represents an input signal to each tap of the filter, and d (n) represents a reference signal.
[0008]
Boundary cells BC1, BC2 and BC3 of the number of taps are sequentially connected in series via a delay element D having a delay amount of one timing period, and the output side of boundary cell BC3 is connected to final cell FC via delay element D3. Connected. Internal cells IC11, IC12, IC13 are sequentially connected to the other output side of boundary cell BC1, internal cells IC22, IC23 are sequentially connected to the other output side of boundary cell BC2, and the other output of boundary cell BC3 is output. The internal cell IC 33 is connected to the side. The other output side of the internal cells IC11, IC12, IC13 is connected to the boundary cell BC2 and the internal cells IC22, IC23, respectively. The other output side of the internal cells IC22, IC23 is connected to the boundary cell BC3, the internal cell IC33. The other output side of the internal cell IC 33 is connected to the final cell FC. That is, these cells BC, IC, FC are arranged in a triangular matrix.
[0009]
Each boundary cell BC is given an input δ _in from the previous boundary cell via the delay element D and an input u _in from the internal cell IC connected to the boundary cell BC, and u _in = When 0 or δ _in = 0, the following calculation is performed.
x = β ² x, s = 0, z = u _in , δ _out = δ _in (10)
When u _in ≠ 0 and δ _in ≠ 0, the following calculation is performed.
[0010]
z = u _in , x ′ = β ² x + δ _in | z | ² , c = β ² x / x ′, s = δ _in z / x ′, x = x ′, δ _out = cδ _in (11)
Here, x is a value of the boundary cell itself and is a positive real number. Β is the square root of the forgetting factor λ in the RLS algorithm. The s and z calculated in the boundary cell BC are passed to the adjacent internal cell IC in the row direction, and δ _out calculated using δ _in is delayed by one timing through the delay element to be the next boundary. Passed to the cell. In Expressions (10) and (11), x, β, δ _in , δ _out , and c are positive real numbers, and s, z, and u _in are complex numbers.
[0011]
In each internal cell, the following calculation is performed based on inputs s and z from the row direction and inputs u _in from the column direction.
u _out = u _in −zx, x = x + s ^* u _out (12)
u _out is output in the column direction, s and z are output as they are in the row direction, and the value x of the cell is updated. At this time, the value of x is a complex number.
[0012]
Finally, in the final cell FC, the error signal e (n) is obtained by multiplying u _in from the internal cell IC 33 connected thereto by δ _in inputted through the delay element D3 from the immediately preceding boundary cell BC3. Derived.
e (n) = u _in δ _in (13)
As shown in FIG. 3A, input signals u (1), u (2), u (3),... Are sequentially input to the first stage boundary cell BC1 at each timing, and the internal cell IC11 is input to the boundary cell BC1. On the other hand, 0, u (1), u (2),... Are sequentially input with a delay of 1 timing, and 0, 0, u (1), ... with a delay of 2 timings with respect to the boundary cell BC1. u (2),... are sequentially input, and reference signals d (1), d (2), d (3),... are sequentially input to the internal cell IC13 with a delay of three timings with respect to the boundary cell BC1. .
[0013]
The RLS algorithm is processed as a whole by repeating the process of performing the above calculation in the boundary cell / internal cell, passing the result to the neighboring cell, and performing the calculation in each cell again, and the error signal and the tap coefficient are calculated. Obtainable.
When the systolic array processor that performs the RLS algorithm processing is designed using a programmable device such as an ASIC (Application Specific IC), the operation speed is obtained when a fixed-point operation is used instead of a floating-point operation. Is fast. However, when fixed-point arithmetic is performed, problems such as overflow and rounding errors occur. By carrying out scaling (normalizing the input value to 1 in advance), the problem of overflow due to multiplication can be avoided, but the problem of overflow due to division and sum / difference remains. Thus, by performing pseudo floating point arithmetic that obtains the optimum bit shift amount for each operation and performs bit shift, the problem of overflow is eliminated, but the operation speed is reduced. In terms of operating speed, fixed point arithmetic is the fastest, followed by pseudo floating point arithmetic, and floating point arithmetic is the slowest.
[0014]
When performing fixed-point arithmetic in a systolic array processor, even if scaling is used, a problem of overflow may occur due to the influence of division and sum / difference. In equations (11) and (12), overflow may occur due to the effect of summation. Also, β ² x«δ _in the equation (11) | when _{«1, x '≒ δ in |} | z | 2 and | z z | ^2, thus | s | = 1 / | becomes | z Therefore, there is a possibility that the value of s becomes large and overflow occurs. If an overflow occurs, the RLS algorithm is not correctly processed.
[0015]
[Problems to be solved by the invention]
An object of the present invention is to reduce an error caused by the influence of overflow or the like when a systolic array processor is designed using a fixed-point operation by a programmable device such as an ASIC (Application Specific IC).
[0016]
[Means for Solving the Problems]
In the present invention, a fixed-point operation is used, and a threshold value is provided when calculating a boundary cell, and an operation to be executed depending on whether one of the inputs is equal to or less than the threshold value or the other case. Divide.
With this configuration, the error of the result obtained by the fixed-point operation is reduced, and the error of the result obtained by the floating-point operation is approached.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
In the embodiment of the present invention as well, a systolic array processor that performs the same RLS algorithm processing as shown in FIG. 3 is configured. However, in the present invention, a small value is added to the conditional statement for the boundary cell operation. The following calculation is performed by setting a threshold value min.
[0018]
If u _in <min or δ _in <min or β ² x + δ _in | z | ² <min, {x = β ² x, s = 0, z = u _in , δ _out = δ _in }
Otherwise, {z = u _in , x ′ = β ² x + δ _in | z | ² ,
c = β ² x / x ′, s = δ _in z / x ′, x = x ′,
δ _out = c δ _in } (14)
Is calculated. As a result, even if β ² x << δ _in | z | ² and | z | << 1, if the input u _in is smaller than the value of a certain threshold value (min), u _in <min, so s = The problem that the value of s overflows due to the influence of division can be avoided. Also, since (1 / x ′) is calculated in the calculation of c and s, x ′ becomes a small value, β is the square root of the forgetting factor λ , and therefore β ² becomes a very small value, and δ _in is If the value is extremely small, x ′ becomes a small value, and the problem of overflow occurs in the operation of (1 / x ′). In the present invention, when δ _in <min or x ′ <min, c and s Is not performed, i.e., 1 / x 'is not performed. Therefore, there is no problem that 1 / x' becomes a large value and overflow occurs.
[0019]
The preferable value of the threshold value min varies depending on the fluctuation state of the input signal, the number of calculation bits, and the like. Therefore, a preferable threshold value corresponding to various conditions may be obtained in advance by simulation, and a threshold value min suitable for the use condition may be selected. As a result, errors caused by the influence of overflow can be reduced.
[0020]
【The invention's effect】
According to the present invention, when a systolic array processor is designed using a fixed-point operation by a programmable device such as an ASIC (Application Specific IC), an error caused by the influence of overflow or the like can be reduced. As a result, the error of the result obtained by the fixed-point operation can be brought close to the error of the result obtained by the floating-point operation, and the error can be reduced.
[0021]
Here, the effect of the present invention in the parameter number 16 estimated as an example will be described. FIG. 4 shows a tap coefficient derived by floating point calculation using equation (9) when an appropriate input signal u (n) and reference signal d (n) are used and the number of taps of the adaptive filter is 16. And an error between the tap coefficient derived using the systolic array processor. The unit of error is%. In a conventional systolic array processor that does not set a threshold value in the boundary cell, there is almost no error between the tap coefficient derived using floating-point arithmetic and the tap coefficient derived using floating-point arithmetic using Equation (9). . This is because the calculation in equation (9) and the calculation performed in the systolic array processor are mathematically equivalent. Next, when the tap coefficient is derived by floating point calculation using the method of equation (14) that sets the threshold value (here, the threshold value is set to 0.0001 as an example) in the boundary cell Some errors have occurred. This is because the calculation of equation (9) and the operation performed inside the systolic array processor are no longer mathematically equivalent by setting the threshold value. Next, if fixed point arithmetic (integer part 10 bits, decimal part 22 bits) is performed and the threshold values are not set inside the boundary cell, the calculation of formulas (10) and (11) will cause a large error. Has produced. This is thought to be due to the influence of overflowing digits. Finally, in the case of this invention in which a fixed-point operation (integer part 10 bits, decimal part 22 bits) is performed and a threshold value is set in a conditional statement in an operation inside the boundary cell, the threshold value is not set. In comparison, the error is greatly reduced. This is because the problem that the value of s overflows due to the influence of division when β ² x << δ _in | z | ² and | z | << 1 can be avoided. Therefore, according to the present invention, it is possible to reduce the influence of errors caused by the influence of overflow or the like when using fixed point arithmetic.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a basic model of an adaptive filter.
FIG. 2 is a diagram illustrating a transversal filter.
FIG. 3A is a diagram illustrating a configuration of a conventional systolic array processor that performs processing of an RLS algorithm, and B, C, and D are diagrams illustrating calculations performed in a boundary cell, an internal cell, and a final cell, respectively.
FIG. 4 shows an error between a tap coefficient derived using floating-point arithmetic and a tap coefficient derived using a systolic array processor as the number of parameters to be estimated 16 according to a conventional RLS algorithm formula. The figure which shows an error about 4 types, when not setting with the setting of the threshold value in a boundary cell about each of a fixed point arithmetic.

Claims

ｕ _in ＝０又はδ _in ＝０であれば
ｘ＝β ² ｘ，ｓ＝０，ｚ＝ｕ _in ，δ _out ＝δ _in
その他の場合は
ｚ＝ｕ _in ，ｘ′＝β ² ｘ＋δ _in ｜ｚ｜ ² ，
ｃ＝β ² ｘ／ｘ′，ｓ＝δ _in ｚ／ｘ′，ｘ＝ｘ′，δ _out ＝ｃδ _in
の演算を行うバウンダリーセルと、インターナルセルと、ファイナルセルの各セルで演算を繰り返すことで逐次最小二乗アルゴリズムの計算を行うシストリックアレイプロセッサの構成方法において、
前記バウンダリーセルは、ｕ _in ＜ min 又はδ _in ＜ min 又はβ ² ｘ＋δ _in ｜ｚ｜ ² ＜ min であれば
ｘ＝β ² ｘ，ｓ＝０，ｚ＝ｕ _in ，δ _out ＝δ _in
但しｍｉｎは使用条件に応じて選定されたしきい値
の演算を行うことを特徴とするシストリックアレイプロセッサの構成方法。 u _in = 0 Or if δ _in = 0
x = β ² x, s = 0, z = u _in , δ _out = δ _in
In other cases
z = u _in , x ′ = β ² x + δ _in | z | ² ,
c = β ² x / x ′, s = δ _in z / x ′, x = x ′, δ _out = cδ _in
In the method of configuring the systolic array processor that performs the calculation of the least-squares algorithm sequentially by repeating the calculation in each of the boundary cell, the internal cell, and the final cell,
The boundary cell, u _in <min or [delta] _in <min or β ² x + δ _in | if ² <min | z
x = β ² x, s = 0, z = u _in , δ _out = δ _in
Where min is the threshold selected according to the operating conditions
A method of configuring a systolic array processor, characterized in that