JPH05233679A

JPH05233679A - Parallel data processor

Info

Publication number: JPH05233679A
Application number: JP4000064A
Authority: JP
Inventors: Susumu Yasuda; 晋安田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1992-01-06
Filing date: 1992-01-06
Publication date: 1993-09-10

Abstract

PURPOSE:To decrease input registers to half without varying a throughput of an operation by providing a register for holding a result of addition between each processor element(PE), and inputting input data at every PE by shifting a bit timewise. CONSTITUTION:Plural PEs (101-10N) consisting of a full adder in which one input is connected to an input data register 111, the other input is connected to an input terminal of an arithmetic operation unit(ALU), and a carry output is connected to a carry input through a register 131 of one bit are arranged. Subsequently, between the adjacent PEs, an addition output terminal of the full adder of one PE and an input terminal of the other PE are connected through registers 171-17N of one bit, and input data is inputted successively to the adjacent PEs by each time slot. According to this constitution, the input registers can be decreased to half, but since it becomes a pipeline processing of every bit, the whole throughput is not varied so much, compared with a conventional one.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、並列データ処理装置に
関し、特にビットシリアルプロセッサを用いた場合にス
ループットの速い総和加算を行なう並列処理プロセッサ
に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a parallel data processor, and more particularly to a parallel processor for performing summation addition with high throughput when a bit serial processor is used.

【０００２】[0002]

【従来の技術】一般に、画像処理分野において、高速処
理を行なう場合、図３に示すように、一画面の各画素毎
にプロセッサ（プロセッサアレイ部）を用意し、全画素
の演算処理を同時に行なうことがある。図において、２
０は入力端子、２１は入力インタフェース部、２２はプ
ロセッサアレイ部、２３は出力インタフェース部、２４
は出力端子である。2. Description of the Related Art Generally, in the field of image processing, when high-speed processing is performed, a processor (processor array section) is prepared for each pixel of one screen and arithmetic processing of all pixels is performed simultaneously, as shown in FIG. Sometimes. In the figure, 2
0 is an input terminal, 21 is an input interface unit, 22 is a processor array unit, 23 is an output interface unit, 24
Is an output terminal.

【０００３】このような構成においてよく用いられる演
算処理として、フィルタリング、行列計算等があるが、
その場合、各プロセッサ要素（以下ＰＥという）の総和
加算が行なわれる。Filtering, matrix calculation, etc. are often used as arithmetic operations in such a configuration.
In that case, the summation addition of each processor element (hereinafter referred to as PE) is performed.

【０００４】従来、この様な総和加算は、図４に示すよ
うな構成で行なわれていた。ここでは、総和加算として
図３のプロセッサアレイの１行分の加算を行なう場合と
する。図中、１０１〜１０Ｎは各々ＰＥを示し、１１１
〜１１Ｎは総和をとる各ＰＥの値を保持しているレジス
タ、１２１〜１２ＮはＡＬＵ（算術演算ユニット）を示
し、ここでは全加算器として働く。１３１〜１３Ｎは桁
上げ値を保持するレジスタ、１４１〜１４Ｎは桁上げ出
力端子，１５１〜１５Ｎは桁上げ入力端子である。各Ｐ
Ｅに保持されているデータをａ₀ａ₁〜ａ_n，ｂ₀ｂ₁
…ｂ_n，…，ｄ₀ｄ₁…ｄ_nとする。この各値の添字０
〜ｎがＬＳＢ〜ＭＳＢに対応する。Conventionally, such total sum addition has been performed with a configuration as shown in FIG. Here, it is assumed that addition for one row of the processor array of FIG. 3 is performed as total addition. In the figure, 101 to 10N each represent PE, and 111
.About.11N are registers holding the value of each PE that sums up, and 121 to 12N are ALUs (arithmetic operation units), which function as full adders here. Reference numerals 131 to 13N are registers for holding a carry value, 141 to 14N are carry output terminals, and 151 to 15N are carry input terminals. Each P
The data held in E is set to a ₀ a _{1 to} a _n , b ₀ b ₁
... b _n , ..., d ₀ d ₁ ... d _n . Subscript 0 of each value
~ N corresponds to LSB to MSB.

【０００５】次にこの回路の動作を説明する。レジスタ
１１１〜１１Ｎは、動作クロックの１クロック毎にシフ
トしＡＬＵ１２１〜１２Ｎにデータが入力される。先ず
１回目のクロックでデータａ₀，ｂ₀，ｃ₀，…，ｄ₀
が各ＰＥのＡＬＵ１２１〜１２Ｎに入力され、隣接する
ＡＬＵの出力（加算結果の同じ桁の値）と加算される。Next, the operation of this circuit will be described. The registers 111 to 11N shift every operation clock, and the data is input to the ALUs 121 to 12N. First data a ₀ in the first round of the _{_{clock, b 0, c 0, ...}} , d 0
Is input to the ALU 121 to 12N of each PE, and is added to the output of the adjacent ALU (the value of the same digit of the addition result).

【０００６】ＰＥ１０１の入力は隣接するＰＥがないた
め０とａ₀である。ＰＥ１０２ではＰＥ１０１の加算結
果とｂ₀が加算され、ＰＥ１０３ではＰＥ１０２の加算
結果とＣ₀が加算される。同様の動作が全ＰＥで行なわ
れ、ＰＥ１０ＮのＡＬＵの出力として総和加算のＬＳＢ
が出力される。この時各ＰＥのレジスタ１３１〜１３Ｎ
には、ＬＳＢ＋１ビット目の桁に相当する値が桁上げ値
として保持される。The inputs of PE 101 are 0 and a ₀ because there is no adjacent PE. The PE 102 adds the addition result of the PE 101 and b ₀ , and the PE 103 adds the addition result of the PE 102 and C ₀ . The same operation is performed in all PEs, and the LSB of the sum addition is output as the output of the ALU of PE 10N.
Is output. At this time, the registers 131 to 13N of each PE
Holds a value corresponding to the digit of the LSB + 1st bit as a carry value.

【０００７】次に、２回目のクロックでデータａ₁，ｂ
₁，ｃ₁，…，ｄ₁が各ＰＥのＡＬＵに入力され、隣接
するＡＬＵの出力及びレジスタ１３１〜１３Ｎに保持さ
れた桁上げ値との加算が行なわれる。この結果、ＰＥ１
０ＮのＡＬＵの出力として総和加算のＬＳＢ＋１ビット
目の値が出力される。各ＰＥのレジスタ１３１〜１３Ｎ
にはＬＳＢ＋２ビット目の桁に相当する値が桁上げ値と
して保持される。Next, at the second clock, the data a ₁ , b
₁ , c ₁ , ..., D ₁ are input to the ALU of each PE, and the addition of the output of the adjacent ALU and the carry value held in the registers 131 to 13N is performed. As a result, PE1
As the 0N ALU output, the value of the LSB + 1th bit of the sum addition is output. Registers 131 to 13N of each PE
Holds a value corresponding to the digit of LSB + 2nd bit as a carry value.

【０００８】以上の動作を繰り返して実行し、ｎ回目の
クロックでデータａ_n，ｂ_n，ｃ_n，…，ｄ_nの加算が
行なわれ、ＰＥ１０Ｎからｎビット目の加算結果がレジ
スタ１６に出力される。この時点で各ＰＥのレジスタ１
３１〜１３Ｎにはｎ＋１ビット目の値が保持されてお
り、ｎ＋１回目以降のクロックでは入力レジスタ１１１
〜１１Ｎからの値は０を送り、レジスタ１３１〜１３Ｎ
の値の加算をＮ−１回のクロック期間で実行する。した
がって、ｎ＋Ｎ−１回のクロック期間で総和加算が終了
する。The above operation is repeatedly executed, the data a _n , b _n , c _n , ..., D _n are added at the n-th clock, and the addition result of the n-th bit is output from the PE 10N to the register 16. To be done. Register 1 of each PE at this point
The values of the (n + 1) th bits are held in 31 to 13N, and the input register 111 is held at the (n + 1) th and subsequent clocks.
The value from ~ 11N sends 0, and registers 131 ~ 13N
The value of is added in N-1 clock periods. Therefore, the sum addition is completed in n + N-1 clock periods.

【０００９】[0009]

【発明が解決しようとする課題】この従来のビットシリ
アル型並列プロセッサによる総和加算においては、各桁
の値を各々１回のクロック期間で計算できるため高速に
処理できるが、各ＰＥのレジスタは計算期間中に次の計
算の値を入力することができない。あるいは次の計算の
ためのレジスタを余分に持つことが必要となる。In the summation addition by the conventional bit serial type parallel processor, the value of each digit can be calculated at a high speed because it can be calculated in one clock period, but the register of each PE can be calculated. You cannot enter a value for the next calculation during the period. Alternatively, it is necessary to have an extra register for the next calculation.

【００１０】一方、リアルタイム画像処理においては、
入力画像データが耐え間なく入力されてくるため、前の
処理を待つことができない。したがって、この方法で
は、計算期間中次の計算データを保持するバッファレジ
スタ３１〜３Ｎが必要となり、ハード規模が大きくなる
という欠点があった。On the other hand, in real-time image processing,
Since the input image data is input without endurance, the previous processing cannot be waited for. Therefore, this method requires the buffer registers 31 to 3N for holding the next calculation data during the calculation period, which has a drawback of increasing the hardware scale.

【００１１】本発明の目的は、このような欠点を除き、
バッファレジスタを半減し、ハードウェア小型化した並
列データ処理装置を提供することにある。The object of the present invention is to eliminate these drawbacks.
It is an object of the present invention to provide a parallel data processing device in which the buffer register is halved and the hardware is downsized.

【００１２】[0012]

【課題を解決するための手段】本発明の並列データ処理
装置の構成は、一方の入力を入力データレジスタに接続
し、もう一方の入力を入力端子に接続し、桁上げ出力を
１ビットのレジスタを介して桁上げ入力に接続した全加
算器から成るプロセッサ要素を複数個配設し、これらプ
ロセッサ要素のうち隣接するプロセッサ要素の間は、一
方のプロセッサ要素の前記全加算器の加算出力端子と、
他方のプロセッサ要素の入力端子とを１ビットレジスタ
を介して接続する構成により、入力データを各タイムス
ロットで隣接するプロセッサ要素に順次に入力するよう
にしたことを特徴とする。The parallel data processor of the present invention has a structure in which one input is connected to an input data register, the other input is connected to an input terminal, and a carry output is a 1-bit register. A plurality of processor elements each of which is composed of a full adder connected to a carry input via the processor are provided, and between adjacent processor elements among these processor elements, an addition output terminal of the full adder of one processor element is provided. ,
It is characterized in that the input terminal of the other processor element is connected via a 1-bit register so that input data is sequentially input to the adjacent processor element in each time slot.

【００１３】[0013]

【実施例】図１は本発明の一実施例のビットシリアル型
並列プロセッサの総和加算を求める場合のブロック図で
ある。図４の従来例と同様の機能のブロックは同じ記号
で示してある。本実施例では、従来例に対して各ＰＥ１
０１〜１０Ｎの出力値を保持するレジスタ１７１，１７
２，…を加えることによって従来例におけるレジスタ３
１〜３Ｎを不要としたことを特徴とする。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is a block diagram of a bit serial type parallel processor according to an embodiment of the present invention for obtaining the sum addition. Blocks having the same functions as those in the conventional example shown in FIG. 4 are indicated by the same symbols. In this embodiment, each PE1 is different from the conventional example.
Registers 171 and 17 for holding output values of 01 to 10N
By adding 2, ..., Register 3 in the conventional example
It is characterized in that 1 to 3 N are unnecessary.

【００１４】本実施例の動作を説明するため、データ数
を「４」，ビット数を「６」とした時の各ＰＥのレジス
タの内容を図２に示す。まず、クロック１において、入
力データはＰＥ１０１のレジスタ１１１に入力される。
次に、クロック２において、入力データはＰＥ１０２の
レジスタ１１２に入力される。ＡＬＵ１２１においてａ
₀（ＬＳＢ）と０が加えられ、結果がレジスタ１７１に
保持される。これと同時にＡＬＵ１２２においてクロッ
ク２におけるレジスタ１７１の値とｂ₀（ＬＳＢ）が加
算され、結果がレジスタ１７２に保持され、桁上げ値が
レジスタ１３２に保持される。To explain the operation of this embodiment, the contents of the registers of each PE when the number of data is "4" and the number of bits is "6" are shown in FIG. First, at clock 1, input data is input to the register 111 of the PE 101.
Next, at clock 2, the input data is input to the register 112 of the PE 102. A in the ALU121
₀ (LSB) and 0 are added, and the result is held in the register 171. At the same time, the value of the register 171 at clock 2 and b ₀ (LSB) are added in the ALU 122, the result is held in the register 172, and the carry value is held in the register 132.

【００１５】次に、クロック４において、入力データは
ＰＥ１０４のレジスタ１１４に入力される。ＡＬＵ１２
１においてａ₂（ＬＳＢ＋２ビット目）と０が加えら
れ、結果がレジスタ１７１に保持されると同時にＡＬＵ
１２２においてクロック３におけるレジスタ１７１の値
とｂ₁（ＬＳＢ＋１ビット目）が加算され、結果がレジ
スタ１７２に保持され桁上げ値がレジスタ１３２に保持
される。同時に、ＡＬＵ１２３においてクロック３にお
けるレジスタ１７２の値とｃ₀（ＬＳＢ）が加算され、
結果がレジスタ１７３に保持され、桁上げ値がレジスタ
１３３に保持される。同様の操作をクロック４，クロッ
ク５…と繰り返し行なう。Next, at clock 4, the input data is input to the register 114 of the PE 104. ALU12
A ₂ (LSB + ₂ bit) and 0 is added in 1, at the same time the result is held in the register 171 ALU
At 122, the value of the register 171 at clock 3 and b ₁ (LSB + 1st bit) are added, the result is held in the register 172, and the carry value is held in the register 132. At the same time, in the ALU 123, the value of the register 172 at clock 3 and c ₀ (LSB) are added,
The result is held in the register 173, and the carry value is held in the register 133. The same operation is repeated for clock 4, clock 5 ...

【００１６】クロック５において、４つのデータのＬＳ
Ｂａ₀，ｂ₀，ｃ₀，ｄ₀の加算結果がＡＬＵ１２４か
ら出力され、クロック６においてＬＳＢ＋１ビット目が
出力され、順次に加算結果がＬＳＢが順番に出力され
る。クロック７において、レジスタ１１１のデータが無
くなり、次の入力データが入力され、クロック１から動
作に戻ることになる。この例の場合、データ数よりもビ
ット数が多いため、クロック５，クロック６においてデ
ータを入力できなくなるので、バッファを設ける必要が
あるが、通常画像データ処理においてはデータ数の方が
多いためバッファは必要ない。At clock 5, LS of four data
The addition result of Ba ₀ , b ₀ , c ₀ , d ₀ is output from the ALU 124, the LSB + 1st bit is output at the clock 6, and the addition result is output LSB in sequence. At clock 7, the data in the register 111 is lost, the next input data is input, and the operation returns from clock 1. In the case of this example, since the number of bits is larger than the number of data, the data cannot be input at the clock 5 and the clock 6, so it is necessary to provide a buffer. Is not necessary.

【００１７】以上の説明から、入力データの加算結果が
出るまでの絶対時間は従来例よりも長くかかるが、ビッ
ト毎のパイプライン処理となるため、全体のスループッ
トは従来例と変らないことになる。From the above description, the absolute time until the addition result of the input data is obtained is longer than that of the conventional example, but since the pipeline processing is performed for each bit, the overall throughput is the same as that of the conventional example. ..

【００１８】[0018]

【発明の効果】以上説明したように本発明は、ビットシ
リアル型並列プロセッサで総和加算を求める場合に、各
ＰＥ間に加算結果を保持するレジスタを設け、入力デー
タを各ＰＥ毎に時間的にビットシフトして入力すること
により、従来時間的に同時に全ＰＥにデータを入力して
いた場合に対し、演算のスループットを変えずに入力レ
ジスタを半分に減らすことができるという効果を有す
る。As described above, according to the present invention, when a bit-serial type parallel processor is used to calculate the sum addition, a register for holding the addition result is provided between each PE, and the input data is temporally set for each PE. The bit-shifted input has an effect that the number of input registers can be reduced to half without changing the throughput of the operation as compared with the conventional case where data is input to all PEs simultaneously in terms of time.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施例のブロック図。FIG. 1 is a block diagram of an embodiment of the present invention.

【図２】図１の実施例の動作を説明するデータフロー
図。FIG. 2 is a data flow diagram illustrating the operation of the embodiment of FIG.

【図３】従来例の装置構成を説明するブロック図。FIG. 3 is a block diagram illustrating a device configuration of a conventional example.

【図４】図３の総和加算を行う場合のブロック図。FIG. 4 is a block diagram in the case of performing summation addition of FIG.

【符号の説明】[Explanation of symbols]

１０１〜１０Ｎプロセッサ要素１１１〜１１Ｎ入力データレジスタ１２１〜１２ＮＡＬＵ１３１〜１３Ｎ桁上げ保持レジスタ１４１〜１４Ｎ桁上げ出力端子１５１〜１５Ｎ桁上げ入力端子１６出力データレジスタ１７１〜１７（Ｎ−１）加算出力保持レジスタ２０データ入力端子２１入力インタフェース２２プロセッサアレイ２３出力インタフェース２４データ出力端子３１〜３Ｎ第２の入力データレジスタ 101 to 10N Processor element 111 to 11N Input data register 121 to 12N ALU 131 to 13N Carry hold register 141 to 14N Carry output terminal 151 to 15N Carry input terminal 16 Output data register 171 to 17 (N-1) Addition output Holding register 20 Data input terminal 21 Input interface 22 Processor array 23 Output interface 24 Data output terminal 31 to 3N Second input data register

Claims

【特許請求の範囲】[Claims]

【請求項１】一方の入力を入力データレジスタに接続
し、もう一方の入力を入力端子に接続し、桁上げ出力を
１ビットのレジスタを介して桁上げ入力に接続した全加
算器から成るプロセッサ要素を複数個配設し、これらプ
ロセッサ要素のうち隣接するプロセッサ要素の間は、一
方のプロセッサ要素の前記全加算器の加算出力端子と、
他方のプロセッサ要素の入力端子とを１ビットレジスタ
を介して接続する構成により、入力データを各タイムス
ロットで隣接するプロセッサ要素に順次に入力するよう
にしたことを特徴とするビットシリアル型並列データ処
理装置。1. A processor comprising a full adder having one input connected to an input data register, the other input connected to an input terminal, and a carry output connected to a carry input through a 1-bit register. A plurality of elements are provided, and between the adjacent processor elements among these processor elements, an addition output terminal of the full adder of one processor element,
Bit serial parallel data processing characterized in that input data is sequentially input to adjacent processor elements in each time slot by a configuration in which the input terminal of the other processor element is connected via a 1-bit register. apparatus.