JP2009282744A

JP2009282744A - Computing unit and semiconductor integrated circuit device

Info

Publication number: JP2009282744A
Application number: JP2008134001A
Authority: JP
Inventors: Kenju Osanai; 建樹小山内
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2008-05-22
Filing date: 2008-05-22
Publication date: 2009-12-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide a computing unit and a semiconductor integrated circuit device that can implement an increased operation speed. <P>SOLUTION: The computing unit 10, which executes a SIMD instruction having multi-bit data units S0 to S3 in a plurality of processing cycles PS1 and PS2, comprises a plurality of first operation parts 11-0 to 11-3 for performing a first operation of the data units S0 to S3 without any bit movements across the data units S0 to S3, and a second operation part 12 for performing a second operation with bit movements across the data units S0 to S3. The first operation and second operation execute the SIMD instruction, and the first operation and second operation are executed with a latency of one or more processing cycles. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

この発明は、演算器及び半導体集積回路装置に関するもので、例えばＳＩＭＤ（single instruction multiple data）命令をサポートするプロセッサに関するものである。 The present invention relates to an arithmetic unit and a semiconductor integrated circuit device, for example, a processor that supports a single instruction multiple data (SIMD) instruction.

ＳＩＭＤは、１つの命令で複数のデータに対して同じ処理を実行する処理方式である。ＳＩＭＤ命令をサポートするプロセッサでは、演算の単位となる複数ビットの単位データ（例えば３２ビット）を、同時に複数処理する。従ってプロセッサ内の演算器は、単位データ毎に論理的に独立している。 SIMD is a processing method for executing the same processing on a plurality of data with one instruction. In a processor that supports the SIMD instruction, a plurality of pieces of unit data (for example, 32 bits) serving as a unit of calculation are simultaneously processed. Therefore, the arithmetic units in the processor are logically independent for each unit data.

例えば、４つの単位データを同時に処理するプロセッサの場合、４つの３２ビット加算器を並列に設けることで、並列加算命令を実行出来る。またＳＩＭＤ命令では無いが、論理演算命令（ＡＮＤ、ＯＲ、ＥＸＯＲ等）を処理する場合も、３２ビット単位で演算を行う４つの演算器を並列に配置する構成とすることが可能である。その理由は、これらの演算は基本的にbit wiseな演算であって、単位データ間でのデータ（ビットまたはバイト）の移動が生じないからである（例えば非特許文献１参照）。 For example, in the case of a processor that processes four unit data simultaneously, a parallel addition instruction can be executed by providing four 32-bit adders in parallel. Further, although not SIMD instructions, when processing logical operation instructions (AND, OR, EXOR, etc.), it is possible to adopt a configuration in which four arithmetic units that perform operations in units of 32 bits are arranged in parallel. This is because these operations are basically bit-wise operations, and data (bits or bytes) does not move between unit data (see, for example, Non-Patent Document 1).

他方、単位データ間でのデータの通信を生じる演算命令も存在する。それは、例えばシャッフル命令等である。そして従来、このような演算器を論理合成ツールにより物理実装しようとすると、プロセッサの動作速度が悪化するという問題があった。
東芝、“TX System Risc TX79 Core Architecture (Symmetric 2-way superscalar 64-bit CPU) Rev. 2.0”、B-48頁、PAND命令、2001年4月、インターネット＜URL: http://lukasz.dk/files/tx79architecture.pdf＞ On the other hand, there are arithmetic instructions that cause data communication between unit data. This is, for example, a shuffle instruction. Conventionally, when such an arithmetic unit is physically mounted using a logic synthesis tool, there has been a problem that the operating speed of the processor deteriorates.
Toshiba, “TX System Risc TX79 Core Architecture (Symmetric 2-way superscalar 64-bit CPU) Rev. 2.0”, page B-48, PAND instruction, April 2001, Internet <URL: http://lukasz.dk/ files / tx79architecture.pdf>

この発明は、動作速度を向上出来る演算器及び半導体集積回路装置を提供する。 The present invention provides an arithmetic unit and a semiconductor integrated circuit device capable of improving the operation speed.

この発明の一態様に係る演算器は、複数ビットを一つのデータ単位とするＳＩＭＤ命令を、複数の処理サイクルにより実行する演算器であって、前記データ単位間でのビットの移動を伴うことなく、それぞれ前記データ単位毎に第１演算を行う複数の第１演算部と、前記データ単位間でのビットの移動を伴う第２演算を行う第２演算部とを具備し、前記第１演算と前記第２演算とにより前記ＳＩＭＤ命令が実行され、且つ前記第１演算と前記第２演算とは、互いに１処理サイクル以上のレイテンシを有して実行される。 An arithmetic unit according to an aspect of the present invention is an arithmetic unit that executes a SIMD instruction having a plurality of bits as one data unit in a plurality of processing cycles, and does not involve bit movement between the data units. A plurality of first calculation units that perform a first calculation for each of the data units, and a second calculation unit that performs a second calculation that involves moving a bit between the data units, The SIMD instruction is executed by the second operation, and the first operation and the second operation are executed with a latency of one processing cycle or more.

また、この発明の一態様に係る半導体集積回路装置は、複数の処理ステージを用いてパイプライン動作を行う半導体集積回路装置であって、複数ビットを一つのデータ単位とするＳＩＭＤ命令を、複数の前記処理ステージを用いて実行する演算器を具備し、前記演算器は、第ｉ処理ステージ（ｉは１以上の自然数）において、前記データ単位間でのビットの移動を伴うことなく、前記データ単位毎に第１演算を行い、第（ｉ＋ｊ）処理ステージ（ｊは１以上の自然数）において、前記データ単位間でのビットの移動を伴う第２演算を行う。 A semiconductor integrated circuit device according to an aspect of the present invention is a semiconductor integrated circuit device that performs a pipeline operation using a plurality of processing stages, and a plurality of SIMD instructions having a plurality of bits as one data unit, An arithmetic unit that executes using the processing stage, wherein the arithmetic unit includes the data unit without moving a bit between the data units in an i-th processing stage (i is a natural number of 1 or more). The first operation is performed every time, and the second operation involving the movement of bits between the data units is performed in the (i + j) th processing stage (j is a natural number of 1 or more).

この発明によれば、動作速度を向上出来る演算器及び半導体集積回路装置を提供できる。 According to the present invention, it is possible to provide an arithmetic unit and a semiconductor integrated circuit device that can improve the operation speed.

以下、この発明の実施形態を、図面を参照して説明する。この説明に際し、全図にわたり、共通する部分には共通する参照符号を付す。 Embodiments of the present invention will be described below with reference to the drawings. In the description, common parts are denoted by common reference symbols throughout the drawings.

［第１の実施形態］
この発明の第１の実施形態に係る演算器及び半導体集積回路装置について、図１を用いて説明する。図１は、本実施形態に係るプロセッサのブロック図である。 [First Embodiment]
An arithmetic unit and a semiconductor integrated circuit device according to a first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram of a processor according to the present embodiment.

＜プロセッサの全体構成について＞
図示するようにプロセッサ１は、２つのレジスタＲＡ、ＲＢ、及び第１演算器１０及び第２演算器２０を備えている。またプロセッサ１はパイプライン動作を行い、パイプライン動作は処理ステージＰＳ１と、処理ステージＰＳ１に引き続いて行われる処理ステージＰＳ２とを含む。そして、第１演算器１０は処理ステージＰＳ１、ＰＳ２にまたがった演算を行い、第２演算器２０は処理ステージＰＳ１で完結する演算を行う。更にプロセッサ１は、例えば３２ビットを単位データとするＳＩＭＤ動作を行う。以下、この３２ビットの単位データを「スライス（slice）」と呼ぶこととし、プロセッサ１が４つのスライスを同時に処理する場合を例に挙げて説明する。この４つのスライスを、以下スライスＳ０〜Ｓ３と呼ぶことにする。なお、スライスは３２ビットに限らず、例えば１６ビットや６４ビットでも良く、また同時に処理されるスライス数も４つに限定されるものでは無い。 <About overall processor configuration>
As illustrated, the processor 1 includes two registers RA and RB, and a first arithmetic unit 10 and a second arithmetic unit 20. The processor 1 performs a pipeline operation, and the pipeline operation includes a processing stage PS1 and a processing stage PS2 performed subsequent to the processing stage PS1. The first computing unit 10 performs computations across the processing stages PS1 and PS2, and the second computing unit 20 performs computations that are completed at the processing stage PS1. Further, the processor 1 performs SIMD operation using, for example, 32 bits as unit data. Hereinafter, the 32-bit unit data will be referred to as “slice”, and a case where the processor 1 processes four slices simultaneously will be described as an example. These four slices are hereinafter referred to as slices S0 to S3. The number of slices is not limited to 32 bits, and may be 16 bits or 64 bits, for example, and the number of slices processed simultaneously is not limited to four.

レジスタＲＡは、４つのスライス分のデータを保持可能である。すなわち、（３２ビット×４）＝１２８ビットのデータを保持出来る。以下、レジスタＲＡにおけるスライスＳ０〜Ｓ３に相当する３２ビットデータの各々を、それぞれワード０Ａ〜３Ａと呼ぶ。 The register RA can hold data for four slices. That is, (32 bits × 4) = 128 bits of data can be held. Hereinafter, each of the 32-bit data corresponding to the slices S0 to S3 in the register RA is referred to as words 0A to 3A, respectively.

レジスタＲＢは、レジスタＲＡと同様に４つのスライス分のデータを保持可能である。すなわち、１２８ビットのデータを保持出来る。以下、レジスタＲＢにおけるスライスＳ０〜Ｓ３に相当する３２ビットデータの各々を、それぞれワード０Ｂ〜３Ｂと呼ぶ。 Similarly to the register RA, the register RB can hold data for four slices. That is, 128-bit data can be held. Hereinafter, each of the 32-bit data corresponding to the slices S0 to S3 in the register RB is referred to as words 0B to 3B, respectively.

第１演算器１０は、４つの第１演算部１１−０〜１１−３と、第２演算部１２とを備えている。第１演算部１１−０〜１１−３はそれぞれ、レジスタＲＡ内のスライスＳ０〜Ｓ３とレジスタＲＢ内のスライスＳ０〜Ｓ３との間の演算を行う。すなわち、第１演算部１１−０はワード０Ａとワード０Ｂとをデータとして用いた演算を行い、第１演算部１１−１はワード１Ａとワード１Ｂとをデータとして用いた演算を行い、第１演算部１１−２はワード２Ａとワード２Ｂとをデータとして用いた演算を行い、第１演算部１１−３はワード３Ａとワード３Ｂとをデータとして用いた演算を行う。つまり、第１演算部１１−０〜１１−３の各々において行われる処理は、スライス間でのデータの移動を生じさせない。換言すれば、各処理は、スライス間でのデータ通信を生じさせず、各スライス内で完結する処理である。以下、第１演算部１１−０〜１１−３を区別しない場合には、一括して第１演算部１１と呼ぶことにする。第１演算部１１における処理は、処理ステージＰＳ１において実行される。 The first computing unit 10 includes four first computing units 11-0 to 11-3 and a second computing unit 12. The first calculation units 11-0 to 11-3 perform calculations between the slices S0 to S3 in the register RA and the slices S0 to S3 in the register RB, respectively. That is, the first arithmetic unit 11-0 performs an operation using the word 0A and the word 0B as data, and the first arithmetic unit 11-1 performs an operation using the word 1A and the word 1B as data. The calculation unit 11-2 performs a calculation using the word 2A and the word 2B as data, and the first calculation unit 11-3 performs a calculation using the word 3A and the word 3B as data. That is, the processing performed in each of the first arithmetic units 11-0 to 11-3 does not cause data movement between slices. In other words, each process is a process completed within each slice without causing data communication between slices. Hereinafter, when the first calculation units 11-0 to 11-3 are not distinguished, they are collectively referred to as the first calculation unit 11. The processing in the first calculation unit 11 is executed in the processing stage PS1.

第２演算部１２は、プロセッサ１において同時処理されるスライス数と同数、つまり４つの第１演算部１１−０〜１１−３における処理結果をデータとして用いた演算を行う。すなわち、第１演算部１１−０〜１１−３においてそれぞれ行われたスライスＳ０〜Ｓ３に関する演算結果（３２ビット×４）を、データとして用いて演算を行う。つまり、第２演算部１２において行われる処理は、スライス間でのデータの通信を生じさせる処理である。また第２演算部１２における処理は、処理ステージＰＳ２において実行される。そして、第２演算部１２において得られた３２ビットの演算結果が、第１演算器１０の演算結果として出力される。 The second arithmetic unit 12 performs an arithmetic operation using the same number of slices that are simultaneously processed in the processor 1, that is, the processing results in the four first arithmetic units 11-0 to 11-3 as data. That is, the calculation is performed using the calculation results (32 bits × 4) regarding the slices S0 to S3 respectively performed in the first calculation units 11-0 to 11-3 as data. That is, the process performed in the second calculation unit 12 is a process that causes data communication between slices. The processing in the second calculation unit 12 is executed in the processing stage PS2. Then, the 32-bit calculation result obtained in the second calculation unit 12 is output as the calculation result of the first calculator 10.

第２演算器２０は、４つの第１演算部２１−０〜２１−３を備えている。第１演算部２１−０〜２１−３はそれぞれ、レジスタＲＡ内のスライスＳ０〜Ｓ３とレジスタＲＢ内のスライスＳ０〜Ｓ３とをデータとして用いた演算を行う。すなわち、第１演算部２１−０はワード０Ａとワード０Ｂとを用いた演算を行い、第１演算部２１−１はワード１Ａとワード１Ｂとを用いた演算を行い、第１演算部２１−２はワード２Ａとワード２Ｂとを用いた演算を行い、第１演算部２１−３はワード３Ａとワード３Ｂとを用いた演算を行う。つまり、第１演算部２１−０〜２１−３の各々において行われる処理は、スライス間でのデータの通信が無く、各スライス内で完結する処理である。以下、第１演算部２１−０〜２１−３を区別しない場合には、一括して第１演算部２１と呼ぶことにする。第１演算部２１における処理は、処理ステージＰＳ１において実行される。そして、第１演算部２１において得られた３２ビットの演算結果が、第２演算器２０の演算結果として出力される。 The second computing unit 20 includes four first computing units 21-0 to 21-3. The first calculation units 21-0 to 21-3 perform calculations using the slices S0 to S3 in the register RA and the slices S0 to S3 in the register RB as data, respectively. That is, the first calculation unit 21-0 performs calculation using the words 0A and 0B, the first calculation unit 21-1 performs calculation using the words 1A and 1B, and the first calculation unit 21- 2 performs an operation using the word 2A and the word 2B, and the first operation unit 21-3 performs an operation using the word 3A and the word 3B. That is, the processing performed in each of the first arithmetic units 21-0 to 21-3 is processing that is completed within each slice without communication of data between slices. Hereinafter, when the first computing units 21-0 to 21-3 are not distinguished, they are collectively referred to as the first computing unit 21. The processing in the first calculation unit 21 is executed in the processing stage PS1. Then, the 32-bit calculation result obtained in the first calculation unit 21 is output as the calculation result of the second calculator 20.

＜プロセッサの動作について＞
図２は、プロセッサ１におけるパイプライン処理の様子を示すタイミングチャートであり、第１演算器１０及び第２演算器２０における処理内容の流れを示している。 <About processor operation>
FIG. 2 is a timing chart showing the state of pipeline processing in the processor 1 and shows the flow of processing contents in the first computing unit 10 and the second computing unit 20.

まず、第２演算器２０における処理について説明する。図示するように第２演算器２０では、各処理サイクルにおいて、第１演算部２１−０〜２１−３がそれぞれスライスＳ０〜Ｓ３に関する演算を、並行して行う。つまり、各処理サイクルにおける演算はそれぞれ独立しており、その演算内容は各スライス内で完結するものである。 First, the process in the second computing unit 20 will be described. As illustrated, in the second computing unit 20, in each processing cycle, the first computing units 21-0 to 21-3 perform computations on the slices S0 to S3 in parallel. That is, the operations in each processing cycle are independent, and the content of the operation is completed within each slice.

次に第１演算器１０における処理について説明する。図示するように、第１演算器１０における第１演算部１１の処理は、第２演算器２０における第１演算部２１の処理と同様である。第２演算器２０と異なる点は、第１演算部１１の処理と並行して、第２演算部１２の処理が行われる点にある。つまり、第ｎ（ｎは２以上の自然数）サイクルにおいて第２演算部１２は、第（ｎ−１）サイクルで第１演算部１１−０〜１１−３で得られた演算結果を用いた処理を行う。すなわち、スライスＳ０〜Ｓ３にまたがった処理を行う。つまり、第２演算部１２における演算は、第１演算部１１における演算に対して１サイクルのレイテンシを有して実行される。 Next, processing in the first computing unit 10 will be described. As shown in the figure, the processing of the first computing unit 11 in the first computing unit 10 is the same as the processing of the first computing unit 21 in the second computing unit 20. The difference from the second computing unit 20 is that the processing of the second computing unit 12 is performed in parallel with the processing of the first computing unit 11. That is, in the nth (n is a natural number of 2 or more) cycle, the second calculation unit 12 performs processing using the calculation results obtained by the first calculation units 11-0 to 11-3 in the (n-1) th cycle. I do. That is, a process that spans slices S0 to S3 is performed. That is, the calculation in the second calculation unit 12 is executed with a one-cycle latency with respect to the calculation in the first calculation unit 11.

次に、上記第１演算器１０及び第２演算器２０で行われる処理の具体例について説明する。
まず第２演算器２０について説明する。前述の通り、第２演算器２０で行われる処理はスライス間でのデータの通信を生じさせない処理である。その具体例は、例えば加算命令、減算命令、論理演算命令、または積和命令である。加算命令の場合には、第１演算部２１−０はワード０Ａとワード０Ｂとの加算を行い、第１演算部２１−１はワード１Ａとワード１Ｂとの加算を行い、第１演算部２１−２はワード２Ａとワード２Ｂとの加算を行い、第１演算部２１−３はワード３Ａとワード３Ｂとの加算を行う。 Next, a specific example of processing performed by the first computing unit 10 and the second computing unit 20 will be described.
First, the second computing unit 20 will be described. As described above, the processing performed by the second computing unit 20 is processing that does not cause data communication between slices. Specific examples thereof are, for example, an addition instruction, a subtraction instruction, a logical operation instruction, or a product-sum instruction. In the case of an addition instruction, the first calculation unit 21-0 adds the words 0A and 0B, the first calculation unit 21-1 adds the words 1A and 1B, and the first calculation unit 21. -2 adds word 2A and word 2B, and first operation unit 21-3 adds word 3A and word 3B.

次に第１演算器１０について説明する。前述の通り、第１演算器１０で行われる処理はスライス間でのデータの通信を生じさせる処理である。その具体例は、例えばシフト（shift）命令、ローテート（rotate）命令、またはシャッフル（shuffle）命令などである。 Next, the first computing unit 10 will be described. As described above, the processing performed by the first computing unit 10 is processing that causes data communication between slices. Specific examples thereof include a shift instruction, a rotate instruction, a shuffle instruction, and the like.

以下、上記の３つの命令の詳細について説明する。まず、レジスタＲＡ、ＲＢ内に保持されるデータについて、図３及び図４を用いて説明する。図３及び図４はそれぞれ、レジスタＲＡ、ＲＢの構成を模式的に示す概念図である。 Details of the above three instructions will be described below. First, data held in the registers RA and RB will be described with reference to FIGS. 3 and 4 are conceptual diagrams schematically showing the configuration of the registers RA and RB, respectively.

図示するように、レジスタＲＡ、ＲＢにおいて、ワード０Ａ〜３Ａ、ワード０Ｂ〜３Ｂに順次含まれる１バイトのデータを、以下バイトＢ００〜Ｂ３１と呼ぶことにする。すなわち、ワード０Ａに含まれるデータは、上位ビットから順にバイトＢ００〜Ｂ０３である。またワード１Ａに含まれるデータは、上位ビットから順にバイトＢ０４〜Ｂ０７である。ワード２Ａに含まれるデータは、上位ビットから順にバイトＢ０８〜Ｂ０１１である。ワード３Ａに含まれるデータは、上位ビットから順にバイトＢ１２〜Ｂ１５である。ワード０Ｂに含まれるデータは、上位ビットから順にバイトＢ１６〜Ｂ１９である。ワード１Ｂに含まれるデータは、上位ビットから順にバイトＢ２０〜Ｂ２３である。ワード２Ｂに含まれるデータは、上位ビットから順にバイトＢ２４〜Ｂ２７である。そして、ワード３Ｂに含まれるデータは、上位ビットから順にバイトＢ２８〜Ｂ３１である。 As shown in the figure, in the registers RA and RB, 1-byte data sequentially included in the words 0A to 3A and the words 0B to 3B is hereinafter referred to as bytes B00 to B31. That is, the data included in word 0A is bytes B00 to B03 in order from the upper bit. Data included in the word 1A is bytes B04 to B07 in order from the upper bit. Data included in the word 2A is bytes B08 to B011 in order from the upper bit. Data included in the word 3A is bytes B12 to B15 in order from the upper bit. Data included in the word 0B is bytes B16 to B19 in order from the upper bit. Data included in the word 1B is bytes B20 to B23 in order from the upper bit. Data included in the word 2B is bytes B24 to B27 in order from the upper bit. Data included in the word 3B is bytes B28 to B31 in order from the upper bit.

［シフト命令について］
まず、シフト命令について説明する。図５は、シフト命令において実行される処理内容を示すプログラムの概念図である。図５では、レジスタＲＡの内容をバイト毎に上位ビット側へシフトする場合（左シフト）について示している。図５における左端に示す数字はプログラムの行番号であり、“＜−”は、右辺の値が左辺に代入されることを示す。またＲＡ、ＲＢはそれぞれレジスタＲＡ、ＲＢを示し、ＲＴはターゲットレジスタを示す。ターゲットレジスタＲＴはプロセッサ１に備えられ（図１では省略）、第１演算器１０や第２演算器２０における演算結果を格納するレジスタである。またＲＡ、ＲＢ、ＲＴの直後に記載した“（ｉ：ｊ）”は、各レジスタにおけるｉビット目からｊビット目までのデータであることを示す。すると、各レジスタは４スライス（＝４×３２ビット）を保持可能であるから、ｉ、ｊは０〜１２７までの値である。 [About shift instructions]
First, the shift instruction will be described. FIG. 5 is a conceptual diagram of a program showing the processing contents executed in the shift instruction. FIG. 5 shows a case where the contents of the register RA are shifted to the upper bit side for each byte (left shift). The numbers shown at the left end in FIG. 5 are program line numbers, and “<−” indicates that the value on the right side is assigned to the left side. RA and RB indicate registers RA and RB, respectively, and RT indicates a target register. The target register RT is a register that is provided in the processor 1 (not shown in FIG. 1) and stores the calculation results in the first calculator 10 and the second calculator 20. Further, “(i: j)” described immediately after RA, RB, and RT indicates data from the i-th bit to the j-th bit in each register. Then, since each register can hold 4 slices (= 4 × 32 bits), i and j are values from 0 to 127.

図示するように、まずある値ｓに、レジスタＲＢの例えば２７〜３１ビットで示される値（０〜３１のいずれか）が代入される（図５の１行目参照）。ｓは、シフト量を示す。ｓ＝１であれば１バイト毎シフトされ、ｓ＝２であれば２バイト毎シフトされる。 As shown in the figure, first, a value (any one of 0 to 31) indicated by, for example, 27 to 31 bits of the register RB is substituted for a certain value s (see the first line in FIG. 5). s represents the shift amount. If s = 1, the data is shifted by 1 byte, and if s = 2, the data is shifted by 2 bytes.

そしてｓ＝０〜１５であれば、ターゲットレジスタＲＴにおける０〜（１２７−（ｓ×８））ビットの内容として、レジスタＲＡにおける（ｓ×８）〜１２７ビットの内容が格納される（図５の２、３行目参照）。そして、ターゲットレジスタにおける残りの下位ビットには、“０”が格納される（図５の４行目参照）。 If s = 0 to 15, the contents of (s × 8) to 127 bits in the register RA are stored as the contents of 0 to (127− (s × 8)) bits in the target register RT (FIG. 5). 2 and 3) Then, “0” is stored in the remaining lower bits in the target register (see the fourth line in FIG. 5).

他方、ｓ＝１６〜３１であれば、ターゲットレジスタＲＴの全ビットが“０”とされる（図５の５、６行目参照）。以上の結果、ターゲットレジスタＲＴには、レジスタＲＡ内のデータを左方向へｓバイトだけシフトさせたデータが格納される。 On the other hand, if s = 16 to 31, all bits of the target register RT are set to “0” (see the fifth and sixth lines in FIG. 5). As a result, the target register RT stores data obtained by shifting the data in the register RA by s bytes to the left.

図６は、シフト命令を実行するための構成を示す概念図であり、ターゲットレジスタＲＴも図示している。ターゲットレジスタＲＴは、４つのスライス分のデータを保持可能である。すなわち、（３２ビット×４）＝１２８ビットのデータを保持出来る。以下、レジスタＲＴに保持されるデータを、上位ビットから順にバイトＢ３２〜Ｂ４７と呼ぶことにする。またバイトＢ３２〜Ｂ３５をワード０Ｔ、バイトＢ３６〜Ｂ３９をワード１Ｔ、バイトＢ４０〜Ｂ４３をワード２Ｔ、及びバイトＢ４４〜Ｂ４７をワード３Ｔと呼ぶことにする。 FIG. 6 is a conceptual diagram showing a configuration for executing a shift instruction, and also shows a target register RT. The target register RT can hold data for four slices. That is, (32 bits × 4) = 128 bits of data can be held. Hereinafter, the data held in the register RT will be referred to as bytes B32 to B47 in order from the upper bit. Bytes B32 to B35 are called word 0T, bytes B36 to B39 are called word 1T, bytes B40 to B43 are called word 2T, and bytes B44 to B47 are called word 3T.

演算器は、デコーダ１３及び複数の選択回路（バイトレベルマルチプレクサ：byte level multiplexer）１４を備えている。以降、本明細書における「選択回路」は、特に断らない限りは全てバイトレベルマルチプレクサである。デコーダ１３は、レジスタＲＢにおける２７〜３１バイトの値（ｓ）をデコードし、デコード結果に応じて選択回路１４を制御する。 The computing unit includes a decoder 13 and a plurality of selection circuits (byte level multiplexer) 14. Hereinafter, all “selection circuits” in this specification are byte level multiplexers unless otherwise specified. The decoder 13 decodes the value (s) of 27 to 31 bytes in the register RB, and controls the selection circuit 14 according to the decoding result.

選択回路１４は、ターゲットレジスタＲＴにおけるバイト毎に設けられている。そして各選択回路１４は、自身が対応づけられたバイトとして格納すべきデータ（レジスタＲＡにおけるバイトＢ００〜Ｂ１５、または“０”のいずれか）を、デコーダ１３の制御に従って選択し、ターゲットレジスタＲＴに格納する。 The selection circuit 14 is provided for each byte in the target register RT. Each selection circuit 14 selects data (bytes B00 to B15 or “0” in the register RA) to be stored as a byte associated with the selection circuit 14 according to the control of the decoder 13 and stores it in the target register RT. Store.

例えばターゲットレジスタＲＴのバイトＢ３２に対応づけられた選択回路１４は、バイトＢ００〜Ｂ１５のいずれかを選択する。すなわちｓ＝０であればバイトＢ００を選択し、ｓ＝１であればバイトＢ０１を選択し、ｓ＝１５であればバイトＢ１５を選択し、ターゲットレジスタＲＴのバイトＢ３２に代入する。 For example, the selection circuit 14 associated with the byte B32 of the target register RT selects any of the bytes B00 to B15. That is, if s = 0, the byte B00 is selected, if s = 1, the byte B01 is selected, and if s = 15, the byte B15 is selected and substituted into the byte B32 of the target register RT.

またターゲットレジスタＲＴのバイトＢ３３に対応づけられた選択回路１４は、バイトＢ０１〜Ｂ１５、及び“０”のいずれかを選択する。すなわちｓ＝０であればバイトＢ０１を選択し、ｓ＝１であればバイトＢ０２を選択し、ｓ＝１５であれば“０”を選択し、ターゲットレジスタＲＴのバイトＢ３３に代入する。以下、ターゲットレジスタＲＴのバイトＢ３４〜Ｂ４７に対応づけられた選択回路１４もまた同様である。 The selection circuit 14 associated with the byte B33 of the target register RT selects any one of the bytes B01 to B15 and “0”. That is, if s = 0, the byte B01 is selected, if s = 1, the byte B02 is selected, and if s = 15, “0” is selected and substituted into the byte B33 of the target register RT. The same applies to the selection circuit 14 associated with the bytes B34 to B47 of the target register RT.

従って、シフト命令においては、スライス間でのデータ（バイト）の移動が生じる。すなわち、例えばｓ＝１であれば、ワード０ＴのバイトＢ３２〜Ｂ３５には、レジスタＲＡにおけるバイトＢ０１〜Ｂ０３（これらはスライスＳ０に属する）と、バイトＢ０４（スライスＳ１に属する）とが代入されるからである。 Therefore, in the shift instruction, data (bytes) is moved between slices. That is, for example, if s = 1, bytes B01 to B03 (these belong to slice S0) and byte B04 (belong to slice S1) in register RA are substituted into bytes B32 to B35 of word 0T. Because.

［ローテート命令について］
次に、ローテート命令について説明する。図７は、ローテート命令において実行される処理内容を示すプログラムの概念図である。図７は図５と同様のルールにて表記している。ローテート命令は上記シフト命令において、例えば左シフトすることによってターゲットレジスタＲＴに格納されなかったデータを、ターゲットレジスタＲＴにおける下位ビットに格納するものである。図７では、レジスタＲＡの内容をバイト毎に上位ビット側へローテートする場合（左シフト）について示している。 [About rotate instructions]
Next, the rotate instruction will be described. FIG. 7 is a conceptual diagram of a program showing the processing contents executed in the rotate instruction. FIG. 7 shows the same rules as in FIG. The rotate instruction stores data that has not been stored in the target register RT, for example, by shifting leftward in the shift instruction, in the lower bits of the target register RT. FIG. 7 shows a case where the contents of the register RA are rotated to the upper bit side for each byte (left shift).

図示するように、まずある値ｓに、レジスタＲＢの例えば２８〜３１ビットで示される値（０〜１５のいずれか）が代入される（図７の１行目参照）。ｓは、ローテートする際のシフト量を示す。ｓ＝１であればあバイト毎シフトされ、ｓ＝２であれば２バイト毎シフトされる。 As shown in the figure, first, a value (any one of 0 to 15) indicated by, for example, 28 to 31 bits of the register RB is substituted for a certain value s (see the first line in FIG. 7). s shows the shift amount at the time of rotation. If s = 1, then every byte is shifted, and if s = 2, every two bytes are shifted.

そして、ｓ＝０であれば、ターゲットレジスタＲＴの０〜１２７ビットには、レジスタＲＡの０〜１２７ビットのデータがそのまま代入される（図７の２、３行目参照）。 If s = 0, the data of 0 to 127 bits of the register RA is directly substituted into the 0 to 127 bits of the target register RT (see the second and third lines in FIG. 7).

ｓ＝１〜１５であれば、ターゲットレジスタＲＴの０〜（１２７−（ｓ×８））ビットに、レジスタＲＡの（ｓ×８）〜１２７ビットが代入され、ターゲットレジスタＲＴの（１２７−（ｓ×８））〜１２７ビットに、レジスタＲＡの０〜（（ｓ×８）−１）ビットが代入される。 If s = 1 to 15, (s × 8) to 127 bits of the register RA are substituted into 0 to (127− (s × 8)) bits of the target register RT, and (127− ( 0 to ((s × 8) −1) bits of the register RA are substituted into s × 8)) to 127 bits.

図８は、ローテート命令を実行する演算器の構成を示す概念図であり、ターゲットレジスタＲＴも図示している。図示するように演算器は、デコーダ１５及び複数の選択回路１６を備えている。デコーダ１５は、レジスタＲＢにおける２８〜３１バイトの値をデコードし、デコード結果（＝ｓ）に応じて選択回路１６を制御する。 FIG. 8 is a conceptual diagram showing a configuration of an arithmetic unit that executes a rotate instruction, and also shows a target register RT. As shown in the figure, the computing unit includes a decoder 15 and a plurality of selection circuits 16. The decoder 15 decodes the value of 28 to 31 bytes in the register RB and controls the selection circuit 16 according to the decoding result (= s).

選択回路１６は、ターゲットレジスタＲＴにおけるバイト毎に設けられている。そして各選択回路１６は、自身が対応づけられたバイトとして格納すべきデータ（レジスタＲＡにおけるバイトＢ００〜Ｂ１５のいずれか）を、デコーダ１５の制御に従って選択し、ターゲットレジスタＲＴに格納する。 The selection circuit 16 is provided for each byte in the target register RT. Each selection circuit 16 selects data to be stored as a byte associated with it (any one of bytes B00 to B15 in the register RA) according to the control of the decoder 15 and stores it in the target register RT.

例えばｓ＝１の場合、ターゲットレジスタＲＴのバイトＢ３２に対応づけられた選択回路１６はバイトＢ０１を選択し、バイトＢ３３に対応づけられた選択回路１６はバイトＢ０２を選択し、バイトＢ４７に対応づけられた選択回路１６はバイトＢ００を選択する。またｓ＝２の場合、ターゲットレジスタＲＴのバイトＢ３２に対応づけられた選択回路１６はバイトＢ０２を選択し、バイトＢ３３に対応づけられた選択回路１６はバイトＢ０３を選択し、バイトＢ４６に対応づけられた選択回路１６はバイトＢ００を選択し、バイトＢ４７に対応づけられた選択回路１４はバイトＢ０１を選択する。 For example, when s = 1, the selection circuit 16 associated with byte B32 of the target register RT selects byte B01, and the selection circuit 16 associated with byte B33 selects byte B02 and associates with byte B47. The selected selection circuit 16 selects the byte B00. When s = 2, the selection circuit 16 associated with the byte B32 of the target register RT selects the byte B02, and the selection circuit 16 associated with the byte B33 selects the byte B03 and associates with the byte B46. The selected selection circuit 16 selects the byte B00, and the selection circuit 14 associated with the byte B47 selects the byte B01.

［シャッフル命令について］
次に、シャッフル命令について説明する。以下では、バイト単位のシャッフル命令を例に挙げて説明する。シャッフル命令は、ターゲットレジスタＲＴの各バイトに、レジスタＲＡにおけるバイトＢ００〜Ｂ１５のいずれか、レジスタＲＢにおけるバイトＢ１６〜Ｂ３１のいずれか、または固定値を代入する演算である。図９は、シャッフル命令において実行される処理内容を示すプログラムの概念図である。図９も図５と同様のルールにて表記している。シャッフル命令を実行する際には、レジスタＲＡ、ＲＢの他に、レジスタＲＣが使用される。そしてレジスタＲＣ内のデータに応じて、ターゲットレジスタＲＴに格納される値が決定される。 [About shuffle instructions]
Next, the shuffle instruction will be described. In the following description, a shuffle instruction in byte units will be described as an example. The shuffle instruction is an operation for substituting one of bytes B00 to B15 in the register RA, one of bytes B16 to B31 in the register RB, or a fixed value into each byte of the target register RT. FIG. 9 is a conceptual diagram of a program showing the processing contents executed in the shuffle instruction. FIG. 9 also shows the same rules as in FIG. When executing the shuffle instruction, the register RC is used in addition to the registers RA and RB. Then, the value stored in the target register RT is determined according to the data in the register RC.

図９において、ｊには０〜１５のいずれかの値が順次代入される。以下、数字の先頭に付記する“０ｂ”は、当該数字が２進数表記であることを示し、“０ｘ”は１６進数表記であることを示す。表記の無い数字は１０進数表記である。 In FIG. 9, any value of 0 to 15 is sequentially substituted for j. Hereinafter, “0b” added to the beginning of a number indicates that the number is in binary notation, and “0x” indicates that it is in hexadecimal notation. Numbers without notation are in decimal notation.

レジスタＲＣの（ｊ×８）〜（ｊ×８＋１）ビットの値が“０ｂ１０”であれば、ターゲットレジスタＲＴの（ｊ×８）〜（（ｊ＋１）×８−１）ビットには、固定値“０ｂ００００００００”（＝“０ｘ００”）が代入される。また、レジスタＲＣの（ｊ×８）〜（ｊ×８＋２）ビットの値が“０ｂ１１０”であれば、ターゲットレジスタＲＴの（ｊ×８）〜（（ｊ＋１）×８−１）ビットには、固定値“０ｂ１１１１１１１１”（＝“０ｘＦＦ）が代入される。更に、レジスタＲＣの（ｊ×８）〜（ｊ×８＋２）ビットの値が“０ｂ１１１”であれば、ターゲットレジスタＲＴの（ｊ×８）〜（（ｊ＋１）×８−１）ビットには、固定値“０ｂ１０００００００”（＝“０ｘ８０”）が代入される。 If the value of (j × 8) to (j × 8 + 1) bits of the register RC is “0b10”, the (j × 8) to ((j + 1) × 8-1) bits of the target register RT have a fixed value. “0b00000000” (= “0x00”) is substituted. Further, if the value of (j × 8) to (j × 8 + 2) bits of the register RC is “0b110”, the (j × 8) to ((j + 1) × 8-1) bits of the target register RT include The fixed value “0b11111111” (= “0xFF”) is substituted.If the value of (j × 8) to (j × 8 + 2) bits of the register RC is “0b111”, (j × 8) of the target register RT ) To ((j + 1) × 8-1) bits are assigned a fixed value “0b10000000” (= “0x80”).

それ以外の場合には、ターゲットレジスタＲＴの（ｊ×８）〜（（ｊ＋１）×８−１）ビットには、レジスタＲＡまたはＲＢにおける（ｂ×８）〜（（ｂ＋１）×８−１）ビットの値が代入される。但しｂの値は、レジスタＲＣにおける（（ｊ×８）＋３）〜（（ｊ×８）＋７）ビットで示される値であり、０〜３１のいずれかの値を取る。また、（（ｂ×８）〜（（ｂ＋１）×８−１））＝０〜２５５で表されるビットは、レジスタＲＡにおけるバイトＢ００からレジスタＲＢにおけるバイトＢ３１に相当する。従って、ｂ＝１６の場合には、１２８〜１３５ビットの値、すなわちレジスタＲＢにおけるバイトＢ００が、ターゲットレジスタＲＴにおいて対応するバイトに代入される。 In other cases, the (j × 8) to ((j + 1) × 8-1) bits of the target register RT include (b × 8) to ((b + 1) × 8-1) in the register RA or RB. The bit value is assigned. However, the value of b is a value indicated by ((j × 8) +3) to ((j × 8) +7) bits in the register RC, and takes a value of 0 to 31. Further, the bits represented by ((b × 8) to ((b + 1) × 8-1)) = 0 to 255 correspond to the byte B00 in the register RA to the byte B31 in the register RB. Therefore, when b = 16, the value of 128 to 135 bits, that is, the byte B00 in the register RB is substituted into the corresponding byte in the target register RT.

以上の代入を、ｊ＝０からｊ＝１５まで繰り返す。そして、ｊ＝ｉ（ｉは０〜１５のいずれか）の場合において、ターゲットレジスタＲＴにおけるＢ（ｉ＋３２）への代入が行われる。 The above substitution is repeated from j = 0 to j = 15. When j = i (i is one of 0 to 15), substitution into B (i + 32) in the target register RT is performed.

図１０は、シャッフル命令を実行する演算器の構成を概念的に示すブロック図であり、ターゲットレジスタＲＴも図示している。図示するように第１演算器１０は、それぞれがレジスタＲＴのバイトＢ（ｉ＋３２）毎（ｉは０〜１５のいずれか）に設けられた複数のデコーダ１７、複数の選択回路１８、及び複数のマスク論理１９を備えている。 FIG. 10 is a block diagram conceptually showing the configuration of an arithmetic unit that executes a shuffle instruction, and also shows a target register RT. As shown in the figure, the first computing unit 10 includes a plurality of decoders 17, a plurality of selection circuits 18, and a plurality of selection circuits 18 each provided for each byte B (i + 32) of the register RT (i is any one of 0 to 15). Mask logic 19 is provided.

デコーダ１７は、対応するレジスタＲＣにおけるビット（ｉ×８）〜（（ｉ×８）＋７）をデコードする。そして、デコード結果に応じて、対応する選択回路１８及びマスク論理１９を制御する。 The decoder 17 decodes the bits (i × 8) to ((i × 8) +7) in the corresponding register RC. Then, the corresponding selection circuit 18 and mask logic 19 are controlled according to the decoding result.

選択回路１４の各々は、レジスタＲＡにおけるバイトＢ００〜Ｂ１５及びレジスタＲＢにおけるＢ１６〜Ｂ３１のいずれかを、対応するデコーダ１７の制御に従って選択し、マスク論理１９へ出力する。 Each of the selection circuits 14 selects any one of the bytes B00 to B15 in the register RA and B16 to B31 in the register RB according to the control of the corresponding decoder 17 and outputs the selected data to the mask logic 19.

マスク論理１９は、対応するデコーダ１７の制御に従い、固定値、または対応する選択回路１８から与えられるデータ（Ｂ００〜Ｂ３１のいずれか）、対応するバイトに格納する。すなわち、デコーダ１７におけるデコードの結果、レジスタＲＣの値が“０ｂ１０”、“０ｂ１１０”、または“０ｂ１１１”であれば、ターゲットレジスタＲＴに固定値を格納し、それ以外の場合には選択回路１８から与えられるデータを格納する。 The mask logic 19 stores a fixed value or data (any one of B00 to B31) supplied from the corresponding selection circuit 18 in a corresponding byte according to the control of the corresponding decoder 17. That is, if the value of the register RC is “0b10”, “0b110”, or “0b111” as a result of decoding by the decoder 17, a fixed value is stored in the target register RT. Stores the given data.

以上のようにシャッフル命令においては、レジスタＲＣにおける制御データに応じてランダムに、レジスタＲＡ、ＲＢにおけるいずれかのバイトがターゲットレジスタＲＴに格納される。従ってシャッフル命令では、スライス間でのデータ（バイト）の移動が生じる。 As described above, in the shuffle instruction, any byte in the registers RA and RB is randomly stored in the target register RT in accordance with the control data in the register RC. Therefore, a shuffle instruction causes data (bytes) to move between slices.

上記の図６、図８、及び図１０に示す演算器が、図１における第１演算器１０に相当する。そして、各演算器において、スライス間でのデータの行き来の生じない演算は処理ステージＰＳ１で実行され、スライス間でのデータの行き来の生じる演算は処理ステージＰＳ２で実行される。 The arithmetic units shown in FIGS. 6, 8, and 10 correspond to the first arithmetic unit 10 in FIG. In each arithmetic unit, an operation in which data does not pass between slices is executed in the processing stage PS1, and an operation in which data goes back and forth between slices is executed in the processing stage PS2.

＜効果＞
以上のように、この発明の第１の実施形態に係るプロセッサであると、ＬＳＩの自動設計において、半導体素子の配置の乱れを抑制出来る。その結果、配置及びレイアウトや動作タイミングの設計を容易にし、且つ高速化出来る。以下、本効果について詳細に説明する。 <Effect>
As described above, the processor according to the first embodiment of the present invention can suppress the disorder of the arrangement of semiconductor elements in the LSI automatic design. As a result, the layout, layout, and operation timing can be easily designed and speeded up. Hereinafter, this effect will be described in detail.

placement drivenの論理合成ツールを用いることにより演算ユニットを一括して物理実装する場合、加算器など、用いるデータがスライス毎に独立している演算器は、それを構成するスタンダードセルはスライス毎に集まって配置される傾向にある。また、同一演算器内において、互いに異なるスライスを使用するスタンダードセル群は、比較的距離が離れて配置されやすい。これは、同一スライスを使用するスタンダードセルのゲート同士は、相互に接続されているために、近接してまとめて配置された方が配線遅延を小さく出来るからである。これにより、動作タイミングの観点から優れた設計が可能となる。また、異なるスライスを使用するスタンダードセル群間では、相互に大きな束縛条件が無いからである。 When using a placement-driven logic synthesis tool to physically implement arithmetic units in a batch, arithmetic units such as adders that use independent data for each slice gather standard cells that make up the unit for each slice. Tend to be placed. In addition, standard cell groups using different slices in the same arithmetic unit are likely to be arranged relatively far apart. This is because the gates of standard cells using the same slice are connected to each other, and therefore wiring delay can be reduced by arranging them close together. As a result, an excellent design is possible from the viewpoint of operation timing. In addition, there is no large constraint condition between standard cell groups using different slices.

更に、各演算器のスタンダードセル群は、使用するデータ、つまりスライスが同一であるもの同士（例えば加算器のスライスＳ０と、論理演算器のスライスＳ０など）が、近くに配置されやすい。これは、スライス番号が共通な演算器は、データのソースとなる汎用レジスタのビット番号が共通であるからである。つまり、パイプライン動作するプロセッサでは、パイプラインの処理ステージの配線長を短くして高速に物理実装するためには、スライス番号が共通の演算器同士を近づけて配置した方が有利だからである。 Furthermore, in the standard cell group of each arithmetic unit, data to be used, that is, those having the same slice (for example, the slice S0 of the adder and the slice S0 of the logical arithmetic unit) are easily arranged close to each other. This is because the arithmetic units having the same slice number share the same general-purpose register bit number as the data source. That is, in a processor that operates in a pipeline, it is advantageous to arrange arithmetic units having a common slice number close to each other in order to shorten the wiring length of the processing stage of the pipeline and perform physical mounting at high speed.

その結果、演算ユニットを一括してplacement drivenの論理合成ツールで物理実装する場合、ＬＳＩのレイアウトは、図１１に示すようになる。図１１はＬＳＩのレイアウトを示す模式図であり、演算器の配置をスライス毎に示したものである。図示するように、汎用レジスタ２（またはオペランドＦ／Ｆ）を始点にして、スライス毎の大きな領域３−０〜３−３が自然に形成されて配置され易い。図中において、領域３−０〜３−３はそれぞれ、スライスＳ０〜Ｓ３を使用するスタンダードセル（演算器）の集合である。そしてこのような配置とすることで、各領域３−０〜３−３における配線長を短く出来、動作タイミングの観点からも優れた設計となる。 As a result, when the arithmetic units are physically mounted collectively with a placement driven logic synthesis tool, the LSI layout is as shown in FIG. FIG. 11 is a schematic diagram showing the layout of an LSI, and shows the arrangement of computing units for each slice. As shown in the figure, starting from the general-purpose register 2 (or operand F / F), large areas 3-0 to 3-3 for each slice are naturally formed and arranged easily. In the figure, regions 3-0 to 3-3 are sets of standard cells (calculators) that use slices S0 to S3, respectively. And by setting it as such an arrangement | positioning, the wiring length in each area | region 3-0 3-3 can be shortened, and it becomes the design excellent also from the viewpoint of operation timing.

しかし、シャッフル演算を行う演算器（以下、シャッフラー（shuffler））の場合には、その性質としてスライス間の通信をサポートするという特徴がある。従って、スライス間の境界に注意をしない実装をすると、演算器を構成するスタンダードセル群は、配置上、スライスの区別の無い大きなかたまりになりやすい。そのため、シャッフラーがＳＩＭＤ演算器群や論理演算器と混在するＬＳＩでは、シャッフラーを構成するスタンダードセル群のかたまりの大きさから、ＬＳＩ全体として演算器の配置が乱れるという問題がある。その結果、一部のオペランド供給のバスが必要以上に長くなったり、演算器が小さくまとまりきれずに配置されたりするため、配線遅延が増大し、動作タイミングの最適化が困難となる。そしてこの問題により、ＬＳＩの動作速度が低下するという問題があった。 However, an arithmetic unit (hereinafter referred to as a shuffler) that performs a shuffle operation has a feature of supporting communication between slices as a property thereof. Accordingly, if implementation is performed without paying attention to the boundary between slices, the standard cell group constituting the arithmetic unit tends to be a large cluster without distinction of slices in terms of arrangement. Therefore, in an LSI in which a shuffler is mixed with a SIMD arithmetic unit group and a logical arithmetic unit, there is a problem that the arrangement of arithmetic units as a whole LSI is disturbed due to the size of a group of standard cell groups constituting the shuffler. As a result, some operand-supplied buses become unnecessarily long, and the arithmetic units are arranged in a small and inconsistent manner, increasing wiring delay and making it difficult to optimize the operation timing. Due to this problem, there has been a problem that the operating speed of the LSI is reduced.

また、従来のシャッフラーの設計はカスタム設計により行われていた。つまり、人手によって、演算器の配置やレイアウト及びタイミングの制御が行われていた。従って、そもそも上記のような自動設計を用いた際に生ずる問題に対する配慮は行われてこなかった。しかし人手による設計は、近年の高速設計の時代に適うものでは無く、また開発者にとって大きな負担となるとともに、開発費の高騰の原因ともなっていた。以上の問題はシャッフラーだけでなく、スライス間の通信をサポートする全ての演算器において共通する問題である。 Further, the conventional shuffler has been designed by custom design. That is, the arrangement, layout, and timing of the arithmetic units are controlled manually. Therefore, in the first place, no consideration has been given to the problems that occur when using the automatic design as described above. However, manual design is not suitable for the era of high-speed design in recent years, has become a heavy burden on developers, and has caused a rise in development costs. The above problem is not only a shuffler but also a problem common to all arithmetic units that support communication between slices.

しかし、本実施形態に係るＬＳＩであると、スライス間の通信をサポートする演算器において、スライス間の通信が生じる演算部分と、生じない演算部分とを分離している。より具体的には、それぞれを異なる処理ステージに配置している。更に言い換えれば、スライス間でビットまたはバイトの移動を要する演算と、要しない演算との間に、パイプライン処理における１つ以上の処理ステージに相当するレイテンシを設けている。 However, in the LSI according to the present embodiment, in a computing unit that supports communication between slices, a computation portion in which communication between slices occurs and a computation portion that does not occur are separated. More specifically, each is arranged on a different processing stage. In other words, a latency corresponding to one or more processing stages in pipeline processing is provided between an operation that requires movement of bits or bytes between slices and an operation that does not need to be performed.

前述の通りパイプライン動作を行うＬＳＩでは、配線遅延や動作タイミング等の条件が厳しいのは、同一の処理ステージ内においてである。従って、自動設計の場合には、同一の処理ステージ内におけるスタンダードセルが、スライス毎に集まって配置されやすい。 As described above, in an LSI that performs pipeline operation, conditions such as wiring delay and operation timing are severe within the same processing stage. Therefore, in the case of automatic design, standard cells in the same processing stage are easily collected and arranged for each slice.

従って本実施形態に係る構成であると、スライス間の通信が生じない演算部分を含む処理ステージでは、図１１に示すような配置が可能となる。すなわち、placement drivenの論理合成ツールを用いて自動設計を行った場合であっても、スタンダードセルがスライス毎に配置される。そのため、配線遅延を低減し、動作タイミングを最適化出来、その結果、ＬＳＩの動作速度を向上出来る。 Therefore, with the configuration according to the present embodiment, an arrangement as shown in FIG. 11 is possible in a processing stage including a calculation part in which communication between slices does not occur. That is, even when automatic design is performed using a placement driven logic synthesis tool, standard cells are arranged for each slice. Therefore, the wiring delay can be reduced and the operation timing can be optimized, and as a result, the operation speed of the LSI can be improved.

なお、図１における処理ステージＰＳ２では、第２演算部１２においてスライス間の通信が生じる演算が行われている。従って、処理ステージＰＳ２ではスタンダードセルの配置の乱れが生じるが、ＬＳＩ全体として見た配線の乱れは、従来に比べて大幅に低減出来る。 Note that, in the processing stage PS2 in FIG. 1, the second arithmetic unit 12 performs an operation that causes communication between slices. Accordingly, the standard cell arrangement is disturbed in the processing stage PS2, but the wiring disturbance as seen in the entire LSI can be greatly reduced as compared with the conventional case.

以上のように、自動設計を用いた場合であっても、高速動作出来るＬＳＩの設計が可能となる。従って、人手による設計を削減し、ＬＳＩ開発の省力化を図ることが出来るとともに、開発費を削減出来る。 As described above, even if automatic design is used, it is possible to design an LSI that can operate at high speed. Therefore, manual design can be reduced, labor saving of LSI development can be achieved, and development costs can be reduced.

［第２の実施形態］
次に、この発明の第２の実施形態に係る演算器及び半導体集積回路装置について説明する。本実施形態は、上記第１の実施形態で説明したプロセッサ１のより具体的な構成に関するものである。 [Second Embodiment]
Next, an arithmetic unit and a semiconductor integrated circuit device according to a second embodiment of the present invention will be described. The present embodiment relates to a more specific configuration of the processor 1 described in the first embodiment.

＜プロセッサの全体構成について＞
図１２は、本実施形態に係るプロセッサ１の回路図である。図示するようにプロセッサ１は、順次処理が行われる５段の処理ステージＰＳ１〜ＰＳ５を含むパイプライン動作を行い、また第１の実施形態と同様にスライス単位でＳＩＭＤ動作を行う。スライスの一例は３２ビットである。またプロセッサ１は、汎用レジスタ２、バスＢＳ−Ａ、ＢＳ−Ｂ、ＢＳ−Ｃ、ＢＳ−ＰＳ４、ＢＳ−ＰＳ５、ＢＳ−ＬＳ、ＢＳ−Ａ０〜ＢＳ−Ａ３、ＢＳ−Ｂ０〜ＢＳ−Ｂ３、ＢＳ−Ｃ０〜ＢＳ−Ｃ３、ロードストアユニット（load and store unit）３０、選択回路３１〜４１、６８、６９、７０、フリップフロップ４２〜４８、及び積和演算器５０、算術演算器５１、論理演算器５２、及びシャッフラー５３を備えている。 <About overall processor configuration>
FIG. 12 is a circuit diagram of the processor 1 according to the present embodiment. As shown in the figure, the processor 1 performs a pipeline operation including five processing stages PS1 to PS5 that are sequentially processed, and performs a SIMD operation in units of slices as in the first embodiment. An example of a slice is 32 bits. The processor 1 also includes a general-purpose register 2, buses BS-A, BS-B, BS-C, BS-PS4, BS-PS5, BS-LS, BS-A0 to BS-A3, BS-B0 to BS-B3, BS-C0 to BS-C3, load and store unit 30, selection circuits 31 to 41, 68, 69, 70, flip-flops 42 to 48, product-sum operation unit 50, arithmetic operation unit 51, logic An arithmetic unit 52 and a shuffler 53 are provided.

汎用レジスタ２は、第１の実施形態で説明したレジスタＲＡ〜ＲＢ、ＲＴを含み、データを保持する。汎用レジスタ２の構成について図１３を用いて説明する。図１３は、汎用レジスタ２の内部構成を示すブロック図である。 The general-purpose register 2 includes the registers RA to RB and RT described in the first embodiment and holds data. The configuration of the general-purpose register 2 will be described with reference to FIG. FIG. 13 is a block diagram showing the internal configuration of the general-purpose register 2.

図示するように汎用レジスタ２は、メモリ７１及びレジスタ７２〜７５を備えたマルチポートレジスタファイルである。メモリ７１は、各々が例えば１２８ビットのデータを保持可能な１２８（＝４スライス）個のエントリＥ０〜Ｅ１２７を備えた半導体メモリである。 As illustrated, the general-purpose register 2 is a multiport register file including a memory 71 and registers 72 to 75. The memory 71 is a semiconductor memory having 128 (= 4 slices) entries E0 to E127 each capable of holding, for example, 128-bit data.

レジスタ７２〜７４は、フリップフロップと選択回路とを備える。そして、選択回路がメモリ７１のエントリＥ０〜Ｅ１２７の各々からいずれか１ビットを選択し、選択したデータをフリップフロップに格納する。その結果、レジスタ７２〜７４は、メモリ７１から読み出した１２８ビットのデータを保持する。そしてレジスタ７２〜７４がそれぞれ、上記第１の実施形態で説明したレジスタＲＡ〜ＲＣに相当する。 The registers 72 to 74 include a flip-flop and a selection circuit. The selection circuit selects one bit from each of the entries E0 to E127 of the memory 71, and stores the selected data in the flip-flop. As a result, the registers 72 to 74 hold 128-bit data read from the memory 71. The registers 72 to 74 correspond to the registers RA to RC described in the first embodiment.

レジスタ７５も、フリップフロップと選択回路とを備える。そして、選択回路６９から与えられる１２８ビットのデータを、フリップフロップに格納する。更に、レジスタ７５内において選択回路が、フリップフロップに格納された１２８ビットのデータを１ビットずつ選択し、選択したビットをメモリ７１のエントリＥ０〜Ｅ１２７の各々に格納する。レジスタ７５は、上記第１の実施形態で説明したターゲットレジスタＲＴに相当する。 The register 75 also includes a flip-flop and a selection circuit. Then, the 128-bit data supplied from the selection circuit 69 is stored in the flip-flop. Further, a selection circuit in the register 75 selects 128-bit data stored in the flip-flop bit by bit, and stores the selected bit in each of the entries E0 to E127 of the memory 71. The register 75 corresponds to the target register RT described in the first embodiment.

図１２に戻って、引き続きプロセッサ１の構成について説明する。
バスＢＳ−Ａ〜ＢＳ−Ｃのそれぞれには、レジスタＲＡ〜ＲＣからそれぞれ与えられる１２８ビットのデータを伝送する。またバスＢＳ−ＰＳ４、ＢＳ−ＰＳ５は、処理ステージＰＳ４、ＰＳ５で得られた１２８ビット（４スライス）のデータをそれぞれ伝送する。更にバスＢＳ−ＬＳは、ロードストアユニット３０から与えられる１２８ビットのデータを伝送する。 Returning to FIG. 12, the configuration of the processor 1 will be described.
Each of the buses BS-A to BS-C transmits 128-bit data supplied from the registers RA to RC. The buses BS-PS4 and BS-PS5 respectively transmit 128-bit (4 slices) data obtained in the processing stages PS4 and PS5. Further, the bus BS-LS transmits 128-bit data provided from the load store unit 30.

ロードストアユニット３０は、バスＢＳ−Ａ〜ＢＳ−Ｃを介して、必要なデータをレジスタＲＡ〜ＲＣから読み出すと共に、必要に応じてデータをバスＢＳ−ＬＳに出力する。 The load / store unit 30 reads necessary data from the registers RA to RC via the buses BS-A to BS-C and outputs data to the bus BS-LS as necessary.

選択回路３１は、バスＢＳ−Ｃ、ＢＳ−ＰＳ４、ＢＳ−ＰＳ５、ＢＳ−ＬＳ上のデータのいずれかを選択し、選択したデータをスライス毎にフリップフロップ４２に格納する。選択回路３２は、バスＢＳ−Ａ、ＢＳ−ＰＳ４、ＢＳ−ＰＳ５、ＢＳ−ＬＳ上のデータのいずれかを選択し、選択したデータをスライス毎にフリップフロップ４３に格納する。選択回路３３は、バスＢＳ−Ｂ、ＢＳ−ＰＳ４、ＢＳ−ＰＳ５、ＢＳ−ＬＳ上のデータのいずれかを選択し、選択したデータをスライス毎にフリップフロップ４４に格納する。 The selection circuit 31 selects any of the data on the buses BS-C, BS-PS4, BS-PS5, and BS-LS, and stores the selected data in the flip-flop 42 for each slice. The selection circuit 32 selects any of the data on the buses BS-A, BS-PS4, BS-PS5, and BS-LS, and stores the selected data in the flip-flop 43 for each slice. The selection circuit 33 selects any of the data on the buses BS-B, BS-PS4, BS-PS5, and BS-LS, and stores the selected data in the flip-flop 44 for each slice.

フリップフロップ４２〜４４の各々は、それぞれスライスと同じ数（４個）設けられ、それぞれがスライスＳ０〜Ｓ３に対応づけられている。そして、選択回路３１〜３３によって与えられるデータ（スライス）をそれぞれ保持する。 Each of the flip-flops 42 to 44 is provided in the same number (four) as the slices, and each is associated with the slices S0 to S3. Data (slices) given by the selection circuits 31 to 33 are held.

選択回路３４、３６、３８、４０は、バスＢＳ−Ｂ、ＢＳ−ＰＳ４、ＢＳ−ＰＳ５、ＢＳ−ＬＳ上のデータのいずれかを選択し、選択したデータをフリップフロップ４５〜４８にそれぞれ格納する。フリップフロップ４５〜４８の各々は、それぞれ選択回路３４、３６、３８、４０によって与えられるデータをそれぞれ保持する。そしてフリップフロップ４５〜４８は同じデータを保持し、それぞれに保持されたデータがスライスＳ０〜Ｓ３の制御用データとして使用される。 The selection circuits 34, 36, 38, and 40 select any of the data on the buses BS-B, BS-PS4, BS-PS5, and BS-LS, and store the selected data in the flip-flops 45 to 48, respectively. . Each of flip-flops 45 to 48 holds data provided by selection circuits 34, 36, 38, and 40, respectively. The flip-flops 45 to 48 hold the same data, and the data held in the flip-flops 45 to 48 are used as control data for the slices S0 to S3.

選択回路３５、３７、３９、４１は、バスＢＳ−Ｃ、ＢＳ−ＰＳ４、ＢＳ−ＰＳ５、ＢＳ−ＬＳ上のデータのいずれかを選択し、選択したデータをレジスタＲＤ、ＲＥ、ＲＦ、ＲＧにそれぞれ格納する。レジスタＲＤ〜ＲＧの各々は、それぞれ選択回路３５、３７、３９、４１によって与えられるデータをそれぞれ保持する。シャッフラー５３においてシャッフル演算を行う場合には、レジスタＲＤ〜ＲＧの全てにレジスタＲＣのデータが格納される。そして、レジスタＲＤ〜ＲＦのそれぞれに格納されたデータが、スライスＳ０〜Ｓ３に関する演算の制御用データとして使用される。 The selection circuits 35, 37, 39, and 41 select any of the data on the buses BS-C, BS-PS4, BS-PS5, and BS-LS, and the selected data are stored in the registers RD, RE, RF, and RG. Store each one. Each of registers RD to RG holds data provided by selection circuits 35, 37, 39, and 41, respectively. When the shuffler 53 performs the shuffle operation, the data of the register RC is stored in all the registers RD to RG. The data stored in each of the registers RD to RF is used as control data for operations related to the slices S0 to S3.

バスＢＳ−Ａ０〜ＢＳ−Ａ３のそれぞれは、フリップフロップ４３の各々によって与えられる、レジスタＲＡにおけるスライス０〜スライスＳ３を伝送する。また、バスＢＳ−Ｂ０〜ＢＳ−Ｂ３のそれぞれは、フリップフロップ４４の各々によって与えられる、レジスタＲＢにおけるスライス０〜スライスＳ３を伝送する。更にバスＢＳ−Ｃ０〜ＢＳ−Ｃ３のそれぞれは、フリップフロップ４２の各々によって与えられる、レジスタＲＣにおけるスライス０〜スライスＳ３を伝送する。 Each of the buses BS-A0 to BS-A3 transmits a slice 0 to a slice S3 in the register RA provided by each of the flip-flops 43. Each of the buses BS-B0 to BS-B3 transmits a slice 0 to a slice S3 in the register RB provided by each of the flip-flops 44. In addition, each of the buses BS-C0 to BS-C3 transmits slice 0 to slice S3 in the register RC provided by each of the flip-flops 42.

積和演算器５０は、２つの処理ステージＰＳ３、ＰＳ４において、レジスタＲＡ〜ＲＣにおけるデータの積和処理を行う。積和処理とは、ある時刻において入力データの乗算を行ってこれを蓄積し、次の時刻において次の入力信号の乗算を行い、更にその乗算結果と蓄積されている前の時刻の乗算結果とを加算して蓄積する処理のことを言う。本実施形態に係る積和演算器５０は、スライス単位のＳＩＭＤ動作により上記積和処理を実行する。そして積和処理においては、スライス間でのデータの通信は行われない。従って積和演算器５０は、上記第１の実施形態で説明した第２演算器２０に相当する。 The product-sum operation unit 50 performs product-sum processing of data in the registers RA to RC in the two processing stages PS3 and PS4. Multiply-and-accumulate processing multiplies input data at a certain time and accumulates it, multiplies the next input signal at the next time, and further multiplies the multiplication result and the accumulated multiplication result at the previous time. This is the process of adding and accumulating. The product-sum operation unit 50 according to the present embodiment performs the product-sum processing by the SIMD operation in units of slices. In the product-sum process, data communication between slices is not performed. Therefore, the product-sum calculator 50 corresponds to the second calculator 20 described in the first embodiment.

算術演算器５１は、処理ステージＰＳ３において、レジスタＲＡ、ＲＢにおけるデータの算術演算を行う。算術演算の具体例は、例えば加算命令や減算命令である。本実施形態に係る算術演算器５１は、スライス単位のＳＩＭＤ動作により上記算術演算を実行する。そして上記算術処理においては、スライス間でのデータの通信は行われない。従って算術演算器５１も、上記第１の実施形態で説明した第２演算器２０に相当する。 The arithmetic operator 51 performs an arithmetic operation on the data in the registers RA and RB in the processing stage PS3. A specific example of the arithmetic operation is, for example, an addition instruction or a subtraction instruction. The arithmetic operator 51 according to the present embodiment performs the arithmetic operation by SIMD operation in units of slices. In the arithmetic processing, data communication between slices is not performed. Therefore, the arithmetic operator 51 also corresponds to the second operator 20 described in the first embodiment.

論理演算器５２は、処理ステージＰＳ３において、レジスタＲＡ、ＲＢにおけるデータの論理演算を行う。論理演算の具体例は、例えば論理積（ＡＮＤ）命令、論理和（ＯＲ）命令、または排他的論理和（ＸＯＲ）命令である。本実施形態に係る論理演算器５２は、スライス単位のＳＩＭＤ動作により上記論理演算を実行する。そして上記論理処理においては、スライス間でのデータの通信は行われない。従って論理演算器５２は、上記第１の実施形態で説明した第２演算器２０に相当する。 The logical operation unit 52 performs logical operation on data in the registers RA and RB in the processing stage PS3. Specific examples of the logical operation are, for example, a logical product (AND) instruction, a logical sum (OR) instruction, or an exclusive logical sum (XOR) instruction. The logical operation unit 52 according to the present embodiment executes the logical operation by a SIMD operation in units of slices. In the logical processing, data communication between slices is not performed. Therefore, the logical operation unit 52 corresponds to the second operation unit 20 described in the first embodiment.

シャッフラー５３は、２つの処理ステージＰＳ３、ＰＳ４において、レジスタＲＡ、ＲＢにおけるデータを用いたシャッフル演算を行う。シャッフル演算の詳細は第１の実施形態で説明した通りである。本実施形態に係るシャッフラー５３も、スライス単位のＳＩＭＤ動作により上記シャッフル演算を実行する。シャッフラー５３は、上記第１の実施形態で説明した第１演算器１０に相当する。 The shuffler 53 performs a shuffle operation using data in the registers RA and RB in the two processing stages PS3 and PS4. The details of the shuffle calculation are as described in the first embodiment. The shuffler 53 according to the present embodiment also performs the shuffle calculation by the SIMD operation in units of slices. The shuffler 53 corresponds to the first computing unit 10 described in the first embodiment.

選択回路７０は処理ステージＰＳ４において、算術演算器５１における演算結果（１２８ビット＝４スライス）と、論理演算器５２における演算結果（１２８ビット＝４スライス）とのいずれかを選択し、バスＢＳ−ＰＳ４に出力する。 In the processing stage PS4, the selection circuit 70 selects either the operation result in the arithmetic operator 51 (128 bits = 4 slices) or the operation result in the logic operator 52 (128 bits = 4 slices), and the bus BS− Output to PS4.

フリップフロップ６６の各々は、それぞれスライスＳ０〜Ｓ３に対応づけられている。そして処理ステージＰＳ４において、算術演算器５１におけるスライスＳ０〜Ｓ３についての演算結果をそれぞれ格納する。 Each of the flip-flops 66 is associated with each of the slices S0 to S3. In the processing stage PS4, the calculation results for the slices S0 to S3 in the arithmetic calculator 51 are stored.

フリップフロップ６７の各々は、それぞれスライスＳ０〜Ｓ３に対応づけられている。そして処理ステージＰＳ４において、論理演算器５２におけるスライスＳ０〜Ｓ３についての演算結果をそれぞれ格納する。 Each of the flip-flops 67 is associated with slices S0 to S3. Then, in the processing stage PS4, the calculation results for the slices S0 to S3 in the logic calculator 52 are stored.

選択回路６８は処理ステージＰＳ５において、積和演算器５０における演算結果（１２８ビット＝４スライス）、フリップフロップ６６に格納されたデータ（１２８ビット＝４スライス）、フリップフロップ６７に格納されたデータ（１２８ビット＝４スライス）、及びシャッフラー５３における演算結果（１２８ビット＝４スライス）のいずれかを選択し、バスＢＳ−ＰＳ５に出力する。 In the processing stage PS5, the selection circuit 68 calculates the operation result in the sum-of-products calculator 50 (128 bits = 4 slices), the data stored in the flip-flop 66 (128 bits = 4 slices), and the data stored in the flip-flop 67 ( One of 128 bits = 4 slices) and the operation result in the shuffler 53 (128 bits = 4 slices) is selected and output to the bus BS-PS5.

次に、上記構成のプロセッサ１における積和演算器５０、算術演算器５１、論理演算器５２、及びシャッフラー５３の構成について説明する。
＜積和演算器５０について＞
本実施形態に係る積和演算器５０は、複数の乗算器５４、複数の加算器５６、及び複数のフリップフロップ５５、５７を備えている。 Next, the configuration of the product-sum operation unit 50, the arithmetic operation unit 51, the logical operation unit 52, and the shuffler 53 in the processor 1 having the above configuration will be described.
<About the product-sum calculator 50>
The product-sum operation unit 50 according to the present embodiment includes a plurality of multipliers 54, a plurality of adders 56, and a plurality of flip-flops 55 and 57.

乗算器５４の各々は、それぞれスライスＳ０〜Ｓ３に対応づけられており、処理ステージＰＳ３においてそれぞれスライスＳ０〜Ｓ３についての乗算を行う。すなわち乗算器５４の各々は、バスＢＳ−Ａ０〜Ａ３、ＢＳ−Ｂ０〜Ｂ３、ＢＳ−Ｃ０〜Ｃ３からデータを受け取る。そして、レジスタＲＡ〜ＲＣのスライスＳ０の乗算、レジスタＲＡ〜ＲＣのスライスＳ１の乗算、レジスタＲＡ〜ＲＣのスライスＳ２の乗算、及びレジスタＲＡ〜ＲＣのスライスＳ３の乗算をそれぞれ行う。 Each of the multipliers 54 is associated with the slices S0 to S3, and performs multiplication for the slices S0 to S3 in the processing stage PS3. That is, each of the multipliers 54 receives data from the buses BS-A0 to A3, BS-B0 to B3, and BS-C0 to C3. Then, multiplication of the slice S0 of the registers RA to RC, multiplication of the slice S1 of the registers RA to RC, multiplication of the slice S2 of the registers RA to RC, and multiplication of the slice S3 of the registers RA to RC are performed.

フリップフロップ５５の各々は、それぞれスライスＳ０〜Ｓ３に対応づけられている。そして処理ステージＰＳ３において、乗算器５４の各々におけるスライスＳ０〜Ｓ３についての乗算結果を格納する。 Each of the flip-flops 55 is associated with each of the slices S0 to S3. In the processing stage PS3, the multiplication results for the slices S0 to S3 in each of the multipliers 54 are stored.

加算器５６の各々は、それぞれスライスＳ０〜Ｓ３に対応づけられており、処理ステージＰＳ４においてそれぞれスライスＳ０〜Ｓ３についての加算を行う。すなわち加算器５４の各々は、フリップフロップ５５の各々に保持される乗算結果と、当該乗算結果が得られた乗算の直前の処理サイクルに処理ステージＰＳ３で得られた乗算結果とを、スライス毎に加算する。 Each of the adders 56 is associated with each of the slices S0 to S3, and performs addition for each of the slices S0 to S3 in the processing stage PS4. That is, each of the adders 54 obtains, for each slice, the multiplication result held in each of the flip-flops 55 and the multiplication result obtained in the processing stage PS3 in the processing cycle immediately before the multiplication from which the multiplication result is obtained. to add.

フリップフロップ５７の各々は、それぞれスライスＳ０〜Ｓ３に対応づけられている。そして処理ステージＰＳ４において、加算器５６の各々におけるスライスＳ０〜Ｓ３についての加算結果を格納する。そしてフリップフロップ５７に保持されたデータが、選択回路６８に与えられる。 Each of the flip-flops 57 is associated with each of the slices S0 to S3. Then, in the processing stage PS4, the addition results for the slices S0 to S3 in each of the adders 56 are stored. Then, the data held in the flip-flop 57 is given to the selection circuit 68.

＜算術演算器５１について＞
本実施形態に係る算術演算器５１は、複数の演算器（加算器または減算器）５８、及び複数のフリップフロップ５９を備えている。 <About Arithmetic Unit 51>
The arithmetic operation unit 51 according to this embodiment includes a plurality of operation units (adders or subtractors) 58 and a plurality of flip-flops 59.

演算器５８の各々はそれぞれスライスＳ０〜Ｓ３に対応づけられており、処理ステージＰＳ３においてそれぞれスライスＳ０〜Ｓ３についての加算または減算を行う。すなわち演算器５８の各々は、バスＢＳ−Ａ０〜Ａ３及びＢＳ−Ｂ０〜Ｂ３からデータを受け取る。そして、レジスタＲＡ、ＲＢのスライスＳ０の加算または乗算、レジスタＲＡ、ＲＢのスライスＳ１の加算または乗算、レジスタＲＡ、ＲＢのスライスＳ２の加算または乗算、及びレジスタＲＡ、ＲＢのスライスＳ３の加算または乗算を行う。 Each of the computing units 58 is associated with the slices S0 to S3, respectively, and performs addition or subtraction for the slices S0 to S3 in the processing stage PS3. That is, each of the calculators 58 receives data from the buses BS-A0 to A3 and BS-B0 to B3. Then, addition or multiplication of the slice S0 of the registers RA and RB, addition or multiplication of the slice S1 of the registers RA and RB, addition or multiplication of the slice S2 of the registers RA and RB, and addition or multiplication of the slice S3 of the registers RA and RB I do.

フリップフロップ５９の各々は、それぞれスライスＳ０〜Ｓ３に対応づけられている。そして処理ステージＰＳ３において、演算器５８の各々におけるスライスＳ０〜Ｓ３についての算術演算結果を格納する。そしてフリップフロップ５９に保持されたデータが、選択回路７０及びフリップフロップ６６に与えられる。 Each of the flip-flops 59 is associated with each of the slices S0 to S3. Then, in the processing stage PS3, the arithmetic operation results for the slices S0 to S3 in each of the calculators 58 are stored. The data held in the flip-flop 59 is supplied to the selection circuit 70 and the flip-flop 66.

＜論理演算器５２について＞
本実施形態に係る論理演算器５２は、複数の演算器６０、及び複数のフリップフロップ６１を備えている。演算器６０は、例えばＡＮＤ演算、ＯＲ演算、またはＸＯＲ演算を行う。 <Regarding the logic unit 52>
The logical operation unit 52 according to the present embodiment includes a plurality of operation units 60 and a plurality of flip-flops 61. The computing unit 60 performs, for example, an AND operation, an OR operation, or an XOR operation.

演算器６０の各々はそれぞれスライスＳ０〜Ｓ３に対応づけられており、処理ステージＰＳ３においてそれぞれスライスＳ０〜Ｓ３についての論理演算を行う。すなわち演算器６０の各々は、バスＢＳ−Ａ０〜Ａ３及びＢＳ−Ｂ０〜Ｂ３からデータを受け取る。そして、レジスタＲＡ、ＲＢのスライスＳ０の論理演算、レジスタＲＡ、ＲＢのスライスＳ１の論理演算、レジスタＲＡ、ＲＢのスライスＳ２の論理演算、及びレジスタＲＡ、ＲＢのスライスＳ３の論理演算を行う。 Each of the computing units 60 is associated with the slices S0 to S3, and performs logical operations on the slices S0 to S3 in the processing stage PS3. That is, each of the arithmetic units 60 receives data from the buses BS-A0 to A3 and BS-B0 to B3. Then, the logical operation of the slice S0 of the registers RA and RB, the logical operation of the slice S1 of the registers RA and RB, the logical operation of the slice S2 of the registers RA and RB, and the logical operation of the slice S3 of the registers RA and RB are performed.

フリップフロップ６１の各々は、それぞれスライスＳ０〜Ｓ３に対応づけられている。そして処理ステージＰＳ３において、演算器６０の各々におけるスライスＳ０〜Ｓ３についての論理演算結果を格納する。そしてフリップフロップ６１に保持されたデータが、選択回路７０及びフリップフロップ６７に与えられる。 Each of the flip-flops 61 is associated with each of the slices S0 to S3. Then, in the processing stage PS3, the logical operation results for the slices S0 to S3 in each of the arithmetic units 60 are stored. Then, the data held in the flip-flop 61 is given to the selection circuit 70 and the flip-flop 67.

＜シャッフラー５３について＞
本実施形態に係るシャッフラー５３は、複数の第１演算部６２−０〜６２−３、第２演算部６４、及び複数のフリップフロップ６３−０〜６３−３、６５を備えている。 <About the shuffler 53>
The shuffler 53 according to the present embodiment includes a plurality of first arithmetic units 62-0 to 62-3, a second arithmetic unit 64, and a plurality of flip-flops 63-0 to 63-3 and 65.

第１演算部６２−０〜６２−３の各々は、第１の実施形態で説明した図１における第１演算部１１−０〜１１−３に相当する。すなわち第１演算部６２−０〜６２−３は、それぞれスライスＳ０〜Ｓ３に対応づけられている。そして処理ステージＰＳ３において、それぞれスライスＳ０〜Ｓ３についてのシャッフル演算に関する処理を行う。すなわち第１演算部６２−０〜６２−３の各々は、バスＢＳ−Ａ０〜Ａ３及びＢＳ−Ｂ０〜Ｂ３から、シャッフルすべきデータを受け取る。更に、レジスタＲＤ〜ＲＧ及びフリップフロップ４５〜４８からそれぞれ制御信号を受け取る。そして、これらのデータと制御信号とに基づいて、スライス間でのデータ通信を必要としない処理を行う。 Each of the first calculation units 62-0 to 62-3 corresponds to the first calculation units 11-0 to 11-3 in FIG. 1 described in the first embodiment. That is, the first arithmetic units 62-0 to 62-3 are associated with the slices S0 to S3, respectively. Then, in the processing stage PS3, processing related to shuffle calculation is performed for each of the slices S0 to S3. That is, each of the first arithmetic units 62-0 to 62-3 receives data to be shuffled from the buses BS-A0 to A3 and BS-B0 to B3. Further, control signals are received from the registers RD to RG and the flip-flops 45 to 48, respectively. Based on these data and the control signal, processing that does not require data communication between slices is performed.

フリップフロップ６３−０〜６３−３の各々は、それぞれスライスＳ０〜Ｓ３に対応づけられている。そして処理ステージＰＳ３において、第１演算部６２−０〜６２−３の各々におけるスライスＳ０〜Ｓ３についての演算結果を格納する。 Each of flip-flops 63-0 to 63-3 is associated with slices S0 to S3, respectively. In the processing stage PS3, the calculation results for the slices S0 to S3 in each of the first calculation units 62-0 to 62-3 are stored.

第２演算部６４は、第１の実施形態で説明した図１における第２演算部１２に相当する。すなわち第２演算部６４は、フリップフロップ６３−０〜６３−３に保持されるデータにつき、スライス間でのデータの通信を必要とする処理を行い、シャッフル演算を完了させる。 The second calculation unit 64 corresponds to the second calculation unit 12 in FIG. 1 described in the first embodiment. That is, the second calculation unit 64 performs processing that requires data communication between slices on the data held in the flip-flops 63-0 to 63-3, and completes the shuffle calculation.

フリップフロップ６５の各々は、それぞれスライスＳ０〜Ｓ３に対応づけられている。そして処理ステージＰＳ４において、第２演算部６４で得られたシャッフル演算結果を格納する。そしてフリップフロップ６５に保持されたデータが、選択回路６８に与えられる。 Each of the flip-flops 65 is associated with each of the slices S0 to S3. Then, in the processing stage PS4, the shuffle calculation result obtained by the second calculation unit 64 is stored. Then, the data held in the flip-flop 65 is given to the selection circuit 68.

＜プロセッサ全体の大まかな動作について＞
次に、上記構成のプロセッサ１の大まかな動作について説明する。パイプライン動作は、処理ステージＰＳ１からＰＳ２、ＰＳ３、ＰＳ４、及びＰＳ５の順序で実行される。 <Rough operation of the entire processor>
Next, a rough operation of the processor 1 having the above configuration will be described. The pipeline operation is executed in the order of processing stages PS1 to PS2, PS3, PS4, and PS5.

処理ステージＰＳ１においては、汎用レジスタ２内においてレジスタＲＡ〜ＲＣにデータが読み出される。また、必要に応じてデータがターゲットレジスタＲＴに書き込まれる。 In the processing stage PS1, data is read into the registers RA to RC in the general-purpose register 2. Further, data is written to the target register RT as necessary.

次に処理ステージＰＳ２においては、レジスタＲＡ〜ＲＣ内のデータが、バスＢＳ−Ａ〜ＢＳ−Ｃ、ＢＳ−ＰＳ４、ＢＳ−ＰＳ５に供給される。すなわち、各バスにオペランドが供給される。また必要に応じて、ロードストアユニット３０がバスＢＳ−ＬＳにデータを供給する。 Next, in the processing stage PS2, the data in the registers RA to RC are supplied to the buses BS-A to BS-C, BS-PS4, and BS-PS5. That is, an operand is supplied to each bus. Further, the load store unit 30 supplies data to the bus BS-LS as necessary.

次に処理ステージＰＳ３においては、積和演算器５０、算術演算器５１、論理演算器５２、及びシャッフラー５３における演算が行われる。但し、シャッフラー５３において行われる演算は、スライス間でのデータ通信を生じない演算に限られる。 Next, in the processing stage PS3, calculations in the product-sum calculator 50, the arithmetic calculator 51, the logic calculator 52, and the shuffler 53 are performed. However, operations performed in the shuffler 53 are limited to operations that do not cause data communication between slices.

引き続き処理ステージＰＳ４においては、積和演算器５０及びシャッフラー５３における演算が行われる。シャッフラー５３において行われる演算は、スライス間でのデータ通信が生じる演算である。また、処理ステージＰＳ４では算術演算器５１及び論理演算器５２における演算が既に完了している。従って、本処理ステージＰＳ４において、算術演算器５１及び論理演算器５２における演算結果を、バスＢＳ−ＰＳ４に供給しても良い。バスＢＳ−ＰＳ４に供給された演算結果の汎用レジスタ２への書き戻しは、上記処理ステージＰＳ１において行われる。 Subsequently, in the processing stage PS4, calculation in the product-sum calculator 50 and the shuffler 53 is performed. The calculation performed in the shuffler 53 is a calculation in which data communication between slices occurs. In the processing stage PS4, the arithmetic operation unit 51 and the logical operation unit 52 have already been completed. Therefore, in this processing stage PS4, the calculation results in the arithmetic calculator 51 and the logical calculator 52 may be supplied to the bus BS-PS4. The calculation result supplied to the bus BS-PS4 is written back to the general-purpose register 2 in the processing stage PS1.

最後に処理ステージＰＳ５においては、フリップフロップ５７、６６、６７、６５に格納された演算結果が、バスＢＳ−ＰＳ５に供給される。バスＢＳ−ＰＳ５に供給された演算結果の汎用レジスタ２への書き戻しは、上記処理ステージＰＳ１において行われる。 Finally, in the processing stage PS5, the operation results stored in the flip-flops 57, 66, 67, 65 are supplied to the bus BS-PS5. The calculation result supplied to the bus BS-PS5 is written back to the general-purpose register 2 in the processing stage PS1.

次に、上記シャッフラー５３の詳細な構成と動作について説明する。 Next, a detailed configuration and operation of the shuffler 53 will be described.

＜シャッフラー５３の構成の詳細について＞
図１４は、シャッフラー５３の詳細を示す回路図である。図１４では紙面の都合上、シャッフラー５３の一部のみ（バイトＢ３２に代入すべきバイトを選択する部分のみ）を示すと共に、説明の便宜上、レジスタＲＡ、ＲＢ、ＲＤ〜ＲＧ、ＲＴも示している。 <Details of the configuration of the shuffler 53>
FIG. 14 is a circuit diagram showing details of the shuffler 53. FIG. 14 shows only a part of the shuffler 53 (only a part for selecting a byte to be substituted for the byte B32) for convenience of paper, and also shows registers RA, RB, RD to RG, RT for convenience of explanation. .

図示するようにシャッフラー５３は、選択回路８０−０Ａ〜８０−３Ａ、８０−０Ｂ〜８０−３Ｂ、８１−０〜８１−３、８２、マスク論理８３、及び前述のフリップフロップ６３−０〜６３−３を備えている。 As shown in the figure, the shuffler 53 includes selection circuits 80-0A to 80-3A, 80-0B to 80-3B, 81-0 to 81-3, 82, mask logic 83, and the aforementioned flip-flops 63-0 to 63. -3.

選択回路８０−０Ａ〜８０−３Ａ、８０−０Ｂ〜８０−３Ｂは、それぞれ１６個ずつ設けられると共に、それぞれスライスＳ０〜Ｓ３に対応づけられている。すなわち、１６個の選択回路８０−０Ａは、レジスタＲＡにおけるスライスＳ０のいずれかのバイト、すなわちバイトＢ００〜Ｂ０３のいずれかを選択する。そして１６個の選択回路８０−０Ａの各々は、ターゲットレジスタＲＴにおけるバイトＢ３２〜Ｂ４７に対応づけられている。 Sixteen selection circuits 80-0A to 80-3A and 80-0B to 80-3B are provided, respectively, and are associated with slices S0 to S3, respectively. That is, the 16 selection circuits 80-0A select any byte of the slice S0 in the register RA, that is, any one of the bytes B00 to B03. Each of the 16 selection circuits 80-0A is associated with bytes B32 to B47 in the target register RT.

また１６個の選択回路８０−１Ａは、レジスタＲＡにおけるスライスＳ１のいずれかのバイト、すなわちバイトＢ０４〜Ｂ０７のいずれかを選択する。そして１６個の選択回路８０−１Ａの各々も、ターゲットレジスタＲＴにおけるバイトＢ３２〜Ｂ４７に対応づけられている。 The 16 selection circuits 80-1A select any byte of the slice S1 in the register RA, that is, any one of the bytes B04 to B07. Each of the 16 selection circuits 80-1A is also associated with bytes B32 to B47 in the target register RT.

また１６個の選択回路８０−２Ａは、レジスタＲＡにおけるスライスＳ２のいずれかのバイト、すなわちバイトＢ０８〜Ｂ１１のいずれかを選択する。そして１６個の選択回路８０−２Ａの各々も、ターゲットレジスタＲＴにおけるバイトＢ３２〜Ｂ４７に対応づけられている。 The 16 selection circuits 80-2A select any byte of the slice S2 in the register RA, that is, any one of the bytes B08 to B11. Each of the 16 selection circuits 80-2A is also associated with bytes B32 to B47 in the target register RT.

更に１６個の選択回路８０−３Ａは、レジスタＲＡにおけるスライスＳ３のいずれかのバイト、すなわちバイトＢ１２〜Ｂ１５のいずれかを選択する。そして１６個の選択回路８０−３Ａの各々も、ターゲットレジスタＲＴにおけるバイトＢ３２〜Ｂ４７に対応づけられている。 Further, the 16 selection circuits 80-3A select any byte of the slice S3 in the register RA, that is, any one of the bytes B12 to B15. Each of the 16 selection circuits 80-3A is also associated with bytes B32 to B47 in the target register RT.

選択回路８０−０Ｂ〜８０−３Ｂも同様である。すなわち、１６個の選択回路８０−０Ｂは、レジスタＲＢにおけるスライスＳ０のいずれかのバイト、すなわちバイトＢ１６〜Ｂ１９のいずれかを選択する。そして１６個の選択回路８０−０Ｂの各々は、ターゲットレジスタＲＴにおけるバイトＢ３２〜Ｂ４７に対応づけられている。 The same applies to the selection circuits 80-0B to 80-3B. That is, the 16 selection circuits 80-0B select any byte of the slice S0 in the register RB, that is, any one of the bytes B16 to B19. Each of the 16 selection circuits 80-0B is associated with bytes B32 to B47 in the target register RT.

また１６個の選択回路８０−１Ｂは、レジスタＲＢにおけるスライスＳ１のいずれかのバイト、すなわちバイトＢ２０〜Ｂ２３のいずれかを選択する。そして１６個の選択回路８０−１Ｂの各々も、ターゲットレジスタＲＴにおけるバイトＢ３２〜Ｂ４７に対応づけられている。 The 16 selection circuits 80-1B select any byte of the slice S1 in the register RB, that is, any one of the bytes B20 to B23. Each of the 16 selection circuits 80-1B is also associated with bytes B32 to B47 in the target register RT.

また１６個の選択回路８０−２Ｂは、レジスタＲＢにおけるスライスＳ２のいずれかのバイト、すなわちバイトＢ２４〜Ｂ２７のいずれかを選択する。そして１６個の選択回路８０−２Ｂの各々も、ターゲットレジスタＲＴにおけるバイトＢ３２〜Ｂ４７に対応づけられている。 The 16 selection circuits 80-2B select any byte of the slice S2 in the register RB, that is, any one of the bytes B24 to B27. Each of the 16 selection circuits 80-2B is also associated with bytes B32 to B47 in the target register RT.

更に１６個の選択回路８０−３Ｂは、レジスタＲＢにおけるスライスＳ３のいずれかのバイト、すなわちバイトＢ２８〜Ｂ３１のいずれかを選択する。そして１６個の選択回路８０−３Ｂの各々も、ターゲットレジスタＲＴにおけるバイトＢ３２〜Ｂ４７に対応づけられている。 Further, the 16 selection circuits 80-3B select any byte of the slice S3 in the register RB, that is, any one of the bytes B28 to B31. Each of the 16 selection circuits 80-3B is also associated with bytes B32 to B47 in the target register RT.

選択回路８１−０〜８１−３は、それぞれ１６個ずつ設けられると共に、それぞれスライスＳ０〜Ｓ３に対応づけられている。１６個の選択回路８１−０の各々は、バイトＢ３２〜Ｂ４７の各々に対応づけられている。そして選択回路８１−０の各々は、それぞれバイトＢ３２〜Ｂ４７に対応づけられた選択回路８１−０Ａ、８１−０Ｂのいずれか一方の出力を選択する。すなわち、Ｂ００〜Ｂ０３、Ｂ１６〜Ｂ１９のいずれかを選択する。 Sixteen selection circuits 81-0 to 81-3 are provided, and are associated with slices S0 to S3, respectively. Each of the 16 selection circuits 81-0 is associated with each of the bytes B32 to B47. Each of the selection circuits 81-0 selects one of the outputs of selection circuits 81-0A and 81-0B associated with the bytes B32 to B47. That is, any one of B00 to B03 and B16 to B19 is selected.

１６個の選択回路８１−１の各々は、バイトＢ３２〜Ｂ４７の各々に対応づけられている。そして選択回路８１−１の各々は、それぞれバイトＢ３２〜Ｂ４７に対応づけられた選択回路８１−１Ａ、８１−１Ｂのいずれか一方の出力を選択する。すなわち、Ｂ０４〜Ｂ０７、Ｂ２０〜Ｂ２３のいずれかを選択する。 Each of the 16 selection circuits 81-1 is associated with each of the bytes B32 to B47. Each of the selection circuits 81-1 selects one output of the selection circuits 81-1A and 81-1B associated with the bytes B32 to B47. That is, any one of B04 to B07 and B20 to B23 is selected.

１６個の選択回路８１−２の各々は、バイトＢ３２〜Ｂ４７の各々に対応づけられている。そして選択回路８１−２の各々は、それぞれバイトＢ３２〜Ｂ４７に対応づけられた選択回路８１−２Ａ、８１−２Ｂのいずれか一方の出力を選択する。すなわち、Ｂ０８〜Ｂ１１、Ｂ２４〜Ｂ２７のいずれかを選択する。 Each of the 16 selection circuits 81-2 is associated with each of the bytes B32 to B47. Each of the selection circuits 81-2 selects one of the outputs of the selection circuits 81-2A and 81-2B respectively associated with the bytes B32 to B47. That is, any one of B08 to B11 and B24 to B27 is selected.

１６個の選択回路８１−３の各々は、バイトＢ３２〜Ｂ４７の各々に対応づけられている。そして選択回路８１−３の各々は、それぞれバイトＢ３２〜Ｂ４７に対応づけられた選択回路８１−３Ａ、８１−３Ｂのいずれか一方の出力を選択する。すなわち、Ｂ１２〜Ｂ１５、Ｂ２８〜Ｂ３１のいずれかを選択する。 Each of the 16 selection circuits 81-3 is associated with each of the bytes B32 to B47. Each of the selection circuits 81-3 selects one of the outputs of selection circuits 81-3A and 81-3B associated with the bytes B32 to B47. That is, any one of B12 to B15 and B28 to B31 is selected.

フリップフロップ６３−０〜６３−３はそれぞれ、１６個ずつ設けられると共に、それぞれがスライスＳ０〜Ｓ３に対応づけられている。すなわち１６個のフリップフロップ６３−０の各々は、１６個の選択回路８１−０の各々で選択されたデータを、それぞれ格納する。また１６個のフリップフロップ６３−１の各々は、１６個の選択回路８１−１の各々で選択されたデータを、それぞれ格納する。更に１６個のフリップフロップ６３−２の各々は、１６個の選択回路８１−２の各々で選択されたデータを、それぞれ格納する。そして１６個のフリップフロップ６３−３の各々は、１６個の選択回路８１−３の各々で選択されたデータを、それぞれ格納する。 Each of the 16 flip-flops 63-0 to 63-3 is provided, and each is associated with the slices S0 to S3. That is, each of the 16 flip-flops 63-0 stores the data selected by each of the 16 selection circuits 81-0. Each of the 16 flip-flops 63-1 stores the data selected by each of the 16 selection circuits 81-1. Further, each of the 16 flip-flops 63-2 stores the data selected by each of the 16 selection circuits 81-2. Each of the 16 flip-flops 63-3 stores the data selected by each of the 16 selection circuits 81-3.

選択回路８２は、それぞれバイトＢ３２〜Ｂ４７に対応づけられて１６個、設けられている。そして１６個の選択回路８２の各々は、バイトＢ３２〜Ｂ４７に対応づけられたフリップフロップ６３−０〜６３−３に保持されるデータのいずれかを、それぞれ選択する。 Sixteen selection circuits 82 are provided in association with the bytes B32 to B47, respectively. Each of the 16 selection circuits 82 selects any of the data held in the flip-flops 63-0 to 63-3 associated with the bytes B32 to B47.

マスク論理８３は、それぞれバイトＢ３２〜Ｂ４７に対応づけられて１６個、設けられている。そして１６個のマスク論理８３の各々は、１６個の選択回路８２の各々において選択されたデータを、必要に応じて固定値にフォースし、レジスタＲＴのバイトＢ３２〜Ｂ４７に代入する。 Sixteen mask logics 83 are provided in association with the bytes B32 to B47, respectively. Each of the 16 mask logics 83 forces the data selected in each of the 16 selection circuits 82 to a fixed value as necessary, and substitutes it into bytes B32 to B47 of the register RT.

上記構成において、１６個の選択回路８０−０Ａ、８０−０Ｂ及び１６個の選択回路８１−０がスライスＳ０に関する処理を行うユニットであり、図１２における第１演算部６２−０に相当する。また、１６個の選択回路８０−１Ａ、８０−１Ｂ及び１６個の選択回路８１−１がスライスＳ１に関する処理を行うユニットであり、図１２における第１演算部６２−１に相当する。更に１６個の選択回路８０−２Ａ、８０−２Ｂ及び１６個の選択回路８１−２がスライスＳ２に関する処理を行うユニットであり、図１２における第１演算部６２−２に相当する。そして１６個の選択回路８０−３Ａ、８０−３Ｂ及び１６個の選択回路８１−３がスライスＳ３に関する処理を行うユニットであり、図１２における第１演算部６２−３に相当する。そして以上の構成によって行われる処理は、処理ステージＰＳ３において行われる。 In the above configuration, the 16 selection circuits 80-0A and 80-0B and the 16 selection circuits 81-0 are units that perform processing related to the slice S0, and correspond to the first calculation unit 62-0 in FIG. In addition, the 16 selection circuits 80-1A and 80-1B and the 16 selection circuits 81-1 are units that perform processing related to the slice S1, and correspond to the first calculation unit 62-1 in FIG. Further, the 16 selection circuits 80-2A and 80-2B and the 16 selection circuits 81-2 are units that perform processing related to the slice S2, and correspond to the first calculation unit 62-2 in FIG. The 16 selection circuits 80-3A and 80-3B and the 16 selection circuits 81-3 are units that perform processing related to the slice S3, and correspond to the first calculation unit 62-3 in FIG. And the process performed by the above structure is performed in process stage PS3.

また、選択回路８２及びマスク論理８３がスライスＳ０〜Ｓ３にまたがった処理を行うユニットであり、図１２における第２演算部６４に相当する。そしてこれらの構成によって行われる処理が、処理ステージＰＳ４において行われる。 The selection circuit 82 and the mask logic 83 are units that perform processing across slices S0 to S3, and correspond to the second arithmetic unit 64 in FIG. And the process performed by these structures is performed in process stage PS4.

上記選択回路８０−０Ａ〜８０−３Ａ、８０−０Ｂ〜８０−３Ｂ、８１−０〜８１−３、８２の選択動作、及びマスク論理８３における固定値へのフォースの要否は、レジスタＲＤ〜ＲＧ内のデータのデコード結果に応じて制御される。前述の通り、レジスタＲＤ〜ＲＧにはレジスタＲＣ内のデータがコピーされる。そして、デコーダ８４−０〜８４−３がそれぞれ、レジスタＲＤ〜ＲＧ内のデータをデコードする。そして、それぞれのデコード結果が選択回路８０−０Ａ〜８０−３Ａ、８０−０Ｂ〜８０−３Ｂ、８１−０〜８１−３へ与えられ、更にフリップフロップ８５を介して選択回路８２及びマスク論理８３へ与えられる。 The selection operation of the selection circuits 80-0A to 80-3A, 80-0B to 80-3B, 81-0 to 81-3, 82, and the necessity of the force to a fixed value in the mask logic 83 are described in the registers RD to Control is performed according to the decoding result of the data in the RG. As described above, the data in the register RC is copied to the registers RD to RG. Then, the decoders 84-0 to 84-3 decode the data in the registers RD to RG, respectively. Then, the respective decoding results are given to the selection circuits 80-0A to 80-3A, 80-0B to 80-3B, 81-0 to 81-3, and further through the flip-flop 85, the selection circuit 82 and the mask logic 83. Given to.

処理ステージＰＳ３においては、デコーダ８４−０〜８４−３におけるデコード結果はそれぞれ、第１演算部６２−０〜６２−３の各々に相当する選択回路に与えられる。つまり、選択回路８０−０Ａ、８０−０Ｂ、８１−０にはデコーダ８４−０におけるデコード結果が与えられ、選択回路８０−１Ａ、８０−１Ｂ、８１−１にはデコーダ８４−１におけるデコード結果が与えられる。以下同様である。 In the processing stage PS3, the decoding results in the decoders 84-0 to 84-3 are respectively supplied to selection circuits corresponding to the first arithmetic units 62-0 to 62-3. That is, the decoding results in the decoder 84-0 are given to the selection circuits 80-0A, 80-0B, 81-0, and the decoding results in the decoder 84-1 are given to the selection circuits 80-1A, 80-1B, 81-1. Is given. The same applies hereinafter.

処理ステージＰＳ４においては、デコーダ８４−０〜８４−３の少なくともいずれかにおけるデコード結果に応じて、選択回路８２及びマスク論理８３における動作が制御される。なおデコーダ８４−０〜８４−３は、シャッフラー５３の第１演算部６２−０〜６２−３内に含まれても良い。 In the processing stage PS4, operations in the selection circuit 82 and the mask logic 83 are controlled in accordance with the decoding result in at least one of the decoders 84-0 to 84-3. The decoders 84-0 to 84-3 may be included in the first calculation units 62-0 to 62-3 of the shuffler 53.

次に、上記シャッフラー３の動作について説明する。
＜シャッフラー５３の動作の詳細について＞
図１４を用いて説明した上記構成により、シャッフラー５３は第１の実施形態で説明した図９及び図１０に示される演算を行う。 Next, the operation of the shuffler 3 will be described.
<Details of operation of the shuffler 53>
With the above-described configuration described with reference to FIG. 14, the shuffler 53 performs the calculation shown in FIGS. 9 and 10 described in the first embodiment.

まず処理ステージＰＳ３における処理について説明する。処理ステージＰＳ３においては、スライスＳ０〜Ｓ３の各々において、各バイトＢ３２〜Ｂ４７への代入候補となる一つのバイトが選ばれる。以下、その詳細につき説明する。 First, processing in the processing stage PS3 will be described. In the processing stage PS3, in each of the slices S0 to S3, one byte that is a candidate for substitution into each of the bytes B32 to B47 is selected. The details will be described below.

まず選択回路８０−０Ａ〜８０−３Ａが、レジスタＲＡにおける各スライスＳ０〜Ｓ３におけるいずれかのバイトを、ターゲットレジスタＲＴの対応するバイトへの代入候補として選択する。また選択回路８０−０Ｂ〜８０−３Ｂが、レジスタＲＢにおける各スライスＳ０〜Ｓ３におけるいずれかのバイトを候補として選択する。この選択動作は、各々１６個の選択回路８０−０Ａ〜８０−３Ａ、８０−０Ｂ〜８０−３Ｂによって、バイトＢ３２〜Ｂ４７毎に行われる。 First, the selection circuits 80-0A to 80-3A select any byte in each of the slices S0 to S3 in the register RA as an assignment candidate to the corresponding byte in the target register RT. Further, the selection circuits 80-0B to 80-3B select any byte in each of the slices S0 to S3 in the register RB as a candidate. This selection operation is performed for each byte B32 to B47 by 16 selection circuits 80-0A to 80-3A and 80-0B to 80-3B.

以上の選択動作により、バイトＢ３２〜Ｂ４７毎に、１つのスライスあたり２つの代入候補が決定する。そこで次に選択回路８１−０〜８１−３が、これらの２つの代入候補のいずれかを選択し、選択した１つの代入候補をフリップフロップ６３−０〜６３−３へ代入する。 With the above selection operation, two substitution candidates are determined per slice for each of bytes B32 to B47. Therefore, next, the selection circuits 81-0 to 81-3 select one of these two substitution candidates, and substitute the selected substitution candidate into the flip-flops 63-0 to 63-3.

以上の選択動作により、各スライスあたり１つの代入候補が、バイトＢ３２〜Ｂ４７毎に決定する。つまり、各バイトＢ３２〜Ｂ４７あたり、代入候補が４つまで絞られる。 Through the above selection operation, one substitution candidate for each slice is determined for each of bytes B32 to B47. That is, up to four substitution candidates are narrowed for each byte B32 to B47.

次に処理ステージＰＳ４において処理が行われる。１６個の選択回路８２の各々は、１６個ずつ設けられたフリップフロップ６３−０〜６３−３の各々から、いずれかのバイトを候補として選択する。これにより、各バイトＢ３２〜Ｂ４７あたり、１つの代入候補が決定される。そして、次にマスク論理８３が、対応するバイトＢ３２〜Ｂ４７について、代入すべき値を固定値とするか、または選択回路８２で選択したデータとするかを決定する。 Next, processing is performed in the processing stage PS4. Each of the 16 selection circuits 82 selects one of the 16 flip-flops 63-0 to 63-3 as a candidate. Thereby, one substitution candidate is determined for each byte B32 to B47. Then, the mask logic 83 determines whether the value to be substituted is a fixed value or the data selected by the selection circuit 82 for the corresponding bytes B32 to B47.

以上の処理の具体例について、バイトＢ３２の代入候補を決定する場合を例に、以下説明する。この場合は、図９においてｊ＝０である場合に相当する。 A specific example of the above processing will be described below by taking as an example the case of determining a substitution candidate for byte B32. This case corresponds to the case where j = 0 in FIG.

まず、処理ステージＰＳ３における処理が行われる。すなわち、選択回路８０−０Ａ、８０−０Ｂは、デコーダ８４−０における、レジスタＲＤのビット３〜７のデコード結果に応じて、バイトＢ００〜Ｂ０３のいずれか、及びバイトＢ１６〜Ｂ１９のいずれかを、それぞれ選択する。また選択回路８０−１Ａ、８０−１Ｂは、デコーダ８４−１における、レジスタＲＥのビット３〜７のデコード結果に応じて、バイトＢ０４〜Ｂ０７のいずれか、及びバイトＢ２０〜Ｂ２３のいずれかを、それぞれ選択する。更に選択回路８０−２Ａ、８０−２Ｂは、デコーダ８４−２における、レジスタＲＦのビット３〜７のデコード結果に応じて、バイトＢ０８〜Ｂ１１のいずれか、及びバイトＢ２４〜Ｂ２７のいずれかを、それぞれ選択する。そして選択回路８０−３Ａ、８０−３Ｂは、デコーダ８４−３における、レジスタＲＧのビット３〜７のデコード結果に応じて、バイトＢ１２〜Ｂ１５のいずれか、及びバイトＢ２８〜Ｂ３１のいずれかを、それぞれ選択する。 First, processing in the processing stage PS3 is performed. That is, the selection circuits 80-0A and 80-0B select one of bytes B00 to B03 and one of bytes B16 to B19 according to the decoding result of bits 3 to 7 of the register RD in the decoder 84-0. , Select each. The selection circuits 80-1A and 80-1B select any one of bytes B04 to B07 and any of bytes B20 to B23 according to the decoding result of bits 3 to 7 of the register RE in the decoder 84-1. Select each one. Further, the selection circuits 80-2A and 80-2B select one of the bytes B08 to B11 and one of the bytes B24 to B27 according to the decoding result of the bits 3 to 7 of the register RF in the decoder 84-2. Select each one. Then, the selection circuits 80-3A and 80-3B select any one of bytes B12 to B15 and any of bytes B28 to B31 according to the decoding result of bits 3 to 7 of the register RG in the decoder 84-3. Select each one.

次に、選択回路８１−０は、デコーダ８４−０における、レジスタＲＤのビット３〜７のデコード結果に応じて、選択回路８０−０Ａで選択されたいずれかのバイトと、選択回路８０−０Ｂで選択されたバイトとのいずれか一方を選択する。そして、選択したバイトをフリップフロップ６３−０に格納する。また選択回路８１−１は、デコーダ８４−１における、レジスタＲＥのビット３〜７のデコード結果に応じて、選択回路８０−１Ａで選択されたいずれかのバイトと、選択回路８０−１Ｂで選択されたバイトとのいずれか一方を選択する。そして、選択したバイトをフリップフロップ６３−１に格納する。更に選択回路８１−２は、デコーダ８４−２における、レジスタＲＦのビット３〜７のデコード結果に応じて、選択回路８０−２Ａで選択されたいずれかのバイトと、選択回路８０−２Ｂで選択されたバイトとのいずれか一方を選択する。そして、選択したバイトをフリップフロップ６３−２に格納する。そして選択回路８１−３は、デコーダ８４−３における、レジスタＲＧのビット３〜７のデコード結果に応じて、選択回路８０−３Ａで選択されたいずれかのバイトと、選択回路８０−３Ｂで選択されたバイトとのいずれか一方を選択する。そして、選択したバイトをフリップフロップ６３−３に格納する。 Next, the selection circuit 81-0 selects one of the bytes selected by the selection circuit 80-0A and the selection circuit 80-0B according to the decoding result of bits 3 to 7 of the register RD in the decoder 84-0. Select one of the bytes selected in. Then, the selected byte is stored in the flip-flop 63-0. The selection circuit 81-1 selects one of the bytes selected by the selection circuit 80-1A and the selection circuit 80-1B according to the decoding result of bits 3 to 7 of the register RE in the decoder 84-1. One of the selected bytes. Then, the selected byte is stored in the flip-flop 63-1. Further, the selection circuit 81-2 selects one of the bytes selected by the selection circuit 80-2A and the selection circuit 80-2B according to the decoding result of bits 3 to 7 of the register RF in the decoder 84-2. One of the selected bytes. Then, the selected byte is stored in the flip-flop 63-2. The selection circuit 81-3 selects one of the bytes selected by the selection circuit 80-3A and the selection circuit 80-3B according to the decoding result of bits 3 to 7 of the register RG in the decoder 84-3. One of the selected bytes. Then, the selected byte is stored in the flip-flop 63-3.

次に処理ステージＰＳ４において処理が行われる。バイトＢ３２に対応する選択回路８２は、デコーダ８４−０〜８４−３のいずれかにおける、レジスタＲＤのビット３〜７のデコード結果に応じて、フリップフロップ６３−０〜６３−３のいずれかに保持されるバイトを選択する。 Next, processing is performed in the processing stage PS4. The selection circuit 82 corresponding to the byte B32 is connected to one of the flip-flops 63-0 to 63-3 according to the decoding result of the bits 3 to 7 of the register RD in any of the decoders 84-0 to 84-3. Select the bytes to be retained.

次にマスク論理８３が、デコーダ８４−０〜８４−３のいずれかにおける、レジスタＲＤ〜ＲＧのビット０〜２のデコード結果に応じて、バイトＢ３２に代入すべきデータを固定値にフォースする。すなわち、図９で説明したように、ビット０〜１が“０ｂ１０”であれば、バイトＢ３２には“０ｂ００００００００”が代入され、ビット０〜２が“０ｂ１１０”であれば、バイトＢ３２には“０ｂ１１１１１１１１”が代入され、ビット０〜２が“０ｂ１１１”であれば、バイトＢ３２には“０ｂ１０００００００”が代入される。ビット０〜２が上記以外の値であれば、バイトＢ３２には選択回路８２で選択されたバイトが代入される。 Next, the mask logic 83 forces the data to be substituted into the byte B32 to a fixed value according to the decoding result of the bits 0 to 2 of the registers RD to RG in any of the decoders 84-0 to 84-3. That is, as described in FIG. 9, if bits 0 to 1 are “0b10”, “0b00000000” is assigned to byte B32, and if bits 0 to 2 are “0b110”, “0b110” is assigned to byte B32. If 0b11111111 "is assigned and bits 0-2 are" 0b111 ", then" 0b10000000 "is assigned to byte B32. If bits 0 to 2 are values other than those described above, the byte selected by the selection circuit 82 is substituted for byte B32.

以上がバイトＢ３２に代入すべきデータの選択方法である。バイトＢ３３に代入すべきデータの選択時は、図９におけるｊ＝１の場合に相当するので、各選択回路及びマスク論理の制御にあたっては、レジスタＲＤ〜ＲＧにおけるビット８〜１５のデコード結果が使用される。またバイトＢ３４の場合はｊ＝２の場合に相当するので、この場合にはビット１６〜２３のデコード結果が使用される。以下同様である。 The above is the method for selecting data to be substituted into byte B32. The selection of data to be substituted into the byte B33 corresponds to the case of j = 1 in FIG. 9, so that the decoding results of bits 8 to 15 in the registers RD to RG are used for controlling each selection circuit and mask logic. Is done. Since the byte B34 corresponds to the case of j = 2, the decoding result of bits 16 to 23 is used in this case. The same applies hereinafter.

上記のように、第１の実施形態で説明したＬＳＩは、例えば本実施形態で説明した構成によって実現出来る。本実施形態に係る構成であると、シャッフラー５３は、スライス間でのデータ通信を要する処理と要しない処理とを、処理ステージＰＳ３と処理ステージＰＳ４とに分離している。 As described above, the LSI described in the first embodiment can be realized by the configuration described in the present embodiment, for example. In the configuration according to the present embodiment, the shuffler 53 separates processing that requires data communication between slices and processing that does not need into processing stages PS3 and PS4.

前述の通り、動作タイミングについて配慮すべき領域は処理ステージ毎である。そして、異なる処理ステージ間では、動作タイミングについての要求は同一処理ステージ内に比して緩やかである。その結果、各処理ステージにおいてスタンダードセルはスライス毎に集まって配置されやすい。 As described above, the area to be considered for the operation timing is each processing stage. And between different processing stages, the request | requirement about operation | movement timing is loose compared with the same processing stage. As a result, the standard cells are easily collected and arranged for each slice in each processing stage.

すると、本実施形態に係る構成であると、処理ステージＰＳ３においてはスライスをまたぐ演算部が存在しない。従って、処理ステージＰＳ３を構成するスタンダードセル群は、スライス毎にまとまって配置されることが可能となる。処理ステージＰＳ４ではスライスをまたぐ演算部（第２演算部６４）が存在するが、この演算部はシャッフラー５３の一部に過ぎない。従って、ＬＳＩ全体として見たときに、スライスをまたいでスタンダードセルが配置される領域は、従来に比べて圧倒的に小さくなる。よって、上記第１の実施形態で説明した効果が得られる。 Then, in the configuration according to the present embodiment, there is no arithmetic unit that crosses slices in the processing stage PS3. Therefore, the standard cell group constituting the processing stage PS3 can be arranged together for each slice. In the processing stage PS4, there is a calculation unit (second calculation unit 64) that crosses the slice, but this calculation unit is only a part of the shuffler 53. Therefore, when viewed as the entire LSI, the area where the standard cells are arranged across the slices is overwhelmingly smaller than in the prior art. Therefore, the effect described in the first embodiment can be obtained.

なおシャッフラー５３は、本実施形態で説明した構成によって、シャッフル演算だけでなくローテート演算やシフト演算も可能である。この場合には、ローテート演算やシフト演算が可能となるよう、つまり図５及び図７が実現出来るよう、レジスタＲＣの制御データをセットすれば良い。 The shuffler 53 can perform not only the shuffle calculation but also the rotation calculation and the shift calculation by the configuration described in the present embodiment. In this case, it is only necessary to set the control data of the register RC so that the rotation calculation and the shift calculation can be performed, that is, the processes shown in FIGS.

また、本実施形態に係る構成であると、レジスタＲＣを多重化している。より具体的には、レジスタＲＣのデータ（演算内容の制御データ）のコピーを保持するレジスタＲＤ〜ＲＧを、スライスＳ０〜Ｓ３毎に設けている。そして、レジスタＲＤ〜ＲＧにおける制御データに基づいて、それぞれスライスＳ０〜Ｓ３に対応した第１演算部６２−０〜６２−３が制御される。このことも、処理ステージＰＳ３におけるスライス間でのデータ通信の発生を防止することに寄与している。 In the configuration according to the present embodiment, the register RC is multiplexed. More specifically, registers RD to RG that hold a copy of the data in the register RC (control data of calculation contents) are provided for each of the slices S0 to S3. Based on the control data in the registers RD to RG, the first arithmetic units 62-0 to 62-3 corresponding to the slices S0 to S3 are controlled. This also contributes to preventing the occurrence of data communication between slices in the processing stage PS3.

［第３の実施形態］
次に、この発明の第３の実施形態に係る半導体集積回路装置について説明する。本実施形態は、上記第１、第２の実施形態で説明したプロセッサを用いたプロセッサシステムに関するものである。 [Third Embodiment]
Next explained is a semiconductor integrated circuit device according to the third embodiment of the invention. The present embodiment relates to a processor system using the processor described in the first and second embodiments.

＜プロセッサシステムの全体構成について＞
図１５は、本実施形態に係る画像処理用プロセッサシステムのブロック図である。図示するようにプロセッサシステム９０は、ホストＰＣ９１、マルチコアプロセッサ９２、及びローカルメモリ９３を備えている。 <Overall configuration of processor system>
FIG. 15 is a block diagram of an image processing processor system according to the present embodiment. As shown in the figure, the processor system 90 includes a host PC 91, a multi-core processor 92, and a local memory 93.

ホストＰＣ９１は、外部から映像データを受信し、マルチコアプロセッサ９２に対して映像データのエンコードまたはデコードを命令する。マルチコアプロセッサ９２は、ホストＰＣ９１から映像データを受け取り、ローカルメモリ９３に一時的に格納すると共に、映像データのエンコード及びデコードを行う。そしてホストＰＣ９１は、マルチコアプロセッサ９２でデコードされた映像データを、外部の映像表示部に表示させる。
次に、上記ホストＰＣ９１及びマルチコアプロセッサ９２の構成について説明する。 The host PC 91 receives video data from the outside and instructs the multi-core processor 92 to encode or decode the video data. The multi-core processor 92 receives video data from the host PC 91, temporarily stores it in the local memory 93, and encodes and decodes the video data. Then, the host PC 91 displays the video data decoded by the multi-core processor 92 on an external video display unit.
Next, the configuration of the host PC 91 and the multi-core processor 92 will be described.

＜ホストＰＣ９１について＞
上記ホストＰＣ９１は、ＣＰＵ９４、ビデオＲＡＭ（以下ＶＲＡＭと呼ぶ）９５、グラフィックプロセッサ（以下ＧＰＵ（Graphic Processing Unit）と呼ぶ）９６、メインメモリ９７、第１接続部９８、及び第２接続部９９を備えている。 <About the host PC 91>
The host PC 91 includes a CPU 94, a video RAM (hereinafter referred to as VRAM) 95, a graphic processor (hereinafter referred to as GPU (Graphic Processing Unit)) 96, a main memory 97, a first connection unit 98, and a second connection unit 99. ing.

ＣＰＵ９４は、ホストＰＣ９１全体の処理を司る。メインメモリ９７は、例えばＤＲＡＭ等の半導体メモリであり、ＣＰＵ９４の作業領域として使用される。第２接続部９９は、外部より映像データ及び音声データを受信する。また、ホストＰＣ９１とマルチコアプロセッサ９２との間のデータの授受を司る。外部からのデータは、例えばＵＳＢやシリアルＡＴＡによって、第２接続部９９に転送される。第１接続部９８は、ＣＰＵ９４、メインメモリ９７、第２接続部９９、及びＧＰＵ９６の相互間の接続を司る。ＧＰＵ９６は、第１接続部９８を介して与えられる映像データを、ビデオ出力として映像表示部に表示させる。ＶＲＡＭ９５は、例えばＤＲＡＭ等の半導体メモリであり、ＧＰＵ９６の作業領域として使用される。 The CPU 94 manages the entire processing of the host PC 91. The main memory 97 is a semiconductor memory such as a DRAM, and is used as a work area for the CPU 94. The second connection unit 99 receives video data and audio data from the outside. It also manages data exchange between the host PC 91 and the multi-core processor 92. Data from the outside is transferred to the second connection unit 99 by, for example, USB or serial ATA. The first connection unit 98 governs connection among the CPU 94, the main memory 97, the second connection unit 99, and the GPU 96. The GPU 96 displays the video data provided through the first connection unit 98 on the video display unit as a video output. The VRAM 95 is a semiconductor memory such as a DRAM, for example, and is used as a work area for the GPU 96.

＜マルチコアプロセッサ９２について＞
次に、上記マルチコアプロセッサ９２の構成について説明する。図１６はマルチコアプロセッサ９２の構成を示すブロック図である。 <About the multi-core processor 92>
Next, the configuration of the multi-core processor 92 will be described. FIG. 16 is a block diagram showing the configuration of the multi-core processor 92.

図示するようにマルチコアプロセッサ９２は、第１プロセッサ１００、ＤＭＡ（Direct Memory Access）コントローラ１１０、第１デコーダ１２０、第２デコーダ１３０、複数の第２プロセッサ１４０、ホストインターフェース１６０、及びメモリコントローラ１７０を備えている。これらは、バス１８０によって相互に通信可能に接続されている。 As illustrated, the multi-core processor 92 includes a first processor 100, a DMA (Direct Memory Access) controller 110, a first decoder 120, a second decoder 130, a plurality of second processors 140, a host interface 160, and a memory controller 170. ing. These are connected to each other via a bus 180 so that they can communicate with each other.

第１プロセッサ１００は、マルチコアプロセッサ９２全体の動作を制御するメインプロセッサである。オペレーティングシステム（ＯＳ：Operating System）は、主に第１プロセッサ１００によって実行される。ＯＳの一部の機能は、第２プロセッサ１４０等で分担して実行することもできる。 The first processor 100 is a main processor that controls the operation of the entire multi-core processor 92. An operating system (OS) is mainly executed by the first processor 100. Some functions of the OS can be shared and executed by the second processor 140 or the like.

第２プロセッサ１４０の各々は、第１プロセッサ１００の管理の下で各種の処理を実行する。第１プロセッサ１００は、複数の第２プロセッサ１４０に処理を振り分けて並列に実行させるための制御を行う。これにより高速で効率よい処理を実行できる。第２プロセッサ１４０の各々は、大まかには制御部、演算部、及びメモリを備えている。制御部は、第１プロセッサ１００の命令に基づき、演算部に対して必要な演算実行を命令する。メモリは、ＤＭＡコントローラ１１０の命令に基づき、外部からデータを受け取り、または外部へデータを出力する。このデータは、例えばＯＳやアプリケーションプログラム、または映像データ等である。演算部は、制御部の命令に基づき、メモリに格納されたデータを用いて演算を行う。第２プロセッサ１４０の詳細については後述する。 Each of the second processors 140 executes various processes under the management of the first processor 100. The first processor 100 performs control for distributing the processing to the plurality of second processors 140 and executing them in parallel. Thereby, high-speed and efficient processing can be executed. Each of the second processors 140 roughly includes a control unit, a calculation unit, and a memory. The control unit instructs the calculation unit to execute necessary calculation based on the instruction of the first processor 100. The memory receives data from the outside or outputs data to the outside based on a command from the DMA controller 110. This data is, for example, an OS, an application program, or video data. The operation unit performs an operation using data stored in the memory based on an instruction from the control unit. Details of the second processor 140 will be described later.

第１デコーダ１２０は、外部から与えられたＭＰＥＧ（Moving Picture Experts Group）−２形式の映像データをデコードする。第２デコーダ１３０は、外部から与えられたＨ．２６４形式の映像データをデコードする。なお、ＭＰＥＧ−２及びＨ．２６４とは、映像を圧縮して符号化する規格の名称である。 The first decoder 120 decodes video data in the MPEG (Moving Picture Experts Group) -2 format provided from the outside. The second decoder 130 is connected to the H.264 supplied from the outside. H.264 format video data is decoded. MPEG-2 and H.264 H.264 is the name of a standard for compressing and encoding video.

ＤＭＡコントローラ１１０は、マルチコアプロセッサ９２内におけるバス１８０を介したデータの通信を司る。ホストインターフェース１６０はホストＰＣ９１との間のデータの授受を司り、メモリコントローラ１７０はローカルメモリ９３との間のデータの授受を司る。 The DMA controller 110 manages data communication via the bus 180 in the multi-core processor 92. The host interface 160 is responsible for data exchange with the host PC 91, and the memory controller 170 is responsible for data exchange with the local memory 93.

なお、図１６の構成では、第１プロセッサ１００が１つであり、また第２プロセッサ１４０ＶＰＵ１２が４つの場合を例に挙げている。しかし、これらの回路ブロックの個数は制限されない。例えば第１プロセッサ１００が複数ある構成や、第１プロセッサ１００を有しない構成も可能である。第１プロセッサ１００を有しない構成の場合、第１プロセッサ１００の行う処理は、いずれかの第２プロセッサ１４０が担当する。つまり、仮想的な第１プロセッサ１００の役割を第２プロセッサ１４０が兼ねる。 In the configuration of FIG. 16, there is an example in which there is one first processor 100 and four second processors 140 VPU 12. However, the number of these circuit blocks is not limited. For example, a configuration with a plurality of first processors 100 or a configuration without the first processor 100 is possible. In the case where the first processor 100 is not provided, any one of the second processors 140 is in charge of processing performed by the first processor 100. That is, the second processor 140 serves as the virtual first processor 100.

＜第２プロセッサ１４０について＞
次に、上記第２プロセッサ１４０の構成について説明する。図１７は第２プロセッサ１４０の構成を示すブロック図である。 <About the second processor 140>
Next, the configuration of the second processor 140 will be described. FIG. 17 is a block diagram showing a configuration of the second processor 140.

図示するように第２プロセッサ１４０は、チャネルインターフェース１４１、分岐命令実行部１４２、ＤＭＡインターフェース１４３、ローカルメモリ１４４、ロードストアユニット１４５、バッファ制御部１４６、命令バッファ１４７、汎用レジスタ１４８、オペランド供給ネットワーク１４９、浮動小数点演算部１５０、及び整数演算部１５１を備えている。 As shown, the second processor 140 includes a channel interface 141, a branch instruction execution unit 142, a DMA interface 143, a local memory 144, a load store unit 145, a buffer control unit 146, an instruction buffer 147, a general-purpose register 148, and an operand supply network 149. A floating point arithmetic unit 150 and an integer arithmetic unit 151.

チャネルインターフェース１４１は、図示せぬメモリフローコントローラ（Memory Flow Controller）を介して、第１プロセッサ１００から命令を受信する。 The channel interface 141 receives an instruction from the first processor 100 via a memory flow controller (not shown).

ＤＭＡインターフェース１４３は、メモリコントローラを介して、ローカルメモリ９３から映像データやプログラムを受信する。 The DMA interface 143 receives video data and programs from the local memory 93 via the memory controller.

ローカルメモリ１４４は例えばＤＲＡＭやＥＥＰＲＯＭ（Electrically Erasable and Programmable ROM）等の半導体メモリであり、ＤＭＡインターフェース１４３で受信した映像データやプログラムを保持する。また、整数演算部１５１や浮動小数点演算部１５０、または分岐命令実行部１４２における処理結果を保持する。 The local memory 144 is a semiconductor memory such as DRAM or EEPROM (Electrically Erasable and Programmable ROM), and holds video data and programs received by the DMA interface 143. In addition, the processing result in the integer arithmetic unit 151, the floating point arithmetic unit 150, or the branch instruction execution unit 142 is held.

命令バッファ１４７は、バッファ制御部１４６の制御に従って、ローカルメモリ１４４に保持されたプログラムを一時的に保持する。 The instruction buffer 147 temporarily holds a program held in the local memory 144 under the control of the buffer control unit 146.

分岐命令実行部１４２は、チャネルインターフェース１４１で受信した命令に基づき、命令バッファ１４７に読み出されたプログラムにおける分岐命令を実行する。 The branch instruction execution unit 142 executes a branch instruction in the program read to the instruction buffer 147 based on the instruction received by the channel interface 141.

ロードストアユニット１４５は、チャネルインターフェース１４１で受信した命令に基づき、各種データやプログラムのローカルメモリ１４４に対するロード（読み出し）及びストア（書き込み）を制御する。すなわち、必要なデータやプログラムをローカルメモリ１４４からロードし、オペランド供給ネットワーク１４９に出力する。またローカルメモリ１４４におけるプログラムをロードして、命令バッファ１４７に保持させる。更に、分岐命令実行部１４２、整数演算部１５１、または浮動小数点演算部１５０における演算結果を、ローカルメモリ１４４に格納する。ロードストアユニット１４５は、上記第２の実施形態で説明した図１２におけるロードストアユニット３０に相当する。 The load / store unit 145 controls loading (reading) and storing (writing) of various data and programs to / from the local memory 144 based on instructions received by the channel interface 141. That is, necessary data and programs are loaded from the local memory 144 and output to the operand supply network 149. The program in the local memory 144 is loaded and held in the instruction buffer 147. Further, the operation result in the branch instruction execution unit 142, the integer operation unit 151, or the floating point operation unit 150 is stored in the local memory 144. The load / store unit 145 corresponds to the load / store unit 30 in FIG. 12 described in the second embodiment.

オペランド供給ネットワーク１４９は、上記第２の実施形態で説明した図１２におけるバスＢＳ−Ａ〜ＢＳ−Ｃ、ＢＳ−ＰＳ４、ＢＳ−ＰＳ５、ＢＳ−ＬＳに相当する。従って、図１７におけるロードストアユニット１４５でロードされたデータやプログラムが、図１２におけるバスＢＳ−ＬＳに与えられる。また、図１２におけるバスＢＳ−Ａ〜ＢＳ−Ｃ、ＢＳ−ＰＳ４、ＢＳ−ＰＳ５において伝送されるデータやプログラムが、ロードストアユニット１４５を介してローカルメモリ１４４に格納される。 The operand supply network 149 corresponds to the buses BS-A to BS-C, BS-PS4, BS-PS5, and BS-LS in FIG. 12 described in the second embodiment. Accordingly, the data and programs loaded by the load / store unit 145 in FIG. 17 are given to the bus BS-LS in FIG. Further, data and programs transmitted on the buses BS-A to BS-C, BS-PS4, and BS-PS5 in FIG. 12 are stored in the local memory 144 via the load store unit 145.

汎用レジスタ１４８は、上記第２の実施形態で説明した図１２における汎用レジスタ２に相当する。すなわち汎用レジスタ１４８は、レジスタＲＡ〜ＲＧ、ＲＴを含む。 The general-purpose register 148 corresponds to the general-purpose register 2 in FIG. 12 described in the second embodiment. That is, general purpose register 148 includes registers RA to RG and RT.

整数演算部１５１及び浮動小数点演算部１５０は、上記第２の実施形態で説明した図１２における、処理ステージＰＳ３、ＰＳ４における演算ユニットに相当する。そして整数演算部１５１及び浮動小数点演算部１５０は、それぞれ整数及び浮動小数点につき、積和演算、算術演算、論理演算、またはシャッフル演算（及び／またはシフト演算及びローテート演算）を行う。 The integer arithmetic unit 151 and the floating point arithmetic unit 150 correspond to the arithmetic units in the processing stages PS3 and PS4 in FIG. 12 described in the second embodiment. The integer operation unit 151 and the floating point operation unit 150 perform product-sum operation, arithmetic operation, logical operation, or shuffle operation (and / or shift operation and rotate operation) for the integer and the floating point, respectively.

＜マルチコアプロセッサ９２の動作について＞
次に、上記マルチコアプロセッサ９２の動作について、ホストＰＣ９１から入力されたＭＰＥＧ−２形式の映像データをＨ．２６４形式に変換する場合を例に挙げて、以下説明する。勿論、この変換処理は計算機システム１０の行う処理の一例に過ぎない。図１８は、マルチコアプロセッサ９２における処理の流れを示すフローチャートであり、図１９乃至図２１はマルチコアプロセッサ９２のブロック図である。図１９乃至図２１において、白抜き矢印は制御を示し、黒塗り矢印はデータの流れを示す。 <Operation of multi-core processor 92>
Next, regarding the operation of the multi-core processor 92, the MPEG-2 format video data input from the host PC 91 is converted into the H.264 format. The case of converting to the H.264 format will be described below as an example. Of course, this conversion processing is only an example of processing performed by the computer system 10. FIG. 18 is a flowchart showing the flow of processing in the multi-core processor 92, and FIGS. 19 to 21 are block diagrams of the multi-core processor 92. In FIG. 19 to FIG. 21, white arrows indicate control, and black arrows indicate data flow.

まず初めに、マルチコアプロセッサ９２がホストＰＣ９１から映像データの変換命令を受ける。すると図１９に示すように、マルチコアプロセッサ９２では、変換命令に応答して第１プロセッサ１００が、ＤＭＡコントローラ１１０に対して、ホストＰＣ９１からの映像データの読み出しを命令する。この命令に従いＤＭＡコントローラ１１０は、ホストインターフェース１６０を介してＭＰＥＧ−２形式で符号化された映像データ（データストリーム）を読み出す。そして読み出した映像データを、メモリコントローラ１７０を介してローカルメモリ９３に格納する（図１８におけるステップＳ１０）。 First, the multi-core processor 92 receives a video data conversion command from the host PC 91. Then, as shown in FIG. 19, in the multi-core processor 92, in response to the conversion command, the first processor 100 instructs the DMA controller 110 to read the video data from the host PC 91. In accordance with this command, the DMA controller 110 reads video data (data stream) encoded in the MPEG-2 format via the host interface 160. The read video data is stored in the local memory 93 via the memory controller 170 (step S10 in FIG. 18).

次に図２０に示すように、第１プロセッサ１００は第１デコーダ１２０に対して、ステップＳ１０で読み出した映像データについてのデコードを命令する。すると第１デコーダ１２０は、メモリコントローラ１７０を介してローカルメモリ９３から映像データを読み出し、ＭＰＥＧ−２形式の映像データをデコードし、デコード結果をローカルメモリ９３に格納する（ステップＳ１１）。 Next, as shown in FIG. 20, the first processor 100 instructs the first decoder 120 to decode the video data read in step S10. Then, the first decoder 120 reads the video data from the local memory 93 via the memory controller 170, decodes the MPEG-2 format video data, and stores the decoding result in the local memory 93 (step S11).

次に、第１デコーダ１２０からデコードの終了を通知されると、図２１に示すように第１プロセッサ１００は、複数の第２プロセッサ１４０に対してＨ．２６４エンコードプログラムの実行を命令する。Ｈ．２６４エンコードプログラムは、映像をＨ．２６４形式により圧縮符号化するためのプログラムである。Ｈ．２６４エンコードプログラムは、第２プロセッサ１４０自身が保持していても良いし、ローカルメモリ９３から第２プロセッサ１４０に転送されても良い。Ｈ．２６４エンコードプログラムに基づいて第２プロセッサ１４０の各々は、メモリコントローラ１７０を介してローカルメモリ９３から、ＭＰＥＧ−２のデコード結果を読み込む（ステップＳ１２）。このデコード結果は、ステップＳ１１において第１デコーダ１２０がデコードして得た映像データである。 Next, when the end of decoding is notified from the first decoder 120, the first processor 100 sends the H.264 to the plurality of second processors 140 as shown in FIG. Command the execution of the H.264 encoding program. H. H.264 encoding program converts video to H.264. This is a program for compressing and encoding in the H.264 format. H. The H.264 encoding program may be held by the second processor 140 itself, or may be transferred from the local memory 93 to the second processor 140. H. Based on the H.264 encoding program, each of the second processors 140 reads the MPEG-2 decoding result from the local memory 93 via the memory controller 170 (step S12). This decoding result is video data obtained by the first decoder 120 decoding in step S11.

そしてＨ．２６４エンコードプログラムに従って、各々の第２プロセッサ１４０が、読み込んだ映像をＨ．２６４形式にエンコード（符号化）し、エンコード結果をローカルメモリ９３に格納する（ステップＳ１３）。 And H. In accordance with the H.264 encoding program, each second processor 140 converts the read video into H.264 format. The data is encoded (encoded) in the H.264 format, and the encoded result is stored in the local memory 93 (step S13).

エンコード処理が終了すると、第２プロセッサ１４０はＨ．２６４エンコードプログラムに基づいて、第１プロセッサ１００に対してエンコードの終了を通知する。エンコードの終了を通知された第１プロセッサ１００は、ＤＭＡコントローラ１１０に対してデータの転送を命令する。するとＤＭＡコントローラ１１０は、ローカルメモリ９３に保持されるＨ．２６４形式のエンコード結果を、ホストＰＣ９１へ転送する。 When the encoding process is completed, the second processor 140 executes H.264. Based on the H.264 encoding program, the first processor 100 is notified of the end of encoding. The first processor 100 notified of the end of encoding instructs the DMA controller 110 to transfer data. Then, the DMA controller 110 stores the H.264 data stored in the local memory 93. The H.264 format encoding result is transferred to the host PC 91.

なお、ステップＳ１１で得られた映像ストリームは、ローカルメモリ９３だけでなくホストＰＣ９１へ転送されても良い。これにより、映像を映像表示部で再生しつつ、当該映像のＭＰＥＧ−２形式からＨ．２６４形式への変換が可能となる。 Note that the video stream obtained in step S11 may be transferred not only to the local memory 93 but also to the host PC 91. As a result, while the video is reproduced on the video display unit, the MPEG-2 format of the video is changed from the MPEG-2 format. Conversion to H.264 format is possible.

以上のように、第１、第２の実施形態で説明したプロセッサ１はプロセッサシステムの一部として使用することが可能である。そして、特に画像処理ではプロセッサの負担は非常に大きく、各プロセッサの動作速度が重要となる。従って、マルチコアプロセッサを使用したプロセッサシステムに対して第１、第２の実施形態で説明したプロセッサ１を使用することで、画像処理の処理速度を格段に向上出来る。 As described above, the processor 1 described in the first and second embodiments can be used as a part of the processor system. Especially in image processing, the burden on the processor is very large, and the operation speed of each processor is important. Therefore, by using the processor 1 described in the first and second embodiments for a processor system using a multi-core processor, the processing speed of image processing can be significantly improved.

上記のように、この発明の第１乃至第３の実施形態に係る演算器は、複数ビットを一つのデータ単位（スライスＳ０〜Ｓ３）とするＳＩＭＤ命令を、複数の処理サイクルにより実行する演算器１０、５３であって、スライス間でのビットの移動を伴うことなく、それぞれスライス毎に第１演算を行う複数の第１演算部１１−０〜１１−３、６２−０〜６２−３と、スライス間でのビットの移動を伴う第２演算を行う第２演算部１２、６４とを具備する。そして、第１演算と第２演算とによりＳＩＭＤ命令が実行され、且つ第１演算と第２演算とは、互いに１処理サイクル以上のレイテンシを有して実行される。 As described above, the arithmetic unit according to the first to third embodiments of the present invention executes a SIMD instruction having a plurality of bits as one data unit (slices S0 to S3) in a plurality of processing cycles. A plurality of first arithmetic units 11-0 to 11-3 and 62-0 to 62-3 that perform a first operation for each slice without moving a bit between slices; , Second arithmetic units 12 and 64 for performing a second operation involving the movement of bits between slices. The SIMD instruction is executed by the first operation and the second operation, and the first operation and the second operation are executed with a latency of one processing cycle or more.

また、この発明の第１乃至第３の実施形態に係る半導体集積回路装置は、複数の処理ステージを用いてパイプライン動作を行うプロセッサ１であって、複数ビットを一つのデータ単位（スライスＳ０〜Ｓ３）とするＳＩＭＤ命令を、複数の前記処理ステージを用いて実行する演算器１０、５３を具備する。そして演算器１０、５３は、第ｉ処理ステージ（ｉは１以上の自然数、図１におけるＰＳ１または図１２におけるＰＳ３）において、スライス間でのビットの移動を伴うことなく、スライス単位の第１演算を行い、第（ｉ＋ｊ）処理ステージ（ｊは１以上の自然数、図１におけるＰＳ２または図１２におけるＰＳ４）において、スライス間でのビットの移動を伴う第２演算を行う。 The semiconductor integrated circuit device according to the first to third embodiments of the present invention is a processor 1 that performs a pipeline operation using a plurality of processing stages, and a plurality of bits are converted into one data unit (slices S0 to S0). Computation units 10 and 53 that execute the SIMD instruction in S3) using a plurality of the processing stages are provided. Then, the arithmetic units 10 and 53 perform the first operation in units of slices in the i-th processing stage (i is a natural number of 1 or more, PS1 in FIG. 1 or PS3 in FIG. 12) without moving bits between slices. In the (i + j) th processing stage (j is a natural number of 1 or more, PS2 in FIG. 1 or PS4 in FIG. 12), a second operation involving the movement of bits between slices is performed.

そして、上記プロセッサ１は、スライス毎に設けられ、演算器５３における演算内容を制御する制御データを保持する複数のレジスタＲＤ〜ＲＧを更に備え、第１演算において、レジスタＲＤ〜ＲＧの各々に保持される制御データは、スライス毎の第１演算にそれぞれ使用される。 The processor 1 further includes a plurality of registers RD to RG which are provided for each slice and hold control data for controlling the calculation contents in the calculator 53, and are held in each of the registers RD to RG in the first calculation. The control data to be used is used for the first calculation for each slice.

なお上記第１、第２実施形態では、演算器が初めの処理ステージにおいてスライス間でのデータ通信を要しない演算（第１演算）を行い、これに引き続く次の処理ステージにおいて、スライス間でのデータ通信を要する演算（第２演算）を行う場合について説明した。しかし、この順序は逆であっても良い。すなわち、まず第２演算を行い、次に第１演算を行っても良い。また上記実施形態では、これら２つの演算が連続する処理ステージで行われる場合について説明した。すなわちレイテンシ＝１の場合である。しかし、レイテンシは２以上であっても良い。例えば図１２のシャッフラー５３において、第２演算部６４における第２演算は、処理ステージＰＳ４では無く処理ステージＰＳ５以降で行われても良い。すなわち、第１演算と第２演算とが異なる処理ステージで行われる構成であれば、演算の順序等は特に限定されず、適宜選択することが出来る。 In the first and second embodiments, the arithmetic unit performs an operation (first operation) that does not require data communication between slices in the first processing stage, and in the subsequent processing stage, The case where the calculation requiring the data communication (second calculation) has been described. However, this order may be reversed. That is, the second calculation may be performed first, and then the first calculation may be performed. In the above embodiment, the case where these two operations are performed in successive processing stages has been described. That is, the case of latency = 1. However, the latency may be 2 or more. For example, in the shuffler 53 of FIG. 12, the second calculation in the second calculation unit 64 may be performed not after the processing stage PS4 but after the processing stage PS5. In other words, as long as the first calculation and the second calculation are performed in different processing stages, the order of the calculations is not particularly limited and can be selected as appropriate.

また、上記第１、第２の実施形態では、スライス間でのデータ通信を要する演算として、バイト単位のシャッフル演算、ローテート演算、及びシフト演算を例に挙げた。しかし、これらの演算はバイト単位に限らず、例えばビット単位であっても良い。また、これらの演算に限定されるものでは無く、プログラム上において単一の命令として取り扱われ、且つスライス間でのデータの通信を要する演算であれば、全般的に、上記実施形態を適用出来る。 In the first and second embodiments, as operations requiring data communication between slices, shuffle operations, rotate operations, and shift operations in units of bytes are given as examples. However, these operations are not limited to byte units, but may be bit units, for example. In addition, the present embodiment is not limited to these operations, and the embodiments described above can be generally applied to operations that are handled as a single instruction in a program and require data communication between slices.

なお、本願発明は上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。更に、上記実施形態には種々の段階の発明が含まれており、開示される複数の構成要件における適宜な組み合わせにより種々の発明が抽出されうる。例えば、実施形態に示される全構成要件からいくつかの構成要件が削除されても、発明が解決しようとする課題の欄で述べた課題が解決でき、発明の効果の欄で述べられている効果が得られる場合には、この構成要件が削除された構成が発明として抽出されうる。 Note that the present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the scope of the invention in the implementation stage. Furthermore, the above embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, even if some constituent requirements are deleted from all the constituent requirements shown in the embodiment, the problem described in the column of the problem to be solved by the invention can be solved, and the effect described in the column of the effect of the invention Can be extracted as an invention.

この発明の第１の実施形態に係るプロセッサのブロック図。1 is a block diagram of a processor according to a first embodiment of the present invention. この発明の第１の実施形態に係るプロセッサの動作の流れを示すタイミングチャート。2 is a timing chart showing a flow of operations of the processor according to the first embodiment of the present invention. この発明の第１の実施形態に係るレジスタの構成を示す模式図。The schematic diagram which shows the structure of the register | resistor which concerns on 1st Embodiment of this invention. この発明の第１の実施形態に係るレジスタの構成を示す模式図。The schematic diagram which shows the structure of the register | resistor which concerns on 1st Embodiment of this invention. シフト命令の詳細を示すダイアグラム。Diagram showing details of shift instruction. シフト命令を実行する演算器の概念図。The conceptual diagram of the arithmetic unit which performs a shift command. ローテート命令の詳細を示すダイアグラム。Diagram showing details of rotate instruction. ローテート命令を実行する演算器の概念図。The conceptual diagram of the arithmetic unit which performs a rotate instruction. シャッフル命令の詳細を示すダイアグラム。Diagram showing details of shuffle instructions. シャッフル命令を実行する演算器の概念図。The conceptual diagram of the arithmetic unit which performs a shuffle instruction. ＬＳＩのレイアウトを示す模式図。The schematic diagram which shows the layout of LSI. この発明の第２の実施形態に係るプロセッサのブロック図。The block diagram of the processor which concerns on 2nd Embodiment of this invention. この発明の第２の実施形態に係る汎用レジスタのブロック図。The block diagram of the general purpose register which concerns on 2nd Embodiment of this invention. この発明の第２の実施形態に係るシャッフラーのブロック図。The block diagram of the shuffler which concerns on 2nd Embodiment of this invention. この発明の第３の実施形態に係るプロセッサシステムのブロック図。The block diagram of the processor system which concerns on 3rd Embodiment of this invention. この発明の第３の実施形態に係るマルチコアプロセッサのブロック図。The block diagram of the multi-core processor which concerns on 3rd Embodiment of this invention. この発明の第３の実施形態に係るプロセッサのブロック図。The block diagram of the processor which concerns on 3rd Embodiment of this invention. この発明の第３の実施形態に係るマルチコアプロセッサにおける処理の流れを示すフローチャート。The flowchart which shows the flow of a process in the multi-core processor which concerns on the 3rd Embodiment of this invention. この発明の第３の実施形態に係るマルチコアプロセッサのブロック図。The block diagram of the multi-core processor which concerns on 3rd Embodiment of this invention. この発明の第３の実施形態に係るマルチコアプロセッサのブロック図。The block diagram of the multi-core processor which concerns on 3rd Embodiment of this invention. この発明の第３の実施形態に係るマルチコアプロセッサのブロック図。The block diagram of the multi-core processor which concerns on 3rd Embodiment of this invention.

符号の説明Explanation of symbols

１…プロセッサ、２、１４８…汎用レジスタ、１０…第１演算器、１１−０〜１１−３、２１−０〜２１−３、６２−０〜６２−３…第１演算部、１２、６４…第２演算部、１３、１５、１７、８４−０〜８４−３…デコーダ、１４、１６、１８、３１〜４１、４５〜４８、６８、６９、７０、８０−０Ａ〜８０−３Ａ、８０−０Ｂ〜８０−３Ｂ、８１−０〜８１−３、８２…マルチプレクサ、１９、８３…マスク論理、２０…第２演算器、３０、１４５…ロードストアユニット、４２〜４４、５５、５７、５９、６１、６３−０〜６３−３、６５〜６７、７２〜７５、８５…フリップフロップ、５０…積和演算器、５１…算術演算器、５２…論理演算器、５３…シャッフラー、５４…乗算器、５６…加算器、５８、６０…演算器、７１…メモリ、９０…プロセッサシステム、９１…ホストＰＣ、９２…マルチコアプロセッサ、９３、１４４…ローカルメモリ、９４…ＣＰＵ、９５…ＶＲＡＭ、９６…ＧＰＵ、９７…メインメモリ、９８…第１接続部、９９…第２接続部、１００…第１プロセッサ、１１０…ＤＭＡコントローラ、１２０…第１デコーダ、１３０…第２デコーダ、１４０…第２プロセッサ、１６０…ホストインターフェース、１７０…メモリコントローラ、１４１…チャネルインターフェース、１４２…分岐命令実行部、１４３…ＤＭＡインターフェース、１４６…バッファ制御部、１４７…命令バッファ、１４９…オペランド供給ネットワーク、１５０…浮動小数点演算部、１５１…整数演算部 DESCRIPTION OF SYMBOLS 1 ... Processor 2, 148 ... General-purpose register, 10 ... 1st calculator, 11-0-11-3, 21-0-21-3, 62-0-62-3 ... 1st calculating part, 12, 64 ... 2nd calculating part, 13, 15, 17, 84-0-84-3 ... Decoder, 14, 16, 18, 31-41, 45-48, 68, 69, 70, 80-0A-80-3A, 80-0B to 80-3B, 81-0 to 81-3, 82 ... Multiplexer, 19, 83 ... Mask logic, 20 ... Second arithmetic unit, 30, 145 ... Load store unit, 42-44, 55, 57, 59, 61, 63-0 to 63-3, 65 to 67, 72 to 75, 85 ... flip-flop, 50 ... product-sum operation unit, 51 ... arithmetic operation unit, 52 ... logic operation unit, 53 ... shuffler, 54 ... Multiplier 56 ... Adder 58, 60 ... Calculator 71 ... Me 90 ... Processor system, 91 ... Host PC, 92 ... Multi-core processor, 93, 144 ... Local memory, 94 ... CPU, 95 ... VRAM, 96 ... GPU, 97 ... Main memory, 98 ... First connection unit, 99 ... Second connection unit 100 ... first processor 110 ... DMA controller 120 ... first decoder 130 ... second decoder 140 ... second processor 160 ... host interface 170 ... memory controller 141 ... channel interface 142 ... branch instruction execution unit, 143 ... DMA interface, 146 ... buffer control unit, 147 ... instruction buffer, 149 ... operand supply network, 150 ... floating point operation unit, 151 ... integer operation unit

Claims

複数ビットを一つのデータ単位とするＳＩＭＤ命令を、複数の処理サイクルにより実行する演算器であって、
前記データ単位間でのビットの移動を伴うことなく、それぞれ前記データ単位毎に第１演算を行う複数の第１演算部と、
前記データ単位間でのビットの移動を伴う第２演算を行う第２演算部と
を具備し、前記第１演算と前記第２演算とにより前記ＳＩＭＤ命令が実行され、且つ前記第１演算と前記第２演算とは、互いに１処理サイクル以上のレイテンシを有して実行される
ことを特徴とする演算器。 An arithmetic unit that executes a SIMD instruction having a plurality of bits as one data unit in a plurality of processing cycles,
A plurality of first calculation units that perform a first calculation for each data unit without moving a bit between the data units;
A second operation unit that performs a second operation with a bit shift between the data units, the SIMD instruction is executed by the first operation and the second operation, and the first operation and the second operation The second operation is executed with a latency of one processing cycle or more.

前記ＳＩＭＤ命令は、シャッフル命令、ローテート命令、及びシフト命令の少なくともいずれかである
ことを特徴とする請求項１記載の演算器。 The arithmetic unit according to claim 1, wherein the SIMD instruction is at least one of a shuffle instruction, a rotate instruction, and a shift instruction.

複数の処理ステージを用いてパイプライン動作を行う半導体集積回路装置であって、
複数ビットを一つのデータ単位とするＳＩＭＤ命令を、複数の前記処理ステージを用いて実行する演算器を具備し、
前記演算器は、第ｉ処理ステージ（ｉは１以上の自然数）において、前記データ単位間でのビットの移動を伴うことなく、前記データ単位毎に第１演算を行い、
第（ｉ＋ｊ）処理ステージ（ｊは１以上の自然数）において、前記データ単位間でのビットの移動を伴う第２演算を行う
ことを特徴とする半導体集積回路装置。 A semiconductor integrated circuit device that performs a pipeline operation using a plurality of processing stages,
An arithmetic unit that executes a SIMD instruction having a plurality of bits as one data unit using a plurality of the processing stages;
The computing unit performs a first computation for each data unit in the i-th processing stage (i is a natural number equal to or greater than 1) without moving a bit between the data units,
A semiconductor integrated circuit device characterized in that, in an (i + j) th processing stage (j is a natural number equal to or greater than 1), a second operation involving bit movement between the data units is performed.

前記データ単位毎に設けられ、前記演算器における演算内容を制御する制御データを保持する複数のレジスタを更に備え、
前記第１演算において、前記レジスタの各々に保持される前記制御データは、前記データ単位毎の前記第１演算にそれぞれ使用される
ことを特徴とする請求項３記載の半導体集積回路装置。 A plurality of registers that are provided for each data unit and hold control data for controlling calculation contents in the calculator;
4. The semiconductor integrated circuit device according to claim 3, wherein in the first operation, the control data held in each of the registers is used for the first operation for each data unit. 5.

前記演算器によって実行される前記ＳＩＭＤ命令は、シャッフル命令、ローテート命令、及びシフト命令の少なくともいずれかである
ことを特徴とする請求項４記載の半導体集積回路装置。 The semiconductor integrated circuit device according to claim 4, wherein the SIMD instruction executed by the arithmetic unit is at least one of a shuffle instruction, a rotate instruction, and a shift instruction.