JP3146707B2

JP3146707B2 - Computer with parallel operation function

Info

Publication number: JP3146707B2
Application number: JP34792992A
Authority: JP
Inventors: 多加志堀田; 康弘中塚; 成弥田中; 弘道山田; 英雄前島
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1992-01-06
Filing date: 1992-12-28
Publication date: 2001-03-19
Anticipated expiration: 2016-03-19
Also published as: JPH05257687A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、並列演算を実行する計
算機に係り、特にスーパスカラ方式とVLIW方式とを混在
させて実行する並列演算機能を有する計算機に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a computer for executing a parallel operation, and more particularly to a computer having a parallel operation function for executing a superscalar system and a VLIW system in a mixed manner.

【０００２】[0002]

【従来の技術】計算機アーキテクチャは、半導体技術の
進歩等に支えられ、年々進歩している。１９８０年代に
は、これまでの複雑な命令をマイクロ命令を使って複数
サイクルにかけて処理するＣＩＳＣ(Complex Instructi
on Set Computer)に代って、簡単な命令を１サイクルで
実行するＲＩＳＣ(Reduced Instruction Set Computer)
が現れた。2. Description of the Related Art Computer architecture is progressing year by year, supported by advances in semiconductor technology. In the 1980s, CISC (Complex Instructi
RISC (Reduced Instruction Set Computer) that executes simple instructions in one cycle instead of on set computer)
Appeared.

【０００３】さらに、演算方式の高速化技術として、ス
ーパスカラ方式とＶＬＩＷ（VeryLong Instruction Wor
d）方式が提案されている。[0003] Further, as a technique for accelerating the operation method, a superscalar method and a VLIW (Very Long Instruction Worry) have been proposed.
d) A scheme has been proposed.

【０００４】スーパスカラ方式とは、命令実行時にハー
ドウェアで命令間の競合を調べ、競合が無ければ１マシ
ンサイクルに複数命令を実行する方式で、特願昭63−28
3673号(従来技術１)や、J-Hennessy and D.A Patterson
“Computer Architecture AQuantitative Approach”Mo
rgan Kantmann Publishers, Inc１９９０.Ｐ.３１８
（従来技術２）に記載されている。The superscalar system is a system in which a conflict between instructions is checked by hardware at the time of instruction execution, and if there is no conflict, a plurality of instructions are executed in one machine cycle.
No. 3673 (prior art 1), J-Hennessy and DA Patterson
“Computer Architecture AQuantitative Approach” Mo
rgan Kantmann Publishers, Inc 1990.P.318
(Prior art 2).

【０００５】またＶＬＩＷ方式とは、複数演算器の動作
を制御するフィールドを持った長い命令を用いる方式で
ある。通常のＲＩＳＣプロセッサの命令長が３２bit な
のに対し、６４，１２８，２５６以上といった長さの命
令を持つ。この方式についての説明も、前記J−Henness
y and D.A Patterson(従来技術１）による文献に記され
ている。The VLIW method is a method using a long instruction having a field for controlling the operation of a plurality of arithmetic units. While the instruction length of a normal RISC processor is 32 bits, it has an instruction length of 64, 128, 256 or more. The explanation of this method is also described in J-Henness
y and DA Patterson (Prior Art 1).

【０００６】ＶＬＩＷ方式の改良技術として、１語長命
令と３語長命令を混在させて、VLIW方式で処理すること
により、コードサイズの大きさを改善する技術が、Robe
rtCohn et al.“Architecture and Compiler Tradeoffs
a Long Instruction WordMicroprocessor”Third Inte
rnational Conference on Architectual Supportfor Pr
ogramming Languages and Operating System, １９８
９，Ｐ．２−１４（従来技術３）に記載されている。As an improved technique of the VLIW method, a technique of improving the code size by mixing a one-word instruction and a three-word instruction and processing them in the VLIW method is disclosed in Robe.
rtCohn et al. “Architecture and Compiler Tradeoffs
a Long Instruction WordMicroprocessor ”Third Inte
rnational Conference on Architectual Supportfor Pr
ogramming Languages and Operating System, 198
9, p. 2-14 (Prior Art 3).

【０００７】[0007]

【発明が解決しようとする課題】以下に、スーパスカラ
方式とＶＬＩＷ方式の特徴について述べる。The features of the superscalar system and the VLIW system will be described below.

【０００８】スーパスカラ方式の利点は、単一演算を指
示する命令長の短い命令で、有効演算のみを指示するた
めコードサイズが小さくできる。The advantage of the superscalar method is that the instruction size is a short instruction that indicates a single operation, and only an effective operation is specified, so that the code size can be reduced.

【０００９】命令を追加する必要がないので前機種との
互換性が保たれることである。Since there is no need to add instructions, compatibility with the previous model is maintained.

【００１０】これに対して、スーパスカラ方式の第１の
問題点は、並列実行する演算内の競合を検出しなくては
ならないことである。並列演算する演算の数が多くなれ
ばなる程、競合検出に要するハードウェア量は大とな
る。On the other hand, the first problem of the superscalar system is that it is necessary to detect a conflict in the operation to be executed in parallel. The greater the number of operations performed in parallel, the greater the amount of hardware required for conflict detection.

【００１１】また、第２の問題点は、現サイクル以前に
実行した命令と現サイクルに実行する命令との間の競合
検出，待合わせが複雑であることである。並列演算する
演算の数が多くなればなる程、現サイクルの命令と競合
する可能性のある命令が多くなり、第２の問題点である
両者の競合検出，待合わせのハードウェアは複雑にな
る。A second problem is that the detection of a conflict and the waiting between an instruction executed before the current cycle and an instruction executed in the current cycle are complicated. As the number of operations to be performed in parallel increases, the number of instructions that may conflict with the instruction in the current cycle increases, and the second problem, that is, hardware for conflict detection and queuing, becomes more complicated. .

【００１２】また、第３の問題点は命令長が短いため、
命令によって指定できるレジスタの数が少ないことであ
る。１６〜３２本が典型例である。J−Hennessy and D.
A Pattersonの文献のＰ.３２５に示されているように、
並列して実行可能な演算を増やすためのソフトウェア上
の工夫として、ループアンローリングやソフトウェアパ
イプラインを用いようとすると、レジスタの数が不足す
る。逆に言えば、存在するレジスタの範囲でしか最適化
できない。The third problem is that the instruction length is short,
That is, the number of registers that can be specified by the instruction is small. 16 to 32 are typical examples. J-Hennessy and D.
As shown on page 325 of the A Patterson document,
When using loop unrolling or a software pipeline as a software device to increase the number of operations that can be executed in parallel, the number of registers becomes insufficient. Conversely, optimization can only be performed within the range of existing registers.

【００１３】この改善策として、上記従来技術１の文献
のＥ−２１〜２２に、演算結果を次の命令にすぐには反
映させないようにすることで、レジスタの数の不足を改
善することができると記載されている。As a remedy, the shortage of the number of registers can be improved by preventing the operation result from being immediately reflected in the next instruction in E-21 to 22 of the above-mentioned prior art 1. It is described as possible.

【００１４】また、David Callahan et al.“Software
Prefetching"Fourth InternationalConference on Arch
itectual Support for Programming Languages and
Operating System, １９９１，Ｐ．４０〜５２の文
献に、スーパスカラマシンにおいて命令によりメインメ
モリからキャッシュメモリにデータをプリフェッチする
ことが記載されている。Also, David Callahan et al. “Software
Prefetching "Fourth InternationalConference on Arch
itectual Support for Programming Languages and
Operating System, 1991, p. Documents 40 to 52 describe prefetching data from a main memory to a cache memory by an instruction in a super scalar machine.

【００１５】以上から、スーパスカラ方式では、命令実
行の並列度を増すと上記第１，第２の問題点である競合
検出の複雑さからマシンサイクルを高めることができ
ず、処理速度が向上しないという問題を有している。As described above, in the superscalar system, if the degree of parallelism of instruction execution is increased, the machine cycle cannot be increased due to the complexity of conflict detection, which is the first and second problems, and the processing speed is not improved. Have a problem.

【００１６】次に、ＶＬＩＷ方式の第１の利点は、命令
長が長く、１命令の中に複数の演算が指定でき、かつ、
命令内での演算間の競合がないため、実行時にハードウ
ェアで、並列実行する演算間の競合を検出しなくてもよ
いことである。Next, the first advantage of the VLIW method is that the instruction length is long, a plurality of operations can be specified in one instruction, and
Since there is no conflict between operations in the instruction, it is not necessary to detect conflict between operations to be executed in parallel by hardware at the time of execution.

【００１７】第２の利点は、命令長が長いため、多くの
レジスタが指定可能なことである。次にＶＬＩＷ方式の
第１の問題点は、前述の命令内での演算間の競合を避け
るため、全てのフィールドに有効な演算を指定できると
は限らず、コードサイズが大きくなってしまうことであ
る。The second advantage is that since the instruction length is long, many registers can be specified. Next, the first problem of the VLIW method is that in order to avoid conflict between operations in the above-mentioned instructions, it is not always possible to specify valid operations in all fields, and the code size increases. is there.

【００１８】第２の問題点は現サイクル以前に実行した
命令と現サイクルに実行する命令との間の競合検出，待
合わせが複雑なことである。これは、スーパスカラ方式
の第２の問題点と同じである。The second problem is that detection of a conflict between the instruction executed before the current cycle and the instruction executed in the current cycle and the waiting time are complicated. This is the same as the second problem of the superscalar system.

【００１９】これについて、ハードウェアでは競合検出
を行わず、コンパイラによって予め競合回避を行う技術
が、Andrew Wolf and John P. Shen“A Variable
InstructionStream Extension to the VLIW Archite
cture”，Fourth International Conference o
n Architectual Support for Programming Langu
ages and Operating System, １９９１，Ｐ．２〜１４
に記載されている。また、ＶＬＩＷ方式の第３の問題点
は、前機種との互換性が取れないことである。これは、
スーパスカラ方式が、従来の１語長命令をハードウェア
で並列実行するのに対し、ＶＬＩＷ方式では、命令の再
定義が必要となるからである。In this regard, a technique that does not detect conflicts in hardware but avoids conflicts in advance by a compiler is disclosed in Andrew Wolf and John P. Shen “A Variable
InstructionStream Extension to the VLIW Archite
cture ”, Fourth International Conference o
n Architectual Support for Programming Langu
ages and Operating System, 1991, p. 2-14
It is described in. A third problem of the VLIW method is that compatibility with the previous model cannot be obtained. this is,
This is because the superscalar system executes a conventional one-word-length instruction in parallel with hardware, whereas the VLIW system requires redefinition of the instruction.

【００２０】これまでに述べてきたように、スーパスカ
ラ方式とＶＬＩＷ方式の利点を活かしながら、スーパス
カラ方式とＶＬＩＷ方式の欠点を補う計算機は存在しな
かった。As described above, there has been no computer that makes use of the advantages of the superscalar system and the VLIW system and compensates for the disadvantages of the superscalar system and the VLIW system.

【００２１】本発明の第１の目的は、スーパスカラ方式
とＶＬＩＷ方式を混在させて演算実行可能な計算機を提
供することにある。これは、単一演算を指示する命令長
の短い命令よりなる従来アーキテクチャを持つ計算機の
上位互換性を保ちながら、処理速度を向上させることで
ある。A first object of the present invention is to provide a computer capable of executing an operation by mixing the superscalar system and the VLIW system. This is to improve the processing speed while maintaining the upward compatibility of a computer having a conventional architecture consisting of an instruction having a short instruction length instructing a single operation.

【００２２】[0022]

【課題を解決するための手段】上記目的を達成するため
に、本発明によれば、第１に、レジスタとメモリとプロ
グラムカウンタを有し、上記プログラムカウンタで指示
される上記メモリに格納されている命令を読み出し、上
記命令の指示する演算を上記レジスタと上記メモリと上
記プログラムカウンタに対して実行する並列演算機能を
有する計算機において、上記命令は単一演算を指示する
命令長の短い命令又は複数演算を指示する命令長の長い
命令であって、上記プログラムカウンタで指示された上
記命令が上記命令長の短い命令か上記命令長の長い命令
かを判定する命令語長判定手段と、上記命令語長判定手
段によって上記プログラムカウンタで指示された上記命
令が命令長の長い命令であれば上記レジスタに上記命令
を設定し、上記プログラムカウンタで指示された上記命
令が命令長の短い命令であれば所定のレジスタに上記命
令を設定する命令選択手段とを有する。According to the present invention, first, there is provided a register, a memory, and a program counter which are stored in the memory indicated by the program counter. A computer having a parallel operation function of reading an instruction and executing the operation specified by the instruction with respect to the register, the memory, and the program counter, wherein the instruction has a short instruction length or a plurality of instructions that indicate a single operation. Instruction length determining means for determining whether the instruction indicated by the program counter is an instruction having a short instruction length or an instruction having a long instruction length, the instruction having a long instruction length for instructing an operation; If the instruction indicated by the program counter by the length determining means is an instruction having a long instruction length, the instruction is set in the register and the program is executed. If short instruction the instruction is instructed with instruction length in grams counter having an instruction selecting means for setting the instruction to a predetermined register.

【００２３】本発明の第２の特徴によれば、レジスタと
メモリとプログラムカウンタを有し、上記プログラムカ
ウンタで指示される上記メモリに格納される命令を読み
出し、上記命令の指示する演算を上記レジスタと上記メ
モリと上記プログラムカウンタに対して実行する並列演
算機能を有する計算機において、上記命令は単一演算を
指示する命令長の短い命令又は複数演算を指示する命令
長の長い命令であって、上記プログラムカウンタで指示
された上記命令が上記命令長の短い命令か上記命令長の
長い命令かを判定する命令語長判定手段と、上記命令長
の短い命令間の競合を検出する競合検出手段と、上記命
令語長判定手段によって上記命令長の短い命令と判定さ
れると上記レジスタに上記命令を設定し、上記命令長の
短い命令と判定され、かつ、上記競合検出手段によって
競合がないと判定されると所定のレジスタに上記命令を
設定する命令選択手段とを有する。According to a second feature of the present invention, there is provided a register, a memory, and a program counter, for reading an instruction stored in the memory indicated by the program counter, and executing an operation indicated by the instruction in the register. And a computer having a parallel operation function to be executed on the memory and the program counter, wherein the instruction is an instruction having a short instruction length for instructing a single operation or an instruction having a long instruction length for instructing a plurality of operations, Instruction word length determining means for determining whether the instruction specified by the program counter is the short instruction length instruction or the long instruction length instruction; conflict detection means for detecting competition between the short instruction length instructions; When the instruction word length determining means determines that the instruction has the short instruction length, the instruction is set in the register, and the instruction is determined to be the instruction having the short instruction length. And, when it is determined that there is no conflict by said conflict detection means and an instruction selecting means for setting said instruction to a predetermined register.

【００２４】本発明の第３の特徴によれば、レジスタと
メモリとプログラムカウンタを有し、上記プログラムカ
ウンタで指示される上記メモリに格納される命令を読み
出し、上記命令の指示する演算を上記レジスタと上記メ
モリと上記プログラムカウンタに対して実行する並列演
算機能を有する計算機において、上記命令は単一演算を
指示する命令長の短い命令又は複数演算を指示する命令
長の長い命令であって、上記プログラムカウンタで指示
された命令が上記命令長の短い命令か上記命令長の長い
命令かを判定する命令語長判定手段と、上記命令語長判
定手段によって上記命令長の短い命令と判定されると、
上記命令長の短い命令間の競合を検出する競合検出手段
と、上記命令語長判定手段によって上記命令長の短い命
令と判定されると、上記競合検出手段の内容に応じて上
記命令長の短い命令を１マシンサイクルに所定の数実行
し、上記命令長の長い命令と判定されると、命令長の長
い命令を１マシンサイクルに所定の数実行する演算手段
とを有する。According to a third feature of the present invention, there is provided a register, a memory, and a program counter, wherein an instruction stored in the memory indicated by the program counter is read, and an operation indicated by the instruction is performed by the register. And a computer having a parallel operation function for executing the operation on the memory and the program counter, wherein the instruction is an instruction having a short instruction length to designate a single operation or an instruction having a long instruction length to designate a plurality of operations, An instruction word length determining unit that determines whether the instruction specified by the program counter is the instruction having the short instruction length or the instruction having the long instruction length; and if the instruction word length determining unit determines that the instruction has the short instruction length. ,
When the instruction word length determination unit determines that the instruction has the short instruction length, the contention detection unit that detects the competition between the instructions having the short instruction length and the instruction word length determination unit determines that the instruction has the short instruction length. There is provided an arithmetic unit for executing a predetermined number of instructions in one machine cycle and executing a predetermined number of long instructions in one machine cycle when it is determined that the instruction has the long instruction length.

【００２５】[0025]

【作用】本発明によれば、１マシンサイクルに単一演算
を指示する命令長の短い命令を複数個、あるいは、複数
演算を指示する命令長の長い命令を１個実行できるの
で、演算が並列処理され性能が高められる。According to the present invention, it is possible to execute a plurality of instructions having a short instruction length instructing a single operation or one instruction having a long instruction length instructing a plurality of operations in one machine cycle. Processed to enhance performance.

【００２６】本発明の一態様によれば、並列して実行で
きる演算の多い時のみ命令長の長い命令を用いることに
より、命令長の長い命令の中の無効フィールドを少なく
することができ、コードサイズを小さくすることができ
る。これにより、主メモリ及びキャッシュメモリの使用
効率が上がり、処理速度の向上が図れる。According to one embodiment of the present invention, by using an instruction having a long instruction length only when there are many operations that can be executed in parallel, the number of invalid fields in the instruction having a long instruction length can be reduced. The size can be reduced. As a result, the use efficiency of the main memory and the cache memory increases, and the processing speed can be improved.

【００２７】本発明の他の一態様によれば、命令長の長
い命令の中で指定する複数の演算内の競合はあり得ず、
ハードウェアでこれを検出する必要はない。ハードウェ
アは同一サイクルに実行する命令長の短い命令間の競合
のみを検出すればよい。本発明によれば、１マシンサイ
クルで実行される命令長の短い命令の数を、命令長の長
い命令の中に指示される演算の数より小さくすることに
より、平均的に１マシンサイクルに実行される演算数が
高い割に、並列に実行する演算間の競合検出を容易にす
ることができる。According to another aspect of the present invention, there is no competition among a plurality of operations specified in an instruction having a long instruction length,
You do not need to detect this in hardware. The hardware need only detect a conflict between instructions having a short instruction length and executed in the same cycle. According to the present invention, the number of instructions having a short instruction length to be executed in one machine cycle is made smaller than the number of operations indicated in the long instruction length, so that the instruction is executed in one machine cycle on average. Although the number of operations to be performed is high, it is possible to easily detect a conflict between operations executed in parallel.

【００２８】本発明の他の一態様によれば、命令長の長
い命令と、それ以前の命令長の長い命令との競合が無い
ようにコンパイラで命令列を生成することが可能であ
り、ハードウェアでこれを検出する必要はない。According to another aspect of the present invention, an instruction sequence can be generated by a compiler so that there is no competition between an instruction having a long instruction length and an instruction having a long instruction length before the instruction. There is no need to detect this in the hardware.

【００２９】本発明の他の一態様によれば、有効な命令
長の短い命令実行後、有効な命令長の長い命令を実行す
る時、及び、有効な命令長の長い命令の実行後、有効な
命令長の短い命令を実行する時には、両者の間に必要な
だけの無効命令を挿入することによりソフト的に両者の
競合を解消できるのでハードウェアで両者の競合を検出
する必要はない。ハードウェアが検出しなくてはならな
いのは、現サイクル以前に実行した命令長の短い命令と
現サイクルに実行する命令長の短い命令との間の競合だ
けである。故に、本発明によれば１マシンサイクルで実
行される命令長の短い命令の数を命令長の長い命令の中
に指示される演算の数より小さくすることにより、平均
的に１マシンサイクルに実行される演算数が高い割に、
現サイクル以前に実行した、現サイクルに実行した命令
との間のハードウェアによる競合検出を容易にできる。According to another aspect of the present invention, when an instruction having a long effective instruction length is executed after execution of an instruction having a short effective instruction length, or after execution of an instruction having a long effective instruction length, When an instruction with a short instruction length is executed, the conflict between the two can be resolved by software by inserting the necessary invalid instructions between the two, so that it is not necessary to detect the conflict between the two by hardware. All that the hardware must detect is a conflict between the short instruction length executed before the current cycle and the short length instruction executed during the current cycle. Therefore, according to the present invention, the number of instructions having a short instruction length to be executed in one machine cycle is made smaller than the number of operations indicated in the long instruction length, so that the number of instructions executed in one machine cycle is averaged. Despite the high number of operations performed,
It is possible to easily detect a conflict by hardware between the instruction executed before the current cycle and the instruction executed during the current cycle.

【００３０】本発明の他の一態様によれば、命令によっ
て指示された演算結果は直ちに次命令に反映されず、一
定数後の命令から反映されるので、命令実行後、その結
果が反映されるまでに実行される命令は、書き込まれる
前のレジスタの値を読むことができ、ソフトウェアが使
うレジスタの数を実質的に多くし、ソフトウェア上の最
適化により演算の並列度をあげることができる。According to another aspect of the present invention, the operation result specified by the instruction is not immediately reflected in the next instruction, but is reflected from the instruction after a certain number, so that after execution of the instruction, the result is reflected. Instructions that are executed before can read the values of registers before they are written, can substantially increase the number of registers used by software, and can increase the degree of parallelism of operations through software optimization. .

【００３１】本発明の他の一態様によれば、競合検出の
ためのハードウェアが簡単になり、マシンサイクルの向
上が図られ、処理速度を高めることができる。According to another aspect of the present invention, hardware for conflict detection is simplified, a machine cycle is improved, and a processing speed can be increased.

【００３２】本発明の他の一態様によれば、単一演算を
指示する命令長の短い従来アーキテクチャの命令に、複
数演算を指示する命令長の長い命令を追加して新アーキ
テクチャとできるので新アーキテクチャに従来アーキテ
クチャを含ませ、上位互換性を保つことができる。According to another aspect of the present invention, a new architecture can be obtained by adding a long instruction length instruction for instructing a plurality of operations to a conventional architecture instruction short instruction length for instructing a single operation. The architecture can include the legacy architecture to maintain upward compatibility.

【００３３】[0033]

【実施例】次に本発明の好ましい一実施例について述べ
る。発明の本質と無関係な詳細は省略してある。Next, a preferred embodiment of the present invention will be described. Details unrelated to the nature of the invention have been omitted.

【００３４】図１に全体ブロック図を示す。１２００は
メモリ、１３００は命令キャッシュ、１３０３は命令制
御ユニット、１６０は演算ユニット、１５０は命令長判
定手段、１０９は競合検出手段である。命令制御ユニッ
ト１３０３は、インタフェース１７０を用いて、命令キ
ャッシュより命令を読み出し、デコードし、インタフェ
ース１８０を通じて演算ユニット１６０を制御する。演
算ユニット１６０は複数の演算を並列に処理することが
できる。本計算機は単一演算を指示する４バイト長命令
と、複数演算を指示する１６バイト長命令を有し、命令
キャッシュ1300には、１６バイト長命令間、及び、１６
バイト長命令と４バイト長命令の間の競合は無いよう
に、１６バイト長命令と４バイト長命令が混在しておか
れている。競合検出手段１０９は、４バイト長命令間の
みの競合を検出する。命令制御ユニット１３０３は、命
令長判定手段１５０を具備し、１６バイト長命令実行時
は競合検出手段１０９の出力を無視し、４バイト長命令
実行時には、競合検出手段１０９の出力に応じて、セレ
クタ１１０は並列して実行できる演算を選びデコード
し、インタフェース１８０を通じて演算ユニット１６０
を制御する。尚、ここでは、演算ユニットが２つの場合
を示しているが、２つ以上でも良い。FIG. 1 shows an overall block diagram. 1200 is a memory, 1300 is an instruction cache, 1303 is an instruction control unit, 160 is an arithmetic unit, 150 is instruction length determination means, and 109 is contention detection means. The instruction control unit 1303 reads an instruction from the instruction cache using the interface 170, decodes the instruction, and controls the arithmetic unit 160 through the interface 180. The operation unit 160 can process a plurality of operations in parallel. This computer has a 4-byte instruction that indicates a single operation and a 16-byte instruction that specifies a plurality of operations. The instruction cache 1300 stores between 16-byte instructions and 16 instructions.
The 16-byte instruction and the 4-byte instruction are mixed so that there is no competition between the byte-length instruction and the 4-byte-length instruction. The conflict detection means 109 detects a conflict between only 4-byte long instructions. The instruction control unit 1303 includes an instruction length judging unit 150, and ignores the output of the conflict detection unit 109 when executing a 16-byte length instruction, and selects the selector according to the output of the conflict detection unit 109 when executing a 4-byte length instruction. 110 selects and decodes operations that can be executed in parallel,
Control. Although the case where there are two arithmetic units is shown here, two or more arithmetic units may be used.

【００３５】以下、レジスタ構成，命令フォーマットを
説明し、さらにパイプライン、及び、動作タイミングを
説明し、最後に図１の全体ブロック図の詳細を述べる。The register configuration and instruction format will be described below, the pipeline and operation timing will be described, and finally the details of the entire block diagram of FIG. 1 will be described.

【００３６】図２にレジスタ構成を示す。ＦＲ０〜ＦＲ
３１は６４ビット長の浮動小数点レジスタ、Ｒ０〜Ｒ３
１は３２ビット長の整数レジスタである。簡単のため、
浮動小数点データは全て倍精度で６４ビット長とする。
また、アドレスは３２ビット毎に振られているものとす
る。FIG. 2 shows a register configuration. FR0-FR
31 is a 64-bit floating point register, R0 to R3
1 is a 32-bit integer register. For simplicity,
All floating point data is double precision and 64 bits long.
It is assumed that the address is assigned every 32 bits.

【００３７】本実施例では、命令長の短い命令の命令長
を１語長，命令の長い命令の命令長を４語長とする。In this embodiment, the instruction length of an instruction having a short instruction length is one word length, and the instruction length of an instruction having a long instruction length is four word lengths.

【００３８】図３に、命令形式を示す。１語は３２ビッ
トである。基本命令，分岐命令，ロード・ストア命令は
１語長命令、複合命令は４語長命令である。基本命令は
全てレジスタ・レジスタ間演算である。本実施例では、
命令長の長い命令の命令長を４語長としたが、実施例に
よっては、もっと長いことも短いこともあり得る。FIG. 3 shows an instruction format. One word is 32 bits. Basic instructions, branch instructions, and load / store instructions are one word long instructions, and compound instructions are four word long instructions. The basic instructions are all register-to-register operations. In this embodiment,
The instruction length of the instruction having a long instruction length is set to four words, but may be longer or shorter depending on the embodiment.

【００３９】本実施例では、簡単のため４語長命令は必
ず４語長境界で区切られた４語に配置されると仮定する
が、この仮定をはずすことは容易である。In this embodiment, for simplicity, it is assumed that a four-word instruction is always arranged in four words separated by a four-word boundary, but this assumption can be easily removed.

【００４０】まず基本命令について説明する。ＯＰフィ
ールドはオペコードの種類を、Ｓ１とＳ２フィールドは
２つのソースレジスタの番号を、Ｔフィールドはターゲ
ットレジスタの番号を、ＣＣフィールドは、フラグの立
て方を示すフィールドである。即ち、Ｓ１とＳ２で示さ
れるレジスタが、ＯＰで示される演算をほどこされ、Ｔ
で示されるレジスタに結果が書き込まれる。詳細を図４
に示す。First, the basic instructions will be described. The OP field indicates the type of the operation code, the S1 and S2 fields indicate the numbers of the two source registers, the T field indicates the number of the target register, and the CC field indicates how to set a flag. That is, the registers indicated by S1 and S2 are subjected to the operation indicated by OP, and T
The result is written to the register indicated by. Figure 4 for details
Shown in

【００４１】次に分岐命令について説明する。ｄはディ
スプレースメントを示す。分岐命令では、プログラムカ
ウンタＰＣにｄの値が加算される。Next, the branch instruction will be described. d indicates a displacement. In the branch instruction, the value of d is added to the program counter PC.

【００４２】次にロード・ストア命令について説明す
る。Ｆフィールドは、ロード、又は、ストアするデータ
が浮動小数点データであるか、整数データであるかを示
す。Next, the load / store instruction will be described. The F field indicates whether the data to be loaded or stored is floating point data or integer data.

【００４３】ＳＩＺＥフィールドは、図４に示すよう
に、ロード、又は、ストアするデータの語長を示す。整
数については、１ワードのみが定義され、浮動小数点に
ついては、２〜１６ワードが定義されているものとす
る。図４に示すように、ＦＳＴ命令では、ＦＲ(Ｓ１)が
Ｒ(Ｓ２)番地に書き込まれる。ＳＩＺＥが１６ワードの
時には、ＦＲ(Ｓ１)〜ＦＲ(Ｓ１＋７)が、Ｒ(Ｓ２)番地
から始まる連続する１６ワードに書き込まれるものとす
る。また、ＦＬＤ命令では、Ｒ(Ｓ１)＋Ｒ(Ｓ２)番地の
データを、ＦＲ(Ｔ)に書き込む。ＳＩＺＥが１６ワード
の時には、Ｒ(Ｓ１)＋Ｒ(Ｓ２)番地から始まる連続する
１６ワードが、ＦＲ(Ｔ)〜ＦＲ(Ｔ＋７)に書き込まれ
る。The SIZE field indicates the word length of data to be loaded or stored, as shown in FIG. It is assumed that only one word is defined for an integer and 2 to 16 words are defined for a floating point. As shown in FIG. 4, in the FST instruction, FR (S1) is written to address R (S2). When SIZE is 16 words, FR (S1) to FR (S1 + 7) are written in 16 consecutive words starting from address R (S2). In the FLD instruction, the data at the address R (S1) + R (S2) is written into FR (T). When SIZE is 16 words, 16 consecutive words starting from the address R (S1) + R (S2) are written to FR (T) to FR (T + 7).

【００４４】次に、図３，図５を用いて、４語長の複合
命令について説明する。この命令では、Ｉ１，Ｉ２，Ｉ
Ｔ，ＳＩＺＥ，Ｆフィールドで示される、ロード・スト
ア操作と、Ｊ１，Ｊ２，ＪＴ，Ｊフィールドで示される
整数演算と、Ｍ１，Ｍ２，ＭＴフィールドで示される第
１浮動小数点演算と、Ａ１，Ａ２，ＡＴ，Ａフィールド
で示される第２浮動小数点演算と、Ｎ１，Ｎ２，ＮＴフ
ィールドで示される第３浮動小数点演算と、Ｂ１，Ｂ
２，ＢＴ，Ｂフィールドで示される第４浮動小数点演算
と、ＣＣ，ｄ，Ｎフィールドで示されるフロー制御の計
７つの演算が指示できる。各フィールドの詳細について
は、図５に示す。第１浮動小数点演算と、第３浮動小数
点演算は乗算、第２浮動小数点演算と第４浮動小数点演
算は加減算である。Ｎフィールドは、本命令に続き、挿
入したい無効サイクルの数を示す。使用方法については
後で述べる。Next, a compound instruction having a length of four words will be described with reference to FIGS. In this instruction, I1, I2, I
Load / store operation indicated by T, SIZE, F fields, integer operation indicated by J1, J2, JT, J fields, first floating point operation indicated by M1, M2, MT fields, A1, A2 , AT and A fields, a third floating point operation indicated by N1, N2 and NT fields, B1 and B
A total of seven calculations can be specified, including a fourth floating point calculation indicated by the 2, BT, and B fields and a flow control indicated by the CC, d, and N fields. Details of each field are shown in FIG. The first floating point operation and the third floating point operation are multiplication, and the second floating point operation and the fourth floating point operation are addition and subtraction. The N field indicates the number of invalid cycles to be inserted following this instruction. How to use will be described later.

【００４５】整数演算について説明する。Ｊフィ−ルド
≠１１１１のときは、図５に示すように通常の演算を行
う。しかし、Ｊフィ−ルド＝１１１１のときは、データ
の格納されているメモリからキャッシュメモリへのプリ
フェッチを行う。すなわち、Ｒ(Ｊ１)＋Ｒ(Ｊ２)をアド
レスとしてキャッシュメモリをアクセスして、該当する
データが無ければメモリからキャッシュへデータを転送
する。Next, the integer operation will be described. When the J field is # 1111, a normal operation is performed as shown in FIG. However, when the J field = 1111, the prefetch from the memory storing the data to the cache memory is performed. That is, the cache memory is accessed using R (J1) + R (J2) as an address, and if there is no corresponding data, the data is transferred from the memory to the cache.

【００４６】また、１語長命令では、演算結果は次命令
に直ちに反映されるが、４語長命令では、演算結果は３
つ後の命令に初めて反映される。この仕様を好ましく利
用したパイプライン構成と、プログラム例について以下
述べる。In a one-word instruction, the operation result is immediately reflected in the next instruction.
It is reflected for the first time in the next instruction. A pipeline configuration and a program example that preferably use this specification will be described below.

【００４７】図６に示すようにパイプライン構成は、Ｉ
Ｆ，Ｄ，Ｅ，Ｆ，Ｓの５段である。ＩＦステージでは、
命令の読み出し、Ｄステージでは、命令のデコード、Ｅ
ステージではレジスタの読み出しと演算の一部、Ｆステ
ージでは、演算、Ｓステージでは、演算の残りとレジス
タへの演算結果の書き込みが行われる。パイプライン構
成は、整数演算と浮動小数演算で同じとする。As shown in FIG. 6, the pipeline configuration
There are five stages of F, D, E, F, and S. In the IF stage,
Instruction reading, instruction decoding in the D stage, E
In the stage, the register is read and a part of the operation is performed. In the F stage, the operation is performed. In the S stage, the remainder of the operation and the operation result are written in the register. The pipeline configuration is the same for integer arithmetic and floating-point arithmetic.

【００４８】図７に１語長命令の本実施例による処理フ
ローを示す。１マシンサイクルに２命令処理されるスー
パスカラ方式である。命令１と２，命令３と４，命令５
と６，命令７と８のそれぞれが、特に競合の無い限り、
並列に処理される。このスーパスカラ方式については、
特願昭63−283673号に詳細に記されている。FIG. 7 shows a processing flow of a one-word length instruction according to this embodiment. This is a super scalar system in which two instructions are processed in one machine cycle. Instruction 1 and 2, Instruction 3 and 4, Instruction 5
And 6, instructions 7 and 8, respectively, unless there is a conflict
Processed in parallel. About this super scalar method,
It is described in detail in Japanese Patent Application No. 63-283673.

【００４９】次に図８は、命令２の結果を命令３が使う
場合の処理の様子を示したものである。命令３と４のＥ
ステージは、命令２のＳステージが終了するまで引き伸
ばされる。前命令の結果が次命令に反映するという命令
仕様を満足するため、ハードウェアは上記競合を検出
し、図８に示す処理を行わなくてはならない。Next, FIG. 8 shows the state of processing when the result of the instruction 2 is used by the instruction 3. E of instructions 3 and 4
The stage is extended until the S stage of instruction 2 ends. In order to satisfy the instruction specification that the result of the previous instruction is reflected on the next instruction, the hardware must detect the conflict and perform the processing shown in FIG.

【００５０】４語長命令の処理の様子を図９に示す。４
語長命令は、１マシンサイクルに１命令ずつ処理され
る。命令１の演算結果は先に述べた仕様により、命令
２，３には反映されず、命令４になって初めて反映され
る。命令１のレジスタ書き込みステージであるＳステー
ジは、命令４のレジスタ読み出しステージであるＤステ
ージの１つ前にちょうど終了しているので、図８を用い
て説明した様なハードウェアによる競合制御は必要な
い。本実施例では、演算ステージはＥ，Ｆ，Ｓの３段で
あるが、一般に、演算結果を書き込む前に行う次命令の
数Ｎとパイプライン段数Ｍの間にＮ≧Ｍ−１であればハ
ードウェアによる競合制御は不要である。本実施例では
Ｎ＝２，Ｍ＝３のケースである。FIG. 9 shows how a four-word instruction is processed. 4
Word length instructions are processed one instruction at a time in one machine cycle. According to the above-described specification, the operation result of the instruction 1 is not reflected on the instructions 2 and 3, but is reflected only on the instruction 4. Since the S stage, which is the register write stage of the instruction 1, has just finished just before the D stage, which is the register read stage of the instruction 4, conflict control by hardware as described with reference to FIG. 8 is necessary. Absent. In this embodiment, there are three operation stages E, F, and S. In general, if N ≧ M−1 between the number N of next instructions to be executed before writing the operation result and the number M of pipeline stages. No contention control by hardware is required. In this embodiment, N = 2 and M = 3.

【００５１】また、１語長命令と次の有効な４語長命令
の間には必ず無効な４語長命令を２つおくものとする。
同様に有効な４語長命令と次の１語長命令の間には必ず
無効な４語長命令を２つおくものとする。It is assumed that two invalid four-word instructions are always placed between the one-word instruction and the next valid four-word instruction.
Similarly, two invalid four-word instructions are always placed between the valid four-word instruction and the next one-word instruction.

【００５２】次にこの４語長命令の好ましいプログラム
例について図１０，図１１，図１２を用いて述べる。次
式の計算をする場合を考える。Next, a preferred example of the program of the four-word length instruction will be described with reference to FIGS. 10, 11, and 12. FIG. Consider the case where the following equation is calculated.

【００５３】Ａ(ｉ)＝Ａ(ｉ)＋Ｃ×Ｂ(ｉ) ，１＜ｉ＜２４但し、Ｃは定数、Ａ(ｉ)，Ｂ(ｉ)は、メモリ上に、図１
０のように配置されている６４ビットの浮動小数点デー
タである。A (i) = A (i) + C × B (i), 1 < i < 24, where C is a constant and A (i) and B (i) are stored in the memory as shown in FIG.
It is 64-bit floating point data arranged like 0.

【００５４】図１１は、Ａ(ｉ)の計算をするのに、各サ
イクルに、どんな演算がされるかを説明する図である。
横軸は時刻で、単位はマシンサイクルである。図に書か
れた横長の箱は、処理されるデータの通る演算器のＥ，
Ｆ，Ｓの３ステージを示す。(１)〜(１０)について説明
する。演算は、インデックスｉについて４つずつ行われ
る。(１)〜(１０)はＡ(９)〜Ａ(１２)を計算するための
処理である。以下、各処理について説明する。定数Ｃは
ＦＲ３１にあるものとする。FIG. 11 is a diagram for explaining what operation is performed in each cycle to calculate A (i).
The horizontal axis is time, and the unit is machine cycle. The horizontal boxes shown in the figure are E,
The three stages F and S are shown. (1) to (10) will be described. The calculation is performed for each of four indices i. (1) to (10) are processes for calculating A (9) to A (12). Hereinafter, each process will be described. It is assumed that the constant C is in FR31.

【００５５】（１）Ａ(１)〜Ａ(４)をＦＲ４〜ＦＲ７，
Ｂ(７)〜Ｂ(１０)をＦＲ０〜ＦＲ３にロードする。(1) A (1) to A (4) are converted to FR4 to FR7,
B (7) to B (10) are loaded into FR0 to FR3.

【００５６】（２）ＦＲ０×ＦＲ３１をＦＲ８に格納。(2) Store FR0 × FR31 in FR8.

【００５７】（３）ＦＲ１×ＦＲ３１をＦＲ９に格納。(3) FR1 × FR31 is stored in FR9.

【００５８】（４）ＦＲ４＋ＦＲ８をＦＲ１２に格納。(4) FR4 + FR8 is stored in FR12.

【００５９】（５）ＦＲ５＋ＦＲ９をＦＲ１３に格納。(5) FR5 + FR9 is stored in FR13.

【００６０】（６）ＦＲ２×ＦＲ３１をＦＲ１０に格
納。(6) Store FR2 × FR31 in FR10.

【００６１】（７）ＦＲ３×ＦＲ３１をＦＲ１１に格
納。(7) FR3 × FR31 is stored in FR11.

【００６２】（８）ＦＲ６＋ＦＲ１０をＦＲ１４に格
納。(8) Store FR6 + FR10 in FR14.

【００６３】（９）ＦＲ７＋ＦＲ１１をＦＲ１５に格
納。(9) FR7 + FR11 is stored in FR15.

【００６４】（１０）ＦＲ１２〜ＦＲ１５をＡ(９)〜Ａ
(１２)にストアする。(10) FR12 to FR15 are converted to A (9) to A
Store in (12).

【００６５】（１)〜(１０）の演算スケジューリングに
関しては、１つの演算に、図１０を用いて説明した様に
３サイクルかかることが考慮されている。Ａ(９)〜Ａ
(１２)の処理について説明したが、Ａ(１３)〜Ａ(１
６)，Ａ(１７)〜Ａ(２０)…の処理も全く同様に行われ
る。各演算器は１サイクルピッチでパイプラインされて
おり、Ａ(１３)〜Ａ(１６)，Ａ(１７)〜Ａ(２０)…の処
理をＡ(９)〜Ａ(１２)の処理に重ねて、図１１のように
処理可能である。As for the operation scheduling of (1) to (10), it is considered that one operation takes three cycles as described with reference to FIG. A (9)-A
Although the processing of (12) has been described, A (13) to A (1
6), A (17) to A (20)... Are performed in exactly the same manner. Each of the arithmetic units is pipelined at one cycle pitch, and the processing of A (13) to A (16), A (17) to A (20)... Is superimposed on the processing of A (9) to A (12). Thus, processing can be performed as shown in FIG.

【００６６】さて、図１１に示した処理を実現する４語
長命令列を示したのが図１２である。Ａ(１)〜Ａ(２４)
の演算は、図１２に示す命令１〜命令２２の２２命令で
実現できる。使用するレジスタはＦＲ０〜ＦＲ１５，Ｆ
Ｒ３１の１７本である。命令１，３，５，７，９，１
１，１３，１５でどのデータがロードされるか、又、命
令１２，１４，１６，１８，２０，２２でどのデータが
ストアされるかを、図１０に示した。命令３はＦＲ０〜
ＦＲ３に値を書き込むか、その結果が反映するのは命令
６以降であるので、命令１でロードしたＦＲ０〜ＦＲ３
の値を命令４で使用することが可能である。命令１の結
果が、命令２にすぐに反映する従来の方式で図１１と同
じ処理を行おうとすると、命令３でＦＲ０〜ＦＲ３に書
き込みを行うことはできず、ＦＲ１６〜ＦＲ１９等、新
たなレジスタが必要となる。ところが、使用できるレジ
スタの数には限りがあり、レジスタの数がネックにな
り、処理サイクル数が伸びてしまう。図１２のプログラ
ムは、現命令の結果が３命令後にしか反映しないという
遅延書き込みを生かした結果、１７本のレジスタを使う
だけで、演算が可能となったのである。FIG. 12 shows a four-word instruction sequence for realizing the processing shown in FIG. A (1)-A (24)
Can be realized by 22 instructions 1 to 22 shown in FIG. Registers used are FR0 to FR15, F
17 of R31. Instructions 1, 3, 5, 7, 9, 1
FIG. 10 shows which data is loaded at 1, 13, and 15, and which data is stored at instructions 12, 14, 16, 18, 20, and 22. Instruction 3 is FR0
Since the value is written to FR3 or the result is reflected after instruction 6, FR0 to FR3 loaded by instruction 1
Can be used in instruction 4. If an attempt is made to perform the same processing as in FIG. 11 by the conventional method in which the result of the instruction 1 is immediately reflected on the instruction 2, the instruction 3 cannot write to FR0 to FR3, and new registers such as FR16 to FR19 are used. Is required. However, the number of registers that can be used is limited, the number of registers becomes a bottleneck, and the number of processing cycles increases. The program in FIG. 12 makes use of the delay writing that the result of the current instruction is reflected only after three instructions. As a result, the operation can be performed by using only 17 registers.

【００６７】本例で示した様に遅延書き込みは、レジス
タを指定するオペコードのフィールドを増加させずに、
実質的に使用できるレジスタの数を増やす効果がある。As shown in this example, the delayed write is performed without increasing the opcode field for specifying the register.
This has the effect of substantially increasing the number of registers that can be used.

【００６８】図１２上で、“Ｘ”印は何も演算するもの
がない為、空いてしまったフィールドであるが、特に、
処理の開始時の命令１〜６，終了時の命令１７〜２２に
空フィールドが多い。しかし、これらの空フィールド
は、コンパイラにより、次の一連の処理の開始処理と、
現在の一連の処理の終了処理を重ねることにより、減ら
すことができる。また、全くすることのない、命令２，
４，２１は、命令１，３，２０のＮフィールドを“０
１”にすることにより、省略することができる。In FIG. 12, the "X" mark is a vacant field because there is no operation to perform any operation.
Instructions 1 to 6 at the start of processing and instructions 17 to 22 at the end have many empty fields. However, these empty fields are used by the compiler to start the next series of processing,
It can be reduced by repeating the end processing of the current series of processing. Also, there is no instruction 2,
4, 21 set the N fields of instructions 1, 3, and 20 to "0".
By setting it to 1 ", it can be omitted.

【００６９】本発明では、演算結果を書き込むレジスタ
の指定は、演算を指定する命令で行うので、命令４，２
１が初めて省略できる。演算結果を書き込むレジスタの
指定を、書き込みが行われるステージに発行される命令
で指定する方式では、命令４，２１では命令１，１８で
演算した結果の書き込み指定が必要であり、省略不可で
ある。In the present invention, the register for writing the operation result is specified by the instruction for specifying the operation.
1 can be omitted for the first time. In a method in which the specification of the register into which the operation result is written is specified by an instruction issued to the stage where writing is performed, the instructions 4 and 21 need to specify the writing of the result calculated by the instructions 1 and 18 and cannot be omitted. .

【００７０】また、図１２のプログラムの前に１語長命
令があった時に、図１２の命令１の前に２つの４語長の
無効命令を挿入しなくてはならないが、これは、Ｎフィ
ールドが“０１”である４語長の無効命令を１つ入れれ
ばよい。また、図１２のプログラムの後に１語長命令が
来る場合、図１２の命令２２のＮフィールドを“10”に
すればよい。When a one-word length instruction is present before the program in FIG. 12, two 4-word-length invalid instructions must be inserted before instruction 1 in FIG. It suffices to insert one 4-word invalid instruction whose field is "01". When a one-word length instruction comes after the program in FIG. 12, the N field of the instruction 22 in FIG. 12 may be set to “10”.

【００７１】このように命令を長くしてＮフィールドを
設け、有効に使用することにより、コードサイズを削減
することが可能である。また、１語長命令では４語で４
つの演算しか指示できないのに比べ、４語長命令では４
語で、図５に示すように７つの演算が指示できる。As described above, it is possible to reduce the code size by lengthening the instruction and providing N fields to effectively use the instruction. In addition, in one word length instruction, 4 words are 4 words.
Compared to the case where only one operation can be specified, 4
In terms of words, seven operations can be indicated as shown in FIG.

【００７２】次に、プログラム作成方法について述べ
る。プログラムは FORTRAN，Ｃなどの高級言語で記述
し、コンパイラにより命令列に変換する。Next, a program creation method will be described. The program is described in a high-level language such as FORTRAN or C, and is converted into an instruction sequence by a compiler.

【００７３】図２６に本発明に関するコンパイラの処理
フローを示す。高級言語によるプログラムは、字句解析
部，構文解析部，意味解析部を経て、中間コードに変換
される。中間コードは、最適化部によって最適化され、
コード生成部により図３に示した命令列に変換される。
ここで、最適化部とコード生成部を合わせて合成部とい
う。この合成部に本願発明に関する特徴がある。すなわ
ち、この合成部では、中間コードを見て並列してできる
演算ができるだけ多くなるように命令列を生成する並列
化部を有する。この並列化部では、並列してできる演算
数が多いときには４語長命令を用い、並列してできる演
算数の少ないときには１語長命令を用いる。ここで、４
語長命令を用いるか１語長命令を用いるかの判断基準
は、並列して実行できる演算数により決まるが、この演
算数はシステムによって異なるので、プログラムを作成
するときにパラメータとしてコンパイラに指定できるよ
うになっている。上述のようにすることにより、コード
サイズが小さくなり、主メモリ、及び、キャッシュメモ
リの使用効率が上がり、処理速度が高められる。FIG. 26 shows a processing flow of the compiler according to the present invention. A program in a high-level language is converted into an intermediate code through a lexical analyzer, a syntax analyzer, and a semantic analyzer. The intermediate code is optimized by the optimization unit,
The code generator converts the instruction sequence into the instruction sequence shown in FIG.
Here, the optimization unit and the code generation unit are collectively referred to as a synthesis unit. This synthesizing unit has features relating to the present invention. In other words, the synthesizing unit has a parallelizing unit that generates an instruction sequence so that the number of operations that can be performed in parallel by looking at the intermediate code is as large as possible. This parallelization unit uses a four-word instruction when the number of operations that can be performed in parallel is large, and uses a one-word instruction when the number of operations that can be performed in parallel is small. Where 4
The criterion for determining whether to use the word-length instruction or the one-word-length instruction is determined by the number of operations that can be executed in parallel. Since the number of operations differs depending on the system, it can be specified to the compiler as a parameter when creating a program. It has become. By doing as described above, the code size is reduced, the use efficiency of the main memory and the cache memory is increased, and the processing speed is increased.

【００７４】次に、合成部の特徴としてレジスタの割当
てがある。４語長命令はその演算結果が３つ後の命令に
初めて反映されるので、特別な配慮が必要になる。例え
ば、図１２において命令１の結果は命令４に初めて反映
されるので、命令２，３では命令１の結果を使用せずに
できる演算をできるだけたくさん割り付けておく。この
時、演算できる命令が１つも無いときは、無効命令生成
部によって無効命令を挿入しておく。また、合成部で
は、１語長命令と次の有効な４語長命令との間には必ず
無効な４語長命令を２つ挿入しておく。逆に、有効な４
語長命令と次の１語長命令との間にも必ず無効な４語長
命令を２つ挿入しておく。ここで、先に述べたように無
効命令はＮフィールドを用いることによって省略でき
る。すなわち、本実施例のコンパイラは、長い命令間の
競合を検出し、Ｎフィールドを用いて、命令実行後に挿
入すべき無効サイクルの数を指定できるので、ハードウ
ェアで長い命令間の競合を検出したり、処理をする必要
が無い。Next, a feature of the synthesizing unit is register allocation. The 4-word instruction requires special consideration because the operation result is reflected for the first three instructions. For example, in FIG. 12, the result of the instruction 1 is reflected for the first time in the instruction 4, so that in the instructions 2 and 3, as many operations as possible without using the result of the instruction 1 are allocated. At this time, when there is no instruction that can be operated, an invalid instruction is inserted by the invalid instruction generating unit. In the synthesizing unit, two invalid four-word instructions are always inserted between the one-word instruction and the next valid four-word instruction. Conversely, valid 4
Two invalid four word length instructions are always inserted between the word length instruction and the next one word length instruction. Here, as described above, the invalid instruction can be omitted by using the N field. That is, the compiler of the present embodiment detects a conflict between long instructions and can specify the number of invalid cycles to be inserted after the execution of the instruction using the N field. And no need for processing.

【００７５】次に、これまでに説明した命令の処理を行
うハードウェアの一実施例について説明する。図１３は
図１を詳細化した全体構成である。１３００は命令キャ
ッシュ、１３０１は命令キャッシュコントローラ、１３
０２は命令処理フローを制御する分岐ユニット、１３０
３は、命令をデコードする命令制御ユニット、1304は整
数演算ユニット、１３０７は浮動小数点演算ユニット、
１３０６はデータキャッシュ、１３０５はデータキャッ
シュコントローラ、１３０８はメモリインタフェースユ
ニットである。Next, a description will be given of an embodiment of hardware for processing the above-described instructions. FIG. 13 shows the overall configuration of FIG. 1 in detail. 1300 is an instruction cache, 1301 is an instruction cache controller, 13
02 is a branch unit for controlling the instruction processing flow, 130
3 is an instruction control unit for decoding instructions, 1304 is an integer operation unit, 1307 is a floating point operation unit,
1306 is a data cache, 1305 is a data cache controller, and 1308 is a memory interface unit.

【００７６】命令制御ユニット１３０３は、命令キャッ
シュ１３００より実行すべき命令をバス１３１０を通し
て受け取り、命令をデコードし、整数演算ユニット制御
信号１３１８を整数演算ユニット１３０４に、浮動小数
点演算ユニット制御信号1314を浮動小数点演算ユニット
１３０７に、分岐ユニット制御信号１３１２を分岐ユニ
ット１３０２に送出する。さらにプログラムカウンタ３
５００の制御のためにモード信号１１０も、分岐ユニッ
ト１３０２に送出する。また、整数演算ユニット１３０
４より、フラグ１３１７を、浮動小数点演算ユニットよ
りフラグ1315を受け取る。The instruction control unit 1303 receives an instruction to be executed from the instruction cache 1300 via the bus 1310, decodes the instruction, and outputs an integer operation unit control signal 1318 to the integer operation unit 1304 and a floating point operation unit control signal 1314 to the floating point operation unit. The branch unit control signal 1312 is sent to the decimal point operation unit 1307 and the branch unit 1302. Furthermore, program counter 3
The mode signal 110 is also sent to the branch unit 1302 for the control of 500. In addition, the integer operation unit 130
4, the flag 1317 is received, and the flag 1315 is received from the floating-point operation unit.

【００７７】整数演算ユニット１３０４は、オペランド
アドレス１３１９をデータキャッシュ１３０６と、デー
タキャッシュコントローラ１３０５に送出する。データ
キャッシュより読み出されたデータは、データバス１３
２０を通して、整数演算ユニット１３０４、又は、浮動
小数点演算ユニット１３０７に送出される。データキャ
ッシュの中に所望のデータが無い時には、データキャッ
シュコントローラ1305が、メモインタフェースユニット
１３０８にインタフェース信号１３２１を通して起動を
かけ、主メモリよりデータを読み出す。この間の待合わ
せ制御を、信号１３１６を通して、命令制御ユニット１
３０３と行う。The integer operation unit 1304 sends the operand address 1319 to the data cache 1306 and the data cache controller 1305. The data read from the data cache is transmitted to the data bus 13
The data is sent to the integer arithmetic unit 1304 or the floating-point arithmetic unit 1307 through 20. When there is no desired data in the data cache, the data cache controller 1305 activates the memo interface unit 1308 through the interface signal 1321, and reads data from the main memory. The queuing control during this time is controlled by the command control unit 1 through the signal 1316.
Perform 303.

【００７８】分岐ユニットは、次に読み出すべき命令の
アドレス１３０９を、命令キャッシュ１３００と、命令
キャッシュコントローラ１３０１に送出する。所望の命
令が命令キャッシュ１３００の中に無い時には、命令キ
ャッシュコントローラ1301は、メモリインタフェースユ
ニット１３０８にインタフェース信号１３１３を通して
起動をかけ、主メモリより、命令を読み出す。この間の
待合わせ制御を信号１３１１を通して、命令制御ユニッ
ト１３０３と行う。The branch unit sends the address 1309 of the next instruction to be read to the instruction cache 1300 and the instruction cache controller 1301. When the desired instruction is not in the instruction cache 1300, the instruction cache controller 1301 activates the memory interface unit 1308 through the interface signal 1313 and reads the instruction from the main memory. The waiting control during this time is performed with the instruction control unit 1303 through the signal 1311.

【００７９】整数演算ユニット１３０４の詳細を示した
のが、図１４である。１４００はデコーダ、１４０１は
第１ＡＬＵ、１４０２は第２ＡＬＵ、１４０３は整数レ
ジスタである。第１ＡＬＵにはソースバス１４０６，１
４０７を通して、整数レジスタファイル１４０３からデ
ータが送られ、演算結果は、ターゲットバス１３２２を
通して、整数レジスタファイル１４０３に返される。ま
た、第２ＡＬＵには、ソースバス１４０８，１４０９を
通して、整数レジスタファイル１４０３からデータが送
られ、演算結果は、ターゲットバス１３１９を通して整
数レジスタファイル１４０３に返される。１３１７−１
は、第１ＡＬＵより出力されるフラグ、１３１７−２は
第２ＡＬＵより出力されるフラグである。バス１３１
９，1322は、ロードストア及びプリフェッチ時のアドレ
スとしてデータキャッシュ１３０６に送出される。FIG. 14 shows the details of the integer operation unit 1304. 1400 is a decoder, 1401 is a first ALU, 1402 is a second ALU, and 1403 is an integer register. The first ALU has source bus 1406,1
Data is sent from the integer register file 1403 through 407, and the operation result is returned to the integer register file 1403 through the target bus 1322. The data is sent from the integer register file 1403 to the second ALU through the source buses 1408 and 1409, and the operation result is returned to the integer register file 1403 through the target bus 1319. 1317-1
Is a flag output from the first ALU, and 1317-2 is a flag output from the second ALU. Bus 131
9, 1322 are sent to the data cache 1306 as addresses at the time of load store and prefetch.

【００８０】図１３の浮動小数点演算ユニット１３０７
の詳細を示したのが、図１５である。１５０１はデコー
ダ、１５０２は浮動小数点レジスタファイル、１５０３
は第１乗算器、１５０４は第２乗算器、１５０５は第１
加算器、１５０６は第２加算器である。第１乗算器１５
０３には、ソースバス１５１７，１５１８を通して、第
２乗算器１５０４には、ソースバス１５１５，１５１６
を通して、第１加算器１５０５には、ソースバス１５１
３，１５１４を通して、第２加算器１５０６には、ソー
スバス１５１１，１５１２を通して、浮動小数点レジス
タファイル1502よりデータが送られ、演算結果はそれぞ
れ、ターゲットバス１５０７，１５０８，１５０９，１
５１０を通して浮動小数点レジスタファイルに書き込ま
れる。The floating point arithmetic unit 1307 shown in FIG.
FIG. 15 shows the details of. 1501 is a decoder, 1502 is a floating-point register file, 1503
Is the first multiplier, 1504 is the second multiplier, and 1505 is the first multiplier.
An adder 1506 is a second adder. First multiplier 15
03 through source buses 1517 and 1518, the second multiplier 1504 provides source buses 1515 and 1516
To the first adder 1505, the source bus 151
3, 1514, the data is sent from the floating-point register file 1502 to the second adder 1506 through the source buses 1511 and 1512, and the operation results are respectively sent to the target buses 1507, 1508, 1509, 1
Written to the floating point register file through 510.

【００８１】１３１５−１は、第１乗算器１５０３のフ
ラグ、１３１５−２は第２乗算器１５０４のフラグ、１
３１５−３は第１加算器１５０５のフラグ、１３１５−
４は第２加算器１５０６のフラグである。1315-1 is a flag of the first multiplier 1503, 1315-2 is a flag of the second multiplier 1504,
315-3 is a flag of the first adder 1505, 1315-
4 is a flag of the second adder 1506.

【００８２】図１５の浮動小数点レジスタファイル１５
０２の詳細を示したのが、図１６である。１６００〜１
６０８は浮動小数点レジスタ、１３１４−１〜１３１４
−９はそれぞれ浮動小数点レジスタ１６００〜１６０８
の制御信号である。１６１０はロードアライナ、１６０
９はストアアライナである。１６１１〜１６１８は、浮
動小数点レジスタ１６００〜１６０８を、メモリと結ぶ
バスである。バス1611は、ＦＲ０，１６，２４に、バス
１６１２は、ＦＲ１，９，１７，２５に接続されてい
る。バス１６１３，１６１４，１６１５，１６１６，１
６１７も同様で、バス１６１８にはＦＲ７，１５，２
３，３１が接続されている。ロード命令実行時には、バ
ス１３２０を通して、送られてきたデータを、ロードア
ライナ1610により、１６１１〜１６１８の内、所望のバ
スに乗せかえ、所望のレジスタに書き込む。またストア
命令実行時には、１６１１〜１６１８にレジスタよりデ
ータが読み出され、ストアアライナ１６０９により、バ
ス１３２０の所望の位置にデータが出力される。The floating-point register file 15 shown in FIG.
FIG. 16 shows the details of No. 02. 1600-1
608 is a floating-point register, 1314-1 to 1314
-9 are floating point registers 1600 to 1608, respectively.
Is a control signal. 1610 is a load aligner, 160
9 is a store aligner. Buses 1611 to 1618 connect the floating point registers 1600 to 1608 to the memory. The bus 1611 is connected to FR0, 16, 24, and the bus 1612 is connected to FR1, 9, 17, 25. Buses 1613, 1614, 1615, 1616, 1
617, and the bus 1618 has FR7, 15, 2
3, 31 are connected. At the time of executing the load instruction, the data sent through the bus 1320 is transferred to a desired bus out of 1611 to 1618 by the load aligner 1610 and written to a desired register. When a store instruction is executed, data is read from the registers 1611 to 1618, and the data is output to a desired position on the bus 1320 by the store aligner 1609.

【００８３】図１６の浮動小数点レジスタ１６００の第
１の実施例を記したのが、図１７である。レジスタ１６
００について示したが、１６０１〜１６０８も同様であ
る。図１７に示すようにレジスタ１６００は６４ビット
のレジスタの集まりである。１７００〜１７６３はそれ
ぞれ１ビットのレジスタである。１５１１−００〜１５
１８−００はレジスタ１７００の読み出しバス、１５０
７−００〜１５１０−００はレジスタ１７００の書き込
みバス、１６１１−００はレジスタ１７００の読み書き
バスである。レジスタ１７６３のバス構成も同様であ
る。FIG. 17 shows a first embodiment of the floating-point register 1600 of FIG. Register 16
00 is shown, but the same applies to 1601 to 1608. As shown in FIG. 17, the register 1600 is a set of 64-bit registers. Reference numerals 1700 to 1763 are 1-bit registers. 1511-00-15
18-00 is a read bus of the register 1700, 150
7-00 to 1510-00 are write buses for the register 1700, and 1611-00 is a read / write bus for the register 1700. The same applies to the bus configuration of the register 1763.

【００８４】図２８は、図１３のデータキャッシュの詳
細を図示したものである。２８０１はデータを保持する
データアレイ、２８００はロードストア演算用のアドレ
スアレイ、２８０２はプリフェッチ用のアドレスアレイ
である。アドレスアレイ2800と２８０２は同じ内容のデ
ータを保持している。１語長命令のロードストア命令実
行時には、バス１３１９または１３２２を用いてアドレ
スアレイ２８００とデータアレイ２８０１がアクセスさ
れる。４語長命令実行時には、バス１３２２を用いてア
ドレスアレイ２８００とデータアレイ２８０１がアクセ
スされる。また、バス１３１９を用いてプリフェッチの
ためにアドレスアレイ２８０２がアクセスされる。FIG. 28 illustrates details of the data cache of FIG. Reference numeral 2801 denotes a data array for holding data, 2800 denotes an address array for load / store operation, and 2802 denotes an address array for prefetch. Address arrays 2800 and 2802 hold data of the same contents. When executing a load / store instruction of one word length instruction, the address array 2800 and the data array 2801 are accessed using the bus 1319 or 1322. At the time of executing a four-word instruction, the address array 2800 and the data array 2801 are accessed using the bus 1322. Further, the address array 2802 is accessed using the bus 1319 for prefetch.

【００８５】図２９は、ロードストア演算でキャッシュ
ミスを生じたときのパイプラインを示す図である。パイ
プラインは、データがメモリからキャッシュメモリへ転
送される間ロックされる。図２９でφで示されるのがロ
ックされる期間である。FIG. 29 is a diagram showing a pipeline when a cache miss occurs in the load store operation. The pipeline is locked while data is transferred from memory to cache memory. The period indicated by φ in FIG. 29 is the locked period.

【００８６】一方、プリフェッチ演算のときには、アド
レスアレイ２８０２がヒットすれば何も行わない。ミス
すれば、そのアドレスを含むブロックがメモリよりデー
タアレイ２８０１にバスを通して転送される。但し、こ
の時はパイプラインはロックされない。コンパイラによ
り、ミスする可能性のあるロードストア演算の前にプリ
フェッチを設定しておけば、メモリからキャッシュメモ
リへの転送を他の演算と並列に行うことができる。その
ために、図２９で示したパイプラインロックによる性能
低下を避けることができる。On the other hand, during the prefetch operation, nothing is performed if the address array 2802 hits. If a miss occurs, the block containing the address is transferred from the memory to the data array 2801 via the bus. However, at this time, the pipeline is not locked. If a prefetch is set by the compiler before a load / store operation that may cause a miss, the transfer from the memory to the cache memory can be performed in parallel with other operations. Therefore, performance degradation due to the pipeline lock shown in FIG. 29 can be avoided.

【００８７】図１７のレジスタ１７００の回路構成例を
記したのが、図１８である。1816と１８１７はインバー
タ、１８０２〜１８１５はクロックドインバータであ
る。FIG. 18 shows an example of a circuit configuration of the register 1700 in FIG. 1816 and 1817 are inverters, and 1802-1815 are clocked inverters.

【００８８】１３１４−１−１〜８がhighになると、そ
れぞれバス１５１１−００〜１５１８−００にレジスタ
の値が出力される。また、１３１４−１−１０〜１４が
highになると、バス１５１０−００〜１５０７−００の
値がレジスタに書き込まれる。また、１３１４−１−９
がhighになるとレジスタの値がバス１６１１−００に出
力され、１３１４−１−１０がhighになると、バス１６
１１−００がレジスタに書き込まれる。信号１８００は
予備の読み出しポート、１８０１は予備の書き込みポー
トである。１８００と１８０１の用途については後に説
明する。When 1314-1-1-8 become high, the register values are output to the buses 1511-00 to 1518-00, respectively. Also, 1314-1-10-14 are
When it goes high, the values on buses 1510-00 through 1507-00 are written to the registers. Also, 1314-1-9
Becomes high, the value of the register is output to the bus 1611-00.
11-00 is written to the register. Signal 1800 is a spare read port and 1801 is a spare write port. The uses of 1800 and 1801 will be described later.

【００８９】図１９は、図１６の浮動小数点レジスタ１
６００の第２の実施例を示したものである。図１９の実
施例は、図１７の実施例と比較して１９００〜１９６３
の第１シャドウレジスタ，２０００〜２０６３の第２シ
ャドウレジスタが付加されている点が異なっている。第
１シャドウレジスタ１９００は信号１８００を通して、
レジスタ１７００の値を読み取ることができる。また、
信号１９６４を通して第２シャドウレジスタ２０００
に、第１シャドウレジスタ１９００の値を送出する。第
２シャドウレジスタ２０００は、自分の値を信号１８０
１を通してレジスタ１７００に送出する。即、レジスタ
１７００〜１７６３，第１シャドウレジスタ１９００〜
１９６３，第２シャドウレジスタ２０００〜２０６３は
リング状のシフトレジスタを構成している。更に第１シ
ャドウレジスタ１９００〜１９６３，第２シャドウレジ
スタ２０００〜２０６３は、レジスタ１７００〜１７６
３と同様に、バス１６１１−００〜１６１１−６３を通
して、読み書きができる。FIG. 19 shows the floating point register 1 of FIG.
600 shows a second embodiment of the present invention. The embodiment of FIG. 19 is different from the embodiment of FIG.
1 and a second shadow register 2000 to 2063 are added. The first shadow register 1900 receives the signal
The value of the register 1700 can be read. Also,
Second shadow register 2000 through signal 1964
Then, the value of the first shadow register 1900 is transmitted. The second shadow register 2000 outputs its value to the signal 180
1 to the register 1700. Immediately, the registers 1700 to 1763, the first shadow register 1900 to
1963, the second shadow registers 2000 to 2063 constitute a ring-shaped shift register. Further, the first shadow registers 1900 to 1963 and the second shadow registers 2000 to 2063 are registers 1700 to 176.
As in the case of No. 3, reading and writing can be performed through buses 1611-00 to 1611-63.

【００９０】１３１４−１−１５は、第１シャドウレジ
スタ１９００〜１９６３の制御信号、１３１４−１−１
６は、第２シャドウレジスタ２０００〜２０６３の制御
信号である。Reference numeral 1314-1-15 denotes a control signal of the first shadow register 1900 to 1963;
Reference numeral 6 denotes a control signal for the second shadow registers 2000 to 2063.

【００９１】シャドウレジスタの目的は、４語長命令実
行時の割込みからの復帰を可能にすることである。図２
０〜図２２を用いてその動作を説明する。Ｗ′ステージ
は、レジスタから、第１シャドウレジスタＦＲＳ１に書
き込むステージ、Ｗ″は第１シャドウレジスタＦＲＳ１
から第２シャドウレジスタＦＲＳ２に書き込むステージ
である。The purpose of the shadow register is to enable return from an interrupt when executing a 4-word long instruction. FIG.
The operation will be described with reference to FIGS. The W 'stage is a stage for writing from the register to the first shadow register FRS1, and W "is a stage for writing to the first shadow register FRS1.
From the second to the second shadow register FRS2.

【００９２】図２０は割込みのない通常時の４語長命令
の動作である。ＦＲ，ＦＲＳ１，ＦＲＳ２のタイムチャ
ート上の数字は、どの命令の演算結果が各レジスタに入
っているかを示す。図２０の通り、通常時は、ＦＲから
ＦＲＳ１へ、ＦＲＳ１からＦＲＳ２へと１サイクルピッ
チで演算結果がシフトされる。FIG. 20 shows the operation of a four-word instruction at normal time without interruption. The numbers on the time charts of FR, FRS1, and FRS2 indicate which instruction operation result is stored in each register. As shown in FIG. 20, the calculation result is normally shifted from FR to FRS1 and from FRS1 to FRS2 at one cycle pitch.

【００９３】図２１は、命令３と命令４の間に割込みが
入った時の動作を示す図である。命令４，５，６，７は
無効化される。各レジスタは割込み発生後値の更新を止
め、ＦＲは命令３の結果を、ＦＲＳ１は命令２の結果
を、ＦＲＳ２は命令１の結果を保持する。また、プログ
ラムカウンタには割込みベクタがセットされる。割込み
ベクタから始まる割込み処理プログラムで、ＦＲ，ＦＲ
Ｓ１，ＦＲＳ２の値をメモリ上に退避する図２２は、割
込み処理からの復帰時の動作を説明する図である。ま
ず、割込み処理プログラムの最後に、図２２に示すよう
に命令１の結果をＦＲに、命令２の結果をＦＲＳ２に、
命令３の結果をＦＲＳ１に復帰する。こうすることによ
り、命令４のレジスタ読み出しステージのＥステージ
で、命令１の結果を見ることができる。命令４のＥステ
ージ終了後、ＦＲの値をＦＲＳ１に、ＦＲＳ１の値をＦ
ＲＳ２に、ＦＲＳ２の値をＦＲへコピーする。この結
果、命令５のＥステージでは命令２の結果を見ることが
できる。命令５のレジスタ読み出しステージのＥステー
ジ終了後も、同じ動作をさせることにより、命令６のレ
ジスタ読み出しステージのＥステージで、命令３の結果
を見ることができる。以後の処理は通常通りで、１命令
実行毎にＦＲの値をＦＲＳ１へ、ＦＲＳ１の値をＦＲＳ
２へコピーし、ＦＲＳ２の値は捨てる。FIG. 21 is a diagram showing an operation when an interrupt occurs between the instruction 3 and the instruction 4. Instructions 4, 5, 6, and 7 are invalidated. Each register stops updating the value after the interrupt occurs, FR holds the result of instruction 3, FRS1 holds the result of instruction 2, and FRS2 holds the result of instruction 1. Further, an interrupt vector is set in the program counter. An interrupt processing program starting from the interrupt vector, FR, FR
FIG. 22 for saving the values of S1 and FRS2 on the memory is a diagram for explaining the operation at the time of return from the interrupt processing. First, at the end of the interrupt processing program, as shown in FIG. 22, the result of instruction 1 is set to FR, the result of instruction 2 is set to FRS2,
The result of the instruction 3 is returned to the FRS1. By doing so, the result of the instruction 1 can be seen at the E stage of the register read stage of the instruction 4. After the E stage of instruction 4, the value of FR is set to FRS1, and the value of FRS1 is set to F
Copy the value of FRS2 to FR into RS2. As a result, at the E stage of the instruction 5, the result of the instruction 2 can be seen. By performing the same operation after the E stage of the register read stage of the instruction 5, the result of the instruction 3 can be seen at the E stage of the register read stage of the instruction 6. The subsequent processing is as usual, and the value of FR is set to FRS1 and the value of FRS1 is set to FRS each time one instruction is executed.
2 and discard the value of FRS2.

【００９４】以上、説明した様に、シャドウレジスタを
設けることにより、遅延書き込み命令実行時でも割込み
を受け付け復帰することができる。シャドウレジスタの
無い場合は、図２１で命令３の結果しか退避できず、図
２２の割込み復帰時の命令４で、命令１の結果を見るこ
とができない。これは、命令２，３が命令１と同じレジ
スタに値を書き込む場合があるからである。例えば図１
２のプログラムでも、命令１と同じレジスタに命令３で
書き込む。As described above, by providing a shadow register, an interrupt can be accepted and returned even when a delayed write instruction is executed. If there is no shadow register, only the result of instruction 3 can be saved in FIG. 21 and the result of instruction 1 cannot be seen in instruction 4 at the time of interrupt return in FIG. This is because instructions 2 and 3 may write values to the same register as instruction 1. For example, FIG.
Even in the program 2, the instruction 3 is written in the same register as the instruction 1.

【００９５】次に、シャドウレジスタ付加によるハード
ウェアの増加量について述べる。レジスタの大きさは、
ポート数にほぼ比例する。図１７と図１９を比べれば分
かるように、シャドウレジスタのポート数は３と、レジ
スタのポート数１３に比べてずっと小さいので、シャド
ウレジスタ付加によるハードウェアの増加量は小さい。Next, an increase in hardware due to the addition of a shadow register will be described. The size of the register is
It is almost proportional to the number of ports. As can be seen from a comparison between FIG. 17 and FIG. 19, the number of ports of the shadow register is three, which is much smaller than the number of ports of the register 13, so that an increase in hardware due to the addition of the shadow register is small.

【００９６】図４１は、図１３の命令制御ユニット１３
０３の実施例である。１５０は命令語長判定手段、１０
１は第１命令レジスタ、１０２は第２命令レジスタ、１
０３は第３命令レジスタ、１０４は第４命令レジスタで
ある。４１００はモードレジスタである。１００はモー
ド制御回路、１０５はレジスタ読み出し制御回路、１０
６はレジスタ書き込み制御回路、１０７はファンクショ
ン制御回路、１０８はパイプライン制御回路、１０９は
競合検出回路である。FIG. 41 shows the operation of the instruction control unit 13 shown in FIG.
03 is an example. 150 is an instruction word length determining means, 10
1 is a first instruction register, 102 is a second instruction register, 1
03 is a third instruction register, and 104 is a fourth instruction register. 4100 is a mode register. 100 is a mode control circuit, 105 is a register read control circuit, 10
6 is a register write control circuit, 107 is a function control circuit, 108 is a pipeline control circuit, and 109 is a conflict detection circuit.

【００９７】４語長命令は、必ず４語境界をまたがない
様に配置されているものとする。また、１語長命令は、
２語境界に囲まれた２語が同時に実行されるものとす
る。本実施例では命令語長判定は、図３で説明した様に
オペコードの中の最も左のビット、即、図４１の信号１
３１０−１−１(Ｃ０００)そのものを見ることにより行
われる。It is assumed that four-word length instructions are always arranged so as not to cross four-word boundaries. Also, the one word length instruction is
Assume that two words surrounded by a two-word boundary are executed simultaneously. In this embodiment, the instruction word length is determined by the leftmost bit in the operation code as described with reference to FIG.
This is performed by looking at 310-1-1 (C000) itself.

【００９８】モード制御回路１００の詳細を示したのが
図２７である。２７００は制御回路、２７０１はＮフィ
ールド１３１０−４−１を保持するレジスタ、２７０２
はディクリメンタ、２７０３はコンパレータである。コ
ンパレータの出力信号VALID（２７０４）は、制御回路
２７００に送り出される。レジスタ２７０１にセットさ
れたＮフィールドの値は、ディクリメンタ２７０２によ
り１サイクルごとに１減算され、００になったときに信
号ＶＡＬＩＤ(２７０４)がアサートされる。信号ＶＡＬ
ＩＤはネゲート時に無効サイクルの挿入を指示し、アサ
ート時に命令の実行を指示する信号である。FIG. 27 shows details of the mode control circuit 100. 2700 is a control circuit, 2701 is a register holding the N field 1310-4-1, 2702
Is a decrementer and 2703 is a comparator. The output signal VALID (2704) of the comparator is sent to the control circuit 2700. The value of the N field set in the register 2701 is decremented by 1 every cycle by the decrementer 2702, and when it becomes 00, the signal VALID (2704) is asserted. Signal VAL
ID is a signal that instructs insertion of an invalid cycle when negated and that instructs execution of an instruction when asserted.

【００９９】制御回路２７００は、競合検出回路出力１
１６(ＢＵＢ)と、命令アドレスの下位から２ビット目１
３０９−１(ＣＡ３０)と、オペコード中の４語長命令か
どうかを示すビット１３１０−１−１(Ｃ０００)と信号
２７０４（ＶＡＬＩＤ）を見て５つのモードの内、どれ
であるかを判定し、図２３に示すように第１〜４命令レ
ジスタへのオペコードのセット，プログラムカウンタの
インクリメントを信号１１０により行う。また、現サイ
クルがどのモードであるかを示す信号１１０は、モード
レジスタ４１００にラッチされ、その出力信号１３０は
レジスタ読み出し制御回路１０５，レジスタ書き込み制
御回路１０６，ファンクション制御回路１０７，パイプ
ライン制御回路１０８，競合検出回路１０９に送出され
る。図２３で、Ｃ０〜３は、４語境界内にある４語で、
若いアドレスよりＣ０，Ｃ１，Ｃ２，Ｃ３と命令が並ん
でいるものとする。Ｃ０の最左ビットＣ０００により、
図３に示すように、１語長命令か４語長命令かが判定で
きる。１語命令モード１は、４語境界内の左の２つの命
令(Ｃ０，Ｃ１)を実行するモードで、第１命令レジスタ
にＣ０が、第２命令レジスタにＣ１がセットされプログ
ラムカウンタＰＣは＋２される。また、１語命令モード
２は、４語長境界内の右の２つの命令(Ｃ２，Ｃ３)を実
行するモードで、第１命令レジスタにＣ２，第２命令レ
ジスタにＣ３がセットされ、プログラムカウンタＰＣは
＋２される。即ち、１語長命令実行時には、第１命令レ
ジスタと、第２命令レジスタのみ用い、第３命令レジス
タと第４命令レジスタは用いない。４語長命令モード
は、４語長命令（Ｃ０，Ｃ１，Ｃ２，Ｃ３）実行するモ
ードで、第１〜４命令レジスタにＣ０〜Ｃ３がセットさ
れ、プログラムカンタＰＣは＋４される。競合モードと
は、競合検出回路１０９が競合を検出した場合で、第１
〜４命令レジスタ及び、モードレジスタ４１００は前サ
イクルの値を保持する。また、プログラムカウンタＰＣ
の更新は行わない。無効命令モードは、現サイクル以前
に実行した４語長命令のＮフィールドで、現サイクルに
ハードウェアで無効命令(ＮＯＰ)を挿入することを指示
されている場合で、命令レジスタには無効命令がセット
され、プログラムカウンタＰＣは更新されない。これに
より、無効サイクルが１サイクル挿入されることにな
る。The control circuit 2700 outputs the conflict detection circuit output 1
16 (BUB), the first bit from the lower order of the instruction address
309-1 (CA30), a bit 1310-1-1 (C000) indicating whether the instruction is a 4-word length instruction in the operation code, and a signal 2704 (VALID) determine which of the five modes it is. 23, the operation code is set in the first to fourth instruction registers and the program counter is incremented by the signal 110. A signal 110 indicating which mode the current cycle is in is latched in the mode register 4100, and its output signal 130 is output to the register read control circuit 105, the register write control circuit 106, the function control circuit 107, and the pipeline control circuit 108. , And sent to the conflict detection circuit 109. In FIG. 23, C0-3 are four words within the four-word boundary,
It is assumed that instructions are arranged in order of C0, C1, C2, C3 from a young address. By the leftmost bit C000 of C0,
As shown in FIG. 3, it can be determined whether the instruction is a one-word length instruction or a four-word length instruction. The one-word instruction mode 1 is a mode in which the left two instructions (C0, C1) within the four-word boundary are executed. C0 is set in the first instruction register, C1 is set in the second instruction register, and the program counter PC is +2. Is done. The one-word instruction mode 2 is a mode for executing the right two instructions (C2, C3) within the four-word length boundary. C1 is set in the first instruction register, C3 is set in the second instruction register, and the program counter is set. PC is +2. That is, when executing a one-word length instruction, only the first instruction register and the second instruction register are used, and the third instruction register and the fourth instruction register are not used. The 4-word length instruction mode is a mode for executing a 4-word length instruction (C0, C1, C2, C3), in which C0 to C3 are set in the first to fourth instruction registers, and the program counter PC is incremented by +4. The conflict mode is when the conflict detection circuit 109 detects a conflict.
The? 4 instruction register and the mode register 4100 hold the value of the previous cycle. Also, the program counter PC
Is not updated. The invalid instruction mode is a case where an instruction to insert an invalid instruction (NOP) by hardware in the current cycle is specified in the N field of a 4-word length instruction executed before the current cycle. Set, the program counter PC is not updated. As a result, one invalid cycle is inserted.

【０１００】１語長命令実行時にはＣ０、又は、Ｃ２の
実行の為に、第１ALU1401(図１４)，第１乗算器１５０
３(図１５)，第１加算器１５０５(図１５)を用いる。一
方、Ｃ１、又は、Ｃ３の実行の為に、第２ALU1402(図１
４)，第２乗算器１５０４(図１５)，第２加算器１５０
６(図１５)を用いる。また、４語長命令実行時には、ロ
ードストア演算のアドレス計算を第１ALU1401（図１４)
で、整数演算を第２ALU1402（図１４)で、第１浮動小数
点演算を第１乗算器１５０３(図１５)で、第２浮動小数
点演算を第１加算器１５０５(図１５)で、第３浮動小数
点演算を第２乗算器１５０４(図１５)で、第４浮動小数
点演算を第２加算器１５０６(図１５)で行う。When executing a one-word length instruction, the first ALU 1401 (FIG. 14) and the first multiplier 150 are used to execute C0 or C2.
3 (FIG. 15) and a first adder 1505 (FIG. 15). On the other hand, in order to execute C1 or C3, the second ALU 1402 (FIG. 1)
4), second multiplier 1504 (FIG. 15), second adder 150
6 (FIG. 15). When a four-word instruction is executed, the address calculation of the load / store operation is performed by the first ALU 1401 (FIG. 14).
The integer operation is performed by the second ALU 1402 (FIG. 14), the first floating-point operation is performed by the first multiplier 1503 (FIG. 15), and the second floating-point operation is performed by the first adder 1505 (FIG. 15). The decimal point operation is performed by the second multiplier 1504 (FIG. 15), and the fourth floating point operation is performed by the second adder 1506 (FIG. 15).

【０１０１】図４１のレジスタ読み出し制御回路１０
５，レジスタ書き込み回路１０６，ファンクション制御
回路１０７は、モード制御回路出力のモード指定信号１
１０と第１〜４命令レジスタの値により、上述の演算器
割当て規則に従い、整数演算ユニット１３０４(図１３)
の制御信号１３１８を生成する。レジスタ読み出し制御
回路について、更に詳細に説明したのが図２４である。
６つの演算器のそれぞれの２つの入力に入れるレジスタ
の指定を、オペコードのどのフィールドで行うかを示し
ている。フィールドの略号については、図３の複合命令
の欄に示す。４語長命令のＪ１とＡ１は、１語長命令の
Ｓ２と、４語長命令のＪ２とＡ２は、１語長命令のＳ２
と同じ位置であることを利用し、図２４上のフィールド
指定には、１語長命令時も、Ｊ１，Ｊ２，Ａ１，Ａ２を
用いて述べている。これは、Ｃ０とＣ１の区別の為であ
る。The register read control circuit 10 of FIG.
5, the register writing circuit 106 and the function control circuit 107 receive the mode designation signal 1 of the mode control circuit output.
The integer operation unit 1304 (FIG. 13) according to the arithmetic unit assignment rule described above, based on the value of the tenth and the first to fourth instruction registers.
Is generated. FIG. 24 illustrates the register read control circuit in further detail.
It shows in which field of the operation code the register to be put into each of the two inputs of each of the six arithmetic units is specified. The abbreviations of the fields are shown in the column of the compound instruction in FIG. The four word length instructions J1 and A1 are one word length instruction S2, and the four word length instructions J2 and A2 are one word length instruction S2
24, the field designation on FIG. 24 is described using J1, J2, A1, and A2 even for a one-word length instruction. This is to distinguish between C0 and C1.

【０１０２】次に図４１の競合検出回路１０９について
述べる。図７〜図９を用いて説明したように、本実施例
では４語長命令間の競合を検出する必要はない。また、
本実施例では、１語長命令実行用の全ての演算器を２重
化している為、同時に実行する２つの１語長命令間の演
算器による競合はあり得ない。簡単のためにレジスタ競
合も無いものとする。本実施例をレジスタ競合がある場
合に拡張することは、例えば、特願昭63−283673号のよ
うに容易である。前に述べたように、１語長命令と次の
有効な４語長命令の間に、必ず無効な４語長命令が２つ
入り、同様に、有効な４語長命令と次の１語長命令との
間には必ず無効な４語長命令が２つ入っているので、４
語長命令と１語長命令間の競合も有り得ない。故に、競
合検出回路１０９は、現サイクルの１語長命令と、それ
以前に実行された１語長命令間の競合のみを検出すれば
よい。モード制御回路１００の制御により、１語長命令
は、第１命令レジスタ１０１と第２命令レジスタ１０２
にのみセットされるので、競合検出回路１０９は、第１
命令レジスタ１０１と第２命令レジスタ１０２のみを見
ればよく、第３命令レジスタ１０３，第４命令レジスタ
１０４は見る必要はない。Next, the conflict detection circuit 109 shown in FIG. 41 will be described. As described with reference to FIGS. 7 to 9, in the present embodiment, it is not necessary to detect a conflict between four word long instructions. Also,
In the present embodiment, since all the arithmetic units for executing the one-word-length instruction are duplicated, there is no possibility of conflict between the two one-word-length instructions executed at the same time by the arithmetic unit. For simplicity, there is no register conflict. It is easy to extend this embodiment when there is a register conflict, for example, as disclosed in Japanese Patent Application No. 63-283673. As mentioned earlier, there are always two invalid 4-word instructions between the one-word instruction and the next valid 4-word instruction, and similarly, a valid 4-word instruction and the next one-word instruction. There are always two invalid 4-word length instructions between the long instruction and the long instruction.
There can be no conflict between word length instructions and one word length instructions. Therefore, the conflict detection circuit 109 need only detect a conflict between the one-word-length instruction in the current cycle and the one-word-length instruction executed before that. Under the control of the mode control circuit 100, the one-word length instruction is divided into a first instruction register 101 and a second instruction register 102.
, The conflict detection circuit 109 sets the first
It is only necessary to look at the instruction register 101 and the second instruction register 102, and it is not necessary to look at the third instruction register 103 and the fourth instruction register 104.

【０１０３】図２５は、競合検出回路１０９の実施例の
ブロック図である。２５０１〜2504はレジスタ、２５０
５はマスク回路、２５０６〜２５２１はコンパレータで
ある。図７において、命令７，８を現命令とし、命令３
〜６との競合検出を考える。命令１，２のＳステージの
次に命令７，８のＥステージが来るので、命令１，２と
命令７，８の間の競合はない。図２５のレジスタ２５０
１には、命令５が書き込むレジスタの番号が、レジスタ
２５０３には、命令６が書き込むレジスタの番号が、レ
ジスタ２５０２には、命令３が書き込むレジスタの番号
が、レジスタ２５０４には命令４が書き込むレジスタの
番号が記憶されている。上記４つのレジスタと、命令７
及び命令８が読み出す４つのレジスタの番号を、２５０
６〜２５２１の１６個のコンパレータで比較し、結果を
マスク回路２５０５に送出する。マスク回路２５０５
は、モード制御回路１００の出力１３０や、パイプライ
ン制御回路１０８の出力１１５を見て、コンパレータの
ヒット信号が有効であるかどうかを判定し、有効であれ
ば競合を示す信号１１６をアサートする。即、コンパレ
ータの出力がレジスタの一致を示していてもその命令が
無効化される場合は、信号１１６をネゲートする。この
マスク回路２５０５により、モード信号１３０が４語長
モードであることを示している時には、信号１１６をネ
ゲートする。FIG. 25 is a block diagram of an embodiment of the conflict detection circuit 109. 2501 to 2504 are registers, 250
5 is a mask circuit, and 2506 to 2521 are comparators. In FIG. 7, instructions 7 and 8 are the current instruction, and instruction 3
Consider competition detection with # 6. Since the E stage of the instructions 7 and 8 comes after the S stage of the instructions 1 and 2, there is no conflict between the instructions 1 and 2 and the instructions 7 and 8. Register 250 of FIG.
1 is the number of the register to which instruction 5 is written, register 2503 is the number of the register to which instruction 6 is written, register 2502 is the number of the register to which instruction 3 is written, and register 2504 is the register to which instruction 4 is written. Are stored. The above four registers and instruction 7
And the number of the four registers read by instruction 8 is 250
The results are compared by 16 comparators 6 to 2521, and the result is sent to the mask circuit 2505. Mask circuit 2505
Sees the output 130 of the mode control circuit 100 and the output 115 of the pipeline control circuit 108, determines whether or not the hit signal of the comparator is valid, and asserts the signal 116 indicating contention if it is valid. Immediately, if the instruction is invalidated even if the output of the comparator indicates register match, the signal 116 is negated. When the mode signal 130 indicates the 4-word length mode by the mask circuit 2505, the signal 116 is negated.

【０１０４】次にパイプライン制御回路１０８について
説明する。パイプライン制御回路は、モード信号１３
０，図１３の整数演算ユニット１３０４からのフラグ信
号1317，図１３の浮動小数点演算ユニット１３０７から
のフラグ信号１３１５，図１３のデータキャッシュコン
トローラとのインタフェース１３１６，図１３の命令キ
ャッシュコントローラとのインタフェース１３１１を用
いて、図１３の分岐ユニット１３０２制御信号１３１２
を送出し、分岐ユニットを制御する。即ち、有効な分岐
命令が来た時には、分岐を行い、それ以外の時には、モ
ード信号１１０を用いて、図２３のように分岐ユニット
内にあるプログラムカウンタを制御する。またパイプラ
イン制御回路１０８は、信号１１５を、レジスタ読み出
し制御回路１０５，パイプライン書き込み制御回路１０
６，ファンクション制御回路１０７，競合検出回路１０
９を送出し、パイプラインの状態を制御する。即ち、命
令キャッシュ、あるいはデータキャッシュのアクセスに
際してミスを生じた時に図２９に示すようにパイプライ
ンをロックする。Next, the pipeline control circuit 108 will be described. The pipeline control circuit outputs the mode signal 13
0, a flag signal 1317 from the integer operation unit 1304 in FIG. 13, a flag signal 1315 from the floating point operation unit 1307 in FIG. 13, an interface 1316 with the data cache controller in FIG. 13, and an interface 1311 with the instruction cache controller in FIG. , The branch unit 1302 control signal 1312 of FIG.
To control the branching unit. That is, when a valid branch instruction arrives, branching is performed, and at other times, the program counter in the branch unit is controlled using the mode signal 110 as shown in FIG. Further, the pipeline control circuit 108 outputs the signal 115 to the register read control circuit 105 and the pipeline write control circuit 10.
6, function control circuit 107, conflict detection circuit 10
9 to control the state of the pipeline. That is, when a miss occurs in accessing the instruction cache or the data cache, the pipeline is locked as shown in FIG.

【０１０５】次に上記実施例の第１の変形例について述
べる。上記実施例において、４語長命令は演算結果を３
つ後の命令に初めて反映することにより、４語長命令間
の競合検出部が不要になった。同様の効果を達成するた
めに、４語長命令でも演算結果を次の命令に反映する
が、４語長命令間での競合をコンパイラで避けるように
することもできる。具体的には、ある４語長命令が書き
込むレジスタを次の４語長命令と次の次の４語長命令と
で読まないようにするのである。このようにすることに
よって、第１の実施例のレジスタ数を実質的に増やす効
果は失われるが、図１９で示したシャドウレジスタは不
要になる。Next, a first modification of the above embodiment will be described. In the above-described embodiment, the four-word length instruction has
By reflecting the first instruction in the succeeding instruction for the first time, a conflict detection unit between instructions having a length of four words becomes unnecessary. In order to achieve the same effect, the operation result is reflected in the next instruction even in the case of a four-word instruction, but the conflict between the four-word instructions can be avoided by the compiler. Specifically, a register to which a certain four-word instruction is written is prevented from being read by the next four-word instruction and the next four-word instruction. By doing so, the effect of substantially increasing the number of registers in the first embodiment is lost, but the shadow register shown in FIG. 19 becomes unnecessary.

【０１０６】さらに、上記実施例の第２の変形例につい
て述べる。図３で示した実施例では、命令の中に１語長
命令か４語長命令かを示すビットを持っていたが、計算
機の中に１語長命令か４語超命令かを示すフラグを持
ち、このフラグを命令で制御することも可能である。こ
れによって、フラグを制御する命令が必要になるが、１
度フラグを切り換えれば毎回命令の中で語長を示す必要
が無くなるという利点がある。Further, a second modification of the above embodiment will be described. In the embodiment shown in FIG. 3, the instruction has a bit indicating whether the instruction is a one-word instruction or a four-word instruction, but a flag indicating whether the instruction is a one-word instruction or more than four instructions is provided in the computer. It is also possible to control this flag with an instruction. This requires an instruction to control the flag,
Switching the degree flag has the advantage that it is not necessary to indicate the word length in the instruction every time.

【０１０７】図３０〜図３２を用いて、上記実施例の第
３の変形例について述べる。A third modification of the above embodiment will be described with reference to FIGS.

【０１０８】本変形例では、図３０の様に浮動小数点レ
ジスタを３２本から１２８本に拡張している。図３１に
命令フォーマット，図３２に命令の説明を示す。ＦＲ０
〜３１は基本命令，複合命令の両者で使用可であるが、
ＦＲ３２〜ＦＲ１２７は、複合命令でのみ使用するレジ
スタである。レジスタを指定するＩ１，ＩＴ，ＭＴ，Ａ
１，ＡＴ，Ｎ１，ＮＴ，Ｂ１，ＢＴの各フィールドは図
３１に示すように、各７ビットと増加する。本変形例で
は、ソースレジスタの片方をターゲットレジスタと一致
させることにより、全体を４語に収めているが、そうし
たくなければ、全体の語長をさらに長くすることも可能
である。In this modification, the number of floating-point registers is increased from 32 to 128 as shown in FIG. FIG. 31 shows an instruction format, and FIG. 32 shows an explanation of the instruction. FR0
~ 31 can be used for both basic instructions and compound instructions.
FR32 to FR127 are registers used only in compound instructions. I1, IT, MT, A to specify register
The fields 1, 1, AT, N1, NT, B1, and BT increase to 7 bits as shown in FIG. In this modification, one of the source registers is matched with the target register so that the entirety is contained in four words. However, if this is not desired, the entire word length can be further increased.

【０１０９】本変形例では、基本命令に複合命令を追加
し、複合命令で基本命令が使えるレジスタの数よりも、
多くのレジスタを使えるようにすることにより、全体と
して、使用可能なレジスタ数を増やせるという効果があ
る。本変形例ではＦＲ０〜３１は基本命令と複合命令の
両方でアクセス可であるが代案として、基本命令用の３
２本と、複合命令用の１２８本のレジスタを独立のもの
とすることも可能である。In this modification, a compound instruction is added to the basic instruction, and the number of registers in which the basic instruction can be used in the compound instruction is smaller than that of the compound instruction.
By making many registers available, the number of available registers can be increased as a whole. In this modification, FR0 to 31 can be accessed by both the basic instruction and the compound instruction.
It is also possible to make the two registers and the 128 registers for compound instructions independent.

【０１１０】さらに本変形例では、図３２に示すよう
に、メモリからキャッシュへのプリフェッチの際にプリ
フェッチする語数をＪＴフィールドで指定可となってい
る。これにより一度に複数のブロック転送を行うよう命
令で指示でき、効率があがるという利点がある。Further, in this modification, as shown in FIG. 32, the number of words to be prefetched at the time of prefetching from the memory to the cache can be specified in the JT field. Thus, there is an advantage that efficiency can be increased by instructing to perform a plurality of block transfers at once by an instruction.

【０１１１】図３３，図３４を用いて上記実施例の第４
の変形例について述べる。本変形例では、データのプリ
フェッチをＪ１，Ｊ２，ＪＴといった整数演算フィール
ドで行わずに、１ビットのＰフィールドで行っている点
が異なる。Ｐ＝１の時には、ロード・ストア演算に用い
られたアドレスを含むブロックの次のブロックがプリフ
ェッチされる。こうすることにより、プリフェッチに要
するフィールドが節約でき、ロード・ストア演算，整数
演算，プリフェッチの３動作が並列指定可となるという
利点が生まれる。Referring to FIGS. 33 and 34, the fourth embodiment will be described.
A modified example of the above will be described. This modification is different from the first embodiment in that data prefetching is not performed in an integer operation field such as J1, J2, or JT but in a 1-bit P field. When P = 1, the block next to the block including the address used for the load / store operation is prefetched. By doing so, the fields required for prefetch can be saved, and there is an advantage that the three operations of load / store operation, integer operation, and prefetch can be specified in parallel.

【０１１２】図３５はもう１つの実施例について説明す
る全体ブロック図である。FIG. 35 is an overall block diagram for explaining another embodiment.

【０１１３】３５００はプログラムカウンタ、３５０１
は命令を格納するメモリ手段、3502はマスク・スイッチ
回路、３５０３〜３５０６はＭ個のｎバイト長の命令レ
ジスタ、３５０７はデコーダ、３５０８，３５０９はＬ
個（Ｌ＞１）の演算ユニット、１５０は命令長判定手
段、１０９は競合検出回路、１００はモード制御回路、
４１００はモードレジスタである。Reference numeral 3500 denotes a program counter, 3501
Is a memory means for storing instructions, 3502 is a mask switch circuit, 3503 to 3506 are M instruction registers of n-byte length, 3507 is a decoder, and 3508 and 3509 are L
(L > 1) arithmetic units, 150 is instruction length determining means, 109 is a conflict detection circuit, 100 is a mode control circuit,
4100 is a mode register.

【０１１４】プログラムカウンタ３５００は命令アドレ
ス３５１３を、命令を格納するメモリ手段３５０１に送
出する。３５０１の中には、ｎバイト長の命令とｎ×Ｍ
バイト長（Ｍ＞１）の命令が混在しており、命令アドレ
ス３５１３で指定された命令を含む複数の命令をマスク
・スイッチ回路３５０２に送出する。マスク・スイッチ
回路３５０２はｎバイト命令であればＭ個の命令レジス
タの内Ｎ個（１＜Ｎ＜Ｍ）の命令レジスタ３５０３〜３
５０４の内の少なくとも１つにセットし、ｎ×Ｍバイト
令命であれば命令レジスタ３５０３〜３５０６にセット
する。デコーダ３５０７は命令レジスタ３５０３〜３５
０６よりの命令３５１９〜３５２２をデコードし、Ｌ個
の演算ユニットを制御信号３５２３，３５２４を用いて
制御する。命令長判定手段１５０は、少なくとも命令３
５１４の一部を見て、命令長を示す信号３５２６をモー
ド制御回路１００に送出する。競合検出回路１０９は、
命令レジスタ３５０３〜３５０４を見て、ｎバイト命令
間の競合の有無を知らせる信号１１６をモード制御回路
１００に送出する。モード制御回路は、命令長，競合の
有無，プログラムカウンタの値によりモードを判定し、
制御信号１１０により、プログラムカウンタ，マスク・
スイッチ回路，デコーダを制御する。The program counter 3500 sends the instruction address 3513 to the memory means 3501 for storing the instruction. 3501 includes an n-byte length instruction and n × M
Instructions of byte length (M> 1) are mixed, and a plurality of instructions including the instruction specified by the instruction address 3513 are sent to the mask switch circuit 3502. If the mask switch circuit 3502 is an n-byte instruction, N (1 < N <M) instruction registers 3503 to 3 out of M instruction registers
It is set in at least one of the instruction registers 504. If the instruction is n × M bytes, it is set in the instruction registers 3503 to 3506. The decoder 3507 has instruction registers 3503 to 35
Decode the instructions 3519 to 3522 from 06 and control the L arithmetic units using control signals 3523 and 3524. The instruction length determining means 150 determines at least the instruction 3
Looking at a part of 514, a signal 3526 indicating the instruction length is sent to the mode control circuit 100. The conflict detection circuit 109
By looking at the instruction registers 3503 to 3504, a signal 116 is sent to the mode control circuit 100 to notify whether or not there is contention between n-byte instructions. The mode control circuit determines the mode based on the instruction length, the presence or absence of conflict, and the value of the program counter.
The control signal 110 controls the program counter, mask,
Controls switch circuits and decoders.

【０１１５】本実施例と、図１〜図２９及び図４１の実
施例との対応を説明する。図１〜図２９の実施例は、ｎ
＝４，Ｍ＝４，Ｎ＝２，Ｌ＝２の場合である。また演算
ユニットは整数演算ユニットと浮動小数点演算ユニット
であった。また、図３５のマスク・スイッチ回路は、図
４１の第１〜４命令レジスタ１０１〜１０４にセットす
る命令を生成しているセレクタや、ＮＯＰによるマスク
等に対応する。また図３５の競合検出回路１０９は、図
４１の競合検出回路１０９に対応する。図３５の命令長
判定手段１５０は図４１の命令長判定手段１５０に対応
する。図３５のモード制御回路１００は、図４１のモー
ド制御回路１００に対応する。The correspondence between this embodiment and the embodiments shown in FIGS. 1 to 29 and 41 will be described. The embodiment of FIGS.
= 4, M = 4, N = 2, L = 2. The arithmetic units were an integer arithmetic unit and a floating-point arithmetic unit. The mask switch circuit of FIG. 35 corresponds to a selector that generates an instruction to be set in the first to fourth instruction registers 101 to 104 of FIG. 41, a mask by NOP, and the like. The conflict detection circuit 109 in FIG. 35 corresponds to the conflict detection circuit 109 in FIG. The instruction length determining means 150 of FIG. 35 corresponds to the instruction length determining means 150 of FIG. The mode control circuit 100 in FIG. 35 corresponds to the mode control circuit 100 in FIG.

【０１１６】図３，図５，図１０，図１１，図１２で示
した実施例では、図１０の様にメモリ上でのデータ配置
が制約されるという欠点があった。これを解決したのが
図３６〜図４０に示す実施例である。本実施例の複合命
令では、図３６，図３７に示すように、１命令でロード
・ストア等のメモリ演算を２個実行できる。ハードウェ
アとしてはキャッシュを２ポート化するか、あるいは、
１マシンサイクルに２度アクセス可能な構成にすればよ
い。図３９，図４０に図１１，図１２に示したものと同
じ問題を解くプログラムを示す。The embodiment shown in FIGS. 3, 5, 10, 11, and 12 has a disadvantage that the data arrangement on the memory is restricted as shown in FIG. The embodiment shown in FIGS. 36 to 40 solves this problem. In the compound instruction of this embodiment, two memory operations such as load / store can be executed by one instruction as shown in FIGS. As a hardware, the cache is made into two ports, or
What is necessary is just to make it the structure which can be accessed twice in one machine cycle. FIGS. 39 and 40 show programs for solving the same problems as those shown in FIGS.

【０１１７】これまでに説明した実施例では、命令長の
短い命令と長い命令を混在した計算機において、長い命
令の演算結果を、以後の任意の命令から反映する。ある
いは、長い命令は、次命令以降に続く無効命令の数を任
意に指定できる。あるいは、長い命令は、メモリまたは
キャッシュメモリからレジスタへデータを転送する第１
のフィールドと、メモリからキャッシュメモリへデータ
を転送する第２のフィールドを設けるといった工夫を行
ったが、これらの工夫は、長い命令のみを有するＶＬＩ
Ｗ型計算機に対しても有効である。In the above-described embodiments, in a computer in which both short instructions and long instructions are mixed, the operation result of a long instruction is reflected from any subsequent instructions. Alternatively, for a long instruction, the number of invalid instructions following the next instruction can be arbitrarily specified. Alternatively, the long instruction is a first instruction that transfers data from memory or cache memory to a register.
, And a second field for transferring data from the memory to the cache memory.
It is also effective for W-type computers.

【０１１８】[0118]

【発明の効果】以上説明したように、本実施例では、４
語長命令を用いて、４語で７つの演算を指定できるが、
競合検出は、４×４＝１６個のコンパレータで行うこと
ができる。４語長命令間の競合検出をハードウェアで行
おうとすると、前サイクルの分岐演算を除く、６個の演
算の書き込みレジスタと前々サイクルの同じく６個の書
き込みレジスタと、現サイクルの１２個の読み出しレジ
スタとの間の競合検出が必要で、(６＋６)×１２＝１４
４個のコンパレータが必要となる。本実施例では、これ
に比して１６／１４４のハードウェアで済むという利点
がある。As described above, in this embodiment, 4
Using word length instructions, seven operations can be specified in four words,
The conflict detection can be performed by 4 × 4 = 16 comparators. When trying to detect a conflict between instructions having a length of four words by hardware, the write registers for six operations, six write registers for the same cycle two cycles before, and the twelve write registers for the current cycle, excluding the branch operation in the previous cycle, are used. Conflict detection with the read register is required, and (6 + 6) × 12 = 14
Four comparators are required. In this embodiment, there is an advantage that the hardware of 16/144 is sufficient.

【０１１９】本実施例では、４語長命令で指示される演
算数７に対して、１マシンサイクルで処理される命令長
の短い命令が２である。これにより、２演算分の競合検
出回路で最大７演算の並列処理が得られるという効果が
ある。In this embodiment, the number of instructions having a short instruction length processed in one machine cycle is 2 for the operation number 7 indicated by the 4-word length instruction. As a result, there is an effect that parallel processing of a maximum of seven operations can be obtained by the conflict detection circuit for two operations.

【０１２０】本発明によれば、並列に実行する演算の数
を増やし、性能を高めることができる。According to the present invention, the number of operations executed in parallel can be increased, and the performance can be improved.

【０１２１】本発明によれば、コードサイズを小さくす
ることができる。これによりコードキャッシュのヒット
率が高まり、性能を高めることができる。According to the present invention, the code size can be reduced. As a result, the hit ratio of the code cache is increased, and the performance can be improved.

【０１２２】本発明によれば、並列に実行する演算内の
ハードウェアによる競合検出を容易にすることができ
る。これにより、マシンサイクルを高めること、ハード
物量を減らし、コストを下げることができる。特に、長
い命令の中で指定する演算数が大の時、この効果は著し
い。According to the present invention, it is possible to easily detect a conflict in hardware among operations executed in parallel. As a result, the machine cycle can be increased, the amount of hardware can be reduced, and the cost can be reduced. This effect is particularly remarkable when the number of operations specified in a long instruction is large.

【０１２３】本発明によれば、現サイクル以前に実行し
た命令と、現サイクルに実行する命令との間のハードウ
ェアによる競合検出，待合わせを容易にすることができ
る。これにより、マシンサイクルを高めること、ハード
物量を減らし、コストを下げることができる。According to the present invention, it is possible to easily detect a conflict and wait for a conflict between an instruction executed before the current cycle and an instruction executed in the current cycle by hardware. As a result, the machine cycle can be increased, the amount of hardware can be reduced, and the cost can be reduced.

【０１２４】本発明によれば、ソフトウェアが用いられ
るレジスタの数を実質的に多くし、ソフトウェア上の最
適化により演算の並列度をあげ、性能を高めることがで
きる。According to the present invention, the number of registers used by software can be substantially increased, and the degree of parallelism of operation can be increased by software optimization, thereby improving performance.

【０１２５】本発明によれば、従来アーキテクチャとの
上位互換性を保つことができる。According to the present invention, upward compatibility with the conventional architecture can be maintained.

【図面の簡単な説明】[Brief description of the drawings]

【図１】命令制御ユニットの全体図である。FIG. 1 is an overall view of an instruction control unit.

【図２】レジスタ構成を示す図である。FIG. 2 is a diagram showing a register configuration.

【図３】命令形式を説明する図である。FIG. 3 is a diagram illustrating an instruction format.

【図４】１語長命令の動作を説明する図である。FIG. 4 is a diagram illustrating the operation of a one-word length instruction.

【図５】４語長命令の動作を説明する図である。FIG. 5 is a diagram illustrating the operation of a four-word length instruction.

【図６】パイプラインステージを説明する図である。FIG. 6 is a diagram illustrating a pipeline stage.

【図７】競合無の時の１語長命令処理のパイプラインを
示す図である。FIG. 7 is a diagram showing a pipeline of one-word length instruction processing when there is no contention;

【図８】競合有の時の１語長命令処理のパイプラインを
示す図である。FIG. 8 is a diagram showing a pipeline of one-word-length instruction processing when there is contention;

【図９】４語長命令のパイプラインを示す図である。FIG. 9 is a diagram showing a pipeline of a 4-word length instruction.

【図１０】データのメモリ上での配置を示す図である。FIG. 10 is a diagram showing an arrangement of data on a memory.

【図１１】４語長命令を用いた時の演算の様子を説明す
る図である。FIG. 11 is a diagram illustrating a state of an operation when a 4-word length instruction is used.

【図１２】４語長命令を用いたプログラムを説明する図
である。FIG. 12 is a diagram illustrating a program using a 4-word length instruction.

【図１３】全体ブロック図である。FIG. 13 is an overall block diagram.

【図１４】整数演算ユニットのブロック図である。FIG. 14 is a block diagram of an integer operation unit.

【図１５】浮動小数点演算ユニットのブロック図であ
る。FIG. 15 is a block diagram of a floating-point operation unit.

【図１６】浮動小数点レジスタファイルのブロック図で
ある。FIG. 16 is a block diagram of a floating-point register file.

【図１７】浮動小数点レジスタのブロック図である。FIG. 17 is a block diagram of a floating-point register.

【図１８】浮動小数点レジスタ１ビット分の回路図であ
る。FIG. 18 is a circuit diagram of one bit of a floating-point register.

【図１９】浮動小数点レジスタのブロック図である。FIG. 19 is a block diagram of a floating-point register.

【図２０】シャドウレジスタの動作を説明する図であ
る。FIG. 20 is a diagram illustrating the operation of a shadow register.

【図２１】シャドウレジスタの動作を説明する図であ
る。FIG. 21 is a diagram illustrating the operation of a shadow register.

【図２２】シャドウレジスタの動作を説明する図であ
る。FIG. 22 is a diagram illustrating the operation of a shadow register.

【図２３】モード制御回路の動作を説明する図である。FIG. 23 is a diagram illustrating the operation of the mode control circuit.

【図２４】レジスタ読み出し制御回路の動作を説明する
図である。FIG. 24 is a diagram illustrating the operation of the register read control circuit.

【図２５】競合検出回路のブロック図である。FIG. 25 is a block diagram of a conflict detection circuit.

【図２６】コンパイラの処理フローである。FIG. 26 is a processing flow of a compiler.

【図２７】モード制御回路の詳細を示した図である。FIG. 27 is a diagram showing details of a mode control circuit.

【図２８】データキャッシュの詳細を示したものであ
る。FIG. 28 shows details of the data cache.

【図２９】ロードストア演算でキャッシュミスを生じた
ときのパイプラインを示す図である。FIG. 29 is a diagram showing a pipeline when a cache miss occurs in a load store operation.

【図３０】その他の実施例を示す図である。FIG. 30 is a diagram showing another embodiment.

【図３１】その他の実施例を示す図である。FIG. 31 is a diagram showing another embodiment.

【図３２】その他の実施例を示す図である。FIG. 32 is a diagram showing another embodiment.

【図３３】その他の実施例を示す図である。FIG. 33 is a diagram showing another embodiment.

【図３４】その他の実施例を示す図である。FIG. 34 is a diagram showing another embodiment.

【図３５】その他の実施例を示す図である。FIG. 35 is a diagram showing another embodiment.

【図３６】その他の実施例を示す図である。FIG. 36 is a view showing another embodiment.

【図３７】その他の実施例を示す図である。FIG. 37 is a diagram showing another embodiment.

【図３８】その他の実施例を示す図である。FIG. 38 is a diagram showing another embodiment.

【図３９】その他の実施例を示す図である。FIG. 39 is a view showing another embodiment.

【図４０】その他の実施例を示す図である。FIG. 40 is a diagram showing another embodiment.

【図４１】図１を詳細化した図である。FIG. 41 is a detailed view of FIG. 1;

【符号の説明】[Explanation of symbols]

１００…モード制御部、１０１…第１命令レジスタ、１
０２…第２命令レジスタ、１０３…第３命令レジスタ、
１０４…第４命令レジスタ、１０５…レジスタ読み出し
制御、１０６…レジスタ書き込み制御、１０７…ファン
クション制御、１０８…パイプライン制御、１０９…競
合検出部。100: mode control unit, 101: first instruction register, 1
02: second instruction register, 103: third instruction register,
104: fourth instruction register, 105: register read control, 106: register write control, 107: function control, 108: pipeline control, 109: conflict detection unit.

───────────────────────────────────────────────────── フロントページの続き (72)発明者山田弘道茨城県日立市久慈町4026番地株式会社日立製作所日立研究所内 (72)発明者前島英雄茨城県日立市久慈町4026番地株式会社日立製作所日立研究所内 (56)参考文献特開平２−130634（ＪＰ，Ａ) 特開昭52−102648（ＪＰ，Ａ) 特開平３−129433（ＪＰ，Ａ) 特開平２−132524（ＪＰ，Ａ) 特開平２−130635（ＪＰ，Ａ) 弘中哲夫、外３名、“ストリームＦＩＦＯ方式に基づくベクトル・プロセッサ『順風』”、電子情報通信学会技術研究報告、社団法人電子情報通信学会、1989 年８月４日、Ｖｏｌ．89，Ｎｏ．167、ｐ．49−54 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 9/38 ──────────────────────────────────────────────────の Continued on the front page (72) Inventor Hiromichi Yamada 4026 Kuji-cho, Hitachi City, Ibaraki Prefecture Inside Hitachi, Ltd.Hitachi Laboratory (72) Inventor Hideo Maejima 4026 Kuji-machi, Hitachi City, Ibaraki Prefecture Hitachi Research, Ltd. In-house (56) References JP-A-2-130634 (JP, A) JP-A-52-102648 (JP, A) JP-A-3-129433 (JP, A) JP-A-2-132524 (JP, A) Japanese Patent Application Laid-Open No. 2-130635 (JP, A) Tetsuo Hironaka and three others, "Vector processor" Shunpu "based on stream FIFO system", IEICE technical report, IEICE, 1989 On August 4, Vol. 89, No. 167, p. 49-54 (58) Field surveyed (Int. Cl. ⁷ , DB name) G06F 9/38

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】入力された命令の命令長を検出し、１つの
命令で複数の演算を実行するための第１の命令か、１つ
の命令で１の演算を実行するための複数の第２の命令か
を判定する命令長判定手段と、上記第２の命令が入力さ
れた時、上記第２の命令と上記第２の命令の次に入力さ
れた第２の命令との間の競合の検出を行う競合検出手段
とを有し、入力された上記命令をデコードする命令制御
部と、上記命令制御部からのデコードされた上記第１の命令に
基づいて並列に複数の演算を実行し、または少なくとも
２つのデコードされた上記第２の命令に基づいて並列に
演算を実行し、または上記第２の命令間の競合が検出さ
れたときに１つのデコードされた上記第２の命令に基づ
いて演算を実行する演算部とを有する情報処理装置。1. A detects the instruction length of an input instruction, one first or instructions for executing a plurality of operations in the instruction, one instruction at a plurality of second for performing the operations of 1 of instruction
Instruction length determining means for determining whether the second instruction
When the second instruction is input next to the second instruction.
The second has a detection and conflict detection means <br/> performing conflict between instructions, and the instruction control unit for decoding the instruction inputted, the decoded from the instruction control unit execute multiple operations in parallel on the basis of the first instruction, or at least two perform operations in parallel on the basis of the decoded said second instruction, or the second inter-instruction conflict is detected And an operation unit that executes an operation based on one decoded second instruction when the second instruction is issued.

【請求項２】請求項１記載の情報処理装置において、上記第１の命令によって並列に演算できる数は、上記第
２の命令によって並列に演算できる数より大きい情報処
理装置。2. The information processing apparatus according to claim 1, wherein a number that can be calculated in parallel by said first instruction is larger than a number that can be calculated in parallel by said second instruction.

【請求項３】１つの命令で複数の演算を実行するための
第１の命令と、１つの命令で１の演算を実行するための
複数の第２の命令とからなる命令を記憶する記憶部と、上記記憶部から入力された命令の命令長を検出し、上記
第１の命令か第２の命令かを判定する命令長判定手段
と、上記第２の命令が入力された場合、上記第２の命令
と上記第２の命令の次に入力された第２の命令との間の
競合の検出を行う競合検出手段とを有し、入力された上
記命令をデコードする命令制御部と、上記命令制御部から送られてきたデコードされた上記第
１の命令に基づいて並列に複数の演算の実行、または少
なくとも２つのデコードされた上記第２の命令に基づい
て並列に演算の実行、または上記第２の命令間の競合が
検出されたときに１つのデコードされた上記第２の命令
に基づいて演算の実行を行う演算部とを有する情報処理
装置。3. An apparatus for performing a plurality of operations with one instruction .
Detecting a first instruction, and one memory unit for storing a plurality of and a second instruction instructions for performing one of the operations in the instruction, the instruction length of the instruction inputted from the storage unit, the above
Instruction length determining means for determining whether the instruction is a first instruction or a second instruction
If, when the second command is input, and a conflict detection means for detecting a conflict between the second command input to the next of said second instruction and said second instruction the instruction control unit which decodes the command input, execution of multiple operations in parallel on the basis of the instructions decoded in the first sent from the instruction control unit, or at least two decoded the An execution unit for executing the operation in parallel based on the second instruction, or executing the operation based on one decoded second instruction when a conflict between the second instructions is detected. Information processing device.

【請求項４】請求項３記載の情報処理装置において、上記記憶部はキャッシュメモリである情報処理装置。4. The information processing apparatus according to claim 3, wherein said storage unit is a cache memory.

【請求項５】請求項１又は３記載の情報処理装置におい
て、上記演算部は、複数の演算器を有する情報処理装置。5. An information processing apparatus according to claim 1, wherein said arithmetic unit has a plurality of arithmetic units.

【請求項６】単一演算を指示する命令長の短い命令又は
複数演算を指示する命令長の長い命令が入力された時、
上記命令長の短い命令か上記命令長の長い命令かを判定
する命令長判定手段と、上記命令長の短い命令と判定さ
れた場合、上記命令長の短い命令間の競合を検出する競
合検出手段を有し、入力された上記命令長の短い命令又
は上記命令長の長い命令をデコードする命令制御部と、上記命令制御部によって上記命令長の長い命令がデコー
ドされた場合、上記命令長の長い命令に基づいて並列に
複数の演算を行い、あるいは上記命令長の短い命令がデ
コードされた場合、少なくとも２つのデコードされた上
記命令長の短い命令に基づいて並列に演算を処理行い、
あるいは上記競合検出手段によって上記命令長の短い命
令と上記命令長の短い命令の次に入力された命令長の短
い命令との間の競合が検出された場合、デコードされた
１つの命令長の短い命令に基づいて演算を行う演算部と
を有する情報処理装置。6. When the long instruction the instruction length to indicate the short instruction or operation of instruction length to an instruction of a single operation is input,
And determining instruction length determining means for determining the instruction length short instruction or the instruction length long instruction, if it is determined that the short instruction of the instruction length, competition detecting conflicts between short instructions of said instruction length
A command or a command having a short command length
Is an instruction control unit for decoding a long instruction of the instruction length, the instruction length long instruction by the instruction control unit when it is decoded <br/> de, multiple parallel based on a long instruction of the instruction length Perform an operation, or if the instruction with the short instruction length is
If coded , perform operations in parallel based on at least two of the decoded instructions with the shorter instruction lengths;
Alternatively, the instruction having a short instruction length and the instruction having a short instruction length input next to the instruction having a short instruction length are output by the conflict detection means.
And an operation unit that performs an operation based on one decoded instruction having a short instruction length when a conflict with a new instruction is detected.

【請求項７】請求項６記載の情報処理装置において、上記演算部によって実行される演算において、並列して
実行できる演算の多い時には上記命令長の長い命令を用
い、並列して実行できる演算の少ない時には上記命令長
の短い命令を用いることを特徴とする情報処理装置。 7. The information processing apparatus according to claim 6 , wherein, when there are many operations that can be executed in parallel in the operations executed by the operation unit, the instructions having the long instruction length are used to execute the operations that can be executed in parallel. An information processing apparatus characterized by using an instruction having a short instruction length when the number is small.

【請求項８】請求項６記載の情報処理装置において、上記命令長の長い命令を用いるか上記命令長の短い命令
を用いるかの判断基準は、並列して実行できる演算数に
依存し、上記演算数は、パラメータとしてコンパイラに
指定されることを特徴とする情報処理装置。8. The information processing apparatus according to claim 6 , wherein said instruction having a long instruction length is used or said instruction having a short instruction length is used.
The criterion for determining whether to use is the number of operations that can be performed in parallel.
Depends on the above operation number
An information processing apparatus characterized by being specified .

【請求項９】請求項６記載の情報処理装置において、上記命令長の長い命令の中で指定できる演算数は、１マ
シンサイクルで処理される上記命令長の短い命令の数よ
りも大きいことを特徴とする情報処理装置。9. The information processing apparatus according to claim 6, wherein the number of operations that can be specified in the long instruction length is larger than the number of the short length instruction processed in one machine cycle. Characteristic information processing device .