JP2011503676A

JP2011503676A - Compound instructions in multithreaded processors

Info

Publication number: JP2011503676A
Application number: JP2010520626A
Authority: JP
Inventors: ピーターリーバック; モリスバーグラス
Original assignee: イマジネイションテクノロジーズリミテッド
Priority date: 2007-08-14
Filing date: 2008-08-14
Publication date: 2011-01-27
Anticipated expiration: 2028-08-14
Also published as: GB0715824D0; US7904702B2; GB2451845B; JP5425074B2; GB2451845A; US20090063824A1; EP2179350B1; EP2179350A1; WO2009022142A1

Abstract

マルチスレッドプロセッサに使用するための複合命令、及びそのような命令を使用するプロセッサを提供する。本発明は、各スレッドがその実行に必要とするリソースの利用可能性に基づいて複数のスレッドを実行するためのマルチスレッドプロセッサを開示する。プロセッサは、どのスレッドを実行すべきかを判断するための手段と、判断結果に基づいて、各スレッドがスレッドの状態を格納するためにかつスレッド上で命令を実行するのに使用するためにそれぞれのレジスタ手段に結合されたスレッドの実行間で切り換えるための手段と、全てのスレッドによって共有され、実行中のスレッドが実行性能を改善するために使用する更に別のレジスタ手段と、内部レジスタ手段が使用されている間は別のスレッドへの実行の切り換えを防止するための手段とを含む。
【選択図】図２Provided are complex instructions for use in a multithreaded processor, and a processor using such instructions. The present invention discloses a multi-thread processor for executing multiple threads based on the availability of resources that each thread requires for its execution. The processor has a means for determining which thread to execute, and based on the determination result, each thread stores each thread's state and uses it to execute instructions on the thread. Used by means for switching between executions of threads coupled to register means, further register means shared by all threads and used by running threads to improve execution performance, and internal register means And means for preventing switching of execution to another thread.
[Selection] Figure 2

Description

本発明は、マルチスレッドプロセッサに使用するための複合命令、及びそのような命令を使用するプロセッサに関する。 The present invention relates to compound instructions for use in a multithreaded processor, and a processor that uses such instructions.

マルチスレッドプロセッサの例は、本出願人の米国特許第５、９６８、１６７号に説明している。これは、各スレッドがその実行に必要とするリソースの利用可能性に基づいて複数のスレッドの各々を実行するプロセッサを開示している。実行に対するスレッド間の選択は、どのスレッドを実行すべきかを判断し、適切にスレッド間で切り換える媒体制御コア又はアービターによって実施される。
そのようなマルチスレッドプロセッサは、いくつかのプログラム又は実行中のスレッドの各々に対するプログラム状態を格納する別の組のレジスタを有することになる。スレッドの１つによって要求されたリソースが利用可能でない、例えば、それがメモリアクセス待ちである時は、スレッドの続行が回避され、プロセッサは、スレッドが要求する全てのリソースが利用可能であり、従って実行を続けることができる別のスレッドに切り換える。スレッド間のアービトレーションは、プロセッサが何もしないのではなく可能な時はいつでも有用な命令を実行することができ、それによってプロセッサの使用が最適化されるように編成される。スレッドが実行していない時は、レジスタの組がその現在の状態を格納する。 An example of a multi-thread processor is described in Applicant's US Pat. No. 5,968,167. This discloses a processor that executes each of a plurality of threads based on the availability of resources that each thread requires for its execution. The selection between threads for execution is performed by a media control core or arbiter that determines which thread to execute and switches between threads appropriately.
Such a multi-threaded processor will have another set of registers that store the program state for each of several programs or running threads. If the resource requested by one of the threads is not available, for example when it is waiting for memory access, thread continuation is avoided and the processor will have all resources requested by the thread available. Switch to another thread that can continue execution. Arbitration between threads is organized so that the processor can execute useful instructions whenever possible, rather than doing nothing, thereby optimizing processor use. When a thread is not executing, a set of registers stores its current state.

プロセッサの最適な使用を達成する決定的な１つの要素は、スレッド間で実行を交換するのに必要な時間オーバーヘッドである。これがメモリアクセス待ちのような特定スレッドに対する待ち時間と類似している場合、実行中のスレッド間で切り換えるプロセッサ効率に正味の利得はない。したがって、プロセッサ効率を最適化するためにスレッド実行間の迅速な交換を必要とすることが認められてきた。迅速なスレッド交換は、各スレッドに対して格納されたプログラム状態のための別々の組のレジスタを有することによって助けられる。 One critical factor in achieving optimal processor usage is the time overhead required to exchange execution between threads. If this is similar to the latency for a particular thread, such as waiting for memory access, there is no net gain in processor efficiency switching between running threads. Thus, it has been recognized that rapid exchange between thread executions is required to optimize processor efficiency. Rapid thread exchange is aided by having a separate set of registers for the program state stored for each thread.

上述のように、実行中のスレッドの状態は、１組のレジスタに格納される。これらのレジスタから最大性能を取得するために、それらが各クロックサイクル内で少なくとも２回読まれ、少なくとも１回書かれることが普通である。これは、機械語コード命令の構造に起因する。一例が「ＡＤＤ」命令である。これは、２つのソースレジスタのコンテンツを取り、それらに加算を行い、次に、結果を再度レジスタストアに格納する。これが１クロックサイクル内で実行されるように、レジスタストレージは、２つの読取ポートと１つの書込ポートを必要とし、２つの読取ポートは、加算が行われる２つのデータ部分を提供し、書込ポートは、結果をレジスタに再度書き込むことを可能にする。これに伴う問題は、レジスタストア上のポート数が増加すると、ストアを生成するのに必要なシリコンの面積が有意に増加し、その結果、演算速度が低下することである。デバイスのコストも増加する。 As described above, the state of the running thread is stored in a set of registers. In order to get maximum performance from these registers, they are usually read at least twice and written at least once within each clock cycle. This is due to the structure of machine language code instructions. An example is the “ADD” instruction. This takes the contents of the two source registers, adds them, and then stores the result back in the register store. Register storage requires two read ports and one write port so that this is done within one clock cycle, and the two read ports provide two data portions to be added and written The port allows the result to be written back to the register. The problem with this is that as the number of ports on the register store increases, the area of silicon required to create the store increases significantly, resulting in a decrease in computational speed. Device costs also increase.

マルチポートレジスタストレージは、迅速な切換機能を必要とするスレッドの数だけ深さを増加させなければならない。例えば、プロセッサが１６個のレジスタを有しており、４つのスレッドを効率的に切り換えるべきであることが要求された場合、４掛ける１６個のレジスタストレージが要求され、スレッド当たり１６個のレジスタストアになる。したがって、レジスタストレージに必要なシリコン面積は、ポートの数及びスレッドの数の関数である。 Multi-port register storage must be increased in depth by the number of threads that require rapid switching capabilities. For example, if a processor has 16 registers and it is required that 4 threads should be switched efficiently, 4 times 16 register storage is required, and 16 register stores per thread. become. Thus, the silicon area required for register storage is a function of the number of ports and the number of threads.

一実施形態では、各スレッドがその実行に必要とするリソースの利用可能性に基づいて複数のスレッドを実行するためのマルチスレッドプロセッサを開示する。プロセッサは、どのスレッドを実行すべきかを判断するための手段と、判断結果に基づいて、各スレッドがスレッドの状態を格納するためにかつスレッド上で命令を実行するのに使用するためにそれぞれのレジスタ手段に結合されたスレッドの実行間で切り換えるための手段と、全てのスレッドによって共有された更に別のレジスタ手段とを含み、実行中のスレッドは、実行性能を改善する更に別のレジスタ手段を使用し、プロセッサは、内部レジスタ手段が使用されている間は別のスレッドへの実行の切り換えを防止するための手段を更に含む。 In one embodiment, a multi-thread processor is disclosed for executing multiple threads based on the availability of resources that each thread needs to execute. The processor has a means for determining which thread to execute, and based on the determination result, each thread stores each thread's state and uses it to execute instructions on the thread. Including means for switching between executions of threads coupled to the register means and further register means shared by all threads, wherein the executing thread has further register means for improving execution performance. In use, the processor further includes means for preventing execution switching to another thread while the internal register means is in use.

本発明の好ましい実施形態は、マルチスレッドプロセッサに対する主レジスタストアとは別の小さなレジスタストアを提供する。
これは、内部レジスタストアと呼ばれる。この内部レジスタストアと主レジスタストアの違いは、内部レジスタがスレッドの数に対して複写されず、すなわち、全てのスレッドによって共有されるただ１つの内部レジスタストアが提供されることである。内部レジスタストアは、どの実行中のスレッドによっても使用することができる。 The preferred embodiment of the present invention provides a small register store separate from the main register store for multithreaded processors.
This is called an internal register store. The difference between this internal register store and the main register store is that the internal registers are not duplicated for the number of threads, that is, only one internal register store is provided that is shared by all threads. The internal register store can be used by any running thread.

内部レジスタストアにおける内部レジスタは、全てのスレッド間で共有され、プロセッサは、内部レジスタが使用されている間は別のスレッドの実行に切り換えることが防止される。内部レジスタは、命令の実行中に使用することができる追加レジスタを提供し、それによってデータへの同時アクセスが増加し、かつそれによってより機能的に豊富な命令の実行を可能にする。仮にこの数の余分なレジスタ及び読取／書込ポートが主レジスタストアに追加される場合、それらは、各スレッドに対して複写されるべきであり、それによってシリコンコストに相当に上乗せされるであろう。 Internal registers in the internal register store are shared among all threads, preventing the processor from switching to the execution of another thread while the internal registers are used. Internal registers provide additional registers that can be used during instruction execution, thereby increasing concurrent access to data and thereby allowing execution of more functionally rich instructions. If this number of extra registers and read / write ports are added to the main register store, they should be duplicated for each thread, thereby adding significantly to the silicon cost. Let's go.

好ましくは、実行中のスレッドは、少数の命令を複合命令にグループ分けすることになる。この複合命令が、そのスレッドの実行を停止させると考えられるいずれの命令も含まない場合、実行中のスレッドにおける切り換えを防止することによってＣＰＵ効率に損失はない。
したがって、好ましい実施形態は、より多くのポートを主レジスタストアに追加するコストなしに、より多くの読取／書込アクセスを有するプロセッサを提供する。複合命令を使用は、プロセッサ利用度の最適化を保証することに役立つものである。 Preferably, the executing thread will group a small number of instructions into compound instructions. If this compound instruction does not include any instruction that would stop execution of that thread, there is no loss in CPU efficiency by preventing switching in the executing thread.
Thus, the preferred embodiment provides a processor with more read / write access without the cost of adding more ports to the main register store. The use of compound instructions helps to ensure optimization of processor utilization.

従来技術の中央演算処理装置の簡略化したブロック図である。It is the simplified block diagram of the central processing unit of a prior art. 本発明を具現化するプロセッサのブロック図である。FIG. 3 is a block diagram of a processor embodying the present invention. 本発明の実施形態における命令のコンパイルの例を示す図である。It is a figure which shows the example of the compilation of the instruction | indication in embodiment of this invention. 本発明の実施形態に使用するための命令フォーマットのレイアウトを示す図である。It is a figure which shows the layout of the instruction format for using for embodiment of this invention. 図４のフォーマットに対する更なる詳細を与える図である。FIG. 5 provides further details for the format of FIG.

図１には、中央演算処理装置（ＣＰＵ）２が示されている。これは、メモリバス６によって外部メモリ４に結合される。このバス６は、外部メモリとの間でデータ及び命令を転送するのに使用される。
ＣＰＵ２によって実施される処理は、算術論理演算ユニット（ＡＬＵ）８内で行われる。Ｒは、外部メモリバス６を通じて外部メモリ４にメモリ及び命令要求を送り、かつ外部メモリバス６を通じて応答を受け取る。
ＡＬＵは、レジスタストア１２に結合された１組の読取／書込ポート１０を有する。この例では、４つのレジスタストア１２がある。これは、ＣＰＵ２が命令の４つのスレッドを処理するように使用され、それらの間を必要に応じて適切に切り換え、適切なレジスタストア１２から各スレッドのステータスを取り出すのを可能にする。 FIG. 1 shows a central processing unit (CPU) 2. This is coupled to the external memory 4 by a memory bus 6. This bus 6 is used to transfer data and instructions to and from an external memory.
Processing performed by the CPU 2 is performed in an arithmetic logic unit (ALU) 8. R sends memory and command requests to the external memory 4 through the external memory bus 6 and receives responses through the external memory bus 6.
The ALU has a set of read / write ports 10 coupled to a register store 12. In this example, there are four register stores 12. This is used by CPU 2 to process four threads of instructions, allowing them to switch appropriately between them as needed and retrieve the status of each thread from the appropriate register store 12.

図２は、更に別の組の読取／書込ポート１６によってＡＬＵ８に結合された内部レジスタストア１４の追加によって修正された図１の構成を示している。この読取／書込ポートの組は、ＡＬＵ８をレジスタストア１２に結合する読取／書込ポートとは別である。しかし、ＡＬＵ８上で実行しているいずれのスレッドも使用することができる内部レジスタストア１４の１つのコピーが存在する。この例の目的に対して、レジスタストア１２のために２つの読取ポートと１つの書込ポートがあるものと仮定する。レジスタストア１２には、他の数の読取ポート及び書込ポートを提供することができる。更に、内部レジスタストア１４内への２つの読取ポートと１つの書込ポートが存在する。ＣＰＵの異なる演算性能を必要とする場合、異なる数の読取ポート及び書込ポートを提供することができる。 FIG. 2 shows the configuration of FIG. 1 modified by the addition of an internal register store 14 coupled to the ALU 8 by yet another set of read / write ports 16. This set of read / write ports is separate from the read / write ports that couple the ALU 8 to the register store 12. However, there is one copy of the internal register store 14 that can be used by any thread running on the ALU 8. For the purposes of this example, assume that there are two read ports and one write port for register store 12. The register store 12 may be provided with other numbers of read and write ports. In addition, there are two read ports and one write port into the internal register store 14. Different numbers of read ports and write ports can be provided if different computational performance of the CPU is required.

ここで、内部レジスタストア１４を有するＣＰＵの演算をＣＰＵが実施すべきである一般的な数学演算、すなわち、ベクトルドット積に関連して説明する。この演算の３次元バージョンは、下の式に示されている。
ドット積＝Ａｘ^*Ｂｘ＋Ａｙ^*Ｂｙ＋Ａｚ^*Ｂｚ
この式を実施するために、３つの乗算と２つの加算が必要である。ＡＬＵ８には、単一サイクルの乗算及び加算論理が設けられている。したがって、上の式に示したドット積を３サイクルで実行することが可能であるはずである。これは、以下の理論的機械語命令に関連して示される。
ＭＵＬＲ６、Ｒ０、Ｒ１
ＭＬＡＲ６、Ｒ２、Ｒ３、Ｒ６
ＭＬＡＲ６、Ｒ４、Ｒ５、Ｒ６ Here, the operation of the CPU having the internal register store 14 will be described in relation to a general mathematical operation that the CPU should perform, that is, a vector dot product. A three-dimensional version of this operation is shown in the equation below.
Dot product = Ax ^* Bx + Ay ^* By + Az ^* Bz
To implement this equation, three multiplications and two additions are required. ALU 8 is provided with single cycle multiplication and addition logic. Therefore, it should be possible to execute the dot product shown in the above equation in three cycles. This is shown in connection with the following theoretical machine language instruction.
MUL R6, R0, R1
MLA R6, R2, R3, R6
MLA R6, R4, R5, R6

「ＭＵＬＲ６、Ｒ０、Ｒ１」は、レジスタＲ０のコンテンツのレジスタＲ１のコンテンツとの乗算と、結果をレジスタＲ６に格納することを意味する。レジスタＲ０は、「Ａｘ」を収容し、レジスタＲ１は、「Ｂｘ」を収容するであろう。
「ＭＵＬＲ６、Ｒ２、Ｒ３、Ｒ６」は、レジスタＲ２のコンテンツのレジスタＲ３のコンテンツとの乗算と、結果をレジスタＲ６に加算することを意味する。加算の結果は、レジスタＲ６に再度格納される。レジスタＲ２は、「Ａｙ」を収容し、レジスタＲ３は、「Ｂｙ」を収容するであろう。
「ＭＬＡＲ６、Ｒ４、Ｒ５、Ｒ６」は、レジスタＲ４のコンテンツのレジスタＲ５のコンテンツとの乗算と、結果をレジスタＲ６に加算することを意味する。加算の結果は、レジスタＲ６に再度格納される。レジスタＲ４は、「Ａｚ」を収容し、レジスタＲ５は、「Ｂｚ」を収容するであろう。 “MUL R6, R0, R1” means that the content of the register R0 is multiplied by the content of the register R1, and the result is stored in the register R6. Register R0 will contain “Ax” and register R1 will contain “Bx”.
“MUL R6, R2, R3, R6” means multiplication of the contents of the register R2 with the contents of the register R3 and adding the result to the register R6. The result of the addition is stored again in the register R6. Register R2 will contain “Ay” and register R3 will contain “By”.
“MLA R6, R4, R5, R6” means multiplication of the content of the register R4 with the content of the register R5 and adding the result to the register R6. The result of the addition is stored again in the register R6. Register R4 will contain “Az” and register R5 will contain “Bz”.

これから、「ＭＬＡ」命令に対して、３つのレジスタから読み取り、１つのレジスタに書き込む必要があることが分る。したがって、これは、レジスタ１２に関して先に指摘したよりも１つ多い読取ポートを必要とする。したがって、仮にレジスタストア１２だけが利用可能だとすれば、演算を３サイクル以内で実施することを可能にするには、読取ポイントポートが不十分であろう。この問題は、そのために使用することができる余分な読取／書込ポートを有する内部レジスタストア１４を使用することによって克服することができる。したがって、内部レジスタストアを使用してこれを実行するための機械語命令は、以下の通りである。
ＭＵＬＩ０、Ｒ０、Ｒ１
ＭＬＡＩ０、Ｒ２、Ｒ３、Ｉ０
ＭＬＡＲ６、Ｒ４、Ｒ５、Ｉ０ From this it can be seen that for the “MLA” instruction, it is necessary to read from three registers and write to one register. This therefore requires one more read port than previously pointed out for register 12. Thus, if only register store 12 is available, there will be insufficient read point ports to allow operations to be performed within 3 cycles. This problem can be overcome by using an internal register store 14 with an extra read / write port that can be used for it. Thus, the machine language instruction to do this using the internal register store is as follows:
MUL I0, R0, R1
MLA I0, R2, R3, I0
MLA R6, R4, R5, I0

これは、ドット積の中間結果がレジスタストアＲ６に格納されず、代わりに内部レジスタストア１０に格納されるという点で最初の例と異なる。最終合計後の結果だけがＲ６に再度格納される。この構成を使用すると、２つの読取ポートと１つの書込ポートだけがレジスタストア１２に必要とされることが保証され、これは、ＣＰＵ２のこの特定の例に対する限界である。
図に示すように、外部レジスタの代わりに内部レジスタＩ０が使用され、それによってメモリアクセスの数が減少し、ＣＰＵによって実行すべきであるコード行の全数が生成される。 This is different from the first example in that the intermediate result of the dot product is not stored in the register store R6 but is instead stored in the internal register store 10. Only the result after the final sum is stored again in R6. Using this configuration ensures that only two read ports and one write port are required for the register store 12, which is a limitation for this particular example of CPU2.
As shown, the internal register I0 is used instead of the external register, thereby reducing the number of memory accesses and generating the total number of code lines that should be executed by the CPU.

内部レジスタストア１４が上述の機械語命令の実行に使用されている間、ＣＰＵは、異なるスレッドを実行するような交換が防止されることが必須である。これは、別のスレッドが内部レジスタストアを必要とする可能性があり、既に書き込まれた結果に上書きしてそれを破損させると考えられるためである。したがって、本発明の好ましい実施形態は、スケジュール変更不可ビットと呼ばれる単一ビットの命令を使用して交換を防止するように構成される。このビットが命令に対して設定されるときに、ＣＰＵは、その命令の終わりと次の命令との間でスレッドの交換が防止される。したがって、この例では、スケジュール変更不可ビットは、ドット積ＭＵＬの第１の２つの命令、及びＭＬＡの第１の発生に対して設定される。それは、ＭＬＡの第２の発生に対して設定されないが、ＣＰＵは、ＭＬＡの第２の発生の実行後まで、異なるスレッドへの交換が防止される。 While the internal register store 14 is used to execute the machine language instructions described above, it is essential that the CPU be prevented from being exchanged to execute different threads. This is because another thread may need an internal register store, which will overwrite the already written result and corrupt it. Accordingly, the preferred embodiment of the present invention is configured to prevent exchange using a single bit instruction called the non-schedulable bit. When this bit is set for an instruction, the CPU is prevented from exchanging threads between the end of that instruction and the next instruction. Thus, in this example, the non-schedulable bit is set for the first two instructions of dot product MUL and the first occurrence of MLA. It is not set for the second occurrence of MLA, but the CPU is prevented from switching to a different thread until after the execution of the second occurrence of MLA.

複合命令は、いくつかの連続した命令に対してスケジュール変更不可ビットを設定することによって作成される。この組の連続命令又は複合命令がサイクルごとのベースで実行されると、レジスタストア１２だけがアクセスされた場合にその他の方法で使用できたであろうよりも内部レジスタストアの書込／読取ポートを通じてより多くのデータ経路へのアクセスが可能である。これは、標準のプロセッサアーキテクチャと比べて有意な利点を提供する。標準のアーキテクチャを使用して同じ性能を達成するためには、レジスタストア１２の４つのコピー各々に対して第３の読取ポイントの追加を必要とするであろう。これは、内部レジスタストアに必要とされるシリコン面積よりも相当に高価になると考えられる。 Compound instructions are created by setting the non-schedulable bit for several consecutive instructions. When this set of sequential or compound instructions are executed on a cycle-by-cycle basis, the internal register store write / read port rather than would otherwise be available if only the register store 12 was accessed. Access to more data paths is possible. This provides significant advantages over standard processor architectures. To achieve the same performance using a standard architecture would require the addition of a third read point for each of the four copies of register store 12. This is believed to be considerably more expensive than the silicon area required for the internal register store.

複合命令は、メモリから読み込まれてＣＰＵによって実行される命令を供給するのに使用されるコンパイラ／アセンブラの関連において存在する１つの概念である。ＣＰＵは、通常命令と複合命令を区別しない。同様に、コンパイラ／アセンブラは、複合命令を収容するどの入力プログラムも受け取らないことになる。
コンパイラ／アセンブラによって実施される付加的な機能性は、入力プログラムの解析と、複合命令が使用されてＣＰＵ上で実行されるときにプログラムの性能を改善することができる位置の検索とである。そのような位置が見つかった状態で、コンパイラ／アセンブラは、内部レジスタストアを利用する一連のＣＰＵ命令を作成し、内部レジスタストアがそれ以上使用されなくなるまで停止中のそのスレッドの実行を防止するスケジュール変更不可フラグを設定することができる。
複合命令は、２つの連続ＣＰＵ命令と同程度に簡単になるか、又は数十のＣＰＵ命令と同じくらい複雑になる可能性がある。複合命令がＣＰＵに遭遇すると、スケジュール変更不可フラグが設定されている限り、複合命令の実行が持続する。 Compound instructions are one concept that exists in the context of a compiler / assembler used to provide instructions that are read from memory and executed by the CPU. The CPU does not distinguish between normal instructions and compound instructions. Similarly, the compiler / assembler will not accept any input program containing compound instructions.
Additional functionality implemented by the compiler / assembler is the analysis of the input program and the search for locations that can improve the performance of the program when compound instructions are used and executed on the CPU. With such a location found, the compiler / assembler creates a series of CPU instructions that make use of the internal register store and prevents execution of that thread that is suspended until the internal register store is no longer used. An unchangeable flag can be set.
Compound instructions can be as simple as two consecutive CPU instructions or as complex as dozens of CPU instructions. When a compound instruction encounters the CPU, execution of the compound instruction continues as long as the schedule change impossibility flag is set.

コンパイラ／アセンブラは、使用することができる複合命令の有無を判断するために２つの主要な方法で作動することができる。これらの第１のものは、入力プログラムのコンパイル、次に、実行する命令の数を低減するために内部レジスタストアを使用することができる状況の検索、及び続いて内部レジスタを使用するためにコンパイルした命令の変更を伴う。これらの第２のものは、内部レジスタシステムを使用するのに適する構成を識別するためにコンパイラ／アセンブラが入力プログラムを解析することを伴う。第１のものの例は、図３に示している。ここでは、入力プログラムが３０で受け取られ、第１の通過の後は、３２でコンパイル／アセンブルされる。３３において、コンパイラ／アセンブラは、コンパイル／アセンブルプログラム内で最適化を検索する。ここでは、それは、２つの乗算及び加算と更に別の乗算及び加算を検出し、結果がレジスタＲ６に格納される。３６において、コンパイル／アセンブルのＣＰＵに対する出力は、スケジュール変更不可ビット設定を含む命令を含んでいる。図に見られるように、最初の２つの乗算及び加算は、図３では３４であり、３６の複合命令の最初の２行で実施される。３４のその後の乗算及び加算は、次に、複合命令の３行目で実施される。 The compiler / assembler can operate in two main ways to determine the presence or absence of compound instructions that can be used. These first ones compile the input program, then search for situations where the internal register store can be used to reduce the number of instructions to execute, and then compile to use the internal registers With a change in the order. These second ones involve the compiler / assembler analyzing the input program to identify suitable configurations for using the internal register system. An example of the first is shown in FIG. Here, the input program is received at 30 and compiled / assembled at 32 after the first pass. At 33, the compiler / assembler searches for optimization within the compile / assemble program. Here it detects two multiplications and additions and yet another multiplication and addition and the result is stored in register R6. At 36, the output to the compiling / assembling CPU includes an instruction that includes a non-schedulable bit setting. As can be seen, the first two multiplications and additions are 34 in FIG. 3 and are performed in the first two rows of 36 compound instructions. Subsequent multiplication and addition of 34 is then performed on the third line of the compound instruction.

ＣＰＵそれ自体は、内部レジスタを使用するか又はスレッドスケジューリングを無効にするかを判断しない。代わりに、コンパイラ／アセンブラプログラムは、それがＣＰＵによって提供された内部レジスタリソースを使用することができる状況を検出する。ＣＰＵの命令の組は、アセンブラが、内部レジスタを使用すること及びスレッドのスケジュール変更を無効にすることも既に選択したことを表示するための機構を提供する。 The CPU itself does not determine whether to use internal registers or disable thread scheduling. Instead, the compiler / assembler program detects situations where it can use internal register resources provided by the CPU. The CPU instruction set provides a mechanism for the assembler to indicate that it has already chosen to use internal registers and disable thread rescheduling.

図４には、内部レジスタの使用をサポートするために、同じくスケジュール変更不可フラグを提供するために、適切なコンパイラ／アセンブラによってコンパイルすることができる命令設定フォーマットが示されている。命令フォーマットの各部分に提供されたデータが図５に示されている。図に見られるように、スケジュール変更不可ビットがビット２０にある。
複合命令を設定するために、コンパイラ／アセンブラは、いつこれらの命令を使用することができるかを識別するように設計される。これは、コンパイラ／アセンブラを生成するときに実施することができる。例えば、典型的な命令の組をサポートするプロセッサを考えると、提供されるデータ経路命令は、加算、乗算、及び乗算−合計のような関数を含むことになる。その組から提供することができる命令は、プロセッサのハードウエア実施がサポートすることができるソース及び宛先引数の数によって完全に制限される。プロセッサが単に２つのソース引数をサポートする場合、プロセッサは、実施されるこれらのソース引数を必要とすることになるので、乗算−合計命令を持たないことになる。これらの制限は、これが命令の組を判断するものなので、プロセッサのハードウエアを設計するときに行われる決定によって実施される。例えば、レジスタへの２つの読取ポートだけを含むプロセッサを設計する者は、算術論理演算ユニットに乗算−合計サポートを入れないと考えられる。 FIG. 4 shows an instruction setting format that can be compiled by a suitable compiler / assembler to provide the use of internal registers and also to provide a non-schedulable flag. The data provided for each part of the instruction format is shown in FIG. As can be seen in the figure, there is a non-schedulable bit in bit 20.
In order to set compound instructions, the compiler / assembler is designed to identify when these instructions can be used. This can be done when generating the compiler / assembler. For example, given a processor that supports a typical instruction set, the provided data path instructions will include functions such as add, multiply, and multiply-sum. The instructions that can be provided from the set are completely limited by the number of source and destination arguments that the hardware implementation of the processor can support. If the processor only supports two source arguments, it will have no multiply-sum instructions because it will need those source arguments to be implemented. These restrictions are enforced by decisions made when designing the processor hardware, since this determines the instruction set. For example, a person designing a processor that includes only two read ports to a register would not include multiply-sum support in the arithmetic logic unit.

本発明の実施形態は、典型的なプロセッサのソース及び宛先引数の数を増すものである（通常短い持続時間にわたって一部の制限付きで）。それは、改善した入力／出力データ境界の利点を利用する付加的な命令／演算の実施を可能にする。プロセッサ自体は、付加的な引数を使用する一部の命令をサポートするように設計されるが、それらを使用することができると考えられるあらゆる可能な命令を予期することができないのは明らかである。 Embodiments of the present invention increase the number of typical processor source and destination arguments (usually with some limitations over short durations). It allows additional instruction / operation implementations that take advantage of improved input / output data boundaries. The processor itself is designed to support some instructions that use additional arguments, but it is clear that it cannot anticipate every possible instruction that could use them .

１組内のある一定の命令は、暗黙にその命令の組に含められた余分な引数を使用するハードウエアプロセッササポート命令とすることができ、これらは、プロセッサ上に実施されるプログラムを設計する者に既知であることになる。コンパイラ／アセンブラは、ユーザの入力を取得してそれを命令の組にマップする。したがって、コンパイラ／アセンブラは、全ての命令の背景にある演算を理解するように設計される。このマッピングは、例えば、この命令が使用されるようにユーザがコンパイラ／アセンブラに入力することによって直接形成することができる。別の例のコンパイラ／アセンブラは、命令を検査し、適切な場合には、ユーザによって提供された命令を複合命令上にマップする。 Certain instructions within a set can be hardware processor support instructions that use extra arguments implicitly included in the set of instructions, which design a program to be implemented on the processor. Will be known to the person. The compiler / assembler takes user input and maps it to a set of instructions. Thus, the compiler / assembler is designed to understand the operations behind all instructions. This mapping can be formed directly, for example, by the user entering the compiler / assembler so that this instruction is used. Another example compiler / assembler examines instructions and, where appropriate, maps instructions provided by the user onto compound instructions.

以上の第１の例を考えると、これは、フィルタ命令をサポートするハードウエア実施とすることができる。フィルタ命令は、並行して５つのソース引数からフィルタデータと係数を読み取り、データを１つのスカラー出力までフィルタリングすることができる。これは、次に、映像復号アルゴリズムに使用することができる。映像復号器は、アセンブリ言語で書かれ、直接フィルタ命令を使用することになる。アセンブリは、これをハードウエアによって理解される機械語コードに変換することになる。すなわち、内部レジスタが存在すると、これの利点が利用され、複合命令が使用されることになる。 Considering the first example above, this can be a hardware implementation that supports filter instructions. The filter instruction can read the filter data and coefficients from five source arguments in parallel and filter the data to one scalar output. This can then be used for video decoding algorithms. The video decoder will be written in assembly language and will use filter instructions directly. The assembly will translate this into machine language code understood by the hardware. That is, the presence of internal registers takes advantage of this and uses compound instructions.

以上の第２の例では、コンパイラに伝えられるユーザの意図が、最初に２つの値を乗算して結果を第３の位置に格納する場合、２つの異なる値の乗算と別の新しい位置への格納がそれに続く。コンパイラは、これらの連続演算を識別してそれらを単一の二重乗算演算に変換するように構成され、この二重乗算演算は、ここでもまた、性能を改善するために付加的な内部レジスタストアを使用して、４つの値を並行して読み取り、乗算を実施し、結果を戻す。 In the second example above, if the user's intention communicated to the compiler is to first multiply two values and store the result in a third location, it will multiply the two different values and go to another new location. Storage follows. The compiler is configured to identify these consecutive operations and convert them to a single double multiplication operation, which again is an additional internal register to improve performance. Using the store, read the four values in parallel, perform the multiplication, and return the result.

したがって、内部レジスタストアの使用は、マルチスレッドプロセッサの性能を有意に改善し、実行中にスレッド交換が起こる可能性がない複合命令を作り出すことからより良い性能を得ることを可能にすることが認められるであろう。それによって性能が更に改善され、スレッド交換の結果としてデータの破損がないことが保証される。 Thus, it is recognized that the use of an internal register store can significantly improve the performance of multi-threaded processors and allow better performance from creating compound instructions that are unlikely to undergo thread exchange during execution. Will be done. This further improves performance and ensures no data corruption as a result of thread exchange.

Claims

各スレッドによる実行に必要とされるリソースの利用可能性に基づいて複数のスレッドを実行するためのマルチスレッドプロセッサであって、
どのスレッドが実行すべきかを判断する手段と、
前記判断の結果に基づいて複数のスレッドの実行を切り換える手段と、
を具備し、
各スレッドが、該スレッドの状態を記憶しかつ該スレッドにおける命令の実行に用いられる対応するレジスタ手段に結合されており、
さらに、
前記複数のスレッドのすべてにより共有され、実行性能を向上させるために実行中の複数のスレッドにより利用される、別のレジスタ手段と、
内部レジスタ手段が使用されている間、別のスレッドへの実行の切り換えを防止する手段と、
命令内のスケジュール変更不可ビットを検出する手段と、
を具備し、
別のスレッドへの実行の切り換えを防止する前記手段が、スケジュール変更不可ビットの検出に応答して動作する、ことを特徴とするマルチスレッドプロセッサ。 A multi-thread processor for executing multiple threads based on the availability of resources required for execution by each thread,
A means of determining which thread should execute;
Means for switching execution of a plurality of threads based on the result of the determination;
Comprising
Each thread is coupled to a corresponding register means that stores the state of the thread and is used to execute instructions in the thread;
further,
Another register means shared by all of the plurality of threads and utilized by a plurality of executing threads to improve execution performance;
Means for preventing execution switching to another thread while the internal register means is in use;
Means for detecting a non-schedulable bit in the instruction;
Comprising
A multi-thread processor, wherein said means for preventing switching of execution to another thread operates in response to detection of a non-schedulable bit.

複数の命令のシーケンスを含む複合命令が、その実行において前記別のレジスタ手段を利用し、
別のスレッドへの実行の切り換えを防止する前記手段が、そのような命令のシーケンスに応答して該シーケンスが実行を完了するまで動作する、請求項１に記載のマルチスレッドプロセッサ。 A compound instruction comprising a sequence of instructions utilizes the additional register means in its execution;
The multithreaded processor of claim 1, wherein the means for preventing execution switching to another thread operates in response to a sequence of such instructions until the sequence completes execution.

前記別のレジスタ手段の使用は、前記複数の命令のシーケンスが実行を完了する前に終わる、請求項２に記載のマルチスレッドプロセッサ。 The multi-thread processor of claim 2, wherein the use of the another register means ends before the sequence of instructions completes execution.

スケジュール変更不可ビットが、前記シーケンスにおける各命令に対して設定され、別のスレッドへの実行の切り換えを防止する前記手段が、命令内のスケジュール変更不可ビットの検出に応答して動作する、請求項２又は請求項３に記載のマルチスレッドプロセッサ。 A non-schedulable bit is set for each instruction in the sequence, and the means for preventing switching execution to another thread operates in response to detecting the non-schedulable bit in the instruction. The multi-thread processor according to claim 2 or claim 3.

各スレッドの状態を記憶するレジスタ手段とすべてのスレッドによって使用される別のレジスタ手段とを有するマルチスレッドプロセッサ上で実行するための命令のスレッドをコンパイル／アセンブルする方法であって、
スレッド内のどの命令が実行性能を向上させるために前記別のレジスタ手段を利用することができるかを判断する段階と、
内部レジスタ手段を利用する命令が実行されるときに、実行中のスレッドにスケジュール変更不可ビットを設定する段階と、
を含むことを特徴とする方法。 A method for compiling / assembling a thread of instructions for execution on a multi-thread processor having register means for storing the state of each thread and another register means used by all threads,
Determining which instructions in the thread can utilize the separate register means to improve execution performance;
Setting an unschedulable bit on a running thread when an instruction that uses internal register means is executed;
A method comprising the steps of:

スレッド内のどの命令が前記別のレジスタ手段を利用することができるかを判断する前記段階は、入力プログラムをコンパイルする段階と、必要とされる命令の数を低減するために該別のレジスタ手段を使用することができる状況を検索する段階と、該別のレジスタが使用されるときにスケジュール変更不可ビットを含む命令を含める段階と、を含む、請求項５に記載の方法。 The step of determining which instructions in a thread can utilize the separate register means comprises compiling an input program and the separate register means to reduce the number of instructions required 6. The method of claim 5, comprising: searching for a situation that can be used; and including an instruction that includes a non-schedulable bit when the another register is used.

スレッド内のどの命令が前記別のレジスタ手段を利用することができるかを判断する前記段階は、該別のレジスタ手段を使用した実行に適した命令のための入力プログラムを解析する段階と、そのような構成が見つかったときに、該別のレジスタ手段を利用しかつスケジュール変更不可ビットを含む命令をコンパイルする段階と、を含む、請求項５に記載の方法。 Determining which instruction in the thread can utilize the other register means comprises analyzing an input program for instructions suitable for execution using the other register means; 6. A method as claimed in claim 5, including the step of compiling an instruction utilizing said another register means and including a non-schedulable bit when such a configuration is found.