JP4799171B2

JP4799171B2 - Drawing apparatus and data transfer method

Info

Publication number: JP4799171B2
Application number: JP2005371738A
Authority: JP
Inventors: 清太郎八木
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2005-12-26
Filing date: 2005-12-26
Publication date: 2011-10-26
Anticipated expiration: 2025-12-26
Also published as: JP2007172455A

Description

この発明は、描画装置及びデータ転送方法に関するもので、例えば複数のピクセルを同時に並列処理する画像処理ＬＳＩに関する。 The present invention relates to a drawing apparatus and a data transfer method, for example, an image processing LSI that simultaneously processes a plurality of pixels in parallel.

近年、ＣＰＵ（Central Processing Unit）の動作の高速化に伴って、画像描画装置に対しても高速化の要求が高まってきている。 In recent years, with the speeding up of the operation of a CPU (Central Processing Unit), there has been an increasing demand for speeding up image drawing apparatuses.

画像描画装置は一般に、投入された図形をピクセルに分解する図形分解手段と、ピクセルに描画処理を加えるピクセル処理手段と、描画結果を読み書きする記憶手段とを備える。近年、ＣＧ（Computer Graphics）技術の進歩により、複雑なピクセル処理技術が頻繁に用いられるようになってきている。その結果ピクセル処理手段の負荷が大きくなるため、ピクセル処理手段を並列化することが行われている（例えば特許文献１参照）。 In general, an image drawing apparatus includes a graphic decomposing unit that decomposes an input graphic into pixels, a pixel processing unit that applies a drawing process to the pixels, and a storage unit that reads and writes the drawing result. In recent years, with the advancement of CG (Computer Graphics) technology, complex pixel processing technology has been frequently used. As a result, since the load on the pixel processing means increases, the pixel processing means are parallelized (see, for example, Patent Document 1).

しかしながら上記従来の構成であると、メインメモリ上のデータをキャッシュに読み出すために、種々のバッファメモリが必要であった。その結果、ハードウェアコストが上昇するという問題があった。
米国特許５，９８２，２１１号 However, with the above-described conventional configuration, various buffer memories are required to read data on the main memory into the cache. As a result, there is a problem that the hardware cost increases.
US Pat. No. 5,982,211

この発明は、上記事情に鑑みてなされたもので、その目的は、ハードウェアコストを削減出来る描画装置及びデータ転送方法を提供することにある。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a drawing apparatus and a data transfer method capable of reducing hardware costs.

一態様に係る描画装置は、画像データを保持するメインメモリと、前記メインメモリとの間で前記画像データの授受を行うキャッシュメモリと、前記メインメモリと前記キャッシュメモリとの間のデータ転送を管理すると共に、前記キャッシュメモリの状態に関する情報を保持する転送制御装置と、前記キャッシュメモリ内の前記画像データを用いて画像処理プログラムを実行し、前記画像処理プログラムを実行して得られた画像データを前記キャッシュメモリに保持させるプログラム実行部とを具備し、前記キャッシュメモリは、各々が前記画像データを保持可能な複数のエントリを含み、前記転送制御装置は、前記メインメモリから前記キャッシュメモリのエントリに転送される前記画像データの識別情報と、前記プログラム実行部で得られた前記画像データが前記エントリに保持されているか否かを示すデータ更新情報とを、前記エントリ毎に保持し、前記転送制御装置は、いずれかの前記エントリに対応した前記データ更新情報がアサートされている場合、該エントリ内の前記画像データを前記メインメモリに書き込む。 A rendering apparatus according to an aspect manages a main memory that holds image data, a cache memory that exchanges the image data with the main memory, and data transfer between the main memory and the cache memory And a transfer control device that holds information relating to the state of the cache memory, an image processing program executed using the image data in the cache memory, and image data obtained by executing the image processing program A program execution unit to be held in the cache memory, and the cache memory includes a plurality of entries each capable of holding the image data, and the transfer control device transfers the entries from the main memory to the cache memory. Identification information of the image data to be transferred and obtained by the program execution unit And said image data is data update information indicating whether it is held in the entry, and held for each of the entries, the transfer control device, the data update information asserted corresponding to one of said entry If so, the image data in the entry is written into the main memory.

この発明によれば、ハードウェアコストを削減出来る描画装置及びデータ転送方法を提供出来る。 According to the present invention, it is possible to provide a drawing apparatus and a data transfer method that can reduce hardware costs.

以下、この発明の実施形態を図面を参照して説明する。この説明に際し、全図にわたり、共通する部分には共通する参照符号を付す。 Embodiments of the present invention will be described below with reference to the drawings. In the description, common parts are denoted by common reference symbols throughout the drawings.

この発明の第１の実施形態に係るグラフィックプロセッサについて、図１を用いて説明する。図１は、本実施形態に係るグラフィックプロセッサのブロック図である。 A graphic processor according to a first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram of a graphic processor according to the present embodiment.

図示するように、グラフィックプロセッサ１０はラスタライザ（rasterizer）１１、複数のピクセルシェーダ（pixel shader）１２−０〜１２−３、及びローカルメモリ１３を備えている。なお、本実施形態ではピクセルシェーダ１２の数は４個であるがこれは一例に過ぎず、８個、１６個、３２個等でも良く、その数は限定されるものではない。 As illustrated, the graphic processor 10 includes a rasterizer 11, a plurality of pixel shaders 12-0 to 12-3, and a local memory 13. In the present embodiment, the number of pixel shaders 12 is four, but this is only an example, and may be eight, sixteen, thirty-two, etc., and the number is not limited.

ラスタライザ１１は、入力された図形情報に従ってピクセル（pixel）を生成する。ピクセルとは、所定の図形を描画する際に取り扱われる最小単位の領域のことであり、ピクセルの集合によって図形が描画される。生成されたピクセルはピクセルシェーダ１２−０〜１２−３へ投入される。 The rasterizer 11 generates pixels according to the input graphic information. A pixel is a minimum unit area that is handled when a predetermined figure is drawn, and a figure is drawn by a set of pixels. The generated pixels are input to the pixel shaders 12-0 to 12-3.

ピクセルシェーダ１２−０〜１２−３は、ラスタライザ１１から投入されたピクセルにつき演算処理を行い、ローカルメモリ１３上に画像データを生成する。ピクセルシェーダ１２−０〜１２−３の各々は、データ振り分け部２０、テクスチャユニット（texture unit）２３、及び複数のピクセルシェーダユニット２４を備えている。 The pixel shaders 12-0 to 12-3 perform arithmetic processing on the pixels input from the rasterizer 11 and generate image data on the local memory 13. Each of the pixel shaders 12-0 to 12-3 includes a data distribution unit 20, a texture unit 23, and a plurality of pixel shader units 24.

データ振り分け部２０はラスタライザ１１からデータを受け取る。そして、受け取ったデータをピクセルシェーダ１２−０〜１２−３へ割り振る。 The data distribution unit 20 receives data from the rasterizer 11. Then, the received data is allocated to the pixel shaders 12-0 to 12-3.

テクスチャユニット２３はローカルメモリ１３からテクスチャデータを読み出し、テクスチャマッピングに必要な処理を行う。テクスチャマッピングとは、ピクセルシェーダユニット２４で処理されたピクセルにテクスチャデータを貼り付ける処理のことであり、ピクセルシェーダユニット２４において行われる。 The texture unit 23 reads texture data from the local memory 13 and performs processing necessary for texture mapping. Texture mapping is a process of pasting texture data on the pixels processed by the pixel shader unit 24 and is performed in the pixel shader unit 24.

ピクセルシェーダユニット２４はシェーダエンジン部であり、ピクセルデータに対してシェーダプログラムを実行する。そしてピクセルシェーダユニット２４のそれぞれはＳＩＭＤ（Single Instruction Multiple Data）動作を行って、複数個のピクセルを同時に処理する。ピクセルシェーダユニット２４はそれぞれ、命令制御部２５、描画処理部２６、及びデータ制御部２７を備えている。これらの回路ブロック２５〜２７の詳細については後述する。 The pixel shader unit 24 is a shader engine unit, and executes a shader program for pixel data. Each of the pixel shader units 24 performs a single instruction multiple data (SIMD) operation to simultaneously process a plurality of pixels. Each pixel shader unit 24 includes an instruction control unit 25, a drawing processing unit 26, and a data control unit 27. Details of these circuit blocks 25 to 27 will be described later.

ローカルメモリ１３は、例えばｅＤＲＡＭ（embedded DRAM）であり、ピクセルシェーダ１２−０〜１２−３で描画されたピクセルデータを記憶する。 The local memory 13 is eDRAM (embedded DRAM), for example, and stores pixel data drawn by the pixel shaders 12-0 to 12-3.

次に、本実施形態に係るグラフィックプロセッサにおける図形描画の概念について説明する。図２は、図形を描画すべき全体の空間を示す概念図である。なお、図２に示す描画領域は、ローカルメモリ内においてピクセルデータを保持するメモリ空間（以下、フレームバッファと呼ぶ）に相当する。 Next, the concept of graphic drawing in the graphic processor according to the present embodiment will be described. FIG. 2 is a conceptual diagram showing the entire space in which a figure is to be drawn. The drawing area shown in FIG. 2 corresponds to a memory space (hereinafter referred to as a frame buffer) that holds pixel data in the local memory.

図示するように、フレームバッファは、マトリクス状に配置された例えば（４０×１５）個のブロックＢＬＫ０〜ＢＬＫ５９９を含んでいる。ブロックの数は一例に過ぎず、（４０×１５）個に限定されるものではない。ピクセルシェーダ１２−０〜１２−３は、ブロックＢＬＫ０〜ＢＬＫ５９９順にピクセルを生成する。各ブロックＢＬＫ０〜ＢＬＫ５９９はそれぞれ、マトリクス状に配置された３２個のスタンプ（stamp）を含んで形成されている。図３は、図２に示された各ブロックが複数のスタンプを有する様子を示している。 As shown in the figure, the frame buffer includes, for example, (40 × 15) blocks BLK0 to BLK599 arranged in a matrix. The number of blocks is only an example, and is not limited to (40 × 15). The pixel shaders 12-0 to 12-3 generate pixels in the order of blocks BLK0 to BLK599. Each of the blocks BLK0 to BLK599 is formed to include 32 stamps arranged in a matrix. FIG. 3 shows how each block shown in FIG. 2 has a plurality of stamps.

各スタンプは、同一のピクセルシェーダによって描画される複数のピクセルの集合体である。本実施形態では１個のスタンプは（４×４）＝１６個のピクセルを含んでいるが、この数は例えば１個、４個、…等でも良く、限定されるものではない。図３において、各スタンプに記載された番号（＝０〜３１）を以下スタンプＩＤ（ＳｔＩＤ）と呼び、各ピクセルに記載された番号（＝０〜１５）を以下ピクセルＩＤ（ＰｉｘＩＤ）と呼ぶ。また、各スタンプにおける（２×２）個のピクセルの集合をクアッド（quad）と呼ぶことにする。すなわち、１個のスタンプには（２×２）個のクアッドが含まれる。これらの４つのクアッドを、以下クアッドＱ０〜Ｑ３と呼ぶことにし、この数字をクアッドＩＤと呼ぶ。ブロックＢＬＫ０〜ＢＬＫ５９９の各々には、上記スタンプが（４×８）＝３２個含まれている。従って、全体として（６４０×４８０）個のピクセルによって、図形を描画すべき空間が形成されている。 Each stamp is a collection of a plurality of pixels drawn by the same pixel shader. In this embodiment, one stamp includes (4 × 4) = 16 pixels, but this number may be one, four,..., For example, and is not limited. In FIG. 3, a number (= 0 to 31) described in each stamp is hereinafter referred to as a stamp ID (StID), and a number (= 0 to 15) described in each pixel is hereinafter referred to as a pixel ID (PixID). A set of (2 × 2) pixels in each stamp is referred to as a quad. That is, one stamp includes (2 × 2) quads. These four quads are hereinafter referred to as quads Q0 to Q3, and these numbers are referred to as quad IDs. Each of the blocks BLK0 to BLK599 includes (4 × 8) = 32 stamps. Accordingly, a space for drawing a graphic is formed by (640 × 480) pixels as a whole.

なおピクセルシェーダユニット２４をピクセルシェーダ１２−０〜１２−３順に番号付けすると、その番号に等しいスタンプＩＤを有するスタンプが各ピクセルシェーダユニット２４によって処理される。すなわち、各スタンプ内のピクセルは、その位置に応じて処理が行われるピクセルシェーダユニットが予め決められている。 If the pixel shader units 24 are numbered in the order of the pixel shaders 12-0 to 12-3, stamps having stamp IDs equal to the numbers are processed by each pixel shader unit 24. In other words, the pixel shader unit in which processing is performed in accordance with the position of the pixels in each stamp is determined in advance.

次に、上記フレームバッファに描画される図形に関して説明する。まず図形を描画するにあたって、ラスタライザ１１に図形情報が入力される。図形情報は、例えば図形の頂点座標や色情報などである。ここで、例として三角形を描画する場合について説明する。ラスタライザ１１に入力された三角形は、描画空間において図４に示すような位置を占めるとする。すなわち、三角形の３つの頂点座標が、ブロックＢＬＫ２におけるＳｔＩＤ＝３１のスタンプ、ブロックＢＬＫ４１におけるＳｔＩＤ＝１５のスタンプ、及びブロックＢＬＫ４３におけるＳｔＩＤ＝４のスタンプに位置すると仮定する。ラスタライザ１１は、描画すべき三角形が占める位置に対応するスタンプを生成する。この様子を示しているのが図５である。生成されたスタンプデータは、それぞれ予め対応付けられたピクセルシェーダ１２−０〜１２−３に送られる。 Next, the graphic drawn in the frame buffer will be described. First, graphic information is input to the rasterizer 11 when drawing a graphic. The graphic information is, for example, the vertex coordinates or color information of the graphic. Here, a case where a triangle is drawn will be described as an example. It is assumed that the triangle input to the rasterizer 11 occupies a position as shown in FIG. 4 in the drawing space. That is, it is assumed that the three vertex coordinates of the triangle are located at the stamp of StID = 31 in the block BLK2, the stamp of StID = 15 in the block BLK41, and the stamp of StID = 4 in the block BLK43. The rasterizer 11 generates a stamp corresponding to the position occupied by the triangle to be drawn. This is shown in FIG. The generated stamp data is sent to the pixel shaders 12-0 to 12-3 associated in advance.

そしてピクセルシェーダ１２−０〜１２−３は、入力されたスタンプデータに基づいて、自らの担当するピクセルについて描画処理を行う。その結果、図５に示されるような三角形が、複数のピクセルによって描画される。ピクセルシェーダ１２−０〜１２−３によって描画されたピクセルデータは、スタンプ単位でローカルメモリに格納される。 Then, the pixel shaders 12-0 to 12-3 perform drawing processing on the pixels that they are responsible for based on the input stamp data. As a result, a triangle as shown in FIG. 5 is drawn by a plurality of pixels. Pixel data drawn by the pixel shaders 12-0 to 12-3 is stored in the local memory in units of stamps.

図６は、図５におけるブロックＢＬＫ２の拡大図である。図示するようにブロックＢＬＫ２に関して、ラスタライザ１１は８個のスタンプを生成する。それらのスタンプＩＤはそれぞれＳｔＩＤ＝１６、１７、１９、２１、２５〜２７、３１である。前述の通り、ラスタライザ１１で生成されたスタンプの個々には（４×４）＝１６個のピクセルが含まれている。しかし、例えスタンプが発行されたとしても、図形によっては全てのピクセルに対して描画処理を行う必要はない。例えば図６において、ＳｔＩＤ＝１７、２７のスタンプは三角形の内部にあるので、このスタンプ内に含まれる全てのピクセルに対して描画処理を行う必要がある。しかし、例えばＳｔＩＤ＝２１のスタンプにおいては、ＰｉｘＩＤ＝０〜７、９、１２〜１５のピクセルは三角形の外部にあるため描画処理の必要はない。描画処理の必要なピクセルは、ＰｉｘＩＤ＝８、１０、１１のピクセルのみである。このように、描画処理すべきであることを以下では「バリッド（valid）である」と呼び、描画不要であることを「インバリッド（invalid）である」と呼ぶことにする。 FIG. 6 is an enlarged view of the block BLK2 in FIG. As illustrated, the rasterizer 11 generates eight stamps for the block BLK2. The stamp IDs are StID = 16, 17, 19, 21, 25-27, 31 respectively. As described above, each of the stamps generated by the rasterizer 11 includes (4 × 4) = 16 pixels. However, even if a stamp is issued, it is not necessary to perform drawing processing for all pixels depending on the figure. For example, in FIG. 6, since the stamps with StID = 17 and 27 are inside the triangle, it is necessary to perform drawing processing for all the pixels included in the stamp. However, in the stamp with StID = 21, for example, the pixels with PixID = 0-7, 9, and 12-15 are outside the triangle, so there is no need for drawing processing. Only pixels with PixID = 8, 10, and 11 need to be drawn. In this way, what is to be rendered is hereinafter referred to as “valid” and what is not necessary is referred to as “invalid”.

次に図１に戻ってピクセルシェーダユニット２４の構成について説明する。図示するようにピクセルシェーダユニット２４は、命令制御部２５、描画処理部２６、及びデータ制御部２７を備えている。命令制御部２５は、タスクの実行管理、スタンプデータの受信、クアッドマージ、サブパスの実行管理等を行う。描画処理部２６は、ピクセルの演算処理を行う。データ制御部２７はキャッシュメモリを備え、キャッシュメモリやローカルメモリ１３へのデータアクセスの制御を行う。 Next, returning to FIG. 1, the configuration of the pixel shader unit 24 will be described. As illustrated, the pixel shader unit 24 includes an instruction control unit 25, a drawing processing unit 26, and a data control unit 27. The instruction control unit 25 performs task execution management, reception of stamp data, quad merge, sub-pass execution management, and the like. The drawing processing unit 26 performs pixel calculation processing. The data control unit 27 includes a cache memory and controls data access to the cache memory and the local memory 13.

命令制御部２５の動作について以下説明する。命令制御部２５はパイプライン動作を行う。命令制御部２５は、データ振り分け部２０から複数のデータを受け取り、保持する。そのデータとは、例えばスタンプのＸＹ座標、描画方向、ポリゴンの面（face）情報、描画すべき図形の有するパラメータの代表値、図形の奥行き情報、またはピクセルがバリッドか否かを示す情報などである。また命令制御部２５はクアッドマージを行う。クアッドマージとは、同一ＸＹ座標の連続する２つのスタンプを１つのスタンプにマージすることである。クアッドマージを行うことで、２つのスタンプのうちバリッドなクアッドを１つのスタンプに合成出来、一度に処理出来る。従って、描画処理すべきデータ量を圧縮出来る。クアッドマージの様子を図７に示す。 The operation of the instruction control unit 25 will be described below. The instruction control unit 25 performs a pipeline operation. The instruction control unit 25 receives and holds a plurality of data from the data distribution unit 20. The data includes, for example, the stamp's XY coordinates, drawing direction, polygon face information, representative values of parameters of the figure to be drawn, figure depth information, or information indicating whether or not the pixel is valid. is there. The instruction control unit 25 performs quad merge. The quad merge is to merge two consecutive stamps having the same XY coordinates into one stamp. By performing quad merge, a valid quad of two stamps can be combined into one stamp and processed at a time. Therefore, the amount of data to be drawn can be compressed. A state of the quad merge is shown in FIG.

２つの時間的に連続する２つのスタンプが、例えば図７に示すようであったとする。１つのスタンプに含まれる４つのクアッドをそれぞれクアッドＱ０〜Ｑ３と呼ぶことにする。まず始めにクアッドＱ０、Ｑ２がバリッドで且つクアッドＱ１、Ｑ３がインバリッドなスタンプ１が命令制御部２５に入力され、引き続きクアッドＱ１、Ｑ２がバリッドで且つクアッドＱ０、Ｑ３がインバリッドなスタンプ２が入力された場合を考える。この場合、２つのスタンプ１、２をマージすることにより、スタンプ１のクアッドＱ０、Ｑ２と、スタンプ２のクアッドＱ１、Ｑ２とを含む新規なスタンプを生成する。この新規なスタンプを、クアッドマージ前のスタンプと区別するために以後スレッド（thread）と呼ぶことにする。クアッドマージにより生成されたスレッドは番号付けされ、その番号を以下スレッドＩＤ（ＴｄＩＤ）と呼ぶことにする。そして命令制御部２５は、生成されたスレッドに関する情報を保持する。スレッドに関する情報とは、例えばサブパスＩＤ、スレッドＩＤ、またスレッドに含まれる４つのクアッドのクアッドマージ前のスタンプ内における位置情報などである。サブパスＩＤは、現在実行中または次に実行すべきサブパス（sub-pass）番号である。サブパスについて以下説明する。 Assume that two time-sequential two stamps are as shown in FIG. 7, for example. Four quads included in one stamp are called quads Q0 to Q3, respectively. First, a stamp 1 in which quads Q0 and Q2 are valid and quads Q1 and Q3 are invalid is input to the instruction control unit 25. Subsequently, quads Q1 and Q2 are valid and quads Q0 and Q3 are invalid in stamp 2 are input. Consider the case. In this case, the two stamps 1 and 2 are merged to generate a new stamp including quads Q0 and Q2 of stamp 1 and quads Q1 and Q2 of stamp 2. This new stamp is hereinafter referred to as a thread in order to distinguish it from the stamp before the quad merge. The threads generated by the quad merge are numbered, and the number is hereinafter referred to as a thread ID (TdID). The instruction control unit 25 holds information regarding the generated thread. The information about the thread includes, for example, a sub-path ID, a thread ID, position information in a stamp before quad merge of four quads included in the thread, and the like. The sub-path ID is a sub-pass number that is currently being executed or to be executed next. The sub path will be described below.

命令制御部２５は、各スレッドに対して、エンド信号を検出するまで所定の命令を実行する。実行される命令列は、図８に示すよう最大でＸ個の命令列に分割出来、分割されて出来た個々の命令列がサブパスである。個々のサブパスの最後にはイールド命令Ｙｉｅｌｄが配置され、最終のサブパスの最後にはイールド命令の代わりにエンド命令Ｅｎｄが配置されている。 The instruction control unit 25 executes a predetermined instruction for each thread until an end signal is detected. The instruction sequence to be executed can be divided into a maximum of X instruction sequences as shown in FIG. 8, and each instruction sequence formed by the division is a sub path. A yield instruction Yield is arranged at the end of each sub-pass, and an end instruction End is arranged at the end of the final sub-path instead of the yield instruction.

図９は、サブパスが実行される様子を時間と共に示した概念図である。図９において、スレッド５、６、７は同一のピクセルシェーダユニットによって処理される。図示するように、スレッドに対する処理はイールド命令によって一旦休止する。そして、代わりに他のスレッドに対する命令が実行される。休止したスレッドは、後に発行可能となった際に起動される。すなわち、２つのイールド命令間で実行される命令がサブパスである。そしてサブパスの単位でスレッドが実行され、その期間の処理は連続して実行される。 FIG. 9 is a conceptual diagram showing how sub-passes are executed over time. In FIG. 9, threads 5, 6, and 7 are processed by the same pixel shader unit. As shown in the figure, the processing for the thread is temporarily suspended by the yield instruction. Instead, an instruction for another thread is executed. The suspended thread is activated when it can be issued later. That is, an instruction executed between two yield instructions is a subpath. Then, a thread is executed in units of sub-passes, and processing during that period is executed continuously.

命令制御部２５はサブパスの制御を行う。そして、スレッドとそれに対応するサブパスＩＤを保持し、いずれのスレッドが発行可能かを管理する。 The instruction control unit 25 controls the sub path. Then, the thread and the subpath ID corresponding to the thread are held, and which thread can be issued is managed.

更に命令制御部２５は、データ振り分け部２０から与えられる情報に基づいて、ピクセルデータを補間する。通常、ラスタライザによって生成されるピクセルは、スタンプあたり１個だけである。従って、このラスタライザ１１によって生成されたピクセルデータに基づいた計算により、描画処理部２６は同じスタンプ内の他のピクセルに関する情報を得る。 Further, the command control unit 25 interpolates the pixel data based on the information given from the data distribution unit 20. Typically, only one pixel per stamp is generated by the rasterizer. Accordingly, the drawing processing unit 26 obtains information on other pixels in the same stamp by calculation based on the pixel data generated by the rasterizer 11.

次にデータ制御部２７について、図１０及び図１１を用いて説明する。図１０はデータ制御部２７のブロック図である。データ制御部２７はパイプライン動作を行う。図１１は、パイプライン動作の各ステージと関連付けて示したデータ制御部２７のブロック図である。 Next, the data control unit 27 will be described with reference to FIGS. FIG. 10 is a block diagram of the data control unit 27. The data control unit 27 performs a pipeline operation. FIG. 11 is a block diagram of the data control unit 27 shown in association with each stage of the pipeline operation.

ピクセルシェーダユニットの各回路ブロックにおける処理は、第１乃至第３ステージの少なくとも３つのステージを有する。各ステージについて大まかに説明すると、第１ステージでは、命令制御部２５が必要なデータの読み出しや、命令のプリフェッチ等を行う。またデータ制御部２７では、データアクセスに必要なアドレス信号の生成や、プリロード（後述する）に関する制御を行う。第２ステージでは、命令制御部２５がピクセルデータの補間を行い、データ制御部２７がデータアクセスに必要な命令を生成する。そして第３ステージでは、命令制御部２５及びデータ制御部２７における処理結果に基づいて描画処理部２６が描画処理を行う。なお、命令制御部２５によるデータ振り分け部２０からのデータの受信は、上記第１ステージの前の段階で行われる。 The processing in each circuit block of the pixel shader unit has at least three stages of first to third stages. The stages are roughly described. In the first stage, the instruction control unit 25 reads necessary data, prefetches instructions, and the like. The data control unit 27 performs control related to generation of address signals necessary for data access and preload (described later). In the second stage, the instruction control unit 25 interpolates pixel data, and the data control unit 27 generates an instruction necessary for data access. In the third stage, the drawing processing unit 26 performs drawing processing based on the processing results in the command control unit 25 and the data control unit 27. The instruction control unit 25 receives data from the data distribution unit 20 at a stage before the first stage.

データ制御部２７の構成について説明する。図示するように、データ制御部２７は、アドレス発生部４０、キャッシュメモリ４１、キャッシュ制御部４２、及びプリロード制御部４３を備えている。アドレス発生部４０は、命令制御部２５からロード／ストア命令が発行された際に、ローカルメモリ１３において読み出すべきデータのアドレス、または書き込むべきデータのアドレスを生成する（以下これをロード／ストアアドレスと呼ぶ）。ロード／ストア命令とは、描画処理部２６がピクセル処理を行う際に必要となるデータを読み出す命令（ロード）、または処理したデータを保持させる命令（ストア）である。より詳細には、ロード命令が発行されると、ピクセル描画処理に必要なデータが、キャッシュメモリ４１から、描画処理部２６内にあるレジスタに読み出される。必要なデータがキャッシュメモリ４１に無い場合にはローカルメモリ１３から読み出される。またストア命令が発行されると、描画処理部２６内のレジスタに保持されているデータが、一時的にキャッシュメモリ４１に書き込まれ、その後ローカルメモリ１３に書き込まれる。 The configuration of the data control unit 27 will be described. As shown in the figure, the data control unit 27 includes an address generation unit 40, a cache memory 41, a cache control unit 42, and a preload control unit 43. When a load / store instruction is issued from the instruction control unit 25, the address generator 40 generates an address of data to be read or written in the local memory 13 (hereinafter referred to as a load / store address). Call). The load / store instruction is an instruction (load) for reading data required when the rendering processing unit 26 performs pixel processing or an instruction (store) for holding the processed data. More specifically, when a load command is issued, data necessary for pixel drawing processing is read from the cache memory 41 to a register in the drawing processing unit 26. If the necessary data is not in the cache memory 41, it is read from the local memory 13. When a store instruction is issued, data held in the register in the drawing processing unit 26 is temporarily written in the cache memory 41 and then written in the local memory 13.

キャッシュメモリ４１は、ピクセルデータを一時的に保持する。描画処理部２６は、キャッシュメモリ４１に保持されるデータを用いてピクセル処理を行う。 The cache memory 41 temporarily holds pixel data. The drawing processing unit 26 performs pixel processing using data held in the cache memory 41.

キャッシュ制御部４２は、ロード／ストア命令が発行された際における、キャッシュメモリ４１に対するアクセスを制御する。キャッシュ制御部４２は、キャッシュアクセス制御部４４、キャッシュ管理部４５、及びリクエスト発行制御部４６を備えている。 The cache control unit 42 controls access to the cache memory 41 when a load / store instruction is issued. The cache control unit 42 includes a cache access control unit 44, a cache management unit 45, and a request issue control unit 46.

プリロード制御部４３は、プリロード命令が発行された際における、キャッシュメモリ４１に対するアクセスを制御する。プリロード制御部４３はプリロードアドレス発生部４７、プリロード保持部４８、サブパス情報管理部４９、及びアドレス保持部５０を備えている。プリロード命令とは、次に実行されるであろうスレッドのサブパスで使用されるデータを、ローカルメモリからキャッシュメモリ４１に予めプリフェッチするための命令である。 The preload control unit 43 controls access to the cache memory 41 when a preload instruction is issued. The preload control unit 43 includes a preload address generation unit 47, a preload holding unit 48, a subpath information management unit 49, and an address holding unit 50. The preload instruction is an instruction for prefetching data used in a sub-pass of a thread that will be executed next from the local memory to the cache memory 41 in advance.

またデータ制御部２７は上記回路ブロックのいずれかにおいて、コンフィギュレーションレジスタを備える。コンフィギュレーションレジスタは、信号ＷＩＤＴＨ、ＢＡＳＥ、ＰＲＥＬＯＡＤを保持する。信号ＷＩＤＴＨは、ピクセルに関するフレームバッファのサイズを示す。ＢＡＳＥは、ローカルメモリ１３に保持されるデータのベースアドレス（先頭アドレス）を、フレームバッファモード及びメモリレジスタモードの両方の場合について示す。ＰＲＥＬＯＡＤは、プリロードのＯＮ／ＯＦＦを設定するためのものである。 The data control unit 27 includes a configuration register in any of the circuit blocks. The configuration register holds signals WIDTH, BASE, and PRELOAD. Signal WIDTH indicates the size of the frame buffer for the pixel. BASE indicates the base address (start address) of data held in the local memory 13 in both the frame buffer mode and the memory register mode. PRELOAD is for setting ON / OFF of preload.

データ制御部２７の内部の構成について、以下詳細に説明する。まずアドレス発生部４０について説明する。図１２はアドレス発生部４０のブロック図であり、入出力信号を示している。図示するように、アドレス発生部４０にはオフセットデータ、スレッドのＸＹ座標、スレッドＩＤ、クアッドＩＤ、サブパスＩＤ、及びバッファモード信号が入力される。ＸＹ座標は命令制御部２５から与えられる。スレッドＩＤ、クアッドＩＤ、及びサブパスＩＤは描画処理部２６から与えられる。アドレス発生部４０は、スレッドのＸ座標、Ｙ座標と、コンフィギュレーションレジスタに保持されるＷＩＤＴＨに基づいて、ロード／ストアアドレスを計算する。ロード／ストアアドレスは、上記情報から計算可能であれば良く、その計算式自体は特に限定されるものでは無い。下記に一例として、ピクセルシェーダユニットの数が４個であり、且つ１つのブロック内に３２個のスタンプが含まれる場合のロード／ストアアドレスの計算方法を示す。 The internal configuration of the data control unit 27 will be described in detail below. First, the address generator 40 will be described. FIG. 12 is a block diagram of the address generator 40, showing input / output signals. As shown in the figure, offset data, thread XY coordinates, thread ID, quad ID, subpath ID, and buffer mode signal are input to the address generator 40. The XY coordinates are given from the command control unit 25. The thread ID, quad ID, and sub path ID are given from the drawing processing unit 26. The address generator 40 calculates a load / store address based on the X and Y coordinates of the thread and WIDTH held in the configuration register. The load / store address may be calculated from the above information, and the calculation formula itself is not particularly limited. As an example, a load / store address calculation method when the number of pixel shader units is four and 32 stamps are included in one block is shown below.

Block ID = X/16 + (Y/32) × (WIDTH/16)
Xr = (X/4) mod 16
Yr = (Y/4) mod 16
PUID[0] = Xr[1] ^Yr[1] = StID[0]
PUID[1] = (Xr[1] AND ~(Yr[1] ^Yr[2]) | (~Xr[1] AND Xr[2])) ^Xr[0] ^Yr[0]
= StID[1]
PUID[2] = (Xr[1] AND ~(Yr[1] ^Xr[2]) | (~Xr[1] AND Yr[2])) ^Xr[0] ^Yr[0]
= StID[2]
PUID[3] = Xr[3] = StID[3]
PUID[4] = Yr[3] = StID[4]
なお上式におけるＢｌｏｃｋＩＤは図２で説明したＢＬＫ０〜ＢＬＫ５９９の番号である。Ｘ、ＹはＸ座標及びＹ座標である。PUIDはピクセルシェーダユニット番号であり、ピクセルシェーダユニット２４をピクセルシェーダ１２−０〜１２−３順に番号付けした際の番号である。ピクセルシェーダユニット番号は５ビットの信号であり、PUID[0]〜PUID[4]はその各ビットを示す。また上式の演算子は、ｍｏｄは剰余演算、ＡＮＤは論理積演算、＾は排他的論理和演算、~は論理否定演算、｜は論理和演算を表す。 Block ID = X / 16 + (Y / 32) × (WIDTH / 16)
Xr = (X / 4) mod 16
Yr = (Y / 4) mod 16
PUID [0] = Xr [1] ^ Yr [1] = StID [0]
PUID [1] = (Xr [1] AND ~ (Yr [1] ^ Yr [2]) | (~ Xr [1] AND Xr [2])) ^ Xr [0] ^ Yr [0]
= StID [1]
PUID [2] = (Xr [1] AND ~ (Yr [1] ^ Xr [2]) | (~ Xr [1] AND Yr [2])) ^ Xr [0] ^ Yr [0]
= StID [2]
PUID [3] = Xr [3] = StID [3]
PUID [4] = Yr [3] = StID [4]
BlockID in the above equation is the number of BLK0 to BLK599 described in FIG. X and Y are an X coordinate and a Y coordinate. PUID is a pixel shader unit number, and is a number when the pixel shader unit 24 is numbered in the order of the pixel shaders 12-0 to 12-3. The pixel shader unit number is a 5-bit signal, and PUID [0] to PUID [4] indicate the respective bits. In the above formula, mod represents a remainder operation, AND represents a logical product operation, ^ represents an exclusive OR operation, ~ represents a logical negation operation, and | represents a logical sum operation.

そしてアドレス発生部４０は、上記計算結果とオフセットデータ、クアッドＩＤ、及びピクセルＩＤを図１３または図１４に示す順序に並べることによって、３２ビットのロード／ストアアドレスを生成する。ローカルメモリ１３は、２つのモードでデータを記憶することが可能であり、それぞれのモードをフレームバッファモード、メモリレジスタモードと呼ぶことにする。ロード／ストアアドレスは、ローカルメモリがフレームバッファモードで使用される場合にはスレッドのＸＹ座標から求められ、図１３のように配置することで得られる。他方、メモリレジスタモードで使用される場合にはスレッドＩＤによって求められ、図１４のように配置することで得られる。なお、オフセットデータは命令制御部２５から与えられる。また、フレームバッファモードとメモリレジスタモードのいずれを使用するかは、命令制御部２５からバッファモード信号として与えられる。ピクセルＩＤは、ＸＹ座標から知ることが出来る。なぜなら、図３で説明したように、各ピクセルセルＩＤを有するピクセルのスタンプ内における位置は予め決められているからである。また同様の理由によりクアッドＩＤも知ることが出来る。 Then, the address generator 40 generates a 32-bit load / store address by arranging the calculation result, the offset data, the quad ID, and the pixel ID in the order shown in FIG. The local memory 13 can store data in two modes, and these modes are called a frame buffer mode and a memory register mode. The load / store address is obtained from the XY coordinates of the thread when the local memory is used in the frame buffer mode, and can be obtained by arranging as shown in FIG. On the other hand, when used in the memory register mode, it is obtained by the thread ID and obtained by arranging as shown in FIG. The offset data is given from the instruction control unit 25. Whether to use the frame buffer mode or the memory register mode is given as a buffer mode signal from the instruction control unit 25. The pixel ID can be known from the XY coordinates. This is because, as described with reference to FIG. 3, the position of the pixel having each pixel cell ID in the stamp is determined in advance. For the same reason, the quad ID can also be obtained.

アドレス発生部４０は、図１３または図１４に示すアドレスを発生すると、そのうちの一部をキャッシュデータアドレス、キャッシュインデックスエントリ、及びキャッシュエントリとして出力する。これらの信号は、キャッシュメモリ４１内のアドレスを示す信号であるが、その詳細は後述する。 When generating the addresses shown in FIG. 13 or FIG. 14, the address generator 40 outputs some of them as cache data addresses, cache index entries, and cache entries. These signals are signals indicating addresses in the cache memory 41, and details thereof will be described later.

次にキャッシュメモリ４１について図１５を用いて説明する。図１５はキャッシュメモリ４１のブロック図である。図示するようにキャッシュメモリ４１は、例えば２つのメモリ５１−０、５１−１を備えている。メモリ５１−０、５１−１は例えばＳＲＡＭやまたはＤＲＡＭである。メモリ５１−０、５１−１の各々はＭ個のエントリ０〜（Ｍ−１）を備えている。各エントリ０〜（Ｍ−１）は、それぞれ独立したメモリ５３−０〜５３−（Ｍ−１）である。更に、エントリ０〜（Ｍ−１）の各々は、Ｌ個（Ｌは２以上の自然数）のサブエントリ０〜（Ｌ−１）を備えている。キャッシュメモリ４１からデータが読み出される際には、メモリ５１−０内のいずれかのエントリにおけるいずれかのサブエントリと、メモリ５１−１内のいずれかのエントリにおけるいずれかのサブエントリとからそれぞれ、データがキャッシュリードデータとして読み出される。 Next, the cache memory 41 will be described with reference to FIG. FIG. 15 is a block diagram of the cache memory 41. As shown in the figure, the cache memory 41 includes, for example, two memories 51-0 and 51-1. The memories 51-0 and 51-1 are, for example, SRAM or DRAM. Each of the memories 51-0 and 51-1 includes M entries 0 to (M-1). Each entry 0 to (M-1) is an independent memory 53-0 to 53- (M-1). Furthermore, each of the entries 0 to (M−1) includes L (L is a natural number of 2 or more) subentries 0 to (L−1). When data is read from the cache memory 41, from any subentry in any entry in the memory 51-0 and any subentry in any entry in the memory 51-1, respectively. Data is read as cache read data.

なお、図１５においてエントリ０〜（Ｍ−１）の各々がＬ個のサブエントリ０〜（Ｌ−１）を有している理由は、キャッシュメモリ４１と外部とを接続するバスの転送可能データサイズが、メモリ５１−０、５１−１の各エントリサイズの（１／Ｌ）だからである。従って、バスの転送可能データサイズがエントリサイズ以上であれば、エントリがサブエントリを有する必要はなく、この場合にはエントリサイズでデータが外部へ読み出される。 In FIG. 15, each of entries 0 to (M−1) has L subentries 0 to (L−1) because the transferable data on the bus connecting the cache memory 41 and the outside This is because the size is (1 / L) of each entry size of the memories 51-0 and 51-1. Therefore, if the transferable data size of the bus is equal to or larger than the entry size, the entry does not need to have a subentry. In this case, data is read to the outside with the entry size.

また、図１５においてはキャッシュメモリ４１が２つのメモリ５１−０、５１−１を有する場合について示しているが、この数は一例に過ぎず、１個だけでも良いし、３つ以上であっても良い。キャッシュメモリ４１に含まれる２つのメモリ５１−０、５１−１には、それぞれ識別番号としてインデックス０、インデックス１がそれぞれ割り当てられている。そして、図１２乃至図１４で説明したアドレス信号のうち、キャッシュインデックスエントリ及びキャッシュデータアドレスには、メモリ５１−０、５１−１に割り当てられたインデックス０、インデックス１のいずれを選択すべきかの情報が含まれる。またキャッシュエントリには、サブエントリ０〜（Ｌ−１）のいずれを選択すべきかの情報が含まれている。またキャッシュメモリ４１に対しては、キャッシュイネーブル信号、キャッシュライトイネーブル信号、キャッシュライトデータ、及びキャッシュアドレスが、キャッシュアクセス制御部４４から入力される。キャッシュイネーブル信号はキャッシュメモリ４１をイネーブル状態にするための信号であり、キャッシュライトイネーブル信号はキャッシュメモリ４１への書き込み動作をイネーブルにする信号であり、キャッシュライトデータはキャッシュメモリ４１への書き込みデータであり、キャッシュアドレスはキャッシュメモリにおいてアクセスすべきアドレスを示す。 FIG. 15 shows the case where the cache memory 41 includes two memories 51-0 and 51-1, but this number is only an example, and only one or three or more may be used. Also good. Indexes 0 and 1 are assigned as identification numbers to the two memories 51-0 and 51-1 included in the cache memory 41, respectively. Information indicating which of index 0 and index 1 allocated to the memories 51-0 and 51-1 should be selected as the cache index entry and the cache data address among the address signals described with reference to FIGS. Is included. The cache entry includes information on which of sub-entries 0 to (L-1) should be selected. In addition, a cache enable signal, a cache write enable signal, cache write data, and a cache address are input from the cache access control unit 44 to the cache memory 41. The cache enable signal is a signal for enabling the cache memory 41, the cache write enable signal is a signal for enabling the write operation to the cache memory 41, and the cache write data is the write data to the cache memory 41. Yes, the cache address indicates an address to be accessed in the cache memory.

次にキャッシュ制御部４２が備えるキャッシュアクセス制御部４４、キャッシュ管理部４５、及びリクエスト発行制御部４６について説明する。まずリクエスト発行制御部４６について図１６を用いて説明する。図１６はリクエスト発行制御部４６のブロック図であり、入出力信号を示している。図示するようにリクエスト発行制御部４６には、プリロード要求イネーブル信号、リフィル要求イネーブル信号、リフィルアドレス、リフィル要求ＩＤ、及びリフィルアクノリッジ信号が入力される。プリロード要求イネーブル信号はキャッシュ管理部４５から与えられ、プリロード要求イネーブル信号プリロード要求が出力されるとアサートされる。リフィル要求イネーブル信号、リフィルアドレス、リフィル要求ＩＤはキャッシュ管理部４５から与えられ、それぞれリフィル要求のイネーブル信号、アドレス、リクエストＩＤを示す。ロード／ストア命令が発行された際に、該当するデータがキャッシュメモリ４１内に存在しなかった場合、該当データをローカルメモリからキャッシュメモリ４１へ読み出す必要がある。これをリフィル（refill）と呼ぶ。リフィルアクノリッジ信号はローカルメモリ１３から与えられ、リフィル要求に関するアクノリッジ信号である。 Next, the cache access control unit 44, the cache management unit 45, and the request issue control unit 46 included in the cache control unit 42 will be described. First, the request issuance control unit 46 will be described with reference to FIG. FIG. 16 is a block diagram of the request issuance control unit 46, showing input / output signals. As shown in the figure, the request issuance control unit 46 receives a preload request enable signal, a refill request enable signal, a refill address, a refill request ID, and a refill acknowledge signal. The preload request enable signal is supplied from the cache management unit 45 and asserted when a preload request enable signal preload request is output. The refill request enable signal, refill address, and refill request ID are given from the cache management unit 45, and indicate the refill request enable signal, address, and request ID, respectively. If the corresponding data does not exist in the cache memory 41 when the load / store instruction is issued, it is necessary to read the corresponding data from the local memory to the cache memory 41. This is called refill. The refill acknowledge signal is provided from the local memory 13 and is an acknowledge signal related to the refill request.

リクエスト発行制御部４６は、リフィル要求とプリロード要求の発行を制御する。具体的にはまず、ローカルメモリ１３へのリフィル要求とプリロード要求の総数をカウントする。ローカルメモリ１３からリフィルアクノリッジ信号が返ってくると、これらの要求数をカウントダウンする。これはローカルメモリ１３が受け付けることの出来るリクエスト数に上限があるからである。またプリロードとリフィルとでは、優先度はリフィルの方が高い。従って、リフィル要求とプリロード要求とが同時に発行待ちとなっている場合は、リフィル要求が優先して発行される。そして適切なタイミングで、リフィル要求信号をローカルメモリ１３へ出力する。またリクエスト発行制御部４６は、ローカルメモリ１３に対して発行待ちをしているリフィル要求の有無を、リフィルレディ信号としてアドレス保持部５０へ出力する。更に、ローカルメモリ１３におけるリクエストキューの有無、すなわちローカルメモリ１３に対してリフィル要求及びプリロード要求を発行出来るか否かを、要求状況信号としてアドレス保持部５０へ出力する。 The request issuance control unit 46 controls the issuance of refill requests and preload requests. Specifically, first, the total number of refill requests and preload requests to the local memory 13 is counted. When a refill acknowledge signal is returned from the local memory 13, the number of requests is counted down. This is because there is an upper limit on the number of requests that the local memory 13 can accept. In addition, in the preload and the refill, the priority is higher in the refill. Therefore, when the refill request and the preload request are waiting to be issued simultaneously, the refill request is issued with priority. Then, the refill request signal is output to the local memory 13 at an appropriate timing. Further, the request issuance control unit 46 outputs the presence or absence of a refill request waiting for issuance to the local memory 13 to the address holding unit 50 as a refill ready signal. Further, the presence / absence of a request queue in the local memory 13, that is, whether or not a refill request and a preload request can be issued to the local memory 13 is output to the address holding unit 50 as a request status signal.

次にキャッシュアクセス制御部４４について図１７を用いて説明する。図１７はキャッシュアクセス制御部４４のブロック図であり、入出力信号を示している。図示するようにキャッシュアクセス制御部４４には、ストアデータ、キャッシュインデックスエントリ、キャッシュエントリ、ヒットエントリ番号、ロードイネーブル信号、ストアイネーブル信号、リフィルアクノリッジ信号、リフィル要求ＩＤ、リフィルデータ、ライトバックアクノリッジ信号、ライトバックＩＤ、及びキャッシュリードデータが入力される。 Next, the cache access control unit 44 will be described with reference to FIG. FIG. 17 is a block diagram of the cache access control unit 44, showing input / output signals. As shown in the figure, the cache access control unit 44 includes store data, cache index entry, cache entry, hit entry number, load enable signal, store enable signal, refill acknowledge signal, refill request ID, refill data, write back acknowledge signal, Write back ID and cache read data are input.

ストアデータはキャッシュメモリ４１にストアすべきデータであり、描画処理部２６から与えられる。ヒットエントリ番号はキャッシュ管理部４５から与えられる。そしてロード／ストア命令が発行された際、該当データがキャッシュメモリ４１にあるか否か、ある場合いずれのエントリにあるかを示す。ヒットエントリ番号については後に詳細に説明する。ロードイネーブル信号、ストアイネーブル信号はそれぞれ、キャッシュ管理部４５及びシェーダプログラム実行部描画処理部２６から与えられ、ロード要求及びストア要求が発行された際にアサートされる。リフィルアクノリッジ信号、リフィル要求ＩＤ、リフィルデータはローカルメモリ１３から与えられる。ライトバックアクノリッジ信号、ライトバックＩＤはライトバック動作に関する信号であり、それぞれアクノリッジ信号及びＩＤを示し、ローカルメモリ１３から与えられる。ライトバックとは、キャッシュメモリ４１内のデータをローカルメモリへ書き込む動作のことであり、詳細は第２の実施形態で説明する。 The store data is data to be stored in the cache memory 41 and is given from the drawing processing unit 26. The hit entry number is given from the cache management unit 45. When a load / store instruction is issued, it indicates whether the corresponding data is in the cache memory 41, and if so, in which entry. The hit entry number will be described in detail later. The load enable signal and the store enable signal are respectively supplied from the cache management unit 45 and the shader program execution unit drawing processing unit 26, and are asserted when a load request and a store request are issued. The refill acknowledge signal, refill request ID, and refill data are given from the local memory 13. The write-back acknowledge signal and the write-back ID are signals related to the write-back operation. The write-back acknowledge signal and write-back ID indicate the acknowledge signal and ID, respectively, and are given from the local memory 13. The write back is an operation of writing data in the cache memory 41 to the local memory, and details will be described in the second embodiment.

またキャッシュアクセス制御部４４は、ロードイネーブル信号、ライトバックデータ、キャッシュイネーブル信号、キャッシュライトデータ、キャッシュアドレス、及びリフィルアクノリッジＩＤを出力する。ロードイネーブル信号は描画処理部２６に与えられる。ライトバックデータは、ライトバック時にキャッシュメモリ４１へ書き込むべきデータであり、ローカルメモリ１３へ与えられる。リフィルアクノリッジＩＤはリフィルのアクノリッジＩＤを示す信号であり、キャッシュ管理部４５へ与えられる。 The cache access control unit 44 outputs a load enable signal, write back data, a cache enable signal, cache write data, a cache address, and a refill acknowledge ID. The load enable signal is given to the drawing processing unit 26. The write back data is data to be written to the cache memory 41 at the time of write back, and is given to the local memory 13. The refill acknowledge ID is a signal indicating the acknowledge ID of the refill and is given to the cache management unit 45.

キャッシュアクセス制御部４４は、キャッシュメモリ４１へのデータの書き込み、及びキャッシュメモリ４１からのデータの読み出しを制御する。キャッシュメモリ４１へのアクセスは、ロード、ストア、リフィル、及びライトバックの４種類がある。キャッシュメモリ４１へアクセスがなされる際、キャッシュアクセス制御部４４はキャッシュイネーブル信号をアサートする。 The cache access control unit 44 controls data writing to the cache memory 41 and data reading from the cache memory 41. There are four types of access to the cache memory 41: load, store, refill, and write back. When the cache memory 41 is accessed, the cache access control unit 44 asserts a cache enable signal.

リフィルを行う場合、リフィルアクノリッジ信号がキャッシュアクセス制御部４４に到達してから一定時間後に、リフィルデータがローカルメモリ１３からキャッシュアクセス制御部４４に到達する。キャッシュアクセス制御部４４はリフィルデータを一旦保持した後、キャッシュメモリ４１へ書き込む。キャッシュメモリ４１へリフィルデータを書き込む際には、キャッシュアクセス制御部４４はキャッシュライトイネーブル信号をアサートし、キャッシュライトデータ及びキャッシュアドレスをキャッシュメモリ４１に対して出力する。更にキャッシュアクセス制御部４４は、ローカルメモリ１３からリフィルアクノリッジ信号を受け取ると、リフィルアクノリッジＩＤをキャッシュ管理部４５へ出力する。 When refilling is performed, the refill data reaches the cache access control unit 44 from the local memory 13 after a predetermined time from when the refill acknowledge signal reaches the cache access control unit 44. The cache access control unit 44 once holds the refill data and then writes it to the cache memory 41. When writing the refill data to the cache memory 41, the cache access control unit 44 asserts the cache write enable signal and outputs the cache write data and the cache address to the cache memory 41. Further, when receiving a refill acknowledge signal from the local memory 13, the cache access control unit 44 outputs a refill acknowledge ID to the cache management unit 45.

ライトバックを行う場合、キャッシュアクセス制御部４４は、キャッシュメモリ４１から読み出されたキャッシュリードデータを一旦保持した後、これをライトバックデータとしてローカルメモリ１３へ出力する。 When performing a write back, the cache access control unit 44 temporarily holds the cache read data read from the cache memory 41 and then outputs this to the local memory 13 as write back data.

ストアを行う場合、ストアイネーブル信号がアサートされると共に、描画処理部２６からストアデータが与えられる。そしてキャッシュアクセス制御部４４は、このストアデータをキャッシュメモリ４１に書き込む。 When storing, a store enable signal is asserted and store data is supplied from the drawing processing unit 26. Then, the cache access control unit 44 writes the store data in the cache memory 41.

ロードを行う場合、ロードイネーブル信号がアサートされる。そしてキャッシュアクセス制御部４４は、キャッシュメモリ４１からキャッシュリードデータを読み出す。このデータは同時に描画処理部２６にも与えられる。 When loading, the load enable signal is asserted. Then, the cache access control unit 44 reads the cache read data from the cache memory 41. This data is also given to the drawing processing unit 26 at the same time.

次にキャッシュ管理部４５について図１８を用いて説明する。図１８はキャッシュ管理部４５のブロック図であり、入出力信号を示している。図示するようにキャッシュ管理部４５には、ストール信号、キャッシュデータアドレス信号、ロード要求信号、ストア要求信号、エンド命令、イールド命令、サブパススタート信号、スレッドエントリ番号、フラッシュ要求信号、プリロードアドレス、プリロードスレッドＩＤ、プリロードイネーブル信号、リフィルアクノリッジ信号、ライトバックアクノリッジ信号、ライトバックアクノリッジＩＤ、リフィルアクノリッジＩＤが入力される。 Next, the cache management unit 45 will be described with reference to FIG. FIG. 18 is a block diagram of the cache management unit 45 showing input / output signals. As shown, the cache management unit 45 includes a stall signal, a cache data address signal, a load request signal, a store request signal, an end instruction, a yield instruction, a subpath start signal, a thread entry number, a flash request signal, a preload address, and a preload thread. An ID, a preload enable signal, a refill acknowledge signal, a write back acknowledge signal, a write back acknowledge ID, and a refill acknowledge ID are input.

ストール信号は描画処理部２６から与えられる。ストールとは、何らかの原因によって命令が実行できず、実行を待っている状態のことである。ロード要求信号、ストア要求信号は描画処理部２６から与えられる。エンド命令及びイールド命令は描画処理部２６から与えられる。サブパススタート信号はサブパスが開始されたことを示す信号であり、描画処理部２６から与えられる。フラッシュ要求信号は、キャッシュメモリ４１のフラッシュを要求するための信号であり、描画処理部２６から与えられる。 The stall signal is given from the drawing processing unit 26. A stall is a state where an instruction cannot be executed for some reason and is waiting for execution. The load request signal and the store request signal are given from the drawing processing unit 26. The end command and the yield command are given from the drawing processing unit 26. The sub-pass start signal is a signal indicating that the sub-pass has been started, and is given from the drawing processing unit 26. The flash request signal is a signal for requesting flash of the cache memory 41 and is given from the drawing processing unit 26.

プリロードアドレス、プリロードスレッドＩＤ、及びプリロードイネーブル信号はプリロードに関する信号であり、プリロード制御部４３のアドレス保持部５０から与えられる。 The preload address, preload thread ID, and preload enable signal are signals related to preload, and are given from the address holding unit 50 of the preload control unit 43.

またキャッシュ管理部４５には、リフィルアクノリッジ信号、及びリフィルアクノリッジＩＤが、それぞれローカルメモリ１３及びキャッシュアクセス制御部４４から与えられる。更にライトバックアクノリッジ信号及びライトバックアクノリッジＩＤが、それぞれローカルメモリ１３及びキャッシュアクセス制御部４４から与えられる。 Further, the refill acknowledge signal and the refill acknowledge ID are given to the cache management unit 45 from the local memory 13 and the cache access control unit 44, respectively. Further, the write back acknowledge signal and the write back acknowledge ID are given from the local memory 13 and the cache access control unit 44, respectively.

キャッシュ管理部４５は、キャッシュメモリ４１のヒット判定、エントリのステータス管理、リクエスト発行エントリの決定、ＬＲＦの管理、及びキャッシュメモリ４１のフラッシュ制御を行う。 The cache management unit 45 performs cache memory 41 hit determination, entry status management, request issue entry determination, LRF management, and cache memory 41 flush control.

キャッシュメモリ４１のヒット判定について説明する。例えばロード命令が発行された場合、必要なデータをキャッシュメモリ４１から描画処理部２６へロードする必要がある。この時、必要なデータがキャッシュメモリ４１に保持されていればよいが、保持されていない場合には当該データをローカルメモリからキャッシュメモリ４１へ読み出す（リフィルする）必要がある。このように、必要なデータがキャッシュメモリ４１内に保持されているか否かを判定することをヒット判定と呼ぶ。そしてヒット判定結果をヒットエントリ番号として、キャッシュアクセス制御部４４へ出力する。 The hit determination of the cache memory 41 will be described. For example, when a load instruction is issued, it is necessary to load necessary data from the cache memory 41 to the drawing processing unit 26. At this time, it is sufficient that necessary data is held in the cache memory 41. However, when the data is not held, it is necessary to read (refill) the data from the local memory to the cache memory 41. In this way, determining whether or not necessary data is held in the cache memory 41 is called hit determination. The hit determination result is output to the cache access control unit 44 as a hit entry number.

ロード／ストア命令やプリロード命令がキャッシュミスした場合（キャッシュメモリ４１に保持されていない場合）、キャッシュ管理部４５はリフィル要求イネーブル信号及びリフィルアドレスをリクエスト発行制御部４６へ出力する。 When a load / store instruction or a preload instruction causes a cache miss (when it is not held in the cache memory 41), the cache management unit 45 outputs a refill request enable signal and a refill address to the request issuance control unit 46.

またキャッシュ管理部４５は、キャッシュメモリ４１の各エントリのステータス管理を行う。そのためにキャッシュ管理部４５は、キャッシュメモリ４１の各エントリに対応して設けられ、ステータスフラグを保持するメモリ６１を備えている。ステータスフラグは、キャッシュメモリ４１において対応する各エントリの状態を示す。図１９はメモリ６１の概念図である。メモリ６１は例えばＳＲＡＭやフリップフロップ等であり、メモリ５１−０、５１−１それぞれに対応して設けられる。図１９では、メモリ５１−０、５１−１のいずれかに対応するステータスフラグのみを示している。 The cache management unit 45 manages the status of each entry in the cache memory 41. For this purpose, the cache management unit 45 includes a memory 61 provided corresponding to each entry in the cache memory 41 and holding a status flag. The status flag indicates the state of each corresponding entry in the cache memory 41. FIG. 19 is a conceptual diagram of the memory 61. The memory 61 is, for example, an SRAM or a flip-flop, and is provided corresponding to each of the memories 51-0 and 51-1. In FIG. 19, only the status flag corresponding to one of the memories 51-0 and 51-1 is shown.

図示するように、メモリ６１はメモリ５１−０、５１−１と同様にＭ個のエントリ０〜（Ｍ−１）を備えている。そして各エントリはステータスフラグとして、タグＴ（Tag）、バリッドフラグＶ（Valid flag）、及びリフィルフラグＲ（Refill flag）を保持する。タグＴは、対応するエントリに保持されるデータのアドレス信号に関する。より具体的には、図１３で説明したアドレス信号に含まれるブロックＩＤと、ピクセルシェーダユニット番号の一部に対応する。また図１４で説明したアドレス信号に含まれるスレッドＩＤに対応する。 As shown in the drawing, the memory 61 includes M entries 0 to (M−1) as in the memories 51-0 and 51-1. Each entry holds a tag T (Tag), a valid flag V (Valid flag), and a refill flag R (Refill flag) as status flags. The tag T relates to an address signal of data held in the corresponding entry. More specifically, it corresponds to a block ID included in the address signal described in FIG. 13 and a part of the pixel shader unit number. Further, it corresponds to the thread ID included in the address signal described in FIG.

バリッドフラグＶは、対応するエントリに保持されるデータが有効（バリッド）か否かを示すフラグである。エントリは、リフィル要求が発行されるとバリッドとなり、フラッシュ（flush）されるとインバリッド（invalid）となる。 The valid flag V is a flag indicating whether or not the data held in the corresponding entry is valid (valid). An entry becomes valid when a refill request is issued, and becomes invalid when flushed.

リフィルフラグＲは、リフィル要求を発行中であることを示すフラグである。リフィルフラグＲは、リフィル要求を発行してから、実際にローカルメモリからキャッシュメモリ４１へのデータ転送（これをリプレイス（replace）と呼ぶ）が完了されるまでアサートされ続ける。 The refill flag R is a flag indicating that a refill request is being issued. The refill flag R continues to be asserted after the refill request is issued until the data transfer from the local memory to the cache memory 41 is actually completed (this is called “replace”).

リクエスト発行エントリの決定とは、リフィルやプリロードを行う際に、キャッシュメモリ４１においてデータを保持させるべきエントリを決定することであり、最も古くリフィルされたエントリから順に使用される。この点について図２０を用いて説明する。発行エントリを決定するために、キャッシュ管理部４５は、各々がＭビットのエントリをＭ個有するメモリ６２を備えている。メモリ６２にＬＲＦキュー（Least Recently Filled Queue）が保持される。ＬＲＦキューは、キャッシュメモリ４１においてリフィルが行われた順序を示す。そしてメモリ６２のエントリ０〜（Ｍ−１）の各ビットは、上位ビットから順にキャッシュメモリ４１の各エントリ０〜（Ｍ−１）に対応し、メモリ６２のエントリ０〜（Ｍ−１）の順にリフィルが行われた順序が古くなっていく。従って図２０の例の場合、最近リフィルが行われたキャッシュメモリ４１のエントリは、メモリ６２のエントリ（Ｍ−１）に示されるようにエントリ３であり、次にエントリ１、エントリ５、…である。キャッシュ管理部４５は、図１９に示したステータスフラグに基づいて、リクエスト発行可能エントリ信号を生成する。リクエスト発行可能エントリ信号は、現在リクエスト発行可能なエントリがいずれであるかを示す信号である。そして、上位ビットから順に、キャッシュメモリ４１のエントリ０〜（Ｍ−１）に対応する。従って図２０の例であると、キャッシュメモリ４１のエントリ１、２、３がリクエスト発行可能であると分かる。 The determination of the request issue entry is to determine an entry to hold data in the cache memory 41 when refilling or preloading, and the entries are used in order from the oldest refilled entry. This point will be described with reference to FIG. In order to determine the issue entry, the cache management unit 45 includes a memory 62 having M entries each having M bits. An LRF queue (Least Recently Filled Queue) is held in the memory 62. The LRF queue indicates the order in which refilling is performed in the cache memory 41. The bits of entries 0 to (M−1) of the memory 62 correspond to the entries 0 to (M−1) of the cache memory 41 in order from the upper bit, and the entries 0 to (M−1) of the memory 62 The order in which refills were performed in order is getting older. Therefore, in the example of FIG. 20, the entry of the cache memory 41 that has been refilled recently is entry 3 as indicated by entry (M−1) of memory 62, and then entry 1, entry 5,. is there. The cache management unit 45 generates a request issuable entry signal based on the status flag shown in FIG. The request issuable entry signal is a signal indicating which entry is currently available for request issuance. The entries correspond to entries 0 to (M−1) in the cache memory 41 in order from the upper bit. Therefore, in the example of FIG. 20, it can be seen that entries 1, 2, and 3 in the cache memory 41 can issue requests.

そしてキャッシュ管理部４５は、ＬＲＦキューとリクエスト発行可能エントリ信号との論理積演算を行う。（Ｍ−１）個のＬＲＦキューとリクエスト発行可能エントリ信号との論理演算結果を順に並べることで、リクエスト発行キュー信号が得られる。リクエスト発行キュー信号は、ＬＲＦキューのいずれのエントリに基づいて発行エントリを決定すればよいかを示しており、上位ビットから順に、メモリ６２のエントリ０〜（Ｍ−１）に対応している。従って図２０の例であると、メモリ６２のエントリ３、６、（Ｍ−１）に保持されたＬＲＦキューに基づいて決めれば良いことが分かる。すると、キャッシュメモリ４１において発行可能なエントリはエントリ１、２、３であるところ、これらのうちで最も昔にリフィルが行われたキャッシュメモリ４１のエントリはエントリ２であることがＬＲＦキューから分かる。従って、キャッシュメモリ４１においてリクエスト発行エントリはエントリ２と決定される。これを示しているのがリクエスト発行エントリ信号である。この信号も、上位ビットから順にキャッシュメモリ４１のエントリ０〜（Ｍ−１）に対応しており、“１”とされたビットに対応するエントリがリクエスト発行エントリである。なお図２０に示す回路は、キャッシュメモリ４１に含まれるメモリ５１−０、５１−１毎に設けられている。 Then, the cache management unit 45 performs an AND operation between the LRF queue and the request issuable entry signal. By arranging the logical operation results of (M−1) LRF queues and the request issuable entry signal in order, a request issuance queue signal is obtained. The request issue queue signal indicates which entry in the LRF queue should determine the issue entry, and corresponds to entries 0 to (M−1) of the memory 62 in order from the upper bit. Therefore, in the example of FIG. 20, it can be understood that the determination may be made based on the LRF queue held in the entries 3, 6, and (M-1) of the memory 62. Then, the entries that can be issued in the cache memory 41 are entries 1, 2, and 3. It can be seen from the LRF queue that the entry of the cache memory 41 that has been refilled earliest among these is the entry 2. Therefore, the request issue entry is determined as entry 2 in the cache memory 41. This is a request issue entry signal. This signal also corresponds to entries 0 to (M−1) in the cache memory 41 in order from the upper bit, and the entry corresponding to the bit set to “1” is the request issue entry. The circuit shown in FIG. 20 is provided for each of the memories 51-0 and 51-1 included in the cache memory 41.

次に図１０におけるプリロード制御部４３について説明する。プリロードアドレス発生部４７は、プリロード時のアドレス信号を生成する。プリロード保持部４８は、プリロード要求のなされたスレッドの管理を行う。サブパス情報管理部４９は、サブパスでアクセスしたバッファに関する情報を記憶する。アドレス保持部５０は、プリロードアドレス発生部４７で生成されたアドレス信号を保持する。上記構成において、生成されたプリロードアドレスがキャッシュ管理部４５に与えられる。プリロードに関しては第３の実施形態で詳細を説明する。 Next, the preload control unit 43 in FIG. 10 will be described. The preload address generator 47 generates an address signal at the time of preload. The preload holding unit 48 manages threads for which a preload request has been made. The subpath information management unit 49 stores information regarding the buffer accessed in the subpath. The address holding unit 50 holds the address signal generated by the preload address generation unit 47. In the above configuration, the generated preload address is given to the cache management unit 45. Details of the preload will be described in the third embodiment.

次に、上記データ制御部２７の動作について説明する。データ制御部２７は、キャッシュメモリ４１、ローカルメモリ１３、及び描画処理部２６との間のデータの授受を管理する。これらの間のデータの授受は、図２１に示すようにプリロード、ロード／ストア、リフィル、及びライトバックの４種類がある。本実施形態ではロード／ストア及びリフィルについて説明する。 Next, the operation of the data control unit 27 will be described. The data control unit 27 manages data exchange between the cache memory 41, the local memory 13, and the drawing processing unit 26. As shown in FIG. 21, there are four types of data exchange between these: preload, load / store, refill, and write back. In this embodiment, load / store and refill will be described.

まず、ロード／ストア命令が発行された際のロード動作について図２２を用いて説明する。図２２はピクセルシェーダユニットのブロック図である。ロードは、キャッシュメモリ４１から描画処理部２６へデータを転送する動作である。 First, a load operation when a load / store instruction is issued will be described with reference to FIG. FIG. 22 is a block diagram of the pixel shader unit. The load is an operation of transferring data from the cache memory 41 to the drawing processing unit 26.

まず描画処理部２６からロード要求信号がキャッシュ管理部４５に与えられる。またアドレス発生部４０は、図１３、図１４で説明した方法によりアドレスを生成し、キャッシュデータアドレス信号をキャッシュ管理部４５に与え、キャッシュインデックスエントリ信号及びキャッシュエントリ信号をキャッシュアクセス制御部４４に与える。するとキャッシュ管理部４５はヒット判定を行い、ヒットエントリ番号をキャッシュアクセス制御部４４に与え、またロードイネーブル信号をキャッシュアクセス制御部４４に与える。 First, a load request signal is given from the rendering processing unit 26 to the cache management unit 45. The address generation unit 40 generates an address by the method described with reference to FIGS. 13 and 14, provides a cache data address signal to the cache management unit 45, and provides a cache index entry signal and a cache entry signal to the cache access control unit 44. . Then, the cache management unit 45 performs hit determination, gives a hit entry number to the cache access control unit 44, and gives a load enable signal to the cache access control unit 44.

そしてキャッシュアクセス制御部４４が、キャッシュイネーブル信号を発生してキャッシュメモリ４１をイネーブルにする。更に、キャッシュメモリ４１における、キャッシュインデックスエントリ信号及びキャッシュエントリ信号に対応したアドレスにアクセスし、キャッシュメモリ４１からデータを読み出す。またキャッシュアクセス制御部４４は、ロードイネーブル信号を描画処理部２６に返す。キャッシュメモリ４１から読み出されたキャッシュリードデータは描画処理部２６へ転送される。
以上のようにして、キャッシュメモリ４１内のデータ（キャッシュリードデータ）が描画処理部２６へロードされる。 Then, the cache access control unit 44 enables the cache memory 41 by generating a cache enable signal. Further, the cache memory 41 accesses the cache index entry signal and the address corresponding to the cache entry signal, and reads data from the cache memory 41. Further, the cache access control unit 44 returns a load enable signal to the drawing processing unit 26. The cache read data read from the cache memory 41 is transferred to the drawing processing unit 26.
As described above, the data (cache read data) in the cache memory 41 is loaded into the drawing processing unit 26.

次にストア動作について図２３を用いて説明する。図２３はピクセルシェーダユニットのブロック図である。ストアは、描画処理部２６で処理したデータをキャッシュメモリ４１に保持させる動作である。 Next, the store operation will be described with reference to FIG. FIG. 23 is a block diagram of the pixel shader unit. The store is an operation for causing the cache memory 41 to hold the data processed by the drawing processing unit 26.

まず描画処理部２６からストア要求信号がキャッシュ管理部４５に与えられる。またアドレス発生部４０はアドレスを生成し、キャッシュインデックスエントリ信号及びキャッシュエントリ信号をキャッシュアクセス制御部４４に与える。更に描画処理部２６からキャッシュアクセス制御部４４へ、ストアイネーブル信号及びストアデータが与えられる。 First, a store request signal is given from the drawing processing unit 26 to the cache management unit 45. The address generator 40 generates an address, and provides the cache index entry signal and the cache entry signal to the cache access controller 44. Further, a store enable signal and store data are given from the drawing processing unit 26 to the cache access control unit 44.

そしてキャッシュアクセス制御部４４が、キャッシュイネーブル信号を発生してキャッシュメモリ４１をイネーブルにする。更にキャッシュアクセス制御部４４は、キャッシュメモリ４１にストアデータをキャッシュライトデータとして与える。またキャッシュアクセス制御部４４は、キャッシュインデックスエントリ信号及びキャッシュエントリ信号によって示されるアドレスを、キャッシュアドレスとしてキャッシュメモリ４１に与える。これにより、キャッシュメモリ４１においてキャッシュアドレスに対応するエントリに、ストアデータが書き込まれる。
以上のようにして、描画処理部２６内のデータがキャッシュメモリ４１にストアされる。 Then, the cache access control unit 44 enables the cache memory 41 by generating a cache enable signal. Further, the cache access control unit 44 gives store data to the cache memory 41 as cache write data. In addition, the cache access control unit 44 provides the cache memory 41 with the cache index entry signal and the address indicated by the cache entry signal as a cache address. As a result, the store data is written to the entry corresponding to the cache address in the cache memory 41.
As described above, the data in the drawing processing unit 26 is stored in the cache memory 41.

次にリフィル動作について図２４を用いて説明する。図２４はピクセルシェーダユニットのブロック図である。リフィルは、描画処理部２６からキャッシュメモリ４１に対して要求されたデータがキャッシュメモリ４１に存在しない場合に、該データをローカルメモリからキャッシュメモリ４１に読み出す動作である。 Next, the refill operation will be described with reference to FIG. FIG. 24 is a block diagram of the pixel shader unit. The refill is an operation of reading data requested from the drawing processing unit 26 to the cache memory 41 when the data does not exist in the cache memory 41 from the local memory to the cache memory 41.

まず、キャッシュ管理部４５においてヒット判定がミスした場合、換言すれば、ヒットエントリ番号が全ビットゼロであった場合、すなわち必要なデータがキャッシュメモリ４１に無かった場合、キャッシュ管理部４５はリフィル要求イネーブル信号、リフィルアドレス、及びリフィル要求ＩＤをリクエスト発行制御部４６へ出力する。これらの信号を受けて、リクエスト発行制御部４６はリクエスト数をカウントアップする。またリクエスト発行制御部４６は、ローカルメモリ１３に対してリフィルをリクエストする（リフィル要求信号を出力する）。 First, when the hit determination is missed in the cache management unit 45, in other words, when the hit entry number is all bits zero, that is, when the necessary data is not in the cache memory 41, the cache management unit 45 sets the refill request enable. The signal, refill address, and refill request ID are output to the request issuance control unit 46. Upon receiving these signals, the request issuance control unit 46 counts up the number of requests. Further, the request issuance control unit 46 requests refilling from the local memory 13 (outputs a refill request signal).

リフィル要求を受けたローカルメモリ１３は、キャッシュ管理部４５、キャッシュアクセス制御部４４、及びリクエスト発行制御部４６に対して、リフィルアクノリッジ信号を出力する。リフィルアクノリッジ信号を受けたキャッシュアクセス制御部４４は、キャッシュ管理部４５に対してリフィルアクノリッジＩＤを出力する。これによりキャッシュ管理部４５は、リフィル要求が確かに受け取られたことを認識する。リフィルアクノリッジ信号が出力された後、ローカルメモリ１３からキャッシュアクセス制御部４４に対してリフィルデータが出力される。するとキャッシュアクセス制御部４４はストア動作と同じ要領により、リフィルデータをキャッシュメモリ４１にリプレイスする。但し、リフィルに使用されるエントリは、図２０で説明したＬＲＦキューによって決定される。
以上のようにして、ローカルメモリ１３からデータがキャッシュメモリ４１にリフィルされる。 The local memory 13 that has received the refill request outputs a refill acknowledge signal to the cache management unit 45, the cache access control unit 44, and the request issuance control unit 46. Upon receiving the refill acknowledge signal, the cache access control unit 44 outputs a refill acknowledge ID to the cache management unit 45. As a result, the cache management unit 45 recognizes that the refill request has been received. After the refill acknowledge signal is output, the refill data is output from the local memory 13 to the cache access control unit 44. Then, the cache access control unit 44 replaces the refill data in the cache memory 41 in the same manner as the store operation. However, the entry used for refilling is determined by the LRF queue described with reference to FIG.
As described above, data is refilled from the local memory 13 to the cache memory 41.

上記のように、ロード／ストア命令が発行されると、キャッシュ管理部４５がヒット判定を行ってキャッシュメモリ４１のエントリをチェックする。ヒット判定がヒットした場合にはロード／ストア動作を行い、ミスした場合にはリフィルを行う。リフィルを行うエントリはＬＲＦキューによって決定される。ミスした場合であっても、例えばローカルメモリ１３のリクエストキューがフル（full）の場合や、キャッシュメモリ４１に空きエントリが無い場合にはリフィル要求を発行することが出来ず、「待ち」の状態となる。従って、ロード／ストア命令が発行された場合、データ制御部２７には、図２５に示すように３つの状態を取り得る。図２５はデータ制御部２７の状態遷移図である。 As described above, when a load / store instruction is issued, the cache management unit 45 performs hit determination and checks an entry in the cache memory 41. When the hit determination is hit, a load / store operation is performed, and when it is missed, refill is performed. The entry to be refilled is determined by the LRF queue. Even if there is a miss, for example, if the request queue of the local memory 13 is full or if there is no free entry in the cache memory 41, the refill request cannot be issued, and the state is "waiting" It becomes. Therefore, when a load / store instruction is issued, the data control unit 27 can take three states as shown in FIG. FIG. 25 is a state transition diagram of the data control unit 27.

図示するように、データ制御部２７は、「実行状態（Ｅｘｅｃ）」、「待ち状態（Ｗａｉｔ）」、及び「フィル状態（Ｆｉｌｌ）」の３つの状態を取る。実行状態は、ヒット判定の結果ロード／ストア命令がヒットした場合であり、ピクセルシェーダユニットが動作している状態である。待ち状態は、ヒット判定の結果ロード／ストア命令がミスした場合であり、リフィル要求を発行しようとしている状態である。そしてこの状態ではピクセルシェーダユニットはストールしている。フィル状態は、ローカルメモリ１３に対してリフィル要求が発行されている状態である。この状態でもピクセルシェーダユニットはストールしている。 As shown in the figure, the data control unit 27 takes three states of “execution state (Exec)”, “waiting state (Wait)”, and “fill state (Fill)”. The execution state is when the load / store instruction hits as a result of the hit determination, and is a state where the pixel shader unit is operating. The wait state is a state in which a load / store instruction misses as a result of hit determination, and is a state in which a refill request is about to be issued. In this state, the pixel shader unit is stalled. The fill state is a state in which a refill request has been issued to the local memory 13. Even in this state, the pixel shader unit is stalled.

上記３つの状態が変化するトリガは下記の通りである。各番号は、図２５に記した状態遷移の番号に一致する。
１．実行状態から遷移しない：ロード／ストア命令がヒット
２．実行状態から待ち状態へ：ロード／ストア命令がミス
３．待ち状態からフィル状態へ：リフィル要求が発行される
４．フィル状態から実行状態へ：リフィルアクノリッジ信号が返される
５．待ち状態から遷移しない：ロード／ストア命令がミスしたが、リフィル要求を発行できない
６．フィル状態から遷移しない：リフィルアクノリッジ信号が返されない。 The triggers that change the above three states are as follows. Each number corresponds to the number of the state transition shown in FIG.
1. No transition from execution state: hit load / store instruction
2. From execution state to wait state: Missing load / store instruction
3. From wait state to fill state: A refill request is issued
4). From fill state to run state: A refill acknowledge signal is returned.
5. No transition from wait state: Load / store instruction missed but refill request cannot be issued
6). No transition from fill state: No refill acknowledge signal is returned.

次にロード／ストア命令が発行された際の動作の詳細について、図２６及び図２７を用いて説明する。図２６はデータ制御部２７の動作のフローチャートであり、図２７は各種信号のタイミングチャートである。 Next, details of the operation when a load / store instruction is issued will be described with reference to FIGS. FIG. 26 is a flowchart of the operation of the data control unit 27, and FIG. 27 is a timing chart of various signals.

まずロードストア命令が描画処理部２６から発行される（ステップＳ１０）。すなわち図２７の時刻ｔ０においてロード要求信号が発行される。 First, a load store instruction is issued from the drawing processing unit 26 (step S10). That is, a load request signal is issued at time t0 in FIG.

すると、ロード要求信号に応答してキャッシュ管理部４５がヒット判定を行う（ステップＳ１１）。より具体的には、要求されたアドレスと、ステータスフラグ内のタグＴとを比較する。 Then, the cache management unit 45 makes a hit determination in response to the load request signal (step S11). More specifically, the requested address is compared with the tag T in the status flag.

タグとアドレスとが一致すると（ステップＳ１２）、次にキャッシュ管理部４５はステータスフラグ内のリフィルフラグＲをチェックする（ステップＳ１３）。リフィルフラグＲが“０”の場合（ステップＳ１４）、当該エントリについてのリプレイスは完了しているから、そのデータを用いてロード／ストア命令を実行する（ステップＳ１５）。 When the tag matches the address (step S12), the cache management unit 45 next checks the refill flag R in the status flag (step S13). When the refill flag R is “0” (step S14), since the replacement for the entry has been completed, a load / store instruction is executed using the data (step S15).

ステップＳ１２でアドレスとタグＴとが不一致だった場合、すなわちロード／ストア命令がミスした場合、リフィル要求発行可能なエントリがあるか否かをチェックする（ステップＳ１６）。リフィル要求発行可能なエントリがある場合、キャッシュ管理部４５はリフィル要求（リフィル要求イネーブル信号、時刻ｔ２）を発行する（ステップＳ１８）。またリクエスト発行制御部４６もリフィル要求信号をローカルメモリ１３に対して出力する。 If the address does not match the tag T in step S12, that is, if the load / store instruction misses, it is checked whether there is an entry that can issue a refill request (step S16). If there is an entry that can issue a refill request, the cache management unit 45 issues a refill request (refill request enable signal, time t2) (step S18). The request issuance control unit 46 also outputs a refill request signal to the local memory 13.

次のサイクルでキャッシュ管理部４５は、対応するエントリのステータスフラグ内のタグＴをリフィルデータに関する情報に書き換えると共に、リフィルフラグＲを“１”とする（ステップＳ１９、時刻ｔ２）。そして、このロード／ストア命令はストール（stall）する（ステップＳ２０）。ストールは、ローカルメモリ１３からリフィルアクノリッジ信号が返ってくるまで続く。ストールした状態で、再度ロード／ストア命令が発行される（ステップＳ２１）。すると、ヒット判定（ステップＳ１１）ではアドレスとタグＴとが一致するので（ステップＳ１２）、次にリフィルフラグＲをチェックする（ステップＳ１４）。ローカルメモリ１３からリフィルアクノリッジ信号が返ってきていればリフィルフラグＲは“０”となる。従ってステップＳ１５に進む。しかしローカルメモリ１３からリフィルアクノリッジ信号が返ってきていなければリフィルフラグＲは“１”のままなので、ステップＳ２０に進んでストールが継続される。 In the next cycle, the cache management unit 45 rewrites the tag T in the status flag of the corresponding entry with information related to the refill data, and sets the refill flag R to “1” (step S19, time t2). This load / store instruction is stalled (step S20). The stall continues until a refill acknowledge signal is returned from the local memory 13. In a stalled state, a load / store instruction is issued again (step S21). Then, since the address and the tag T match in the hit determination (step S11) (step S12), the refill flag R is checked next (step S14). If the refill acknowledge signal is returned from the local memory 13, the refill flag R is "0". Accordingly, the process proceeds to step S15. However, if the refill acknowledge signal is not returned from the local memory 13, the refill flag R remains "1", so the process proceeds to step S20 and the stall is continued.

ステップＳ１７でリフィル要求発行可能なエントリが無かった場合、エントリが空くまでストールを続け（ステップＳ２２）、再度ロード／ストア命令を発行する（ステップＳ２３）。ストールを続けていると、やがていずれかのエントリがリフィル要求発行可能となるので、そのエントリに対してリフィル要求が発行される（ステップＳ１８）。 If there is no entry that can issue the refill request in step S17, the stall is continued until the entry becomes empty (step S22), and the load / store instruction is issued again (step S23). If the stall is continued, any of the entries can be refilled in a short time, and a refill request is issued for the entry (step S18).

次に、キャッシュ管理部４５におけるヒット判定のための構成と、その方法について図２８を用いて説明する。図２８はキャッシュ管理部４５の一部と、キャッシュメモリ４１のブロック図である。 Next, the configuration and method for hit determination in the cache management unit 45 will be described with reference to FIG. FIG. 28 is a block diagram of a part of the cache management unit 45 and the cache memory 41.

図示するように、キャッシュ管理部４５はメモリ６１の他に、メモリ５３−０〜５３−（Ｍ−１）毎にそれぞれ設けられた選択回路６５、比較回路６６、及びＡＮＤゲート６７を備えている。またキャッシュメモリ４１は、選択回路６８、６９、及びメモリ７０を備えている。 As illustrated, the cache management unit 45 includes a selection circuit 65, a comparison circuit 66, and an AND gate 67 provided for each of the memories 53-0 to 53- (M-1) in addition to the memory 61. . In addition, the cache memory 41 includes selection circuits 68 and 69 and a memory 70.

ヒット判定のためにキャッシュ管理部４５には、キャッシュデータアドレス信号が入力される。キャッシュデータアドレス信号は、フレームバッファモードにおいてはブロックＩＤ、オフセットデータ、及びピクセルシェーダユニット番号を含む。そして、ブロックＩＤ及びピクセルシェーダユニット番号が対象データについてのタグ情報を示し、オフセットデータがインデックス情報を示す。メモリレジスタモードでは、キャッシュデータアドレス信号はスレッドＩＤ及びオフセットデータを含む。そして、スレッドＩＤがタグ情報を示し、オフセットデータがインデックス情報を示す。インデックス情報とは、メモリ５１−０、５１−１のいずれにアクセスすべきであるかを示す信号である。まず選択回路６５は、アドレス信号のインデックス情報に基づいて、キャッシュメモリ４１内のメモリ５１−０、５１−１のいずれかを選択する。次に比較回路６６の各々は、選択回路６５の各々で選択されたメモリ５１−０または５１−１におけるメモリ５３−０〜５３−（Ｍ−１）、すなわちエントリ０〜（Ｍ−１）に対応するタグＴと、キャッシュデータアドレス信号から得られるタグ情報とを比較する。両者が一致した場合、比較回路６６は“１”を出力し、不一致の場合には“０”を出力する。更にＡＮＤゲート６７の各々は、選択回路６５の各々で選択されたメモリ５１−０または５１−１におけるメモリ５３−０〜５３−（Ｍ−１）に対応するバリッドフラグＶと、比較回路６６の各々の出力とのＡＮＤ演算を行う。このＡＮＤ演算結果が信号ヒットエントリ番号となる。ヒットエントリ番号においていずれかのビットが“１”であるということは、そのビットに対応したメモリ５３−０〜５３−（Ｍ−１）のいずれかに該当データが保持されているということを意味する。 A cache data address signal is input to the cache management unit 45 for hit determination. The cache data address signal includes a block ID, offset data, and a pixel shader unit number in the frame buffer mode. The block ID and the pixel shader unit number indicate tag information about the target data, and the offset data indicates index information. In the memory register mode, the cache data address signal includes a thread ID and offset data. The thread ID indicates tag information, and the offset data indicates index information. The index information is a signal indicating which of the memories 51-0 and 51-1 should be accessed. First, the selection circuit 65 selects one of the memories 51-0 and 51-1 in the cache memory 41 based on the index information of the address signal. Next, each of the comparison circuits 66 enters the memories 53-0 to 53- (M-1) in the memories 51-0 or 51-1 selected by the selection circuits 65, that is, the entries 0 to (M-1). The corresponding tag T is compared with the tag information obtained from the cache data address signal. When the two match, the comparison circuit 66 outputs “1”, and when they do not match, outputs “0”. Further, each of the AND gates 67 includes a valid flag V corresponding to the memories 53-0 to 53- (M-1) in the memory 51-0 or 51-1 selected by each of the selection circuits 65, and the comparison circuit 66. An AND operation is performed on each output. The AND operation result becomes the signal hit entry number. If any bit in the hit entry number is “1”, it means that the corresponding data is held in any of the memories 53-0 to 53- (M−1) corresponding to the bit. To do.

選択回路６８は、ヒットエントリ番号に基づいていずれかのメモリ０〜（Ｍ−１）、すなわちいずれかのエントリ０〜（Ｍ−１）を選択する。例えばヒットエントリ番号が（１００００…）である場合には、エントリ０に該当データが保持されているということであるから、エントリ０を選択する。なお、前述の通り本実施形態の例であると、キャッシュメモリ４１はサブエントリ単位で外部とデータの授受を行う。従って選択回路６９は、選択回路６８で選択されたエントリに含まれるＬ個のサブエントリ０〜（Ｌ−１）のいずれかを、キャッシュエントリ信号に基づいて選択する。キャッシュエントリ信号は、前述の通りクアッドＩＤとオフセットデータとを含む。そしてキャッシュエントリ信号は、各エントリ０〜（Ｍ−１）においていずれのサブエントリ０〜（Ｌ−１）にアクセスすべきかを示すエントリ情報となる。選択回路６９によって選択された１サブエントリ分のデータが、キャッシュリードデータとなる。 The selection circuit 68 selects one of the memories 0 to (M−1), that is, one of the entries 0 to (M−1) based on the hit entry number. For example, when the hit entry number is (10000...), It means that the corresponding data is stored in entry 0, and therefore entry 0 is selected. As described above, in the example of this embodiment, the cache memory 41 exchanges data with the outside in units of subentries. Therefore, the selection circuit 69 selects any one of the L subentries 0 to (L−1) included in the entry selected by the selection circuit 68 based on the cache entry signal. As described above, the cache entry signal includes the quad ID and the offset data. The cache entry signal is entry information indicating which sub-entry 0 to (L-1) should be accessed in each entry 0 to (M-1). The data for one subentry selected by the selection circuit 69 is cache read data.

以上のように、この発明の第１の実施形態に係るグラフィックプロセッサによれば、下記（１）の効果を得ることが出来る。
（１）グラフィックプロセッサ内のハードウェアを削減出来る（その１）。 As described above, according to the graphic processor of the first embodiment of the present invention, the following effect (1) can be obtained.
(1) Hardware in the graphic processor can be reduced (part 1).

本実施形態によれば、キャッシュ管理部４５はステータスフラグとして、リフィルＲとタグＴを保持している。そしてキャッシュ管理部４５は、ヒット判定において当該ロード／ストア命令がミスした際には、まずリフィル要求を発行すると共にタグＴを書き換える。この時点では、まだリプレイスは開始されていない。すなわち、タグＴの示す情報と、キャッシュメモリ４１の対応するエントリ内のデータとが不一致となる。従ってキャッシュ管理部４５は、両者が一致しているか否かをリフィルＲフラグによって管理している。その結果、グラフィックプロセッサのハードウェアを削減でき、製造コストを削減出来る。この点につき、以下詳細に説明する。 According to the present embodiment, the cache management unit 45 holds the refill R and the tag T as status flags. When the load / store instruction misses in the hit determination, the cache management unit 45 first issues a refill request and rewrites the tag T. At this point, the replacement has not yet started. That is, the information indicated by the tag T and the data in the corresponding entry of the cache memory 41 do not match. Therefore, the cache management unit 45 manages whether or not the two match with each other using the refill R flag. As a result, the hardware of the graphic processor can be reduced, and the manufacturing cost can be reduced. This point will be described in detail below.

図２９はリフィルフラグＲを使用しない場合に考え得るキャッシュ管理部４５の構成を示すブロック図である。キャッシュ管理部４５は、本実施形態に構成に加えて、更にロード／ストアミスキュー（load/store miss queue）７１と、比較器７２を備えている。ロード／ストアミスキュー７１は、リプレイスが完了していないロード／ストア命令を保持する。 FIG. 29 is a block diagram showing a configuration of the cache management unit 45 that can be considered when the refill flag R is not used. The cache management unit 45 further includes a load / store miss queue 71 and a comparator 72 in addition to the configuration of the present embodiment. The load / store miss queue 71 holds a load / store instruction that has not been replaced.

図２９においてロード／ストア命令が発行されると、まずヒット判定が行われる。すなわち、比較器６６は、入力されたアドレスとタグＴとを比較する。両者が一致しなかった場合、更に比較器７２が、入力されたアドレスと、ロード／ストアミスキュー７１とを比較する。比較器７２において、両者が不一致だった場合には、当該ロード／ストア命令はロード／ストアミスキュー７１に保持され、リフィル要求が発行される。比較器６６、７２の両方で比較結果がミスであった時にリフィル要求が発行される。リフィル要求が発行されてリプレイスが完了すると、この時点でタグＴが書き換えられる。すなわち、タグＴの示す情報と、キャッシュメモリ４１内のデータとは常時一致している。 In FIG. 29, when a load / store instruction is issued, hit determination is first performed. That is, the comparator 66 compares the input address with the tag T. If they do not match, the comparator 72 further compares the input address with the load / store misqueue 71. If the comparator 72 does not match the load / store instruction, the load / store instruction is held in the load / store miss queue 71 and a refill request is issued. A refill request is issued when the comparison result is a mistake in both the comparators 66 and 72. When the refill request is issued and the replacement is completed, the tag T is rewritten at this point. That is, the information indicated by the tag T and the data in the cache memory 41 always match.

これに対して、本実施形態に係るキャッシュ管理部４５の構成を簡略化して示したのが図３０である。本実施形態では、比較器６６においてアドレスとタグＴとが一致しなかった場合、リフィル要求を発行し、この時点でタグＴを書き換え、更にリフィルＲフラグを“１”にする。その後、いずれかのタイミングでリプレイスを行う。リプレイスが完了すると、リフィルフラグＲは“０”に戻る。比較器６６においてアドレスとタグＴとが一致した場合には、当該エントリがリプレイス中であるか否かをリフィルフラグＲによってチェックする。そしてリプレイスが完了していない場合には当該ロード／ストア命令はストールされ、完了している場合にはロード／ストア命令を実行する。 On the other hand, FIG. 30 shows a simplified configuration of the cache management unit 45 according to the present embodiment. In the present embodiment, when the address does not match the tag T in the comparator 66, a refill request is issued, the tag T is rewritten at this point, and the refill R flag is set to “1”. Thereafter, the replacement is performed at any timing. When the replacement is completed, the refill flag R returns to “0”. When the address matches the tag T in the comparator 66, it is checked by the refill flag R whether or not the entry is being replaced. When the replacement is not completed, the load / store instruction is stalled. When the replacement is completed, the load / store instruction is executed.

このように、リフィル要求の発行と共にタグＴを書き換えてしまうため、図２９におけるロード／ストアミスキュー７１が不要となる。更にリプレイスが完了したか否かをリフィルフラグによって管理している。そのため図２９における比較器７２も不要である。その結果、図２９に示す構成に比べてハードウェアを削減でき、製造コストを削減出来る。
なお本実施形態では、図２７に示すように、ロード／ストア命令は２周期に１度しか発行されない。そのため、リフィル要求と共にタグＴの書き換えが可能となる。なぜなら、図２７に示すように最初のサイクルでタグの読み出しとヒット判定を行い、次のサイクルでタグの書き換えを行う必要があるからである。 As described above, since the tag T is rewritten together with the issuance of the refill request, the load / store misqueue 71 in FIG. 29 is not necessary. Further, whether or not the replacement is completed is managed by a refill flag. Therefore, the comparator 72 in FIG. 29 is also unnecessary. As a result, the hardware can be reduced compared with the configuration shown in FIG. 29, and the manufacturing cost can be reduced.
In the present embodiment, as shown in FIG. 27, the load / store instruction is issued only once every two cycles. Therefore, the tag T can be rewritten together with the refill request. This is because, as shown in FIG. 27, it is necessary to perform tag reading and hit determination in the first cycle, and to rewrite the tag in the next cycle.

また、前述のようにアドレス信号の計算方法は上記実施形態で説明した方法に限定されることはなく、ブロック内に含まれるスタンプの数や、ピクセルシェーダユニット２４の数などによって変化出来る。またアドレス信号の内部構成も図１３、図１４に示したものに限られない。図２８に示すように、アドレス信号はタグ情報、インデックス情報、及びエントリ情報を含んでいれば足りる。更に、キャッシュメモリ４１がメモリ５１−０、５１−１のいずれか一方しか含まない場合にはインデックス情報は不要であるし、キャッシュメモリ４１のエントリサイズでデータ転送可能であればエントリ情報も不要であり、このような場合にはアドレス発生部４０はタグ情報のみを生成すれば良い。そしてアドレス発生部４０には、上記のようなアドレス信号を生成するための情報が与えられる必要がある。それらの情報として、本実施形態では図１２に示すようにオフセットデータ、ＸＹ座標、スレッドＩＤ、クアッドＩＤ、サブパスＩＤ、及びバッファモード信号が与えられる場合について説明した。しかし、これらは一例に過ぎず、タグ情報と、その他の必要なアドレスとを生成するのに使用できる信号であれば限定されない。また本実施形態では、タグＴが、スレッドＩＤやピクセルシェーダユニット番号の一部に対応する情報である場合を例に挙げて説明した。しかし、タグＴとして用いる情報は、データを識別出来るものであれば良く、スレッドＩＤ及びピクセルシェーダユニット番号以外の情報であっても良い。 Further, as described above, the address signal calculation method is not limited to the method described in the above embodiment, and can be changed depending on the number of stamps included in the block, the number of pixel shader units 24, and the like. The internal configuration of the address signal is not limited to that shown in FIGS. As shown in FIG. 28, the address signal only needs to include tag information, index information, and entry information. Further, when the cache memory 41 includes only one of the memories 51-0 and 51-1, the index information is unnecessary, and the entry information is unnecessary if data transfer is possible with the entry size of the cache memory 41. In such a case, the address generation unit 40 may generate only tag information. The address generator 40 needs to be given information for generating the address signal as described above. In this embodiment, as the information, offset data, XY coordinates, thread ID, quad ID, sub path ID, and buffer mode signal are given as shown in FIG. However, these are only examples, and any signal that can be used to generate tag information and other necessary addresses is not limited. In the present embodiment, the case where the tag T is information corresponding to a part of the thread ID or the pixel shader unit number has been described as an example. However, the information used as the tag T may be information that can identify the data, and may be information other than the thread ID and the pixel shader unit number.

次に、この発明の第２の実施形態に係るグラフィックプロセッサについて説明する。本実施形態は、上記第１の実施形態で説明したグラフィックプロセッサにおけるライトバック動作に関するものである。 Next explained is a graphic processor according to the second embodiment of the invention. The present embodiment relates to a write back operation in the graphic processor described in the first embodiment.

本実施形態に係るキャッシュ管理部４５は、上記第１の実施形態で説明した制御に加えて、更にライトバック動作を制御する。ライトバックとは、図２１で説明したように、キャッシュメモリ４１内のデータをローカルメモリに書き込むことである。描画処理部２６からストア命令が発行された際、データはキャッシュメモリ４１にのみ書き込まれる。すなわち、キャッシュメモリ４１内のデータのみがアップデートされる。従って、キャッシュメモリ４１内のデータとローカルメモリ内のデータとが一致しないことになる。このような状態でキャッシュメモリ４１内のデータが失われることを避けるためにライトバックが行われる。なお、キャッシュメモリ４１内にのみアップデートされたデータが保持されている状態を、以下ではダーティー（dirty）と呼ぶことにする。 In addition to the control described in the first embodiment, the cache management unit 45 according to the present embodiment further controls the write back operation. As described with reference to FIG. 21, the write-back means writing data in the cache memory 41 to the local memory. When a store instruction is issued from the drawing processing unit 26, data is written only in the cache memory 41. That is, only the data in the cache memory 41 is updated. Therefore, the data in the cache memory 41 and the data in the local memory do not match. In such a state, write back is performed to avoid losing data in the cache memory 41. A state in which updated data is held only in the cache memory 41 will be referred to as “dirty” below.

図３１はキャッシュ管理部４５の備えるメモリ６１の概念図であり、ステータスフラグとしてタグＴ、バリッドフラグＶ、リフィルフラグＲの他に、ダーティーフラグＤ及びライトバックフラグＷを保持する。ダーティーフラグＤは、対応するエントリがダーティーであるか否か、すなわち、エントリに対して描画処理部２６からデータの書き込みがあったことを示す。そしてライトバックデータの読み出しを開始するまでアサートされる。ライトバックフラグＷは、対応するエントリがライトバック要求を発行中であるか否かを示す。そしてライトバック要求が発行されてからライトバックデータの読み出しが開始されるまでアサートされる。 FIG. 31 is a conceptual diagram of the memory 61 included in the cache management unit 45. In addition to the tag T, the valid flag V, and the refill flag R, the dirty flag D and the write back flag W are held as status flags. The dirty flag D indicates whether or not the corresponding entry is dirty, that is, the drawing processing unit 26 has written data to the entry. The signal is asserted until reading of the write back data is started. The write back flag W indicates whether or not the corresponding entry is issuing a write back request. The signal is asserted after the write back request is issued until the reading of the write back data is started.

図３２はキャッシュ管理部４５において、ライトバック要求を発行するための構成のブロック図である。図示するようにキャッシュ管理部４５は、カウンタ７３と選択回路７４を備えている。選択回路７４は、カウンタ７３におけるカウント数に応じたエントリのダーティーフラグＤを選択する。 FIG. 32 is a block diagram of a configuration for issuing a write-back request in the cache management unit 45. As illustrated, the cache management unit 45 includes a counter 73 and a selection circuit 74. The selection circuit 74 selects the dirty flag D of the entry corresponding to the count number in the counter 73.

次にライトバック動作について図３３を用いて説明する。図３３はピクセルシェーダユニットのブロック図である。
まずキャッシュ管理部４５からライトバック要求信号がローカルメモリ１３へ出力される。ライトバック要求がローカルメモリ１３にエンターされると、ローカルメモリ１３からライトバックアクノリッジ信号がキャッシュ管理部４５及びキャッシュアクセス制御部４４へ出力され、またライトバックＩＤがキャッシュアクセス制御部４４へ出力される。 Next, the write back operation will be described with reference to FIG. FIG. 33 is a block diagram of the pixel shader unit.
First, a write back request signal is output from the cache management unit 45 to the local memory 13. When the write back request is entered into the local memory 13, a write back acknowledge signal is output from the local memory 13 to the cache management unit 45 and the cache access control unit 44, and a write back ID is output to the cache access control unit 44. .

するとキャッシュアクセス制御部４４はライトバックＩＤに基づいて、キャッシュメモリ４１からデータ（キャッシュリードデータ）を読み出す。データをキャッシュメモリ４１から読み出したキャッシュアクセス制御部４４は、ライトバックアクノリッジＩＤをキャッシュ管理部４５へ返すと共に、読み出しデータをライトバックデータとしてローカルメモリ１３に書き込む。その後、キャッシュ管理部４５はライトバックアクノリッジＩＤに応答して、対応するエントリのダーティーフラグＤ及びライトバックフラグＷをデアサート（“０”に）する。 Then, the cache access control unit 44 reads data (cache read data) from the cache memory 41 based on the write back ID. The cache access control unit 44 that has read the data from the cache memory 41 returns the write back acknowledge ID to the cache management unit 45 and writes the read data to the local memory 13 as write back data. Thereafter, in response to the write-back acknowledge ID, the cache management unit 45 deasserts (sets to “0”) the dirty flag D and the write-back flag W of the corresponding entry.

次にキャッシュ管理部４５における、ライトバックを行うエントリの選択方法について図３４のフローチャートを用いて説明する。まずキャッシュ管理部４５は現在のカウンタ７３のカウンタ値に対応するエントリのダーティーフラグＤをチェックする（ステップＳ３０）。ダーティーフラグＤ＝“１”であれば（ステップＳ３１）、対応するエントリにつきライトバック要求を発行する（ステップＳ３２）。ダーティーフラグＤ＝“０”であれば発行しない。そしてカウンタ値が最終エントリに対応する値を示していた場合（ステップＳ３３）、カウンタ値をリセットして（ステップＳ３０）、ステップＳ３０に戻る。カウンタ値が最終エントリに対応する値を示していない場合（ステップＳ３３）には、カウンタ７３はカウントアップしてステップＳ３０に戻る。
すなわち、キャッシュメモリ４１内の全エントリ０〜２（Ｍ−１）について、ダーティーフラグＤを順番にチェックし、そのダーティーフラグＤがアサートされていた場合にライトバック要求を発行する。
その他の構成及び動作は第１の実施形態と同様である。 Next, a method for selecting an entry to be written back in the cache management unit 45 will be described with reference to the flowchart of FIG. First, the cache management unit 45 checks the dirty flag D of the entry corresponding to the current counter value of the counter 73 (step S30). If the dirty flag D = “1” (step S31), a write-back request is issued for the corresponding entry (step S32). Not issued if dirty flag D = "0". If the counter value indicates the value corresponding to the last entry (step S33), the counter value is reset (step S30), and the process returns to step S30. If the counter value does not indicate a value corresponding to the last entry (step S33), the counter 73 counts up and returns to step S30.
That is, for all entries 0 to 2 (M−1) in the cache memory 41, the dirty flag D is checked in order, and if the dirty flag D is asserted, a write-back request is issued.
Other configurations and operations are the same as those in the first embodiment.

以上のように、この発明の第２の実施形態に係るグラフィックプロセッサによれば、第１の実施形態で説明した（１）の効果に加えて、下記（２）、（３）の効果を得ることが出来る。
（２）グラフィックプロセッサ内のハードウェアを削減出来る（その２）。 As described above, according to the graphic processor of the second embodiment of the present invention, the following effects (2) and (3) are obtained in addition to the effect (1) described in the first embodiment. I can do it.
(2) Hardware in the graphic processor can be reduced (part 2).

従来のライトバック手法は、ライトバックデータを一時的にバッファメモリに保持させ、その後、適当なタイミングでバッファメモリに保持させたライトバックデータをローカルメモリに書き込むことが通常であった。これは、ライトバック中にリフィル要求の発行が必要となった場合に、ライトバックが終了するまでリフィルが出来なくなるという状況が発生することを回避するために行われた手法である。この手法によれば、データをバッファに待避させておくことで、ライトバック中であってもそのエントリはリフィル要求を発行出来る。またライトバックは、キャッシュ管理部４５外部からの何らかのトリガに応答してなされるか、またはキャッシュメモリにデータをストアすると同時に行われていた。 In the conventional write back method, the write back data is temporarily held in the buffer memory, and then the write back data held in the buffer memory is written in the local memory at an appropriate timing. This is a technique for avoiding a situation in which refilling is not possible until the write back is completed when a refill request needs to be issued during the write back. According to this method, by saving data in the buffer, the entry can issue a refill request even during write back. The write-back is performed in response to some trigger from the outside of the cache management unit 45 or is performed simultaneously with storing data in the cache memory.

これに対して本実施形態では、キャッシュ管理部４５はステータスフラグとしてダーティーフラグＤを保持し、いずれのキャッシュエントリがダーティーであるかを管理している。そして常にダーティーフラグＤを監視し、いずれかのエントリがダーティーであり、且つライトバック要求発行可能な限りは、そのタイミングでライトバックを行っている。従って、ダーティーなエントリの存在確率が従来に比べて圧倒的に低い。そのため、いずれかのエントリがライトバック中であっても、リフィル要求を発行可能なエントリが他に存在しやすい。よって、従来のようにデータをバッファに待避させる必要がなく、バッファが不要となる。従って、ハードウェアを削減でき、製造コストを低減できる。 On the other hand, in the present embodiment, the cache management unit 45 holds the dirty flag D as a status flag, and manages which cache entry is dirty. The dirty flag D is always monitored, and as long as one of the entries is dirty and a write back request can be issued, write back is performed at that timing. Therefore, the existence probability of dirty entries is overwhelmingly lower than before. Therefore, even if any entry is being written back, there are other entries that can issue a refill request. Therefore, it is not necessary to save data in the buffer as in the prior art, and the buffer is unnecessary. Therefore, hardware can be reduced and manufacturing cost can be reduced.

（３）キャッシュメモリを効率的に利用出来る（その１）。
上記（２）で説明したように、特に外部からの要求が無くてもライトバック要求が発行可能であれば、その時点でライトバックを行っている。従って、キャッシュメモリ４１のエントリを有効に活用出来る。 (3) The cache memory can be used efficiently (part 1).
As described in (2) above, if a write-back request can be issued without any external request, write-back is performed at that time. Therefore, the entries in the cache memory 41 can be used effectively.

更に、ローカルメモリ１３にｅＤＲＡＭ（embedded DRAM）を用い、且つそのレイテンシが長い場合には、本実施形態のように可能な時にライトバックを行うことで、ダーティーなエントリの存在を効果的に低減でき、グラフィックプロセッサの性能を向上出来る。 Further, when an eDRAM (embedded DRAM) is used for the local memory 13 and the latency is long, the presence of dirty entries can be effectively reduced by performing write back when possible as in the present embodiment. The performance of the graphic processor can be improved.

なお、キャッシュメモリ４１のエントリサイズが大きい場合には、特に本実施形態の効果が顕著となる。なぜなら、エントリサイズが大きいほど、従来手法で必要なバッファサイズも大きくなるためであり、面積削減の効果が顕著となる。 Note that when the entry size of the cache memory 41 is large, the effect of the present embodiment is particularly remarkable. This is because the larger the entry size, the larger the buffer size required in the conventional method, and the effect of area reduction becomes remarkable.

また図３５に示すように、キャッシュ管理部４５はバス制御回路７５からバスの状況をデータとして受け取っても良い。バス制御回路７５は、各回路ブロック間のバスによる接続を制御する。ライトバックを行うためには、データ制御部２７とローカルメモリとの間のバスが使用されていない必要がある。そこで、キャッシュ管理部４５はバス制御回路７５から現在のバスの使用状況を受け取り、バスが使用されていないことを認識した際に、ライトバック要求を発行する。これによりバスの使用効率を向上出来る。 As shown in FIG. 35, the cache management unit 45 may receive the bus status from the bus control circuit 75 as data. The bus control circuit 75 controls connection between each circuit block by a bus. In order to perform write back, the bus between the data control unit 27 and the local memory needs not to be used. Therefore, the cache management unit 45 receives the current bus use status from the bus control circuit 75, and issues a write-back request when recognizing that the bus is not used. As a result, the use efficiency of the bus can be improved.

次に、この発明の第３の実施形態に係るグラフィックプロセッサについて説明する。本実施形態は、上記第１、第２の実施形態で説明したグラフィックプロセッサにおけるプリロード動作に関するものである。 Next explained is a graphic processor according to the third embodiment of the invention. This embodiment relates to a preload operation in the graphic processor described in the first and second embodiments.

プリロード動作については、図１０に示したプリロード制御部４３が制御を行う。プリロード制御部４３は、プリロードアドレス発生部４７、プリロード保持部４８、サブパス情報管理部４９、及びアドレス保持部５０を備えている。プリロード保持部４８は、プリロード要求のあったスレッドの管理を行う。プリロード保持部４８は、命令制御部２５からスレッド単位でプリロード要求を受ける。このときプリロード保持部４８は、スレッドのＸＹ座標、スレッドＩＤ、及び実行するサブパス番号を同時に受け取って保持する。プリロード保持部４８は内部に複数のエントリを有するメモリを備えており、プリロード要求をメモリのエントリに積んでいく。プリロード要求は、番号の若いエントリから優先的に発行される。そして、プリロード要求を発行するエントリを決定したら、サブパス情報管理部４９に対して、プリロードスタート信号及びプリロードサブパス番号を出力する。プリロードスタート信号は新たなスレッドに関するプリロードの開始を示し、プリロードサブパス番号はプリロードされるサブパス番号である。 The preload control unit 43 shown in FIG. 10 controls the preload operation. The preload control unit 43 includes a preload address generation unit 47, a preload holding unit 48, a subpath information management unit 49, and an address holding unit 50. The preload holding unit 48 manages a thread for which a preload request has been made. The preload holding unit 48 receives a preload request from the instruction control unit 25 in units of threads. At this time, the preload holding unit 48 simultaneously receives and holds the XY coordinates of the thread, the thread ID, and the sub path number to be executed. The preload holding unit 48 includes a memory having a plurality of entries therein, and accumulates preload requests on the memory entries. The preload request is issued preferentially from the entry with the smallest number. When an entry for issuing a preload request is determined, a preload start signal and a preload subpath number are output to the subpath information management unit 49. The preload start signal indicates the start of preload for a new thread, and the preload subpath number is the subpath number to be preloaded.

次にサブパス情報管理部４９について説明する。サブパス情報管理部４９では、サブパスで使用されたバッファの情報を保持する制御と共に、プリロードのためのパラメータ出力の制御を行う。まずバッファの情報管理のために、サブパス情報管理部４９は、図３６に示すようなインストラクションテーブルを有している。インストラクションテーブルのエントリの各々は各サブパスに対応している。そしてサブパス情報管理部４９は、ロード／ストア命令が発行される度に、当該命令に対応したモード情報をインストラクションテーブルに書き込む。これらの情報は、命令制御部２５から、バッファバンクセレクト信号及びバッファモード信号として与えられる。これらの信号は、例えばローカルメモリがフレームバッファとして使用されているかメモリレジスタとして使用されているか、またデータ格納領域のベースアドレス（先頭アドレス）等の情報を含む。 Next, the sub path information management unit 49 will be described. The subpath information management unit 49 controls the output of parameters for preloading as well as the control for holding the buffer information used in the subpath. First, for buffer information management, the sub-path information management unit 49 has an instruction table as shown in FIG. Each entry in the instruction table corresponds to each subpath. Then, every time a load / store instruction is issued, the subpath information management unit 49 writes mode information corresponding to the instruction in the instruction table. These pieces of information are given from the instruction control unit 25 as a buffer bank select signal and a buffer mode signal. These signals include information such as whether the local memory is used as a frame buffer or a memory register, and the base address (start address) of the data storage area.

またサブパス情報管理部４９は、プリロード命令が発行されると、プリロードスタート信号及びプリロードサブパス番号で指定されるサブパスに関する情報を、インストラクションテーブルから読み出す。そしてインストラクションテーブルから読み出したデータを、プリロードバンク信号としてプリロードアドレス発生部４７へ出力する。またプリロードイネーブル信号をアサートする。 In addition, when a preload instruction is issued, the subpath information management unit 49 reads information on the subpath specified by the preload start signal and the preload subpath number from the instruction table. Then, the data read from the instruction table is output to the preload address generator 47 as a preload bank signal. Also, the preload enable signal is asserted.

次にプリロードアドレス発生部４７について説明する。プリロードアドレス発生部４７は、プリロードに必要なアドレス信号を生成する。アドレスの生成方法は、第１の実施形態で説明したアドレス発生部４０と同様である（図１３、図１４参照）。上記のアドレス計算を行うための信号（プリロード用のＸＹ座標、プリロード用のスレッドＩＤ、プリロードバンク信号）は、プリロード保持部４８及びサブパス情報管理部４９から常時与えられている。その状態で、プリロードイネーブル信号がアサートされると、それに応答してプリロードアドレス発生部４７がアドレスの計算を開始する。得られたプリロードアドレスと、プリロードイネーブル信号はアドレス保持部５０に出力される。 Next, the preload address generator 47 will be described. The preload address generation unit 47 generates an address signal necessary for preloading. The address generation method is the same as that of the address generation unit 40 described in the first embodiment (see FIGS. 13 and 14). Signals for performing the above address calculation (XY coordinates for preload, thread ID for preload, preload bank signal) are always given from the preload holding unit 48 and the subpath information management unit 49. In this state, when the preload enable signal is asserted, the preload address generation unit 47 starts calculating the address in response thereto. The obtained preload address and preload enable signal are output to the address holding unit 50.

次にアドレス保持部５０について説明する。アドレス保持部５０は、プリロード命令の発行がストールした場合に、当該命令に係るアドレスを保持しておくためのキューである。ローカルメモリ１３のリクエストキューに空きが無い場合、キャッシュメモリ４１にプリロード要求の発行可能なエントリが無い場合、及びリクエスト発行制御部４６に発行待ちのリフィル要求がある場合には、プリロード命令はストールし、プリロードイネーブル信号デアサートする。これらの情報は、リクエスト発行制御部４６からリフィルレディ信号及び要求状況信号として与えられる。 Next, the address holding unit 50 will be described. The address holding unit 50 is a queue for holding an address related to an instruction when the preload instruction issuance stalls. If there is no space in the request queue of the local memory 13, there is no entry that can issue a preload request in the cache memory 41, and there is a refill request waiting to be issued in the request issuance control unit 46, the preload instruction stalls. , Deassert the preload enable signal. These pieces of information are given from the request issuance control unit 46 as a refill ready signal and a request status signal.

またアドレス保持部５０は、プリロード命令に関するヒット判定に必要なデータをキャッシュ管理部４５に出力する。 The address holding unit 50 outputs data necessary for hit determination regarding the preload instruction to the cache management unit 45.

次に、本実施形態に係るグラフィックプロセッサのプリロード動作について図３７及び図３８を用いて説明する。図３７はプリロード動作のフローチャートであり、図３８は図３７における各ステップと対応づけたデータ制御部２７のブロック図である。まず命令制御部２５からプリロード保持部４８に対してプリロード要求が発行される（ステップＳ４０）。この際、プリロード保持部４８は、命令制御部２５からプリロード要求信号の他にスレッド情報（ＸＹ座標、スレッドＩＤ、サブパスＩＤ）を受け取る（ステップＳ４１）。 Next, the preload operation of the graphic processor according to the present embodiment will be described with reference to FIGS. FIG. 37 is a flowchart of the preload operation, and FIG. 38 is a block diagram of the data control unit 27 associated with each step in FIG. First, a preload request is issued from the instruction control unit 25 to the preload holding unit 48 (step S40). At this time, the preload holding unit 48 receives thread information (XY coordinates, thread ID, sub path ID) in addition to the preload request signal from the instruction control unit 25 (step S41).

そしてプリロード保持部４８は、プリロードスタート信号とプリロードサブパス番号とをサブパス情報管理部４９に出力する。サブパス情報管理部４９は、受け取ったプリロードスタート信号とプリロードサブパス番号とに基づいて、インストラクションテーブルからロード／ストア命令に関する情報を読み出す（ステップＳ４２）。読み出された情報（プリロードバンク信号）は、プリロードアドレス発生部４７へ出力される。このロード／ストア命令に関する情報は、命令制御部２５においてロード／ストア命令が発行された際に、サブパス情報管理部４９のインストラクションテーブルに格納されたものである。更にサブパス情報管理部４９は、プリロードイネーブル信号をアサートする。またプリロード保持部４８は、スレッド情報（ＸＹ座標、スレッドＩＤ）をプリロードアドレス発生部４７へ出力する。 Then, the preload holding unit 48 outputs the preload start signal and the preload subpath number to the subpath information management unit 49. Based on the received preload start signal and preload subpath number, the subpath information management unit 49 reads out information related to the load / store instruction from the instruction table (step S42). The read information (preload bank signal) is output to the preload address generator 47. The information related to the load / store instruction is stored in the instruction table of the subpath information management unit 49 when the instruction control unit 25 issues a load / store instruction. Further, the subpath information management unit 49 asserts a preload enable signal. Further, the preload holding unit 48 outputs thread information (XY coordinates, thread ID) to the preload address generating unit 47.

次にプリロードアドレス発生部４７は、サブパス情報管理部４９から与えられたロード／ストア命令に関する情報と、プリロード保持部４８から与えられたスレッド情報とを用いてプリロードアドレスを計算する（ステップＳ４３）。そしてプリロードアドレス発生部４７は、計算により得られたプリロードアドレスをアドレス保持部５０へ出力する。またプリロードアドレス発生部４７は、プリロードイネーブル信号をアサートしてアドレス保持部へ出力する。 Next, the preload address generation unit 47 calculates the preload address using the information on the load / store instruction given from the subpath information management unit 49 and the thread information given from the preload holding unit 48 (step S43). Then, the preload address generation unit 47 outputs the preload address obtained by the calculation to the address holding unit 50. The preload address generation unit 47 asserts a preload enable signal and outputs it to the address holding unit.

更にこれらの情報はアドレス保持部５０からキャッシュ管理部４５へ出力される。そしてキャッシュ管理部４５においてヒット判定が行われる（ステップＳ４４）。ステップＳ４４におけるヒット判定は、プリロードされるデータがキャッシュメモリ４１内にすでに存在するか否かを判定する処理である。そして、第１の実施形態においてリフィル動作で説明したように、プリロードのヒット判定結果がミスだった場合に、キャッシュ管理部４５はプリロード要求信号を発行する。またキャッシュ管理部４５はリフィルＩＤ及びリフィルアドレスを発行し、プリロード要求信号と共にリクエスト発行制御部４６へ出力する（ステップＳ４５）。そしてキャッシュ管理部４５はヒット判定を終了すると、ミス／ヒットにかかわらずプリロードヒット判定信号をアサートして、アドレス保持部５０におけるプリロード情報をデアサートする。プリロードヒット判定信号は、キャッシュ管理部４５におけるヒット判定が終了したか否かを示す信号である。 Further, these pieces of information are output from the address holding unit 50 to the cache management unit 45. Then, hit determination is performed in the cache management unit 45 (step S44). The hit determination in step S44 is a process for determining whether or not the preloaded data already exists in the cache memory 41. Then, as described in the refill operation in the first embodiment, when the preload hit determination result is a miss, the cache management unit 45 issues a preload request signal. Further, the cache management unit 45 issues a refill ID and a refill address, and outputs them together with a preload request signal to the request issuance control unit 46 (step S45). When the hit determination is completed, the cache management unit 45 asserts a preload hit determination signal regardless of a miss / hit, and deasserts the preload information in the address holding unit 50. The preload hit determination signal is a signal indicating whether or not the hit determination in the cache management unit 45 is completed.

そして、リクエスト発行制御部４６が、ローカルメモリ２５に対して正式にプリロード要求を発行する（リフィル要求信号を出力する、ステップＳ４６）。その後は、リフィルと同様の要領によって、ローカルメモリ内のデータをキャッシュメモリ４１へプリロードする。 Then, the request issuance control unit 46 officially issues a preload request to the local memory 25 (outputs a refill request signal, step S46). Thereafter, the data in the local memory is preloaded into the cache memory 41 in the same manner as the refill.

上記のように、この発明の第３の実施形態に係るグラフィックプロセッサによれば、第１、第２の実施形態で説明した（１）乃至（３）の効果に加えて、下記（４）の効果が得られる。
（４）キャッシュメモリを効率的に利用出来る（その２）。
本実施形態に係るグラフィックプロセッサでは、スレッド情報と、ロード／ストア命令に関する情報とを用いてプリロードアドレスを計算している。スレッド情報としては、Ｘ座標、Ｙ座標、及びスレッドＩＤをプリロード保持部４８から受け取る。またロード／ストア命令に関する情報として、コンフィギュレーションレジスタで参照すべきデータ、オフセット、及びベースアドレスをサブパス情報管理部４９から受け取る。これらの情報を用いることによって、従来に比べてより正確にプリロードアドレスを算出出来る。より具体的には、ロード／ストア命令に関する情報から、ＷＩＤＴＨの値が分かる。ＷＩＤＴＨの値によって、同一ＸＹ座標であってもブロックＩＤは変化する。更にアドレス信号の先頭アドレスが分かる。またオフセットの値及びメモリの使用モード（フレームバッファモードかメモリレジスタモードか）が分かる。従って、第１の実施形態で説明したアドレス計算式に必要な全ての情報をプリロードアドレス発生部４７は得られる。
プリロードとは、描画処理部２７で必要になるであろうデータを、予めローカルメモリからキャッシュメモリ４１に読み出しておく処理である。従って、プリロードはしたものの、実際にはそのデータは使われないこともありうる。 As described above, according to the graphic processor of the third embodiment of the present invention, in addition to the effects (1) to (3) described in the first and second embodiments, the following (4) An effect is obtained.
(4) The cache memory can be used efficiently (part 2).
In the graphic processor according to the present embodiment, the preload address is calculated using thread information and information related to the load / store instruction. As thread information, the X coordinate, the Y coordinate, and the thread ID are received from the preload holding unit 48. Further, as information related to the load / store instruction, data to be referred to in the configuration register, an offset, and a base address are received from the subpath information management unit 49. By using these pieces of information, the preload address can be calculated more accurately than in the prior art. More specifically, the value of WIDTH can be found from the information regarding the load / store instruction. Depending on the value of WIDTH, the block ID changes even with the same XY coordinates. Further, the head address of the address signal is known. In addition, the offset value and the memory use mode (frame buffer mode or memory register mode) can be known. Therefore, the preload address generation unit 47 can obtain all the information necessary for the address calculation formula described in the first embodiment.
The preload is a process of reading data that will be required by the drawing processing unit 27 from the local memory to the cache memory 41 in advance. Thus, although preloaded, the data may not actually be used.

しかし本実施形態では、ロード／ストア命令が発行された際に与えられた情報を用いてプリロードアドレスを計算、すなわちいずれのデータをプリロードするかを決定している。そのため、プリロードしたデータが使用される確率が高くなる。換言すれば、第１の実施形態で説明したヒット判定時に、プリロードデータがヒットする確率が向上する。これは、命令列は複数のスレッドの処理に用いられるので、実行する命令（サブパス）が分かれば、任意のスレッドが使用するデータが保持されているアドレスを求めることが可能となるからである。そこで、一度実行されたサブパスと同じサブパスを実行する異なるスレッドが起動された際に、以前トレースされた情報を元にプリフェッチを行う。よって、図３９に示すように、本実施形態に係る方法によってプリロードアドレスを計算するためには、いずれかのスレッドでロード／ストア命令が発行される必要がある。図３９では、スレッド０に関するサブパス０についてプリロードすることが出来ない。スレッド０に関するサブパス０でロード／ストア命令が発行されると、その時点でインストラクションテーブルが更新されるので、次のスレッド１についてプリロードが可能となる。 However, in the present embodiment, the preload address is calculated using the information given when the load / store instruction is issued, that is, which data is preloaded. Therefore, the probability that the preloaded data is used is increased. In other words, the probability that the preload data hits is improved at the time of hit determination described in the first embodiment. This is because an instruction sequence is used for processing of a plurality of threads, and if an instruction (subpath) to be executed is known, an address at which data used by an arbitrary thread can be obtained. Therefore, when a different thread that executes the same subpath as that executed once is started, prefetching is performed based on previously traced information. Therefore, as shown in FIG. 39, in order to calculate the preload address by the method according to this embodiment, it is necessary to issue a load / store instruction in any thread. In FIG. 39, it is not possible to preload sub-path 0 related to thread 0. When a load / store instruction is issued in subpath 0 related to thread 0, the instruction table is updated at that time, so that the next thread 1 can be preloaded.

従って、無駄なプリロード動作を削減し、同時にキャッシュメモリ４１のエントリが無駄に占有されることを抑制出来る。よって、キャッシュメモリ４１を効率的に使用でき、グラフィックプロセッサの性能を向上出来る。
次に、この発明の第４の実施形態に係るグラフィックプロセッサについて説明する。本実施形態は、上記第１乃至第３の実施形態で説明したグラフィックプロセッサにおいて、キャッシュ管理部４５が更にエントリのリクエスト発行を制限するものである。 Therefore, useless preload operations can be reduced, and at the same time, it can be suppressed that entries in the cache memory 41 are used up. Therefore, the cache memory 41 can be used efficiently, and the performance of the graphic processor can be improved.
Next explained is a graphic processor according to the fourth embodiment of the invention. In the present embodiment, in the graphic processor described in the first to third embodiments, the cache management unit 45 further restricts entry request issuance.

図４０はメモリ６１の概念図であり、キャッシュ管理部４５の備えるステータスフラグの様子を示している。図示するように、本実施形態に係るキャッシュ管理部４５は、タグＴ、バリッドフラグＶ、リフィルフラグＲ、ライトバックフラグＷの他に、ロックフラグ（lock flag）Ｌをステータスフラグとして保持する。ロックフラグＬは２ビットのデータであり、Ｌ＝“００”は対応するキャッシュメモリ４１のエントリがフリーの状態を示す。この状態では、エントリはプリロード要求及びリフィル要求のいずれでも発行可能である。Ｌ＝“０１”は、エントリがプリロード要求を発行している状態を示す。この状態では、エントリはリフィル要求を発行することは可能であるが、プリロード要求は発行できない。Ｌ＝“１０”は、実行スレッドがエントリを使用している状態を示す。この状態では、エントリはリフィル要求もプリロード要求も発行することができない。 FIG. 40 is a conceptual diagram of the memory 61 and shows a status flag provided in the cache management unit 45. As shown in the drawing, the cache management unit 45 according to the present embodiment holds a lock flag L as a status flag in addition to the tag T, the valid flag V, the refill flag R, and the write back flag W. The lock flag L is 2-bit data, and L = “00” indicates that the corresponding cache memory 41 entry is free. In this state, the entry can be issued by either a preload request or a refill request. L = “01” indicates a state in which the entry issues a preload request. In this state, the entry can issue a refill request, but cannot issue a preload request. L = “10” indicates a state in which the execution thread uses the entry. In this state, the entry cannot issue a refill request or a preload request.

従って、キャッシュ管理部４５はリフィル要求及びプリロード要求がなされた際に、図４１に示すようにステータスフラグのロックフラグＬをチェックする（ステップＳ５０）。そしてＬ＝“００”の場合（ステップＳ５１）にはいずれかの要求を発行する（ステップＳ５２）。Ｌ＝“０１”の場合（ステップＳ５３）には、リフィル要求は発行できるが、プリロード要求はストールする（ステップＳ５４）。Ｌ＝“１０”の場合（ステップＳ５５）にはいずれの要求もストールする（ステップＳ５６）。 Accordingly, when the refill request and the preload request are made, the cache management unit 45 checks the lock flag L of the status flag as shown in FIG. 41 (step S50). If L = “00” (step S51), one of the requests is issued (step S52). When L = “01” (step S53), the refill request can be issued, but the preload request is stalled (step S54). If L = “10” (step S55), all requests are stalled (step S56).

以上のように、ロックフラグＬ、リフィルフラグＲ、及びライトバックフラグＷＢによってキャッシュエントリは次の８つの状態を取りうる。
１．初期状態（Ｉｎｉｔ：Ｌ＝“００”、Ｒ＝“０”、ＷＢ＝“０”）
エントリがフリーの状態であり、プリロード要求もリフィル要求も受け付けることが可能な状態である。
２．レディ状態（Ｒｄｙ：Ｌ＝“０１”、Ｒ＝“０”、ＷＢ＝“０”）
プリロードが完了し、そのエントリを使用するスレッドが実行されるのを待っている状態である。
３．実行状態（Ｅｘｅｃ：Ｌ＝“１０”、Ｒ＝“０”、ＷＢ＝“０”）
実行中のスレッドが該エントリを使用している状態である。
４．非使用状態（ＮｏＷａｋｅ：Ｌ＝“００”、Ｒ＝“１”、ＷＢ＝“０”）
プリロード中に対応するスレッドが実行されたが、該エントリに対するアクセスが無くサブパスが終了した際に生じる状態である。
５．プリロード状態（ＰｒｅＬｄ：Ｌ＝“０１”、Ｒ＝“１”、ＷＢ＝“０”）
プリロード要求を発行している状態である。
６．フィル状態（Ｆｉｌｌ：Ｌ＝“１０”、Ｒ＝“１”、ＷＢ＝“０”）
キャッシュミスによるリフィル要求が発行されている状態、もしくはプリロード要求発行中にそのエントリを使用するスレッドが実行された状態。
７．ライトバック状態（ＷｒＢ：Ｌ＝“００”or“０１”、Ｒ＝“０”、ＷＢ＝“１”）ライトバック要求を発行している状態である。 As described above, the cache entry can take the following eight states by the lock flag L, the refill flag R, and the write back flag WB.
1. Initial state (Init: L = “00”, R = “0”, WB = “0”)
The entry is in a free state, and a preload request and a refill request can be accepted.
2. Ready state (Rdy: L = “01”, R = “0”, WB = “0”)
This is a state in which preloading is completed and a thread using the entry is waiting to be executed.
3. Execution state (Exec: L = “10”, R = “0”, WB = “0”)
The thread being executed is using the entry.
4). Non-use state (NoWake: L = “00”, R = “1”, WB = “0”)
This is a state that occurs when the corresponding thread is executed during preloading, but there is no access to the entry and the subpath is terminated.
5. Preload state (PreLd: L = “01”, R = “1”, WB = “0”)
A preload request is issued.
6). Fill state (Fill: L = “10”, R = “1”, WB = “0”)
A refill request is issued due to a cache miss, or a thread that uses the entry is executed while a preload request is issued.
7). Write-back state (WrB: L = “00” or “01”, R = “0”, WB = “1”) This is a state in which a write-back request is issued.

８．使用状態（ＷｒＢＥｘｅｃ：Ｌ＝“１０”、Ｒ＝“０”、ＷＢ＝“１”）
ライトバック状態においてアクセスが生じたり、使用スレッドが実行されると使用状態に遷移する。使用状態は、ライトバック要求の発行中に実行スレッドが切り替わり、当該エントリが実行スレッドにより使用される状態である。 8). Usage state (WrBExec: L = “10”, R = “0”, WB = “1”)
When access occurs in the write-back state or when the use thread is executed, the state changes to the use state. The use state is a state in which the execution thread is switched while the write back request is issued, and the entry is used by the execution thread.

次に図４２を用いて各状態間の遷移条件について説明する。図４２において縦が遷移前の状態であり、横が遷移後の状態である。そして表中の数字は下記に記す状態変更イベントを示す。 Next, transition conditions between the states will be described with reference to FIG. In FIG. 42, the vertical is the state before the transition, and the horizontal is the state after the transition. The numbers in the table indicate the status change events described below.

１．プリロードがエントリにヒットした時
２．ロード／ストア命令がヒットした時
３．プリロードがミスヒットし、プリロード要求が発行された時
４．ロード／ストア命令がミスヒットし、リフィル要求が発行された時
５、１０．ライトバックの実行を開始した時
６．ライトバックの実行開始と共にサブパスの実行が開始された時
７．実行スレッドのプリロードが行われたが、ロード／ストアアクセスされることなくサブパスが終了した時
８．プリロードされていたエントリを使用するスレッドが実行開始、またはロード／ストア命令がヒットした時
９．プリロードされていたエントリに対して、ロード／ストア命令ミスヒットによるリフィルを行った時
１１．ライトバックの実行開始と共にサブパスの実行が開始、もしくはロード／ストア命令がヒットした時
１２．エンド命令またはイールド命令が実行され、他のスレッドのプリロード要求が無い時
１３．エンド命令またはイールド命令が実行され、他のスレッドのプリロード要求がある時
１４．サブパススタートの次のタイミングでエンド命令またはイールド命令とライトバックとが実行される時に生じる
１５．サブパスがスタートした直後にライトバックが開始された時
１６、２２．プリロードが完了した時
１７．プリロードの完了と他のプリロードのヒットとが同時に生じた時
１８．プリロードの完了と、ロード／ストア命令のヒットとが同時に生じた時
１９．プリロード命令がヒットした時（但しプリロード要求発行中）
２０．ロード／ストア命令がヒットした時（但しプリロード要求発行中）
２１．実行スレッドのプリロードが行われたが、ロード／ストアアクセスされることなくサブパスが終了し、同時にプリロードも終了した時
２３．プリロードの完了と共に、同時にサブパススタートが同時に生じた時
２４．実行スレッドのプリロードが行われたが、ロード／ストアアクセスされることなくサブパスが終了し、同時にプリロード要求が未だに発行中である時
２５．プリロード中のエントリを使用するスレッドの実行が開始された時、またはロード／ストア命令がヒットした時
２６．プリロード状態からフィル状態へと状態遷移したが、ロード／ストアアクセスがなされることなくサブパスが終了すると同時にプリロードを完了した場合であり、他のスレッドのプリロード要求が無い時
２７．プリロード状態からフィル状態へと状態遷移したが、ロード／ストアアクセスがなされることなくサブパスが終了すると同時にプリロードを完了した場合であり、他のスレッドのプリロード要求がある時
２８．リフィルが完了した時
２９．プリロード状態からフィル状態へと状態遷移したが、ロード／ストアアクセスの無いままサブパスが終了する時で、且つプリロードがまだ完了しておらず他のスレッドのプリロード要求が無い時
３０．プリロードからフィルに状態遷移したが、ロード／ストアアクセスの無いままサブパスが終了する時で、且つプリロードがまだ完了しておらず他のスレッドおプリロード要求がある時
３１．Ｌ＝“００”でライトバックが完了した時
３２．Ｌ＝“０１”でライトバックが完了した時
３３．ライトバック終了と同時にロード／ストア命令がヒットした時
３４．ライトバック中のエントリを使用するスレッドが実行された時
３５．ライトバックの完了とエンド命令またはイールド命令とが同時に生じた時であり、他のスレッドのプリロード要求が無い時
３６．ライトバックの完了とエンド命令またはイールド命令とが同時に生じた時であり、他のスレッドのプリロード要求がある時
３７．Ｌ＝“１０”でライトバックが完了した時
３８．エンド命令またはイールド命令によってサブパスが終了した時
以上の条件によって、キャッシュエントリは状態遷移する。 1. When preload hits entry
2. When a load / store instruction hits
3. When a preload misses and a preload request is issued
4). When a load / store instruction misses and a refill request is issued
5,10. When execution of write back is started
6). When sub-pass execution starts when write-back execution starts
7). When the execution thread is preloaded, but the subpath ends without being loaded / stored
8). When a thread that uses a preloaded entry starts execution or a load / store instruction hits
9. When refilling a preloaded entry due to a load / store instruction miss hit
11. When sub-pass execution starts or the load / store instruction hits at the same time as write-back execution starts
12 When an end instruction or yield instruction is executed and there is no preload request from another thread
13. When an end instruction or yield instruction is executed and there is a preload request from another thread
14 Occurs when an end instruction or yield instruction and writeback are executed at the next timing of sub-pass start
15. When write-back starts immediately after the sub-pass starts
16, 22. When preload is complete
17. When a preload completes and another preload hit occurs at the same time
18. When preload completion and load / store instruction hit occur simultaneously
19. When a preload instruction is hit (however, a preload request is being issued)
20. When a load / store instruction hits (however, a preload request is being issued)
21. When an execution thread is preloaded, but the subpath ends without being accessed for load / store, and at the same time the preload ends
23. When sub-pass start occurs simultaneously with completion of preload
24. When an execution thread is preloaded, but the subpath ends without being accessed for load / store and at the same time a preload request is still being issued
25. When execution of a thread that uses the preloading entry is started, or when a load / store instruction is hit
26. When the state transitions from the preload state to the fill state, but preloading is completed at the same time as the subpath ends without load / store access, and there is no preload request from another thread
27. When the state transitions from the preload state to the fill state, but preloading is completed at the same time as the subpath ends without load / store access, and there is a preload request from another thread
28. When refill is complete
29. When the state transitions from the preload state to the fill state, but when the subpath ends without load / store access, and when the preload has not yet been completed and there is no preload request from another thread
30. When the state transition from preload to fill occurs, but the subpath ends without load / store access, and when preload has not yet completed and there is another thread preload request
31. When L = "00" and write back is completed
32. When L = “01” and write back is completed
33. When a load / store instruction hits simultaneously with the end of write-back
34. When a thread that uses the entry being written back is executed
35. When write-back completion and end instruction or yield instruction occur at the same time, and there is no preload request from other threads
36. When write-back completion and end or yield instruction occur at the same time, and there is a preload request from another thread
37. When L = “10” and write back is completed
38. When a subpass is terminated by an end instruction or a yield instruction
Under the above conditions, the cache entry changes state.

上記のように、この発明の第４の実施形態に係るグラフィックプロセッサによれば、上記第１乃至第３の実施形態で説明した（１）乃至（４）の効果に加えて、下記（５）の効果が得られる。
（５）キャッシュメモリを効率的に利用出来る（その３）。
本実施形態に係るグラフィックプロセッサでは、ステータスフラグに複数のレベルを有するロックフラグＬを設けている。そして、ロックフラグＬによってキャッシュメモリ４１のエントリの要求発行を制限している。より具体的には、ロックフラグＬは３つのレベル（“００”、“０１”、“１０”）を含む。そしてＬ＝“００”はエントリがロックされていない状態であり、キャッシュメモリ４１のエントリは自由にプリロード要求やリフィル要求を発行出来る。Ｌ＝“０１”は弱いロック状態であり、キャッシュメモリ４１のエントリはプリロード要求の発行を禁止される。Ｌ＝“１０”は強いロック状態であり、キャッシュメモリ４１のエントリはプリロード要求だけでなくリフィル要求の発行も禁止される。 As described above, according to the graphic processor of the fourth embodiment of the present invention, in addition to the effects (1) to (4) described in the first to third embodiments, the following (5) The effect is obtained.
(5) The cache memory can be used efficiently (part 3).
In the graphic processor according to the present embodiment, the status flag is provided with a lock flag L having a plurality of levels. The lock flag L restricts the issue of entries in the cache memory 41. More specifically, the lock flag L includes three levels (“00”, “01”, “10”). L = “00” indicates that the entry is not locked, and the entry in the cache memory 41 can freely issue a preload request or a refill request. L = “01” is a weak lock state, and the entry of the cache memory 41 is prohibited from issuing a preload request. L = “10” is a strong lock state, and entry of the cache memory 41 is prohibited not only from a preload request but also from a refill request.

プリロードされたデータは、前述の通り実際の処理に先立ってキャッシュメモリ４１に読み出されたデータである。これに対してリフィルされたデータは、ロード／ストア命令によって必要とされたデータである。従って、リフィルによってキャッシュメモリ４１にリプレイスされたデータの方が、プリロードによって読み出されたデータよりも重要性が高く、保護すべき必要性も高い。 The preloaded data is data read into the cache memory 41 prior to actual processing as described above. In contrast, the refilled data is the data required by the load / store instruction. Therefore, the data replaced in the cache memory 41 by refilling is more important than the data read by preloading and needs to be protected.

そこで本実施形態ではステータスレジスタにロックフラグＬを設け、リフィルが行われたエントリを強いロック状態として、プリロードや更なるリフィルによってデータが書き換えられることを防止している。そのため、必要なデータがキャッシュメモリ４１から消失されることを防止出来、キャッシュメモリ４１を効率的に使用出来る。 Therefore, in this embodiment, a lock flag L is provided in the status register so that the refilled entry is placed in a strong lock state to prevent data from being rewritten by preloading or further refilling. Therefore, necessary data can be prevented from being lost from the cache memory 41, and the cache memory 41 can be used efficiently.

またプリロードによって読み出されたデータに関しても、それ対応したサブパスが終了する等しない限りは、エントリを弱いロック状態とすることで、プリロードデータが書き換えられないようにしている。従って、プリロードデータを効率的に使用出来る。以上の結果、キャッシュメモリ４１を効率良く使用でき、グラフィックプロセッサの性能を向上出来る。 Also, with respect to data read by preloading, the preload data is prevented from being rewritten by setting the entry in a weak lock state unless the corresponding subpath is terminated. Therefore, the preload data can be used efficiently. As a result, the cache memory 41 can be used efficiently and the performance of the graphic processor can be improved.

次に、この発明の第５の実施形態に係るグラフィックプロセッサについて説明する。本実施形態は、上記第１乃至第４の実施形態で説明したグラフィックプロセッサにおいて、キャッシュ管理部４５が更にエントリ内のデータ情報を保持するものである。 Next explained is a graphic processor according to the fifth embodiment of the invention. In the present embodiment, in the graphic processor described in the first to fourth embodiments, the cache management unit 45 further holds data information in the entry.

図４３はメモリ６１の概念図であり、キャッシュ管理部４５の備えるステータスフラグの様子を示している。図示するように、本実施形態に係るキャッシュ管理部４５は、タグフラグＴ、バリッドフラグＶ、リフィルフラグＲ、ライトバックフラグＷ、ロックフラグＬの他に、スレッドエントリフラグ（thread entry flag）ＴＥをステータスフラグとして保持する。スレッドエントリフラグＴＥは、キャッシュメモリの対応するエントリが、どのスレッドに関するデータを保持するのかを示すフラグである。スレッドエントリフラグＴＥのビット数は、同時発行可能なスレッド数に等しい。 FIG. 43 is a conceptual diagram of the memory 61, and shows the status flags provided in the cache management unit 45. As shown in the figure, in addition to the tag flag T, valid flag V, refill flag R, write back flag W, lock flag L, the cache management unit 45 according to the present embodiment displays a thread entry flag TE. Hold as a flag. The thread entry flag TE is a flag indicating which thread the corresponding entry of the cache memory holds. The number of bits of the thread entry flag TE is equal to the number of threads that can be issued simultaneously.

スレッドエントリフラグＴＥとキャッシュメモリ４１との関係について、図４４を用いて説明する。図４４はスレッドエントリフラグＴＥとキャッシュメモリの概念図である。 The relationship between the thread entry flag TE and the cache memory 41 will be described with reference to FIG. FIG. 44 is a conceptual diagram of the thread entry flag TE and the cache memory.

図示するようにスレッドエントリフラグＴＥは例えばＮビットである。従って、最大でＮ個のスレッドが同時に生成される。そして上位ビットから各々、スレッド０〜スレッド（Ｎ−１）に対応する。例えばキャッシュメモリ４１のエントリ（Ｍ−１）にはスレッド１、２、４、６のデータが保持されている。従って、キャッシュメモリ４１のエントリ１に対応するスレッドエントリフラグＴＥのビット１、２、４、６が“１”である。またキャッシュメモリ４１のエントリ４にはデータが保持されていない。従って、キャッシュメモリ４１のエントリ４に対応するスレッドエントリフラグＴＥは、全ビットが“０”である。 As shown in the figure, the thread entry flag TE is, for example, N bits. Therefore, a maximum of N threads are generated simultaneously. And from the upper bits, it corresponds to thread 0 to thread (N-1), respectively. For example, the entry (M-1) of the cache memory 41 holds the data of threads 1, 2, 4, and 6. Therefore, bits 1, 2, 4, and 6 of the thread entry flag TE corresponding to entry 1 of the cache memory 41 are “1”. Data is not held in entry 4 of the cache memory 41. Accordingly, all bits of the thread entry flag TE corresponding to the entry 4 of the cache memory 41 are “0”.

次に、スレッドエントリフラグＴＥの書き込みタイミングとその際のエントリの状態について図４５を用いて説明する。まずスレッドエントリフラグＴＥは、対応したエントリに関するプリロード命令、リフィル命令、またはロード／ストア命令が発行された際（ステップＳ５０）に、該命令が実行されるスレッドに対応したビットが“１”とされる（ステップＳ５１）。スレッドエントリフラグＴＥが“１”とされると、対応するエントリはデータのリプレイスもフラッシュ（flush：消去）も禁止される（ステップＳ５２）。スレッドエントリフラグＴＥは、対応するスレッドがエンド命令またはイールド命令を実行した際に（ステップＳ５３）、“０”にされる（ステップＳ５４）。そして、スレッドエントリフラグＴＥの全ビットが“０”である場合（ステップＳ５５）には、対応するエントリについてリプレイスとフラッシュが許可される。他方、スレッドエントリフラグＴＥが１ビットでも“１”であれば（ステップＳ５５）、リプレイスとフラッシュは禁止される。 Next, the write timing of the thread entry flag TE and the entry state at that time will be described with reference to FIG. First, in the thread entry flag TE, when a preload instruction, refill instruction, or load / store instruction relating to the corresponding entry is issued (step S50), the bit corresponding to the thread in which the instruction is executed is set to “1”. (Step S51). When the thread entry flag TE is set to “1”, the corresponding entry is prohibited from being replaced or flushed (step S52). The thread entry flag TE is set to “0” (step S54) when the corresponding thread executes an end instruction or a yield instruction (step S53). When all the bits of the thread entry flag TE are “0” (step S55), replacement and flushing are permitted for the corresponding entry. On the other hand, if the thread entry flag TE is “1” even if it is 1 bit (step S55), the replace and flush are prohibited.

上記のように、この発明の第５の実施形態に係るグラフィックプロセッサによれば、上記第１乃至第４の実施形態で説明した（１）乃至（５）の効果に加えて、下記（６）の効果が得られる。
（６）キャッシュメモリを効率的に利用出来る（その４）。
本実施形態に係るグラフィックプロセッサであると、スレッドエントリフラグＴＥによってエントリのプリロード要求やリフィル要求を制限している。そのため、キャッシュエントリを効率的に使用出来、グラフィックプロセッサの性能を向上出来る。以下、本効果について詳細に説明する。 As described above, according to the graphic processor of the fifth embodiment of the present invention, in addition to the effects (1) to (5) described in the first to fourth embodiments, the following (6) The effect is obtained.
(6) The cache memory can be used efficiently (part 4).
In the graphic processor according to the present embodiment, entry preload requests and refill requests are limited by the thread entry flag TE. Therefore, the cache entry can be used efficiently and the performance of the graphic processor can be improved. Hereinafter, this effect will be described in detail.

キャッシュメモリ４１とローカルメモリ１３とは、勿論バスのサイズにもよるが、基本的にはキャッシュメモリ４１のエントリサイズ単位でデータの授受を行う。データの消去も同じである。従って、キャッシュメモリ４１にＳＲＡＭを用いること等によってキャッシュメモリ４１のエントリサイズが大きい場合、キャッシュメモリ４１の１つのエントリには複数のスレッドに関するデータが読み出される。 The cache memory 41 and the local memory 13 basically exchange data in units of the entry size of the cache memory 41, although it depends on the bus size. The same is true for erasing data. Therefore, when the entry size of the cache memory 41 is large by using an SRAM for the cache memory 41 or the like, data relating to a plurality of threads is read into one entry of the cache memory 41.

すると、あるエントリ内のスレッドについてサブパスが実行完了したとしても、同じエントリにある他のスレッドはその後に使用される可能性がある。すなわち、サブパスが完了することで、あるスレッドに関するデータは不要になったとしても、同一エントリにある別のスレッドに関するデータは、後に必要になるかもしれない。従って、あるスレッドが終了したからといってその他のスレッドに関するデータまで消してしまうのは非効率的である。 Then, even if the execution of the sub-pass is completed for a thread in an entry, another thread in the same entry may be used later. That is, even if data related to a certain thread is no longer necessary when the sub-pass is completed, data related to another thread in the same entry may be required later. Therefore, it is inefficient to erase data related to other threads just after a certain thread is terminated.

そこで本実施形態ではスレッドエントリフラグＴＥを用いることによって、サブパスが実行完了していないスレッドが保持されるエントリについては、データのリプレイス及びライトバック（またはフラッシュ）を禁じている。これにより、データがむやみに消去されることを防止できるので、キャッシュメモリ４１のエントリを効率的に使用でき、グラフィックプロセッサの性能を向上出来る。 Therefore, in the present embodiment, by using the thread entry flag TE, data replacement and write back (or flush) are prohibited for an entry in which a thread whose sub-pass has not been executed is held. As a result, the data can be prevented from being erased unnecessarily, so that the entry of the cache memory 41 can be used efficiently, and the performance of the graphic processor can be improved.

なお、スレッドエントリフラグＴＥがアサートされるタイミングは、エントリに実際にデータがリプレイスされた後でなくても良く、リプレイス前であっても良い。すなわち、ロード／ストア命令がミスしてリフィル要求が発行された後で且つリプレイス前の段階や、プリロード要求が発行された後で且つデータ転送前であっても良い。この場合は、他のスレッドによってエントリが書き潰されないように、使用するエントリをスレッドエントリフラグＴＥにより予約することになる。 Note that the timing at which the thread entry flag TE is asserted may not be after the data is actually replaced in the entry, but may be before the replacement. That is, it may be after the load / store instruction is missed and the refill request is issued and before the replacement, or after the preload request is issued and before the data transfer. In this case, the entry to be used is reserved by the thread entry flag TE so that the entry is not overwritten by another thread.

次に、この発明の第６の実施形態に係るグラフィックプロセッサについて説明する。本実施形態は、ステージがストールした場合のデータの管理手法に係るものである。図４６は、本実施形態に係るデータ管理手法の概念を示すための回路図である。 Next explained is a graphic processor according to the sixth embodiment of the invention. This embodiment relates to a data management method when a stage is stalled. FIG. 46 is a circuit diagram for illustrating the concept of the data management method according to the present embodiment.

図示するように、ある命令がステージＡ〜Ｆにおいて順番に実行され、且つステージＡ〜Ｆはパイプライン動作を行うとする。各ステージはＦ／Ｆを備えており、各ステージに達した命令はＦ／Ｆで保持される。更に、ステージＤにはバッファメモリＤ１、Ｄ２が設けられている。ストール発生時、バッファメモリＤ１、Ｄ２はステージＣのデータを保持し、ステージＤはステージＥのデータを保持する。そしてストール解消してリスタートする際には、バッファメモリＤ１、Ｄ２のデータがステージＤに出力される。 As shown in the figure, it is assumed that a certain instruction is executed in order in stages A to F, and stages A to F perform a pipeline operation. Each stage has an F / F, and an instruction that reaches each stage is held in the F / F. Further, the stage D is provided with buffer memories D1 and D2. When a stall occurs, the buffer memories D1 and D2 hold stage C data, and the stage D holds stage E data. When the stall is resolved and restarted, the data in the buffer memories D1 and D2 are output to the stage D.

次に上記ステージの動作について説明する。まずストールが発生していない通常時の動作について図４７を用いて説明する。図４７は各ステージで実行される命令の時間変化を示す表である。実行される命令は命令０〜命令７まであるものと仮定する。 Next, the operation of the above stage will be described. First, the normal operation in which no stall has occurred will be described with reference to FIG. FIG. 47 is a table showing temporal changes in instructions executed in each stage. Assume that there are instructions 0 to 7 executed.

図示するように、時刻ｔ０において命令０〜５がそれぞれステージＦ〜Ａで実行されているとする。すると、次のサイクル（時刻ｔ１）では、各命令１〜５は次のステージＦ〜Ｂで実行される。また新たな命令６がステージＡに投入されて実行される。命令０は、時刻ｔ０において最後のステージＦで実行が完了したため、処理を終了する。このようにして、各命令０〜７はステージＡ〜Ｆの順にパイプライン処理される。 As illustrated, it is assumed that instructions 0 to 5 are executed in stages F to A at time t0. Then, in the next cycle (time t1), the instructions 1 to 5 are executed in the next stages F to B. A new instruction 6 is input to the stage A and executed. Since execution of the instruction 0 is completed at the last stage F at the time t0, the processing ends. In this way, the instructions 0 to 7 are pipelined in the order of stages A to F.

次にストールが発生した場合について図４８を用いて説明する。図４８も各ステージで実行される命令の時間変化を示す表であり、ここでは一例として命令３がステージＥにおいてストールした場合について説明する。 Next, a case where a stall occurs will be described with reference to FIG. FIG. 48 is also a table showing temporal changes of instructions executed in each stage. Here, a case where the instruction 3 stalls in the stage E will be described as an example.

図示するように、時刻ｔ０において命令０〜５がそれぞれステージＦ〜Ａで実行され、時刻ｔ１において命令１〜６がそれぞれステージＦ〜Ａで実行され、時刻ｔ２において命令２〜７がそれぞれステージＦ〜Ａで実行されたとする。そして時刻ｔ３で命令３がステージＥでストールしたとする。すると、時刻ｔ３では、本来は命令３〜７がステージＦ〜Ｂで実行されるはずである。しかしストールが発生したので、時刻ｔ２でステージＣに保持されていた命令５はバッファメモリＤ１に送られ、時刻ｔ２でステージＥに保持されていた命令３はステージＤにフィードバックされる。 As illustrated, instructions 0 to 5 are executed in stages F to A at time t0, instructions 1 to 6 are executed in stages F to A at time t1, and instructions 2 to 7 are executed in stages F at time t2. Suppose that it is executed at ~ A. Assume that instruction 3 stalls at stage E at time t3. Then, at time t3, instructions 3 to 7 are supposed to be executed in stages F to B. However, since a stall has occurred, the instruction 5 held in the stage C at the time t2 is sent to the buffer memory D1, and the instruction 3 held in the stage E at the time t2 is fed back to the stage D.

次のサイクル（時刻ｔ４）でも依然としてストールしていると、時刻ｔ３でバッファメモリＤ１に保持されていた命令５はバッファメモリＤ２に送られ、時刻ｔ３でステージＣに保持されていた命令６はバッファメモリＤ１に送られ、時刻ｔ３でステージＥに保持されていた命令４はステージＤにフィードバックされる。以降、ストールが継続する時刻ｔ６までの期間、命令５はバッファメモリＤ２に保持され続け、命令６はバッファメモリＤ１に保持され続ける。そして命令３及び命令４は、ステージＤとステージＥとの間をループする。 If it is still stalled in the next cycle (time t4), the instruction 5 held in the buffer memory D1 at the time t3 is sent to the buffer memory D2, and the instruction 6 held in the stage C at the time t3 is buffered. The instruction 4 sent to the memory D1 and held in the stage E at time t3 is fed back to the stage D. Thereafter, during the period up to time t6 when the stall continues, the instruction 5 is kept in the buffer memory D2, and the instruction 6 is kept in the buffer memory D1. Instruction 3 and instruction 4 loop between stage D and stage E.

そして時刻ｔ７でストールが解消すると、時刻ｔ６でそれぞれステージＥ、Ｄ、バッファメモリＤ２、ステージＣに保持されていた命令３〜５、７は、それぞれステージＦ、Ｅ、Ｄ、Ｃで実行される。時刻ｔ６でバッファメモリＤ１に保持されていた命令６は、時刻ｔ７でバッファメモリＤ２に送られ、更に時刻ｔ８において、ステージＤで実行される。 When the stall is resolved at time t7, the instructions 3 to 5 and 7 held in the stages E and D, the buffer memory D2 and the stage C at the time t6 are executed in the stages F, E, D and C, respectively. . The instruction 6 held in the buffer memory D1 at time t6 is sent to the buffer memory D2 at time t7, and is further executed at stage D at time t8.

上記のデータ管理手法を、第１乃至第５の実施形態で説明したグラフィックプロセッサに適用した場合について図４９を用いて説明する。図４９はキャッシュ管理部４５の一部領域の回路図である。図２２を用いて説明したように、ロード命令が発行された際、キャッシュ管理部４５にはアドレス発生部４０からキャッシュデータアドレス信号が与えられる。また図３８を用いて説明したように、プリロード時にはプリロードプリロードアドレスが与えられる。 A case where the above data management technique is applied to the graphic processor described in the first to fifth embodiments will be described with reference to FIG. FIG. 49 is a circuit diagram of a partial area of the cache management unit 45. As described with reference to FIG. 22, when a load instruction is issued, a cache data address signal is given to the cache management unit 45 from the address generation unit 40. As described with reference to FIG. 38, a preload preload address is given at the time of preload.

キャッシュ管理部４５は図１１で説明したように第２ステージにおいて動作する。そして第２ステージの内部において、少なくとも４つの動作ステージ（２−１）〜（２−４）を含む。すなわちアドレス管理部４５は、ステージ（２−１）においてロード／ストア及びプリロードのヒット判定を行う。ステージ（２−２）では、リフィル及びプリロードのエントリを、ＬＲＦキューを用いて選択する。ステージ（２−３）では、キャッシュミスや、ヒットしたエントリがリフィル中である等の場合にストール信号をアサートする。そしてステージ（２−４）でキャッシュ制御部へ信号を転送する。 The cache management unit 45 operates in the second stage as described with reference to FIG. The second stage includes at least four operation stages (2-1) to (2-4). That is, the address management unit 45 performs load / store and preload hit determination in the stage (2-1). In stage (2-2), refill and preload entries are selected using the LRF queue. In the stage (2-3), a stall signal is asserted when a cache miss or a hit entry is being refilled. Then, in stage (2-4), the signal is transferred to the cache control unit.

図４９においてステージ（２−２）からステージ（２−１）へのループパスは、ステージ（２−２）またはステージ（２−１）でストールが発生した際に用いられる。この状態においては、描画処理部２６によってストール信号がアサートされる。 In FIG. 49, the loop path from the stage (2-2) to the stage (2-1) is used when a stall occurs in the stage (2-2) or the stage (2-1). In this state, the stall signal is asserted by the drawing processing unit 26.

ステージ（２−４）からステージ（２−３）へのループパスは、ステージ（２−４）またはステージ（２−３）でストールが発生していた状態において、ステージ（２−２）またはステージ（２−１）でストールが発生した際に用いられる。従ってこの場合、ステージ（２−２）からステージ（２−１）との間のループパスと、ステージ（２−４）からステージ（２−３）ステージへのループパスとが有効となる。 The loop path from the stage (2-4) to the stage (2-3) is the stage (2-2) or stage (2-3) in the state where the stall has occurred in the stage (2-4) or the stage (2-3). Used when a stall occurs in 2-1. Therefore, in this case, the loop path from the stage (2-2) to the stage (2-1) and the loop path from the stage (2-4) to the stage (2-3) stage are effective.

ステージ（２−４）からステージ（２−１）へのループパスは、ステージ（２−４）またはステージ（２−３）でストールが発生した場合に用いられる。この場合にはストール信号がアサートされているので、この信号によってステージ（２−４）からステージ（２−１）へのループパスが有効とされる。また、ステージ（２−２）とステージ（２−１）との間のループパスと、ステージ（２−４）からステージ（２−３）へのループパスが有効になっていれば、ストール信号がデアサートされたタイミングにおいても、このループパスが有効とされる。 The loop path from the stage (2-4) to the stage (2-1) is used when a stall occurs in the stage (2-4) or the stage (2-3). In this case, since the stall signal is asserted, the loop path from the stage (2-4) to the stage (2-1) is validated by this signal. If the loop path between the stage (2-2) and the stage (2-1) and the loop path from the stage (2-4) to the stage (2-3) are valid, the stall signal is deasserted. This loop path is also valid at the determined timing.

バッファメモリ８０は、例えば５つのエントリを有している。そして、ストール信号がアサートされて以降に入力されるアドレスを保持する。これは、ストール信号が第３ステージ（図１１参照）まで伝搬された後に、アドレス発生部４０がアドレスの投入を停止するからである。従って、ストールが発生している期間に入力されるアドレスを有効に保持するために、バッファメモリ８０が使用される。 The buffer memory 80 has, for example, five entries. Then, the address input after the stall signal is asserted is held. This is because the address generation unit 40 stops the input of the address after the stall signal has been propagated to the third stage (see FIG. 11). Accordingly, the buffer memory 80 is used to effectively hold the address input during the period in which the stall occurs.

以上のように、この発明の第６の実施形態に係るグラフィックプロセッサであると、上記第１乃至第５の実施形態で説明した（１）乃至（６）の効果に加えて、下記（７）の効果が得られる。
（７）ストール後におけるグラフィックプロセッサの処理効率を向上出来る。
本実施形態に係るグラフィックプロセッサは、実行すべき命令がストールした際に、命令を緊急避難的に保持するバッファメモリを有している。従って、ストール解消後は、バッファメモリ内のデータを用いて処理を開始出来るため、グラフィックプロセッサの処理効率を向上出来る。この点につき以下説明する。 As described above, the graphic processor according to the sixth embodiment of the present invention has the following (7) in addition to the effects (1) to (6) described in the first to fifth embodiments. The effect is obtained.
(7) The processing efficiency of the graphic processor after the stall can be improved.
The graphic processor according to the present embodiment includes a buffer memory that holds an instruction in an emergency manner when an instruction to be executed stalls. Therefore, since processing can be started using data in the buffer memory after the stall is eliminated, the processing efficiency of the graphic processor can be improved. This point will be described below.

図５０は、バッファメモリを設けない場合に図４７と同様に命令を実行する際の、命令とステージとの関係を示す表である。そして、図４８と同様に命令３がステージＥでストールした場合を考える。ストールが発生した場合、その瞬間にパイプラインを停止することは困難である。すると図５０の場合、時刻ｔ３では時刻ｔ３の状態を保持する必要があるにもかかわらず、ステージＡ〜Ｄの命令７〜命令４がステージＢ〜ステージＥにオーバーランする。その結果、ストールしているステージＥには、命令３を保持しているにもかかわらずステージＤの命令４が入力され、命令３が潰される。このような事態を避けるために、時刻ｔ３では各ステージＡ〜Ｆの命令を全てフラッシュする必要がある。少なくとも、ストールしたステージよりも上流のステージ（ステージＥでストールした場合にはステージＡ〜Ｄ）の命令をフラッシュする必要がある。しかし、命令を全てフラッシュしてしまうために、時刻ｔ４で処理をリスタートする場合には、また最初から命令を改めて投入する必要がある。すると、ストールする度に命令を投入しなければならず、グラフィックプロセッサの性能が大幅に低下する。 FIG. 50 is a table showing the relationship between instructions and stages when instructions are executed in the same way as in FIG. 47 when no buffer memory is provided. Then, consider a case where instruction 3 stalls at stage E as in FIG. When a stall occurs, it is difficult to stop the pipeline at that moment. Then, in the case of FIG. 50, although it is necessary to hold the state at time t3 at time t3, instructions 7 to 4 in stages A to D overrun stage B to stage E. As a result, the stage 4 instruction 4 is input to the stalled stage E even though the instruction 3 is held, and the instruction 3 is crushed. In order to avoid such a situation, it is necessary to flush all the instructions of the stages A to F at time t3. At least the instruction of the stage upstream from the stalled stage (stages A to D when stalled at stage E) needs to be flushed. However, in order to flush all instructions, when restarting the process at time t4, it is necessary to input the instructions again from the beginning. Then, an instruction must be input every time the stall occurs, and the performance of the graphic processor is greatly reduced.

しかし本実施形態に係る構成であると、ストールが解消した際にはバッファメモリ８０に保持されているデータを用いてリスタート出来る。従って、改めて命令を投入する必要が無いため、グラフィックプロセッサの性能低下を抑制出来る。これはグラフィックプロセッサの動作周波数が高い場合（例えば数ＧＨｚ）や、ステージが非常に深い場合に有効である。なぜなら、このような場合にはストール発生が検出された後に実際のパイプラインを停止させるのに数サイクルを要するからである。 However, with the configuration according to the present embodiment, when the stall is resolved, the data can be restarted using the data held in the buffer memory 80. Accordingly, since it is not necessary to input an instruction again, it is possible to suppress the performance degradation of the graphic processor. This is effective when the operating frequency of the graphic processor is high (for example, several GHz) or when the stage is very deep. This is because in such a case, several cycles are required to stop the actual pipeline after the occurrence of a stall is detected.

特に本実施形態に係る構成の場合、図１１に示すように、アドレス発生部４０から出力されたアドレス信号は、数段の処理ステージを含む第２ステージを経た後にキャッシュメモリ４１に達する。このようにパイプラインが深い理由は、命令制御部２５における処理を待つ必要があるからである。１つのピクセルシェーダ２４では一度に例えば（４×４）個のピクセルを一括して処理する。この際、ピクセルを生成するのが命令制御部２５である。ところが、データ振り分け部２０から命令制御部２５に与えられる情報は、代表点となるあるピクセル１個分のデータと、その他のピクセルについての代表点からの差分値のみである。この情報から、命令制御部２５は代表点以外の１５個のピクセルデータを生成する。これにより、データを保持するレジスタ数を削減できる。キャッシュ管理部４５は、このピクセルデータの計算処理を待つ必要があるため、図１１に示すようにパイプラインが深くなる。 In particular, in the case of the configuration according to the present embodiment, as shown in FIG. 11, the address signal output from the address generation unit 40 reaches the cache memory 41 after passing through the second stage including several processing stages. The reason why the pipeline is so deep is that it is necessary to wait for processing in the instruction control unit 25. One pixel shader 24 processes, for example, (4 × 4) pixels at a time. At this time, the instruction control unit 25 generates the pixels. However, the information given from the data distribution unit 20 to the instruction control unit 25 is only data for one pixel as a representative point and a difference value from the representative point for other pixels. From this information, the command control unit 25 generates 15 pixel data other than the representative points. Thereby, the number of registers holding data can be reduced. Since the cache management unit 45 needs to wait for this pixel data calculation process, the pipeline becomes deeper as shown in FIG.

しかし、このようにパイプラインが深くなったとしても、ストールされた際においてステージに保持されるデータを、バッファメモリ８０に待避させ、リスタートする際にはバッファメモリ８０内のデータを使用出来るため、処理効率の低下を効果的に抑制出来る。 However, even if the pipeline becomes deeper in this way, the data held in the stage when stalled is saved in the buffer memory 80, and the data in the buffer memory 80 can be used when restarting. , It is possible to effectively suppress a decrease in processing efficiency.

なお、上記第１乃至第６の実施形態に係るグラフィックプロセッサは、例えばゲーム機、ホームサーバー、テレビ、または携帯情報端末などに搭載することが出来る。図５１は上記第１乃至第６の実施形態に係るグラフィックプロセッサを備えたデジタルテレビの備えるデジタルボードのブロック図である。デジタルボードは、画像・音声などの通信情報を制御するためのものである。図示するように、デジタルボード１０００は、フロントエンド部１１００、画像描画プロセッサシステム１２００、デジタル入力部１２００、Ａ／Ｄコンバータ１４００、１８００、ゴーストリダクション部１５００、三次元ＹＣ分離部１６００、カラーデコーダ１７００、ＬＡＮ処理ＬＳＩ１９００、ＬＡＮ端子２０００、ブリッジメディアコントローラ２１００、カードスロット２２００、フラッシュメモリ１０００、及び大容量メモリ（例えばＤＲＡＭ）１１００を備えている。フロントエンド部１１００は、デジタルチューナーモジュール１１１０、１１２０、ＯＦＤＭ（Orthogonal Frequency Division Multiplex）復調部１１２０、ＱＰＳＫ（Quadrature Phase Shift Keying）復調部１１４０を備えている。 The graphic processor according to the first to sixth embodiments can be mounted on, for example, a game machine, a home server, a television, or a portable information terminal. FIG. 51 is a block diagram of a digital board provided in a digital television provided with the graphic processor according to the first to sixth embodiments. The digital board is for controlling communication information such as images and sounds. As illustrated, the digital board 1000 includes a front end unit 1100, an image drawing processor system 1200, a digital input unit 1200, A / D converters 1400 and 1800, a ghost reduction unit 1500, a three-dimensional YC separation unit 1600, a color decoder 1700, A LAN processing LSI 1900, a LAN terminal 2000, a bridge media controller 2100, a card slot 2200, a flash memory 1000, and a large-capacity memory (for example, DRAM) 1100 are provided. The front end unit 1100 includes digital tuner modules 1110 and 1120, an OFDM (Orthogonal Frequency Division Multiplex) demodulator 1120, and a QPSK (Quadrature Phase Shift Keying) demodulator 1140.

画像描画プロセッサシステム１２００は、送受信回路１２１０、ＭＰＥＧ２デコーダ１２２０、グラフィックエンジン１１００、デジタルフォーマットコンバータ１１１０、及びプロセッサ１１２０を備えている。そして、例えばグラフィックエンジン１１００及びプロセッサ１１２０が、上記第１乃至第６の実施形態で説明したグラフィックプロセッサに対応する。 The image drawing processor system 1200 includes a transmission / reception circuit 1210, an MPEG2 decoder 1220, a graphic engine 1100, a digital format converter 1110, and a processor 1120. For example, the graphic engine 1100 and the processor 1120 correspond to the graphic processor described in the first to sixth embodiments.

上記構成において、地上デジタル放送波、ＢＳデジタル放送波、及び１１０°ＣＳデジタル放送波は、フロントエンド部１１００で復調される。また地上アナログ放送波及びＤＶＤ／ＶＴＲ信号は、３次元ＹＣ分離部１６００及びカラーデコーダ１７００でデコードされる。これらの信号は、画像描画プロセッサシステム１２００に入力され、送受信回路１２１０で、映像・音声・データに分離される。そして、映像に関しては、ＭＰＥＧ２デコーダ１２２０を介してグラフィックエンジン１１００に映像情報が入力される。するとグラフィックエンジン１１００は、上記実施形態で説明したようにして図形を描画する。 In the above configuration, the terrestrial digital broadcast wave, the BS digital broadcast wave, and the 110 ° CS digital broadcast wave are demodulated by the front end unit 1100. The terrestrial analog broadcast wave and the DVD / VTR signal are decoded by a three-dimensional YC separation unit 1600 and a color decoder 1700. These signals are input to the image drawing processor system 1200 and separated into video / audio / data by the transmission / reception circuit 1210. As for the video, video information is input to the graphic engine 1100 via the MPEG2 decoder 1220. Then, the graphic engine 1100 draws a graphic as described in the above embodiment.

図５２は、上記第１乃至第６の実施形態に係るグラフィックプロセッサを備えた録画再生機器のブロック図である。図示するように、録画再生機器３０００はヘッドアンプ３１００、モータードライバ３２００、メモリ３３００、画像情報制御回路３４００、ユーザＩ／Ｆ用ＣＰＵ３５００、フラッシュメモリ３６００、ディスプレイ３７００、ビデオ出力部３８００、及びオーディオ出力部３９００を備えている。 FIG. 52 is a block diagram of a recording / playback apparatus including the graphic processor according to the first to sixth embodiments. As shown in the figure, the recording / playback device 3000 includes a head amplifier 3100, a motor driver 3200, a memory 3300, an image information control circuit 3400, a user I / F CPU 3500, a flash memory 3600, a display 3700, a video output unit 3800, and an audio output unit. 3900.

画像情報制御回路３４００は、メモリインターフェース３４１０、デジタル信号プロセッサ３４２０、プロセッサ３４３０、映像処理用プロセッサ３４５０、及びオーディオ処理用プロセッサ３４４０を備えている。そして、例えば映像処理用プロセッサ３４５０及びデジタル信号プロセッサ３４２０が、上記第１及び第２の実施形態で説明したグラフィックプロセッサに対応する。 The image information control circuit 3400 includes a memory interface 3410, a digital signal processor 3420, a processor 3430, a video processing processor 3450, and an audio processing processor 3440. For example, the video processing processor 3450 and the digital signal processor 3420 correspond to the graphic processor described in the first and second embodiments.

上記構成において、ヘッドアンプ３１００で読み出された映像データが画像情報制御回路３４００に入力される。そして、デジタル信号処理プロセッサ３４２０から映像情報用プロセッサ３４５０に図形情報が入力される。すると映像情報用プロセッサ３４５０は、上記実施形態で説明したようにして図形を描画する。 In the above configuration, video data read by the head amplifier 3100 is input to the image information control circuit 3400. Then, graphic information is input from the digital signal processor 3420 to the video information processor 3450. Then, the video information processor 3450 draws a graphic as described in the above embodiment.

なお、本願発明は上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。更に、上記実施形態には種々の段階の発明が含まれており、開示される複数の構成要件における適宜な組み合わせにより種々の発明が抽出されうる。例えば、実施形態に示される全構成要件からいくつかの構成要件が削除されても、発明が解決しようとする課題の欄で述べた課題が解決でき、発明の効果の欄で述べられている効果が得られる場合には、この構成要件が削除された構成が発明として抽出されうる。 Note that the present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the scope of the invention in the implementation stage. Furthermore, the above embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, even if some constituent requirements are deleted from all the constituent requirements shown in the embodiment, the problem described in the column of the problem to be solved by the invention can be solved, and the effect described in the column of the effect of the invention Can be extracted as an invention.

この発明の第１の実施形態に係るグラフィックプロセッサのブロック図。1 is a block diagram of a graphic processor according to a first embodiment of the present invention. この発明の第１の実施形態に係るグラフィックプロセッサにおけるフレームバッファの概念図。The conceptual diagram of the frame buffer in the graphic processor which concerns on 1st Embodiment of this invention. この発明の第１の実施形態に係るグラフィックプロセッサにおけるフレームバッファの概念図。The conceptual diagram of the frame buffer in the graphic processor which concerns on 1st Embodiment of this invention. この発明の第１の実施形態に係るグラフィックプロセッサにおけるフレームバッファの概念図。The conceptual diagram of the frame buffer in the graphic processor which concerns on 1st Embodiment of this invention. この発明の第１の実施形態に係るグラフィックプロセッサにおけるフレームバッファの概念図。The conceptual diagram of the frame buffer in the graphic processor which concerns on 1st Embodiment of this invention. この発明の第１の実施形態に係るグラフィックプロセッサにおけるフレームバッファの概念図。The conceptual diagram of the frame buffer in the graphic processor which concerns on 1st Embodiment of this invention. この発明の第１の実施形態に係るグラフィックプロセッサの行うクアッドマージの概念図。The conceptual diagram of the quad merge which the graphic processor which concerns on 1st Embodiment of this invention performs. この発明の第１の実施形態に係るグラフィックプロセッサにおいて実行される命令列の概念図。The conceptual diagram of the command sequence performed in the graphic processor which concerns on 1st Embodiment of this invention. この発明の第１の実施形態に係るグラフィックプロセッサにおいて実行されるサブパスの様子を示すタイミングチャート。4 is a timing chart showing the state of sub-paths executed in the graphic processor according to the first embodiment of the present invention. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部のブロック図。The block diagram of the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部のブロック図。The block diagram of the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するアドレス発生部のブロック図。The block diagram of the address generation part which the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided has. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するアドレス発生部が生成するアドレス信号の概念図。The conceptual diagram of the address signal which the address generation part which the data control part with which the graphic processor which concerns on 1st Embodiment of this invention has has has is produced | generated. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するアドレス発生部が生成するアドレス信号の概念図。The conceptual diagram of the address signal which the address generation part which the data control part with which the graphic processor which concerns on 1st Embodiment of this invention has has has is produced | generated. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュメモリのブロック図。The block diagram of the cache memory which the data control part with which the graphic processor which concerns on 1st Embodiment of this invention has is provided. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するリクエスト発行制御部のブロック図。The block diagram of the request issue control part which the data control part with which the graphic processor which concerns on 1st Embodiment of this invention has is provided. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュアクセス制御部のブロック図。The block diagram of the cache access control part which the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided has. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュ管理部のブロック図。The block diagram of the cache management part which the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided has. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュ管理部内のステータスフラグとキャッシュメモリとの関係を示す概念図。The conceptual diagram which shows the relationship between the status flag and cache memory in the cache management part which the data control part with which the graphic processor which concerns on 1st Embodiment of this invention has is provided. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュ管理部の回路図。The circuit diagram of the cache management part which the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided has. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部の状態遷移図。The state transition diagram of the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部のブロック図であり、ロード時の様子を示す図。It is a block diagram of the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided, and shows the mode at the time of loading. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部のブロック図であり、ストア時の様子を示す図。It is a block diagram of the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided, and shows the mode at the time of a store. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部のブロック図であり、リフィル時の様子を示す図。It is a block diagram of the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided, and shows the mode at the time of refill. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部の状態遷移図。The state transition diagram of the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided. この発明の第１の実施形態に係るグラフィックプロセッサのロード／ストア及びリフィル時の動作を示すフローチャート。6 is a flowchart showing an operation at the time of load / store and refill of the graphic processor according to the first embodiment of the present invention; この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部のロード／ストア及びリフィル時における各種信号のタイミングチャート。4 is a timing chart of various signals during load / store and refill of the data control unit included in the graphic processor according to the first embodiment of the present invention. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュ管理部の回路図。The circuit diagram of the cache management part which the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided has. グラフィックプロセッサの備えるデータ制御部のブロック図であり、ロード／ストア命令のヒット判定のための構成を示すブロック図。It is a block diagram of the data control part with which a graphic processor is provided, and is a block diagram which shows the structure for hit determination of a load / store instruction. この発明の第１の実施形態に係るグラフィックプロセッサの備えるデータ制御部のブロック図であり、ロード／ストア命令のヒット判定のための構成を示すブロック図。It is a block diagram of the data control part with which the graphic processor which concerns on 1st Embodiment of this invention is provided, and is a block diagram which shows the structure for the hit determination of a load / store instruction. この発明の第２の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュ管理部内のステータスフラグの概念図。The conceptual diagram of the status flag in the cache management part which the data control part with which the graphic processor which concerns on 2nd Embodiment of this invention has is provided. この発明の第２の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュ管理部の回路図。The circuit diagram of the cache management part which the data control part with which the graphic processor which concerns on 2nd Embodiment of this invention is provided has. この発明の第２の実施形態に係るグラフィックプロセッサの備えるデータ制御部のブロック図であり、ライトバック時の様子を示す図。It is a block diagram of the data control part with which the graphic processor which concerns on 2nd Embodiment of this invention is provided, and shows the mode at the time of write-back. この発明の第２の実施形態に係るグラフィックプロセッサのライトバック時の動作を示すフローチャート。The flowchart which shows the operation | movement at the time of the write-back of the graphic processor which concerns on 2nd Embodiment of this invention. この発明の第２の実施形態に係るグラフィックプロセッサのブロック図。The block diagram of the graphic processor which concerns on 2nd Embodiment of this invention. この発明の第３の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するサブパス情報管理部におけるインストラクションテーブルの概念図。The conceptual diagram of the instruction table in the subpath information management part which the data control part with which the graphic processor which concerns on 3rd Embodiment of this invention has is provided. この発明の第３の実施形態に係るグラフィックプロセッサのプリロード時の動作を示すフローチャート。The flowchart which shows the operation | movement at the time of the preload of the graphic processor which concerns on 3rd Embodiment of this invention. この発明の第３の実施形態に係るグラフィックプロセッサの備えるデータ制御部のブロック図であり、プリロード時の様子を示す図。It is a block diagram of the data control part with which the graphic processor which concerns on 3rd Embodiment of this invention is provided, and shows the mode at the time of preload. この発明の第３の実施形態に係るグラフィックプロセッサにおいて実行されるサブパスの様子を示すタイミングチャート。The timing chart which shows the mode of the subpass performed in the graphic processor which concerns on the 3rd Embodiment of this invention. この発明の第４の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュ管理部内のステータスフラグの概念図。The conceptual diagram of the status flag in the cache management part which the data control part with which the graphic processor which concerns on 4th Embodiment of this invention has is provided. この発明の第４の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュ管理部において、ロックフラグに応じたエントリの制御方法を示すフローチャート。The flowchart which shows the control method of the entry according to a lock flag in the cache management part which the data control part with which the graphic processor concerning 4th Embodiment of this invention is provided has. この発明の第４の実施形態に係るグラフィックプロセッサの備えるデータ制御部取り得る状態を示す図。The figure which shows the state which can take the data control part with which the graphic processor which concerns on 4th Embodiment of this invention is provided. この発明の第５の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュ管理部内のステータスフラグの概念図。The conceptual diagram of the status flag in the cache management part which the data control part with which the graphic processor which concerns on 5th Embodiment of this invention has is provided. この発明の第５の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュ管理部内のステータスフラグとキャッシュメモリとの関係を示す概念図。The conceptual diagram which shows the relationship between the status flag and cache memory in the cache management part which the data control part with which the graphic processor concerning 5th Embodiment of this invention has is provided. この発明の第５の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュ管理部において、スレッドエントリフラグに応じたエントリの制御方法を示すフローチャート。The flowchart which shows the control method of the entry according to the thread entry flag in the cache management part which the data control part with which the graphic processor which concerns on 5th Embodiment of this invention has is provided. この発明の第６の実施形態に係るグラフィックプロセッサの一部領域のブロック図。The block diagram of the partial area | region of the graphic processor which concerns on 6th Embodiment of this invention. この発明の第６の実施形態に係るグラフィックプロセッサにおいて実行される命令とステージとの関係図。FIG. 16 is a diagram showing the relationship between instructions executed in a graphic processor according to a sixth embodiment of the present invention and stages. この発明の第６の実施形態に係るグラフィックプロセッサにおいて実行される命令とステージとの関係図であり、ストールが発生した際の様子を示す図。FIG. 16 is a diagram showing the relationship between instructions executed in a graphic processor according to a sixth embodiment of the present invention and a stage, and showing a situation when a stall occurs. この発明の第６の実施形態に係るグラフィックプロセッサの備えるデータ制御部の有するキャッシュ管理部の回路図。The circuit diagram of the cache management part which the data control part with which the graphic processor concerning 6th Embodiment of this invention is provided has. グラフィックプロセッサにおいて実行される命令とステージとの関係図であり、ストールが発生した際の様子を示す図。FIG. 4 is a diagram illustrating a relationship between an instruction executed in a graphic processor and a stage, and shows a state when a stall occurs. この発明の第１乃至第６の実施形態に係るグラフィックプロセッサを備えたデジタルテレビの有するデジタルボードのブロック図。The block diagram of the digital board which the digital television provided with the graphic processor which concerns on the 1st thru | or 6th embodiment of this invention has. この発明の第１乃至第６の実施形態に係るグラフィックプロセッサを備えた録画再生機器のブロック図。The block diagram of the recording / reproducing apparatus provided with the graphic processor which concerns on the 1st thru | or 6th embodiment of this invention.

符号の説明Explanation of symbols

１０…グラフィックプロセッサ、１１…ラスタライザ、１２−０〜１２−３…ピクセルシェーダ、１３…ローカルメモリ、２０…データ振り分け部、２３…テクスチャユニット、２４…ピクセルシェーダユニット、２５…命令制御部、２６…描画処理部、２７…データ制御部、４０…アドレス発生部、４１…キャッシュメモリ、４２…キャッシュ制御部、４３…プリロード制御部、４４…キャッシュアクセス制御部、４５…キャッシュ管理部、４６…リクエスト発行制御部、４７…プリロードアドレス発生部、４８…プリロード保持部、４９…サブパス情報管理部、５０…アドレス保持部、５１〜６２、７１…メモリ、６５、６８、６９、７４…選択回路、６６、７２…比較器、６７…ＡＮＤゲート、７３…カウンタ、７５…バス制御部、８０…バッファメモリ DESCRIPTION OF SYMBOLS 10 ... Graphic processor, 11 ... Rasterizer, 12-0 to 12-3 ... Pixel shader, 13 ... Local memory, 20 ... Data distribution part, 23 ... Texture unit, 24 ... Pixel shader unit, 25 ... Instruction control part, 26 ... Drawing processing unit, 27 ... data control unit, 40 ... address generation unit, 41 ... cache memory, 42 ... cache control unit, 43 ... preload control unit, 44 ... cache access control unit, 45 ... cache management unit, 46 ... issue request Control unit 47... Preload address generation unit 48. Preload holding unit 49. Subpath information management unit 50. Address holding unit 51 to 62 71 71 Memory 65, 68, 69, 74 Selection circuit 66 72 ... Comparator, 67 ... AND gate, 73 ... Counter, 75 ... Bus controller, 8 ... buffer memory

Claims

画像データを保持するメインメモリと、
前記メインメモリとの間で前記画像データの授受を行うキャッシュメモリと、
前記メインメモリと前記キャッシュメモリとの間のデータ転送を管理すると共に、前記キャッシュメモリの状態に関する情報を保持する転送制御装置と、
前記キャッシュメモリ内の前記画像データを用いて画像処理プログラムを実行し、前記画像処理プログラムを実行して得られた画像データを前記キャッシュメモリに保持させるプログラム実行部と
を具備し、前記キャッシュメモリは、各々が前記画像データを保持可能な複数のエントリを含み、
前記転送制御装置は、前記メインメモリから前記キャッシュメモリのエントリに転送される前記画像データの識別情報と、前記プログラム実行部で得られた前記画像データが前記エントリに保持されているか否かを示すデータ更新情報とを、前記エントリ毎に保持し、
前記転送制御装置は、いずれかの前記エントリに対応した前記データ更新情報がアサートされている場合、該エントリ内の前記画像データを前記メインメモリに書き込む
ことを特徴とする描画装置。 A main memory for storing image data;
A cache memory for transferring the image data to and from the main memory;
A transfer control device that manages data transfer between the main memory and the cache memory, and holds information about the state of the cache memory;
A program execution unit that executes an image processing program using the image data in the cache memory, and holds the image data obtained by executing the image processing program in the cache memory; Each including a plurality of entries capable of holding the image data;
The transfer control device indicates whether the image data identification information transferred from the main memory to the cache memory entry and the image data obtained by the program execution unit are held in the entry. Data update information is held for each entry,
When the data update information corresponding to any of the entries is asserted, the transfer control device writes the image data in the entry to the main memory.

前記転送制御装置は、定期的に、前記データ更新情報に基づいて、前記キャッシュメモリから前記メインメモリへのデータ転送を実行する The transfer control device periodically performs data transfer from the cache memory to the main memory based on the data update information
ことを特徴とする請求項１記載の描画装置。 The drawing apparatus according to claim 1.

前記転送制御装置はカウンタを有し、前記カウンタのカウンタ値で指定される前記エントリの前記データ更新情報がアサートされているか否かを確認する The transfer control device has a counter, and checks whether the data update information of the entry specified by the counter value of the counter is asserted.
ことを特徴とする請求項１または２記載の描画装置。 The drawing apparatus according to claim 1 or 2, wherein

画像データを保持するメインメモリと、複数のエントリを有し、前記メインメモリとの間で前記画像データの授受を行うキャッシュメモリと、前記メインメモリと前記キャッシュメモリとの間のデータ転送を管理すると共に、前記キャッシュメモリの状態に関する情報を保持する転送制御装置と、前記キャッシュメモリ内の前記画像データを用いて画像処理プログラムを実行するプログラム実行部とを具備する描画装置のデータ転送方法であって、
前記プログラム実行部が、前記画像処理プログラムを実行することにより得られた新たな画像データをいずれかの前記エントリに保持させるステップと、
前記新たな画像データが前記エントリに保持された際に、前記転送制御装置が該エントリに関する更新情報をアサートするステップと、
前記転送制御装置が、前記更新情報がアサートされた前記エントリの有無を検出するステップと、
前記更新情報がアサートされた前記エントリが検出された際、前記転送制御装置が、該エントリに保持される前記画像データを前記メインメモリに転送するステップと
を具備することを特徴とするデータ転送方法。 A main memory that holds image data, a cache memory that has a plurality of entries and exchanges the image data with the main memory, and manages data transfer between the main memory and the cache memory And a data transfer method for a drawing apparatus, comprising: a transfer control device that holds information about the state of the cache memory; and a program execution unit that executes an image processing program using the image data in the cache memory. ,
The program execution unit holding new image data obtained by executing the image processing program in any of the entries;
When the new image data is held in the entry, the transfer control device asserts update information regarding the entry;
The transfer control device detecting the presence or absence of the entry for which the update information is asserted;
When the entry for which the update information is asserted is detected, the transfer control device comprises a step of transferring the image data held in the entry to the main memory. .

前記転送制御装置は、定期的に、前記データ更新情報に基づいて、前記画像データの前記メインメモリへの転送を実行する The transfer control device periodically transfers the image data to the main memory based on the data update information
ことを特徴とする請求項４記載のデータ転送方法。 5. The data transfer method according to claim 4, wherein: