TW201137786A - System and method for improving throughput of a graphics processing unit - Google Patents

System and method for improving throughput of a graphics processing unit Download PDF

Info

Publication number
TW201137786A
TW201137786A TW100110084A TW100110084A TW201137786A TW 201137786 A TW201137786 A TW 201137786A TW 100110084 A TW100110084 A TW 100110084A TW 100110084 A TW100110084 A TW 100110084A TW 201137786 A TW201137786 A TW 201137786A
Authority
TW
Taiwan
Prior art keywords
instruction
memory
threads
unit
execution unit
Prior art date
Application number
TW100110084A
Other languages
Chinese (zh)
Other versions
TWI474280B (en
Inventor
Jeff Yang Jiao
Mike Hong
Original Assignee
Via Tech Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US12/764,256 external-priority patent/US8564604B2/en
Application filed by Via Tech Inc filed Critical Via Tech Inc
Publication of TW201137786A publication Critical patent/TW201137786A/en
Application granted granted Critical
Publication of TWI474280B publication Critical patent/TWI474280B/en

Links

Landscapes

  • Image Generation (AREA)

Abstract

Systems and methods for improving the throughput of a graphics processing unit are disclosed. In one embodiment, a system includes a multithreaded execution unit capable of processing requests to access a constant cache, a vertex attribute cache, at least one common register file, and an execution unit data path substantially simultaneously.

Description

201137786 六、發明說明: 【發明所屬之技術領域】 總處理量的方法 個存取要求的執 本毛明係關於改善緣圖處理單元之 與系統,特別是可同時處理多個線程之多 行早元改善方法和系統。 【先前技術】 α么&quot;、w二雒〈川)物 =維⑽影像的方式呈現出來,並顯示於諸如陰極線」 空官卿螢幕或液晶螢幕⑽)之類的顯示裝置、: =物體可以是簡單的幾何基元(primitive),例如點^ 或是多邊形(Ρ〇1,η)。較為複雜的物體則是c ::連之平面多邊形的方式呈現在顯示裝置上,例知 :連串的平面三角形。所有的圖形基元都可以單—頂 -組頂點的方式來表示,例如以座; 點,或是線段的苹一端點)采疋義—個 I點,或是多邊形的某—個頂點。 可呈現在了:生用來代表三維物體的二維投影資料使物體 ,作’以及_顯像管綠的多個階段來處欠— 過二;^ 二―連串相連的處理單元或階段所組成,上‘二= 出可做為下—階ρ、 白'^又的知 階段包括以下 對於猶理單元而言,管線 下頂點操作,基元合成操作,像素操作’201137786 VI. Description of the invention: [Technical field of invention] Total method of processing method The requirements of the access request are related to the improvement of the edge map processing unit and the system, in particular, the processing of multiple threads at the same time Meta-improvement methods and systems. [Prior Art] α?&quot;, w 二雒 <川)物=维(10) imagery is presented and displayed on a display device such as the cathode line "Kuangguanqing screen or LCD screen (10)): It is a simple geometric primitive, such as a point ^ or a polygon (Ρ〇1, η). The more complex object is the way of c:: even the plane polygon is presented on the display device, for example: a series of plane triangles. All graphics primitives can be represented by a single-top-group vertex, such as a seat; a point, or an endpoint of a line segment, which is an I-point, or a certain vertice of a polygon. It can be presented in the following two-dimensional projection data, which is used to represent a three-dimensional object, so that the object, as well as the 'multiple stages of the picture tube green, are owed--two; ^ two-series connected processing units or stages, On the 'two = out can be used as the lower - order ρ, white '^ and the known stage includes the following for the unit, the pipeline under the vertex operation, primitive synthesis operation, pixel operation '

0608-A43050TWF 201137786 、-成k作光栅掃描(rasterization)操作以及碎型 (fragment)操作等等。 β在典型的緣圖顯示系統中,可利用影像資料庫來儲存 场景中物體的敘述符。物體可以多個小多邊形來表示,這 些小多邊形是涵蓋物體表面的多邊形,如同牆面上的磁 磚。每一個多邊形又可以頂點座標清單以及表面材質特性 來表不,甚或再包括每一頂點相對於表面的法線向量。頂 點座標清單可以是模型空間的χυζ座標,表面材質特性可 包括顏色,紋理或亮度等,於具有複雜曲面的三維物體, 通爷疋以—角形或四邊开》來表示,而四邊形又可拆解為— 對三角形。 當使用者決定了觀看的角度,轉換引擎單元便將物體 座‘轉換至相對於觀賞角度。此外,使用者可以指定視野 範圍所產生之影像的大小,以及可見物體的後方是否包 含某一背景或是將背景刪除。 當視野區域選定之後,裁剪單元將位於視野區域之外 的多邊形剔除,並且將部份位於視野區域之外部份位於視 野區域之内的多邊形加以裁剪。裁減後的多邊形對應至原 多邊形位於視野區域之内的部份,其裁減後的邊緣對應至 視野區域的邊界。接著多邊形的頂點會傳遞到下一個管線 階段,包含每一頂點在視野區域的座標(χγ)以及其相對的 深度值(ζ)。之後一般的繪圖處理系統會進行光源模型處 0608-A43050TWF . 201137786 理其顏色值傳遞至光柵掃描器。 在此多邊形拇婦描器會判斷哪些像素位 器(―㈤。光柵掃描器會比較=畫_衝 其像素的深度值與原先錯存在畫;:=之多邊形 素的深度值’如果多邊形像素的深度值較小,:=:: 晝嶋器所儲存之像素的前 :/、位方; 值取代原畫_衝㈣深度值,Μ γ㈣素的深度 先儲存於畫㈣衝器内的多邊形。上述步驟會料 ='所有的!邊形都已顯像處理過。之後,影像控制 二—衝器的内容以逐一婦描線的方式呈現在顯示 展置上。 實現即時顯像的典型方式是以像素來顯示多邊形,此像 素可能位於多邊形之_之外,職生的多邊形邊緣在靜 態顯示之下可能產生不規則外觀,而在動態顯示之下則是 模糊的影像。其問題的背後成因在於㈣(aiiasing)效 應,而用以降低此效應的方法就稱為反鋸齒 (anti-aliasing)技術。 以螢幕為基礎的反鑛齒方法並不需要欲顯像之物體 的相關肓訊’ ϋ為此種方法只需要纟會圖管線的輸出樣本。 其中一種典型的反鋸齒方法是利用掃描線反鋸齒技術,稱 為多取樣反鋸齒(Multi-Sample Anti-AliaSing , MSAA)方 0608-A43050TWF ς 201137786 法’此方法在每一次傳遞時都對單〜像素作一個以上的取 樣。從每-個像素所取樣出來的樣本,或稱做次像素的數 量即是所謂的取樣率,-般來說取樣率越高就會耗費越多 的記憶體流量。 雖然上文僅簡略說明繪圖處理單元之各個元件的大 致操作’本領域熟習技藝者應可理解繪圖資料的處理十分 繁複’因此提高處理效能與降低設計複雜度為常見考量和 需求。若能提高繪圖處理單元的資料總處理量 (throughput),不僅可達到提高處理效能的目的,亦可在 維持一定效能之下,減少硬體需求的複雜度。 【發明内容】 本發明的一實施例提供一種繪圓處理單元,包含一執 行單元’用以處理可程式化著色器操作,並可用以同時處 ;H㈣單元’用以形成 -暫存器檔案來調節該複數個線程的暫存器操作,該記憶 體單元包含複數個記憶庫’其巾魏個第—記憶庫分配认 複數個第-線程’複數個第二記憶庫分配給複數個第二^ 程;一第二記憶體單元,用以形成—常數快取記憶體來調 節該執行單元的該複數個線程所對應之複數個著色器操作 的常數存取’該常數快取記憶體用以儲存對應於該複:個 著色器操作的複數個内文,並用以儲存該複數個内文的複 數個常㈣賴個版本;m三記,隨單元,用以形 0608-A43050TWF 6 AV/上丄 J / /δ〇 成頂點屬性快取記憶體來 程所對應之複數個著:即該執行單元的該複數個線 本發明的另1施::=屬性存取。 一執行單元,可用以綠、種繪圖處理單元,包含: 線程控制器,該操ϋ,該執行單元包含— -第二指令操取仲裁器;复中:;弟-指令操取仲裁器和 代表該執行單元的複數個第令揭取仲裁器用以 二指令#1取仲制用、。令’以及該第 來擷取指令。 執仃衫的複數個第二線程 方法本2的又—實_—種適用於執行單元的指令處理 一 :^ 3 :對應於複數個有效線程的—第—有效線程,從 :指令快取記憶體操取一執行單元的-第-指令;傳播該 弟一指令給該複數個有效線程;維持該第一指令於對岸於 該複數個有效線程至少其中一者的一指令仔列中;解碼該 &amp; W丁列中的-第二指令;以及發送該第二指令的資料存 取要求給至少下列一者··一常數快取記憶體、一頂點屬性 快取記憶體、一共同暫存器檔案和一執行單元資料路徑。 【實施方式】 以下將配合圖不來5兒明本發明的各個實施例,雖然本 發明係藉由該些實施例來說明,但本發明不侷限於下面所 揭露之實施例,下述實施例之變形、改良與等同實施方式 0608-A43050TWF 7 201137786 皆屬於本發明之範圍。 旦本發明提供一種可提高繪圖處理單元之資料總處理 置的糸統和方法。在詳細探討本發明的各個實施例之前, 請先參見第-圖,其緣示繪圖處理管線1〇〇的之部分元件 方塊圖’特別是綠圖處理管線刚的基本元件。這些基本 70件包含頂點著色器11()、幾何著色器12()、三角形設置單 元(tnangle setup unit)13〇、跨距與像碑產生器 卽d tlle generat(3r)14G、屬性設置單元⑽、像素著色 器⑽以及晝傾緩衝H 170。上述元件的基本功能和操作 為本領域習知技術,故在此不多加費述。簡而言之,繪圖 基元可以位置資料(X、γ、z和w座標)還有亮度及紋理資 料來定義,以上所有資料可傳送至頂點著色器ιι〇。如所 已知’頂點著色盗110可對從命令清單所接收的緣圖資料 執行各種轉換’例如從世界座標轉換至視野座標,再到投 影座標’最後是螢幕座標。頂點著色器110可執行的各項 魏為本領域熟習技#者已知,在此不加以敘述。頂點著 色盗110輸出幾何基元給幾何著色器120。 成何著色杰120所產生的幾何和其他繪圖資料會傳送 到—角形5又置單凡13〇來執行三角形設置操作,其細部功 能和實現方式可依據需求有所不同。一般來說,三角形設 置單元130接收三角形基元的頂點資訊,並依據基元的類 型執行各項操作,例如某些幾何轉換。0608-A43050TWF 201137786, - into k raster scan (rasterization) operation and fragment operation and so on. In a typical edge map display system, an image database can be used to store the narration of objects in the scene. An object can be represented by multiple small polygons, which are polygons that cover the surface of the object, like tiles on a wall. Each polygon can also be represented by a list of vertex coordinates and surface material properties, or even a normal vector for each vertex relative to the surface. The vertex coordinates list can be the χυζ coordinate of the model space. The surface material properties can include color, texture or brightness, etc., for a three-dimensional object with a complex surface, the Tongye 疋 is represented by an angle or a four-sided opening, and the quadrilateral is detachable. For — to the triangle. When the user determines the angle of view, the conversion engine unit will 'convert the object' to the viewing angle. In addition, the user can specify the size of the image produced by the field of view and whether the background behind the visible object contains a background or deletes the background. When the field of view is selected, the cropping unit culls the polygons outside the field of view and crops the polygons that are partially outside the field of view and located within the field of view. The reduced polygon corresponds to the portion of the original polygon that is within the field of view, and the reduced edge corresponds to the boundary of the field of view. The vertices of the polygon are then passed to the next pipeline stage, containing the coordinates of each vertex in the field of view (χγ) and its relative depth value (ζ). After that, the general drawing processing system will perform the light source model at 0608-A43050TWF. 201137786 The color value is passed to the raster scanner. In this polygon, the buckwheat will determine which pixel positioner ("(5). The raster scanner will compare = draw_the depth value of its pixel with the original error exists;: = the depth value of the polygon" if the polygon pixel The depth value is small, :=:: The front of the pixel stored in the buffer: /, the square; the value replaces the original _ punch (four) depth value, Μ γ (four) prime depth is first stored in the painting (four) punch polygon. The above steps will be expected to = 'All! The edges have been processed. After that, the content of the image control two-shot is presented on the display display one by one. The typical way to achieve instant development is to Pixels to display polygons, this pixel may be located outside the polygon, the edge of the polygon of the employee may produce an irregular appearance under static display, and under the dynamic display is a blurred image. The cause behind the problem is (4) The (aiiasing) effect, and the method used to reduce this effect is called anti-aliasing. The screen-based anti-mineral method does not require the related information of the object to be imaged. Kind of prescription Only the output samples of the graph pipeline are needed. One of the typical anti-aliasing methods is the use of scan line anti-aliasing technology, called Multi-Sample Anti-AliaSing (MSAA) square 0608-A43050TWF ς 201137786 method 'this The method samples more than one pixel at each pass. The sample sampled from each pixel, or the number of sub-pixels, is the so-called sampling rate. In general, the higher the sampling rate. The more memory traffic is consumed. Although the above merely illustrates the general operation of the various components of the graphics processing unit, those skilled in the art should understand that the processing of graphics data is very complicated', thus improving processing performance and design complexity. For common considerations and requirements, if the total throughput of the graphics processing unit can be improved, not only can the processing performance be improved, but also the complexity of the hardware requirements can be reduced while maintaining certain performance. An embodiment of the present invention provides a circle processing unit including an execution unit 'for processing programmable coloring Operation, and can be used at the same time; H (four) unit 'is used to form a temporary register file to adjust the register operation of the plurality of threads, the memory unit comprises a plurality of memory banks - its towel Wei - memory allocation Recognizing a plurality of first-threads, the plurality of second memory banks are allocated to the plurality of second memories; and a second memory unit is configured to form a constant cache memory to adjust the plurality of threads of the execution unit Constant access of a plurality of shader operations 'The constant cache memory is used to store a plurality of contexts corresponding to the operation of the complex shader, and is used to store a plurality of constants (four) of the plurality of contexts Version; m three notes, with the unit, used to form 0608-A43050TWF 6 AV / upper 丄 J / / δ 〇 顶点 属性 属性 属性 快 快 快 快 快 快 快 快 快 快 快 快 快 快 : : : : : : : : : : : : : 顶点 顶点 顶点 顶点 顶点Another embodiment of the invention:: = attribute access. An execution unit, which can be used in a green, type of graphics processing unit, includes: a thread controller, the operation unit, the execution unit includes - a second instruction fetching arbiter; a middle: a brother - an instruction facsimile and a representative The plurality of instructions of the execution unit are used to extract the arbitrator for use in the second instruction #1. Order and the first order. The second thread method of the shirt is the same as the command processing of the execution unit: ^3: corresponding to a plurality of valid threads - the first effective thread, from: instruction cache memory Gymnastics takes an - instruction of the execution unit; propagates the instruction to the plurality of valid threads; maintains the first instruction in an instruction queue opposite to at least one of the plurality of active threads; decodes the &amp; a second instruction in the W-column; and a data access request to send the second instruction to at least one of the following: a constant cache memory, a vertex attribute cache memory, a common scratchpad file And an execution unit data path. [Embodiment] The embodiments of the present invention will be described with reference to the accompanying drawings. While the present invention is illustrated by the embodiments, the present invention is not limited to the embodiments disclosed below, the following embodiments Modifications, improvements, and equivalent embodiments of 0608-A43050TWF 7 201137786 are within the scope of the present invention. The present invention provides a system and method for improving the overall processing of data in a graphics processing unit. Before discussing the various embodiments of the present invention in detail, please refer to the first figure, which is a block diagram of a part of the drawing processing pipeline 1', particularly the basic components of the greenprint processing pipeline. These basic 70 pieces include vertex shader 11(), geometry shader 12(), tnangle setup unit 13〇, span and image generator 卽d tlle generat(3r)14G, attribute setting unit (10) , pixel shader (10) and tilt buffer H 170. The basic functions and operations of the above elements are well known in the art, and therefore will not be described here. In short, the drawing primitives can be defined by location data (X, γ, z, and w coordinates) as well as brightness and texture data, all of which can be passed to the vertex shader ιι〇. As is known, the vertex coloring thief 110 can perform various transformations on the margin data received from the command list&apos;, e.g., from world coordinates to field of view coordinates, to projection coordinates, and finally to screen coordinates. The various executables of vertex shader 110 are known to those skilled in the art and will not be described herein. The vertices 110 output geometric primitives to the geometry shader 120. The geometry and other drawing data generated by Cheng Caijie 120 will be transmitted to the angle 5 and the other 13 to perform the triangle setting operation. The detailed functions and implementation methods can be different according to the requirements. In general, the triangle setting unit 130 receives the vertex information of the triangle primitive and performs various operations according to the type of the primitive, such as some geometric transformations.

0608-A43050TWF 201137786 對每一頂點而言,所提供的幾何資訊包含X、Υ、Z和 W座&amp; /、中X γ Z為幾何座標’而w為齊次(h〇〇gen〇us) 座‘。如本領域熟習技藝者已知,相關轉換可能是例如從 模型空間到世界空間’再到視野空間 ,投影空間,然後是 齊次空間以及正規化裝置座標(normalized device ⑶oniinates ’ MDC),最後是螢幕空間。請注意,為了簡化 5兄明之便’本發明之敘述省略部分繪圖管線元件,但該些 繪圖官線兀件之#作應為本領域熟習技藝者已知。舉例來 說’本發明未揭露光柵掃描管線的所㈣段,但本領域熟 習技#者應可理解其包含未揭露之管線階段。 以上所述之緣圖管線階段通常實現於繪圖處理單元 或繪圖處理裝置之中。某些管線階段係依循已公開之應用 程式介面(application program interface,API)之規範, 或是多個應用程式介面群組所制定的需求限制。上述應用 程式介面可能是例如Di rect®3D API。下文將以另一觀點 闡述繪圖管線之實現。 請參見第二圖,其繪示本發明一實施例之繪圖處理管 線200的部分元件方塊圖。首先是命令串流處理器 (command stream processor,CSP)252,主要用以從記憶 體接收或讀取頂點。頂點可用以形成幾何基元和管線的工 作項目。命令串流處理器252從記憶體讀取資料,並利用 這些資料來產生繪圖管線的三角形、線段、點或其他基元, 0608-A43050TWF 9 201137786 這些幾何資訊組合完 254。某些綠圖Αρί常 麦接者破傳送到頂點著色器 $限制在於,諸却頂點著色哭之_ 的者色器是使用者可萨儿 百巴》。之頰0608-A43050TWF 201137786 For each vertex, the provided geometric information includes X, Υ, Z, and W seats &amp; /, where X γ Z is the geometric coordinate ' and w is homogeneous (h〇〇gen〇us) seat'. As is known to those skilled in the art, the correlation may be, for example, from the model space to the world space 're-view space, the projection space, then the homogeneous space and the normalized device (3) onions ' MDC), and finally the screen space. Please note that in order to simplify the description of the present invention, the description of the present invention omits part of the drawing pipeline elements, but the drawing of the drawing lines is known to those skilled in the art. For example, the present invention does not disclose the (four) segment of the raster scan pipeline, but it should be understood by those skilled in the art that it includes an unexplained pipeline stage. The edge map pipeline stages described above are typically implemented in a graphics processing unit or a graphics processing unit. Some pipeline stages follow the specifications of the published application program interface (API) or the requirements set by multiple application interface groups. The above application interface may be, for example, the Di rect®3D API. The implementation of the drawing pipeline will be explained from another point of view below. Referring to the second drawing, a block diagram of a portion of a drawing processing pipeline 200 in accordance with an embodiment of the present invention is shown. The first is the command stream processor (CSP) 252, which is mainly used to receive or read vertices from the memory. Vertices can be used to form work items for geometric primitives and pipelines. The command stream processor 252 reads the data from the memory and uses the data to generate triangles, line segments, points, or other primitives of the drawing pipeline, 0608-A43050TWF 9 201137786 These geometric information combinations are completed 254. Some green maps Αρί often get spliced to the vertex shader. The only limitation is that the color sorcerer of the vertices is sorrowful. Cheek

心… 式化階段,也就是說使用這些API 勃“叹计者可以自行設計著色器,以及 執行的操作。因此飞化者色益可 帛—圖中以圓角標示的管線階段即 為了輊式化的階段,例如頂點 ^ 有巴抑廷些可程式化 又了藉U處理器之處理核心單元的可程式化執行單 =區)的指令執行來實現。頂點著色器⑸藉由執行 轉、、知描或打光等操作來處理頂點,然後傳送給幾 何著色器256。幾何著色器挪所接收之輸入為一完整基 疋的所有頂點,並將這些頂點以單一拓樸的形式輸出,例 ^三角形串’線段串或是料單#。此外幾何著色器挪 逛可執行諸如鑲嵌和陰影錐生成等操作。 成何著色器256輸出資料至三角形設置階段257,其 用以執行例如三角形删除(triangle WvM rejection)、行列式(determinantHf#、剔除(⑶出呢)、 刖屬性設置(pre-attribute setup)、邊緣函數計算以及安 全頻帶剪裁(guardband clipping),其操作為本領域熟習 技蟄者已知,在此不贅述。三角形設置階段257輸出資訊 至跨距與像磚產生器258,其用以將不需呈現於螢幕上的 二角形剔除以及執行其他操作。本領域熟習技藝者應可理 解繪圖管線還包含其他處理階段,例如深度測試。深度測 0608-A43050TWF in 201137786 試可以據三角形的深度值來決定此三角形是否會顯示於螢 幕之上,若不需顯示則剔除此三角形。其他未討論之管線 階段為本領域習知技術,故在此省略。 如果三角形設置階段257所處理之三角形沒有被跨距 與像磚產生器258或其他管線階段所剔除,則屬性設置 (attribute setup)階段259會對這些三角形做屬性設置操 作。屬性設置階段259會產生後續管線階段所需之屬性的 内插函數清單,並且對管線階段所處理之幾何基元的各項 屬性值作處理。 像素著色器260則是在屬性設置階段259每次輸出可 覆蓋一個完整基元的各個頂點時被啟動。如所已知,像素 著色器260可執行内插或其他操作來決定輸出至晝幀緩衝 器262的像素顏色值。第二圖的各元件之功能操作為本領 域技術人員所熟知,在此不贅述,因此上述各元件的内部 操作亦省略討論。 接著請參見第三圖,其為本發明一實施例之繪圖處理 器環境的方塊圖。第三圖僅繪示有助於理解本發明的相關 元件,並未完整繪示繪圖處理器之所有元件,本領域熟習 技藝者應可自第三圖理解相關繪圖處理器的一般功能和架 構。 於本實施例中,為敘述簡潔之便,繪圖處理單元300 之部份元件被省略,但本領域熟習技藝者應可理解其中還 0608-A43050TWF 11 201137786 包含其他硬體或邏輯元件。繪圖處理單元300包含執行單 元集區306和執行單元集區控制單元304。執行單元集區 3 0 6包含多個可程式化的執行單元,而執行單元集區控制 單元304用以掌控執行單元集區306之執行單元的線程管 理,以及系統使用者和繪圖處理單元300之其他元件的相 互溝通。執行單元集區控制單元304還包含快取記憶體次 系統,其具有可供執行單元集區306使用的一或多個快取 記憶體,並可用以儲存資料或一般的記憶體存取,例如頂 點著色器線程可儲存資料以供後續之三角形設置單元使 用。此外,執行單元集區306的每一個執行單元可各自具 有執行單元緩衝器,用以儲存此執行單元本身之後續線程 所需使用的資料。 如上所述,繪圖管線的可程式化階段包含頂點著色器 308、幾何著色器310、像素著色器312都是在執行單元集 區306所執行。由於執行單元集區306通常是可執行多線 程操作的處理核心單元,執行單元集區控制單元304需負 責執行單元集區306的線程排程。當執行單元集區控制單 元304接收到執行某一可程式化著色器的要求時,其會指 示執行單元集區306中的某一執行單元建立一個新線程來 執行著色器要求。執行單元集區控制單元304可管理執行 單元集區306的相互載入,以及將某一著色器的資源轉移 給另一著色器來改善管線整理效能,相關管理技術為已知 0608-A43050TWF 12 201137786 技術在此不贅述。舉例來說,如.果以緣圖處理單元 之資料總處理量來看,像素著色器312是造成瓶頸的源 1那麼執行h集區控制單元3Q4可以配置更多 單元資源給像素著色器312來做改善。 、仃 第四圖為本發明—實施例之執行單元400的部份 :塊圖。本實,例的單—執行單元彻可同時執行多個指 7 ’因此執行早凡的集區可同時執行多個著色器線程。執 行早讀包含線程控制器術,用以管理分派給執行單 的任務’以及其中的.有效(仙ve)線程和休眠 s e^lng)線程。有效線程是指對應於該任務的線程已經 2好T執仃,也就是說’線程所需要的資料可被取得因 单元可以執行該線程。而休眠線程則是指線程控制 ^繼所指派的任務尚未準備好,亦即休眠線程處在等待 =官線其他凡件傳送資料的狀態。線程控制器402包含 =揭取仲裁器0 404和指令擷取仲裁器丄4〇6,而在本 彳中線長則可分為偶數線程和奇數線程。舉例而言, t果執行單元_可執行16個線程,其中8個線程,也就 3數線程可分派給指令擷取仲裁器Ο·,而剩下8個 :丈線私則分派給指令擷取仲裁器^彻。將所有線程分 2組=且具有個別的指令擷取仲裁器可減少指令擷取所 ::扎令延遲,進而增進執行單元4QQ的資料總處理 ⑽姆_〜“婦組或配置。 201137786 指令掏取仲裁器4〇4和4〇6可以各自獨立替執行單元 的有效線程仲裁其要求來擷取指令,仲裁方式是依據 提出要求之線程的時序。自線程接收到指令要求之後,指 令操取仲裁器4〇4#棚從指令快取記憶體彻操取指曰 指令快取記憶體彻可包含指令快取控制器,用以= 订快取命令測試來判斷所要求的指令是否存在指令快取記 隐:408之内。如果指令不在指令快取記憶體彻之内, 或疋I·夬取命中測試的結果是誤失,就必須透過二階^快取 5己憶體存取單元412從二階L2快取記憶體或其他記憶體索 取指令。擷取到的指令會在指令傳播匯流排413上傳播給 偶數線程417和奇數線程419,如此-來’若有一個以上 的線程要求相同指令時,至少可以減少一次指令擷取,進 而減少指令延遲。也就是說,如果有不只一個線程向指令 快取記憶體408要求同一個指令,不需要分別為各個線程 擷取和傳送指令,因為所要求的指令是透過指令傳播匯流 排413回傳至指令快取記憶體彻,而執行單元彻中無The heart... The stage of the design, that is to say, using these APIs, the singer can design the shader and perform the operation by himself. Therefore, the colorizer can be profitable – the pipeline stage marked with rounded corners is for the 轾The stage of the process, such as the vertex ^ can be programmed and executed by the executable of the U processor's processing core unit (the programmable execution unit = area). The vertex shader (5) by executing the transfer, Operations such as tracing or lighting are used to process the vertices and then passed to the geometry shader 256. The geometry shader receives the input as a complete base of all vertices and outputs the vertices as a single topology, eg The triangle string 'line segment string or material list #. In addition, the geometry shader can perform operations such as mosaic and shadow cone generation. The shader 256 outputs the data to the triangle setting stage 257, which is used to perform, for example, triangle deletion (triangle) WvM rejection), determinant (determinantHf#, culling ((3) out), 刖 attribute setting (pre-attribute setup), edge function calculation, and secure band clipping (guardba) Nd clipping), the operation of which is known to those skilled in the art, and will not be described here. The triangle setting stage 257 outputs information to the span and tile generator 258 for the quadrilateral that does not need to be presented on the screen. Elimination and performing other operations. Those skilled in the art should understand that the drawing pipeline also includes other processing stages, such as depth testing. Depth measurement 0608-A43050TWF in 201137786 can determine whether the triangle will be displayed on the screen according to the depth value of the triangle. The triangle is eliminated if it is not needed to be displayed. Other pipeline stages not discussed are known in the art and are omitted here. If the triangle processed in the triangle setting phase 257 is not spanned by the tile generator 258 or other The pipeline stage is removed, and the attribute setup stage 259 performs attribute setting operations on these triangles. The attribute setting stage 259 generates an intrinsic function list of attributes required for subsequent pipeline stages, and the geometry processed for the pipeline stage. The attribute values of the primitives are processed. The pixel shader 260 is in the attribute setting phase 259 each time. The individual vertices that cover a complete primitive are initiated. As is known, the pixel shader 260 can perform interpolation or other operations to determine the pixel color values output to the frame buffer 262. Elements of the second figure The functional operation is well known to those skilled in the art and will not be described here, so the internal operations of the above components are also omitted. Next, please refer to the third figure, which is a block diagram of a drawing processor environment according to an embodiment of the present invention. The third figure is only for the purpose of understanding the relevant elements of the present invention, and does not fully illustrate all the elements of the drawing processor. Those skilled in the art should understand the general function and architecture of the related drawing processor from the third figure. In the present embodiment, some components of the graphics processing unit 300 are omitted for brevity, but those skilled in the art should understand that 0608-A43050TWF 11 201137786 includes other hardware or logic components. The drawing processing unit 300 includes an execution unit pool area 306 and an execution unit pool area control unit 304. Execution unit pool 306 includes a plurality of programmable execution units, and execution unit pool control unit 304 is used to control thread management of execution units of execution unit pool 306, and system user and graphics processing unit 300 Communication of other components. Execution unit pool control unit 304 also includes a cache memory subsystem having one or more cache memories available for execution unit pool 306 and for storing data or general memory access, such as The vertex shader thread can store data for subsequent triangle setup units. In addition, each execution unit of execution unit pool 306 may each have an execution unit buffer to store the data needed for subsequent threads of the execution unit itself. As noted above, the stylized stages of the drawing pipeline, including vertex shader 308, geometry shader 310, and pixel shader 312, are all executed at execution unit pool 306. Since execution unit pool 306 is typically a processing core unit that can perform multi-thread operations, execution unit pool control unit 304 is responsible for executing thread schedules for unit pool 306. When execution unit pool control unit 304 receives a request to execute a programmable shader, it will indicate that an execution unit in execution unit pool 306 has established a new thread to execute the shader request. The execution unit pool control unit 304 can manage the mutual loading of the execution unit pools 306 and transfer the resources of one shader to another shader to improve the pipeline finishing performance. The related management technology is known as 0608-A43050TWF 12 201137786 The technology will not be described here. For example, if the pixel shader 312 is the source of the bottleneck in terms of the total amount of data processed by the edge map processing unit, then the h pool control unit 3Q4 can be configured to allocate more unit resources to the pixel shader 312. Make improvements. The fourth figure is a part of the execution unit 400 of the present invention-embodiment: block diagram. In this case, the single-execution unit of the example can execute multiple fingers 7 ' at the same time. Therefore, performing an early cluster can execute multiple shader threads simultaneously. Executing early reading includes thread controllers to manage the tasks assigned to the execution order 'and the .valid and sleep s e^lng threads. A valid thread means that the thread corresponding to the task has been executed, that is, the data required by the thread can be obtained because the unit can execute the thread. The dormant thread refers to the thread control. After the assigned task is not ready, that is, the sleep thread is waiting for the status of other items transmitted by the official line. Thread controller 402 includes a = arbiter 0 404 and an instruction arbitrator 丄 4 〇 6 , while in the present line the line length can be divided into even and odd threads. For example, t execution unit _ can execute 16 threads, 8 threads, that is, 3 threads can be assigned to the instruction capture arbitrator ,·, and the remaining 8: the rule line is assigned to the command 撷Take the arbitrator ^ thoroughly. Divide all threads into 2 groups = and have individual instructions to capture the arbiter to reduce the instruction fetch:: Draw delay, and thus improve the total processing of the execution unit 4QQ (10) _ ~ "women group or configuration. 201137786 instructions 掏The arbitrators 4〇4 and 4〇6 can independently retrieve the instructions for the effective thread of the execution unit to arbitrate the request, and the arbitration mode is based on the timing of the thread that requests the thread. After the thread receives the instruction request, the instruction fetches the arbitration. 4〇4# shed from the instruction cache memory to fetch the instruction command cache memory can include the instruction cache controller, used to = the cache command test to determine whether the required instruction has instruction cache Note: Within 408. If the instruction is not within the instruction cache memory, or if the result of the hit test is missed, it must be removed from the second order by the second-order ^ cache 5 access unit 412. The L2 cache memory or other memory requests instructions. The fetched instructions are propagated on the instruction propagation bus 413 to the even-numbered threads 417 and the odd-numbered threads 419, so that - if more than one thread requires the same finger At least one instruction fetch can be reduced, thereby reducing the instruction delay. That is, if more than one thread requests the same instruction to the instruction cache 408, it is not necessary to separately fetch and transfer instructions for each thread. The required instruction is transmitted back to the instruction cache through the instruction propagation bus 413, and the execution unit is completely free.

論是偶數線程417或奇數線程419都可存取指令傳播匯产 排 413。 /7,L 指令擷取到之後,偶數線削17和奇數線程4i9之内 的線程會判斷所擷取的指令是否需要與常數快取記憶體 410、頂點屬性快取記憶體414、共同記憶體播案q = 共同記憶體標案i 418相互動。舉例來說,材質特 0608-A43050TWF 、月b IΔ 201137786 館存於常數快取記憶體41〇 會改變的參數,以… 對某一内文而吕不 此外光源^ 4之物㈣财彻的常數。 夕势.:、^疋儲存於常數快取記憶體41G ’因為這此 生的過程中是穩定不變的。一: :==:=數::數群組,與線程劃分為偶數 取ff4 、‘ 7 $要從常數快取記憶體410存 取貝枓,那麼在所需的資料 出去。同樣地,則’&quot;不會被發送 浐人,次』, 而要吊數快取記憶體410的資料, 指*7在貝料從常數快 、 送。更進-步,如早^ 料之前不會被發 但位於執行單元外1广:資料是靖圖處理單元之内, 被發送。舉個例子,宜枘扣入 卄之則私令不會 $叫令需減執行單元外部的元件 麻紋理資料並儲存於暫〕兀件 資料成功_並回傳。寻待所要求的 舰可當^令之=于所需的資料已經準備完備,線程控制器 行單元料=Γ供執行單元㈣路經420來執行。執 包含算數邏輯單元。422、算數邏輯 y σ内插☆ 426。當執行單元資料路徑伽對 J執行結束後,所產生的結果可以從執行單元侧= 出緩衝器似作輪出,進而傳送至執行單元_料圖^ 理早7&quot;内的元件’或是執行單元内的其他元件,例二;; 屬性快取記憶H 4U。舉例而言,若某 、點 0608-A43050TWF 7 執仃需要 201137786 更新頂點屬性快取記憶體4内的資料,這些資料就可以 在執灯單7L貧料路徑42〇執行完畢後,透過輸出緩衝器428 傳运至頂點屬性快取記憶體414。在另—個範例中,執行 單元資料路徑可以計算紋理座標或其他參數,然後透 過輸出緩衝n 428輸出至紋理單元或其他執行單元外部的 元件。 .,:主思本發明各實施例中並未緣示所有的元件和資 ;: X利於文子說明的簡潔之便。例如線程控制器可 .以耦接至執行單元隼 一 其 不匚拴制早兀,以用於接收執行單元需 g理的任務。再者,苴一 、二70件可能需要從L2快取記憶體獲 取貝料,而L 2快取記At ^ T9 ^ 隐粗了月匕位於執行單元外部。因此, L2快取讀體存取單 其他記憶體的機制。戈表的疋存取U快取記憶體或 接著請參見第五圖,发修_ 快取記憶體之方_,l…07&quot;料明-個實施例的常數 &quot;此快取記憶體可用於執行單元内。 由於執仃早%可簡時處理對應 執料疋内 的多個執行緒,例如傻 類里之著色器操作 象素者色器、頂點菩$ 器,執行單元地須维持夕έ a 貝,,,占者色态和幾何著色 使用。舉例來說,—個’且吊數以供執行單元資料路徑來 之執行緒的執行單元需 “色时和頂點著色器 器常數。此外執行單开」持像素著色器常數和頂點著色 個常數内文的多個版本。例士頁^持常數的多個内文以及每 0608-A43050TWF 士執行單元内有兩條執行緒在 201137786 執订像素者色器操作,而執行緒内存在多個不同的内文, 因此執行單元必須維持至少兩組不同内文的像素著色器常 丈根據上述原因,本實施例的常數快取記憶體可用以維 、不同对之者色n執行緒的至少兩組常數内文,同理執 =車元也必須維持各内文之常數的多個變化版本。舉例來 二:果5己憶體中的頂點著色器内文的-個常數被頂點著 色為執行緒所改變,赍拿 前版本以及從_==可以維_數的先 =先元内的其他頂點著以執行緒可以根據需求存取此: 數的先W版本或更新版本。 咖包含頭標表502、快取記憶體 5 0 W盖 L内文的各個常數可以根據頭標表 義儲存在快取記憶體内。例如,頭標請可以 本\讀型、内文或内文識別符概括常數的分組。在 連和内文識別符的常數 白常數/ “立址存入快取記憶體。像素著色器可以直接 2料取記憶體咖對某個常數提出要求,但不需要有 關》亥吊*數位置的資。。 某個内文_ 執行緒只需要知道該常數在 p可向f數快取記憶體抓提出要求。在 =:::中,如果像素著色器執行緒内有-個内文 數要求:常韋1=^要提出對内文識別符0的常 _-m3_wf 、°己憶體5〇0就會回傳頭標表502中對應 201137786 之基底位址或其附近的第一常數。同樣地,如果頂點著色 器執行緒内有一個内文識別符為1的内文,那麼只需要提 出對内文識別符1的常數要求,常數快取記憶體500就會 回傳頭標表502中對應之基底位址或其附近的第二常數。 甚者,常數快取記憶體500也可以儲存常數經過執行 單元内之執行續處理的多個版本。本實施例的對照表504 可維持有關經過各著色器執行緒處理之常數的資料,還有 追蹤每個常數的各版本,舉例來說,對照表504的第一個 項目包含經過頂點著色器執行緒處理的頂點著色器常數 A。因此,常數快取記憶體500可已在快取記憶體中維持此 常數的每個版本,以備在其他執行緒需要時可以使用。常 數值的多個版本可依上述範例來維持。 常數快取記憶體500還包含先進先出緩衝器FIFO 508,用以傳送資料給執行單元所處理的著色器線程或是其 他線程。FIFO 508可以配置為任何大小而包含不同數目的 項目,以符合常數快取記憶體500所在之執行單元的實際 需要。舉例來說,當某一著色器線程向常數快取記憶體500 要求常數時,可利用頭標表502和對照表504來定位此常 數並傳送給FIFO 508。FIFO 508接著可以傳播訊號給執行 單元的其他元件以示意此常數已準備好。因為執行單元可 同時處理多個指令,FIFO 508允許在先前線程所要求之常 數擷取完成並準備傳送之前,其他線程即可發送其他常數 0608-A43050TWF 18 201137786 要求。常數快取記憶體刚的總資料處理#可因此提高, 因為常數快取記憶體500可服務的線程要求數量增加了。 請注意本發明的常數快取記憶體5〇〇之頭標表5〇2、對照 表504和FIFO 508可以任何形式夾每相 , 小式木只現,本領域熟習技藝 者應可理解本實施例僅為其中一種實現方式。 第六圖緣示本發明第四圖之執行單元的另—實施例 方塊圖。除了線程控制器604、指令操取器〇 _和指令 揭取器1 608之外,本實施例還包含執行單元_内有效 線程610、612、614和616,以及對應的指令仔列。為表 述簡潔之便,第六圖並未緣示所有的有效線程和指令仔 列,本領域熟習技藝者應可理解執行單元_可能包含較 多或較少數量的有效線程。在本實施例中執行單元_可 同時處理至少八條有效線程’而有效線程又可分為偶數群 組和奇數群組。換個角度來說,執行單可已包含至 固指令仔列,分別對應至上述至少八條有效線程。本 實施例中有效線程分別包含可保持四個指令的指令㈣。 指令操取器G 6G6和指令掘取n丨_替有效線程向指八 快取記憶體602操取指令,其中指令操取器0_代_ 數有效線程610、612,而指令操取器丨6⑽代表奇數 線知 614、616。 可用以根據從指 延遲量,因而維持 請注意對應於有效線程的指令仔列It is said that even thread 417 or odd thread 419 can access the instruction propagation pipeline 413. After the /7, L instruction is fetched, the thread within the even line cut 17 and the odd number line 4i9 determines whether the fetched instruction needs to be compatible with the constant cache memory 410, the vertex attribute cache memory 414, and the common memory. Broadcast case q = Common memory standard i 418 moves to each other. For example, the material special 0608-A43050TWF, the monthly b IΔ 201137786 library stored in the constant cache memory 41 〇 will change the parameters, to ... for a certain text and Lu not the source ^ 4 things (four) rich constant . Xixia.:, ^疋 is stored in the constant cache memory 41G ' because it is stable during this lifetime. One: :==:=Number::Number group, and the thread is divided into even numbers. Take ff4, ‘7$ to access the memory from the constant cache memory 410, then go out in the required data. Similarly, '&quot; will not be sent 浐人,次』, and the data of the memory 410 is to be hanged, meaning that *7 is fast and sent from the constant. More advances, if not early, will not be sent but are outside the execution unit: the data is within the Jingtu processing unit and is sent. For example, if you want to deduct it, you will not be asked to reduce the components outside the execution unit. The texture data is stored in the temporary data. The data is successfully _ and returned. Looking for the required ship can be done = the required data is ready, the thread controller row unit = Γ for the execution unit (4) way 420 to perform. Contains arithmetic logic units. 422, arithmetic logic y σ interpolation ☆ 426. When the execution unit data path gamma pair J execution ends, the generated result can be rotated from the execution unit side = output buffer, and then transferred to the execution unit _ _ _ _ _ early 7 &quot; within the component ' or execute Other components in the unit, Example 2;; Property cache memory H 4U. For example, if a point or point 0608-A43050TWF 7 is required to update the data in the vertex attribute cache 4 in 201137786, the data can be transmitted through the output buffer after the execution of the 7L lean path 42〇 428 is transferred to the vertex attribute cache 414. In another example, the execution unit data path can calculate texture coordinates or other parameters and then output to the texture unit or other components outside the execution unit via output buffer n 428. . . : The main idea of the present invention does not indicate all the components and resources; X is conducive to the simplicity of the text description. For example, the thread controller can be coupled to the execution unit, and it is not required to receive the task for the execution unit. Furthermore, the first and second 70 pieces may need to obtain the bedding material from the L2 cache memory, while the L 2 cache memory At ^ T9 ^ is thick and the moon is located outside the execution unit. Therefore, L2 caches the mechanism by which the body accesses other memory.表 疋 access U cache memory or then see the fifth picture, repair _ cache memory side _, l...07 &quot; material - constant of an embodiment &quot; this cache memory is available Within the execution unit. Since the execution is as early as possible, a plurality of threads in the corresponding processing device can be processed in a simple manner, for example, the coloring device in the silly class operates the pixel color device, the vertex device, and the execution unit must maintain the nighttime a, , occupant color and geometric coloring are used. For example, the execution unit of the thread that is used by the execution unit data path requires "color time and vertex shader constants. In addition, single-open" holds pixel shader constants and vertex shader constants. Multiple versions of the text. There are two threads in the constant page and each 0608-A43050TWF execution unit. There are two threads in the 201137786 to perform the pixel colorizer operation, and the thread has multiple different texts in the execution, so the execution unit The pixel shader must be maintained for at least two different sets of texts. According to the above reasons, the constant cache memory of this embodiment can be used to perform at least two sets of constant texts in the dimension and different colors. = The vehicle element must also maintain multiple variations of the constants of each text. For example, the value of the vertex shader in the body of the 5th memory is changed by the vertex to the thread, and the previous version and the first = first element from the _== dimension The vertices can be accessed by the thread as needed: the first W version or the newer version. The coffee contains the header table 502, the cache memory 5 0 W cover L constants can be stored in the cache memory according to the header definition. For example, the header should be able to summarize the grouping of constants by the \read type, context or context identifier. The constant white constant of the ligature and the internal identifier/"" is stored in the cache memory. The pixel shader can directly request the memory to request a constant, but does not need to be related to the position of the hang. A certain text _ executor only needs to know that the constant can be f to the f-number cache memory. In =:::, if there is a number of texts in the pixel shader thread Requirement: Chang Wei 1 = ^ to raise the constant _-m3_wf of the internal identifier 0, ° 己 recall 5 〇 0 will return the first constant corresponding to the base address of 201137786 in the header table 502 or nearby Similarly, if there is a context with a context identifier of 1 in the vertex shader thread, then only the constant requirement for the context identifier 1 needs to be raised, and the constant cache memory 500 will return the header table. A second constant corresponding to the base address in 502 or a nearby one. Further, the constant cache memory 500 can also store a plurality of versions of the constant through the execution process in the execution unit. The lookup table 504 of the present embodiment can be maintained. Information about the constants processed by each shader thread, Each version of each constant is traced. For example, the first item of lookup table 504 contains a vertex shader constant A that is processed by the vertex shader thread. Therefore, constant cache memory 500 may already be in the cache memory. Each version of this constant is maintained for use by other threads. Multiple versions of the constant value can be maintained in accordance with the above examples. The constant cache memory 500 also includes a first in first out buffer FIFO 508, To transfer the data to the colorizer thread or other thread processed by the execution unit. The FIFO 508 can be configured to any size and contain a different number of items to conform to the actual needs of the execution unit in which the constant cache memory 500 is located. When a shader thread requires a constant to the constant cache memory 500, the header table 502 and the lookup table 504 can be utilized to locate the constant and pass it to the FIFO 508. The FIFO 508 can then propagate the signal to other components of the execution unit. To indicate that this constant is ready. Because the execution unit can process multiple instructions at the same time, the FIFO 508 allows the constants required by the previous thread to be exhausted. Before the transfer is ready, other threads can send other constants 0608-A43050TWF 18 201137786. The total data processing of the constant cache memory can be increased because the number of thread requests that the constant cache memory 500 can service increases. Please note that the constant cache memory of the present invention 5 〇 2, the comparison table 504 and the FIFO 508 can be sandwiched in any form, and the small wood is only available. Those skilled in the art should understand the implementation. The example is only one of the implementations. The sixth figure shows a block diagram of another embodiment of the execution unit of the fourth embodiment of the present invention, except for the thread controller 604, the instruction fetcher _ and the instruction retractor 1 608. The embodiment further includes execution units _ inner effective threads 610, 612, 614, and 616, and corresponding instruction queues. For the sake of brevity, the sixth diagram does not show all of the valid threads and instructions, and those skilled in the art should understand that the execution unit may contain a greater or lesser number of active threads. In this embodiment, the execution unit_ can process at least eight active threads at the same time while the active threads can be further divided into even groups and odd groups. In other words, the execution order can already contain the mandatory instructions, corresponding to at least eight valid threads. In this embodiment, the effective threads respectively contain instructions (4) that can hold four instructions. The instruction fetcher G 6G6 and the instruction fetch n丨_ for the active thread to fetch the instruction to the finger cache memory 602, wherein the instruction fetcher 0_generation_number valid threads 610, 612, and the instruction fetcher丨6(10) represents odd-numbered lines 614, 616. Can be used to delay the amount according to the slave, thus maintaining the attention of the instruction corresponding to the effective thread

令快取記憶體或其他記憶體擷取指令的 0608-A43050TWF 201137786 較多或較少數量的指令,而不一定是四個指令。每條有效 現成的指令都是在實際執行之前預先擷取(prefetch),藉 此減少發送指令要求給指令快取記憶體的延遲,還有在指 令不存在於指令快取記憶體時,從L2快取記憶體或其他記 憶體獲得指令所造成的延遲。再者,請注意將有效線程之 指令的擷取與指令的解碼和執行分離處理可以改善執行單 元的效能和總處理量。擷取到的指令可以透過指令傳播匯 流排617傳播給偶數線程和奇數線程。如此一來,若有一 個以上的線程要求相同指令時,至少可以減少一次指令擁 取,進而減少指令延遲。也就是說,如果有不只一個線程 向指令快取記憶體602要求同一個指令,不需要分別為各 個線程擷取和傳送指令,因為所要求的指令是透過指令傳 播匯流排617回傳至指令快取記憶體6〇2,而執行單元_ 中無論是偶數線程或奇數軸都可存取齡傳播匯流排 617。 每個有效線程還包含指令預解碼器(predec〇der),用 以判斷下一個要處理 或是頂點屬性資料的擷取:Γ二 儲存, 槽案咖、咖复中一者互^ 疋需要與共同暫存器 包含常數的擷取_存,指令預解碼器發現指令 互動,那射f要與常數快取記憶體624 可以向常數有效線程所對應的指令預解碼器 伽瞻、。己隐體仲裁器618發送要求。常數快取記 201137786 憶體仲裁器618是用u # # # ,仲裁吊數快取記憶體624的存取。 Λ ,吊數快取記憶體仲裁器618會對常數快取 記憶體624提出要卡。4 ^、、 -対吊數决取 严理〆敕 °所逑,常數快取記憶體624會 處理吊麵取要求,並將所要求的常 憶體624的FIFQ 626。 吊雾_乂 取或::地或ΓΓ指令仔列的指令要求頂點屬性資料的掏 師八/疋而要與頂點屬性快取記憶體622互動,那 麼拓々預解碼器可以向 s ^屬性快取記憶體仲裁器622發 並將^ 性快取記憶體咖會處理常數擷取要求, ㈣接的於與頂點屬性快取記憶體 632⑽甘 指令需要與共同暫存器檔案 bd2、634 立 Φ —本 π*, 庫㈣人〃 動’那麼該指令所在之有效線程所對 應的才日令預解碼哭 τ ^ u ° 了以向共同暫存器仲裁器620發送要 γ J暫存器仲裁器⑽是 ^或是共同暫存料案1 634的存取要求,; =是來自偶數線程或是奇數線程,要= 暫存器槽案G咖或共同暫存器槽案1 634。〜问 ^ 主意根據第六圖的執行單元 憶體624、共同暫存哭# Q為吊數决取記 頂點屬性錄記和共_諸軸4、 己隐肢628的存取要求可以在指令於 元資料路徑636執行之前先發送和處理,如此_來= 處理瓶頸,執行單元的總 旦 末減乂 了 0608-A43050TWF 、心處理里可以獲得提升。舉例來說, 201137786 吊數擷取要求’而且常數快取記憶體需要從 L2快取記憶髀洗3廿 ^ 次疋其他記憶體擷取所要求的常數,那麼可 &amp;而要夕耗費數個時脈週期來完成。然而本發明的執 元則不需要暫佟t 千 同時處理其他指令擷取,因為執行單元可以 暫存器檔案1634H對共㈣存讀案G 632和共同 如上所述,執行Γ點屬性快取記憶體㈣的要求。 ⑽,用以利用從常數/ 6〇0退包含執行單元資料路徑 微、共同暫存器Jr憶體624、共同暫存器稽案。 的資料執行指令。綠程控 :行體广 的資料準備完成之後可發送指 所需 636。舉例來說’在指令需要向常數 數的情況下,當所要求的〜 取。己隱體624要求常 刪叫那麼線程押:二:健存在常數快取記憶體 元資料路徑636,而執^制'604可以發送指令給執行單 取記憶體觸626魏資^料· 636可以從常數块 衝器644輸出。同樣地,在於人^且把資料透過資料輪出緩 或共同暫存器槽案1 634 ^案〇632 準備好可被執行單元資料 的情況下’當指令 _可以發送指令給執行^ 6所執行時’線程控制器 指令執行所需的資料準;早好?料:徑636。換句話說,當 指令以供執行。旱備好時’綠程控制議可,送0608-A43050TWF 201137786, which allows the cache or other memory to fetch instructions. A larger or smaller number of instructions, not necessarily four instructions. Each valid ready-made instruction is prefetched before the actual execution, thereby reducing the delay required to send the instruction to the instruction cache, and from the L2 when the instruction does not exist in the instruction cache. Cache memory or other memory to get the delay caused by the instruction. Furthermore, please note that the fetching of instructions from the active thread and the decoding and execution of the instructions can improve the performance and total throughput of the execution unit. The retrieved instructions can be propagated through the instruction propagation bus 617 to even and odd threads. In this way, if more than one thread requires the same instruction, at least one instruction acquisition can be reduced, thereby reducing the instruction delay. That is to say, if more than one thread requests the same instruction to the instruction cache 602, it is not necessary to separately fetch and transfer instructions for each thread, because the required instruction is transmitted back to the instruction through the instruction propagation bus 617. The memory 6〇2 is taken, and the age-propagating bus 617 can be accessed in either the even-numbered thread or the odd-numbered axis in the execution unit_. Each valid thread also contains an instruction pre-decoder (predec) to determine the next processing to be processed or the vertex attribute data: the second storage, the slot coffee, the coffee, and the other ones need to The common register contains a constant _catch, and the instruction predecoder finds that the instruction interacts, and the f is to be compared with the constant prefetch memory 624 to the instruction predecoder corresponding to the constant effective thread. The hidden arbitrator 618 sends the request. Constant cache 201137786 The memory arbiter 618 is an access to the memory 624 using the u # # # , arbitration cache. Λ The hangar cache memory arbiter 618 will issue a card to the constant cache memory 624. 4 ^, , - 対 数 决 严 严 逑 逑 逑 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数 常数Hanging fog _ 或 or : : 地 or ΓΓ ΓΓ 的 的 要求 要求 要求 要求 要求 要求 要求 要求 要求 顶点 顶点 顶点 顶点 顶点 顶点 顶点 顶点 要求 要求 要求 要求 要求 要求 要求 要求 要求 要求 要求 要求 疋 疋 疋 疋 疋 疋 疋 疋 疋 疋 疋Take the memory arbitrator 622 and send the memory cache to handle the constant extraction request, (4) connect to the vertex attribute cache memory 632 (10) Gan command and the common register file bd2, 634 Φ - This π*, library (4) person ' 'then the effective thread of the instruction is corresponding to the daily pre-decoding cry τ ^ u ° to send to the common register arbiter 620 to γ J register arbitrator (10) Is ^ or the common temporary storage request 1 634 access requirements; = is from even or odd threads, to = register slot G coffee or common register slot 1 634. ~ Ask ^ The idea according to the sixth figure of the execution unit memory 624, the common temporary memory cry # Q for the number of hangs to record the vertex attribute record and the total _ axis 4, the access request of the hidden limb 628 can be instructed The metadata path 636 is sent and processed before execution, so that the processing bottleneck is reduced, the total denier of the execution unit is reduced by 0608-A43050TWF, and the heart processing can be improved. For example, 201137786 hangs the number of requests 'and the constant cache memory needs to be washed from the L2 cache memory 3 廿 ^ times other memory to retrieve the required constants, then you can spend several times The clock cycle is completed. However, the implementation of the present invention does not require the temporary processing of other instruction fetches, because the execution unit can register the file 1634H to the common (four) deposit G 632 and jointly perform the defect attribute cache as described above. Body (four) requirements. (10), used to use the return from the constant / 6 〇 0 to include the execution unit data path micro, the common register Jr memory 624, the common register file. The data execution instructions. Green Program Control: After the data is ready, you can send the required 636. For example, 'in the case where the instruction needs to be a constant number, when the required ~ is taken. The hidden body 624 requires frequent deletion and then the thread is forced: two: the existence of the constant cache memory data path 636, and the implementation of the '604 can send instructions to the execution of the single memory memory 626 Wei Zi material · 636 can It is output from the constant block buffer 644. Similarly, in the case of the person ^ and the data through the data wheel to slow down or the common register slot case 1 634 ^ case 632 ready to be executed by the unit data 'when the instruction _ can send instructions to the execution ^ 6 executed When the thread controller instructions execute the required data; good? Material: diameter 636. In other words, when the instructions are for execution. When the drought is ready, the green control can be negotiated.

0608-A43050TWF 22 201137786 除此之外,為了更近一步改善執行單元的總處理量, 本發明可以對執行單元資料路徑636進行量測以最佳化指 令的執行。舉例來說,可以將兩個指令合併來改善執行單 元的總處理量,一個是對某一個共同暫存器檔案的兩個數 值做運算的算術指令,一個是接續將運算結果存入另一個 共同暫存器檔案的指令。合併後的指令只需執行算術運算 並將結果存入目的暫存器,減少了將算術指令儲存於共同 暫存器檔案的執行。實現方式是分析指令佇列内的指令, 或是在編譯器(c 〇 m p i 1 e r)將軟體程式碼轉譯為機器指令時 來進行。舉例來說,編譯器在將軟體程式碼轉譯為機器指 令可進行辨識來判斷是否有如上述先執行算術指令,接續 將結果搬移至另一個共同暫存器檔案的情況,在這種情況 下,編譯器可以產生單一指令來合併算術指令和搬移結果 的指令,而非產生兩個分離的指令。 在本發明的另一個實施例中,執行單元之線程所執行 的共同算術指令包含紋理座標的計算和將紋理座標存入共 同暫存器檔案的某一個暫存器。一般來說,此線程執行完 紋理座標計算之後的下一個指令會是採樣指令,或是將紋 理座標輸出至紋理單元或是其他元件的輸出指令,紋理座 標的輸出是透過資料輸出緩衝器來實現的。利用上述的架 構,這兩個指令可以合併成一個指令,計算紋理座標和輸 出至指令的紋理單元或是其他管線元件。因此,本發明之 0608-A43050TWF 23 201137786 允許至少五個操作同時_。舉 點屬性快取:=同:執行常數块取記憶體擷取、頂 存_::自:數:存_°擷取、常數暫 以这士 乂及自執仃早疋路徑輪出資料,輸出可 =迷將紋理麵⑽錢理單⑽是其他元件。 步驟可ΓΓ^Γ本發明—㈣㈣方法流㈣,其中各項 對應於各早几或是執行單元的元件來執行,例如透過 之線程來執行被分派的指令。首先於步驟 “可分支為兩個平行流程,第-個流程描述指令 == 峨卿畊裁,第二個流程描述指令階層 仃仲裁和排程。線程階層起始於步驟718,決定1 =員取的有效線程’其方式可依據執行單元内有效線二 3登選擇替最售(〇ldeSt)的線程預取指令。另一種方式 =選擇自上一次預取之後等待時間最長的線程,本領域孰 &gt;技蟄者應可理解尚有其它選擇方式。 。步驟720中,從指令快取記憶體令替所選擇之有效線 /揭取‘·^,如果指令不存在於指令快取記憶體,則必須 :L2快取㊉憶體或是其他記憶體巾擷取指令。所掏取到的 曰7 S透過傳播匯流排傳送給有效線程,如步驟了22所 丁所有的有效線程皆可透過傳播匯流排獲取指令,因此 右有個α上的線程要求柄同的指♦,可以減少重覆揭取 才同才&quot;所產生的延遲。也就是說,在不同線程要求相同0608-A43050TWF 22 201137786 In addition to this, in order to further improve the overall throughput of the execution unit, the present invention can measure the execution unit data path 636 to optimize the execution of the instructions. For example, two instructions can be combined to improve the total processing capacity of the execution unit. One is an arithmetic instruction that operates on two values of a common register file, and the other is to successively store the operation result in another common The instruction of the scratchpad file. The merged instruction simply performs an arithmetic operation and stores the result in the destination scratchpad, reducing the execution of storing the arithmetic instructions in the common scratchpad file. This is done by analyzing the instructions in the command queue or when the compiler (c 〇 m p i 1 e r) translates the software code into machine instructions. For example, the compiler translates the software code into a machine instruction to identify whether there is an arithmetic instruction as described above, and then moves the result to another common register file. In this case, compile Instead of generating two separate instructions, a single instruction can be generated to combine the arithmetic instructions with the instructions that move the results. In another embodiment of the invention, the common arithmetic instructions executed by the threads of the execution unit include the calculation of texture coordinates and the storage of texture coordinates into a temporary register of the common register file. In general, the next instruction after the thread performs the texture coordinate calculation is the sampling instruction, or the output coordinate output is output to the texture unit or the output instruction of other components. The output of the texture coordinate is realized by the data output buffer. of. Using the architecture described above, the two instructions can be combined into one instruction to compute texture coordinates and texture units that are output to the instruction or other pipeline components. Therefore, the 0608-A43050TWF 23 201137786 of the present invention allows at least five operations to be simultaneously _. Attribute attribute cache: = same: execute constant block to take memory, top save _:: from: number: save _ ° draw, constant temporarily with this gentry and self-exercise early path to rotate the data, The output can be = the texture surface (10) money list (10) is the other components. The steps may be as follows: (4) (4) Method Flow (4), wherein each item is executed corresponding to each element or an element of the execution unit, for example, by executing a dispatched instruction through a thread. First in the step "can be branched into two parallel processes, the first process describes the instruction == 峨 耕 耕, the second process describes the instruction hierarchy 仃 arbitration and scheduling. The thread hierarchy starts at step 718, the decision 1 = member The effective thread is taken in the manner that the thread can be prefetched according to the valid line 2 in the execution unit. The other way = select the thread with the longest waiting time since the last prefetch, the field孰&gt; The technician should understand that there are other options. In step 720, the memory line is selected from the instruction cache to select the valid line/receive '·^, if the instruction does not exist in the instruction cache. , you must: L2 cache ten memory or other memory towel capture instructions. The captured 曰7 S is transmitted to the active thread through the propagation bus, as shown in step 22, all valid threads are available. The bus is fetched to get the instruction, so there is a thread on the right that requires the same finger ♦ to reduce the delay caused by the re-extraction. That is, the same thread requirements are the same.

0608-A43050TWF 24 201137786 _、U况下,不需要逐一為每個線程擷取指令,因為所 有線程都可透過傳播匯流排獲取指令,不 數群組或奇數群組。 偶 接著在步驟724中有效線程會將指令放入其指令佇 上所述,執行單兀中的每個有效線程都具有各自的 二t:列’指令作列之大小可儲存-定數量的指令來減少 才日々快取記憶體擷取指令的延遲。 指令階層的流程起始於步驟7Q4,解碼或是預解石馬某 .之=線程所要執行的下—個指令,據此決定該指令所需 *類型。步驟706中判斷.指令操作之類型,例 1 =存取要求、頂點屬性快取記憶體存取要求、共 行的:二案4取=或是執行單元資料路徑可直接執 或是與常數㈣記憶體之間Γ互147=7、存㈣, 快取記憶體。於步驟710巾,士 运至吊數 點屬性,哎是鱼7而要擷取或儲存頂 令傳送至頂點屬性快取記情體之間有互動,將該指 需要與共同暫存器播案之於步驟708中’如果指令 暫存器權案。於步驟7 ,=勃將該指令傳送至共同 接執行指令,將該指令傳送至單元資料路”直 單元資料路徑對指 τ早7°貝科路禮。當執行 莉^丁 TL畢,步驟 的地,例如紋理單元之類執行單 將貝錢出至目 0608-A43050TWF ^的元件,或是執行單 201137786 元的外部元件。 本發明的各實施例皆可以硬體、軟體、韌體或是上述 的任意組合來實現。於部分實施例中,資料的壓縮可以透 過執行軟體或軔體來實現,而軟體或軔體可以是儲存於記 憶體中並可由適當的指令執行系統來執行。於其他實施例 中,本發明可以硬體方式來實現諸如三角形設置或屬性設 置階段,硬體可能是以下各類的任意組合:離散邏輯電路、 特殊應用整合電路(ASIC)、可程式閘陣列(PGA)、場域可程 式閘陣列(FPGA)。 本發明之流程方法實施例所包含的各項操作或方塊 應解讀為程式碼的模組、片段或部分,其包含可實現特定 邏輯功能或步驟的一或多個指令。任何可實現本發明各實 施例的目的和功能的其他變形或替換實施例仍屬於本發明 所涵蓋的範圍,在不脫離本發明精神之下,其他實施例亦 可依據功能特性改變操作順序或同步執行。上述各項操作 也可理解為可實現特定邏輯功能或步驟的硬體邏輯電路的 模組或部分。 本發明熟習技藝者應可理解,上述所提之執行單元還 可包含額外的元件以實現各項功能和操作。雖然本發明以 多個實施例揭露如上,但本領域熟習技藝者應可理解上述 實施例的各種替換、變更或改良仍屬於本發明之範圍。 0608-A43050TWF 26 201137786 【圖式簡單說明】 第圖為白知技術之電腦系統之緣 之功能方塊圖。 」I切70件 =為本發明一實施例之繪圖處理管線的概件 第三圖為本發明—實施例之_處理H的方塊圖。 ㈣圖為本發明—實施例之執行單元的内部方塊圖。 =圖為本料—實_之常w體的功能方 k圖為本發明另一實施例之執行單㈣内部方塊圖。 弟七圖為本發明—實_之操作流程圖。 【主要元件符號說明】 10 0.繪圖處理管線 120.幾何著色器 14 0:跨距/像磚產生器 160:像素著色器 2〇〇:繪圖處理管線 252:命令串流處理器 256··幾何著色器 258··跨距/像磚產生器 260··像素著色器0608-A43050TWF 24 201137786 _, U, you do not need to fetch instructions for each thread one by one, because all threads can get instructions through the propagation bus, no group or odd groups. Then, in step 724, the active thread puts the instruction into its instruction, and each valid thread in the execution unit has its own two t: column 'instruction size of the column can be stored - a fixed number of instructions To reduce the delay of the memory capture instruction. The flow of the instruction hierarchy begins in step 7Q4, decoding or pre-solving the next instruction to be executed by the thread. The thread determines the type of * required for the instruction. In step 706, the type of the instruction operation is determined. Example 1 = access request, vertex attribute cache memory access request, common line: second case 4 fetch = or execution unit data path can be directly executed or constant (4) Between the memory, 147 = 7, save (four), cache memory. In step 710, the sorcerer is transported to the hangpoint attribute, 哎 is the fish 7 and the squid is transferred or stored to the vertex attribute. There is interaction between the cryptographic ticks, and the finger needs to be broadcasted with the common register. In step 708, 'if the instruction register is right. In step 7, the command is transmitted to the common execution command, and the command is transmitted to the unit data path. The direct unit data path is 7° Beco Road ceremony. When the execution is completed, the step is For example, a texture unit or the like executes an element that exports the money to the object 0608-A43050TWF^, or an external element that executes the single element 201137786. The embodiments of the present invention may be hardware, software, firmware or the like. In some embodiments, the compression of the data may be implemented by executing a software or a body, and the software or the body may be stored in the memory and executed by an appropriate instruction execution system. In an example, the present invention can be implemented in a hardware manner such as a triangle setting or an attribute setting stage, and the hardware may be any combination of the following types: discrete logic circuit, special application integrated circuit (ASIC), programmable gate array (PGA), Field programmable gate array (FPGA). The operations or blocks included in the embodiment of the flow method of the present invention should be interpreted as a module, segment or part of the code. The invention includes one or more instructions that can implement a particular logical function or step. Any other variations or alternative embodiments that can achieve the objects and functions of the various embodiments of the present invention are still within the scope of the present invention without departing from the spirit of the invention. Other embodiments may also change the operational sequence or perform synchronously depending on the functional characteristics. The above operations may also be understood as a module or portion of a hardware logic circuit that can implement a particular logical function or step. It is to be understood that the above-described execution units may also include additional elements to perform various functions and operations. While the invention has been disclosed above in various embodiments, those skilled in the art will appreciate that various alternatives and modifications of the above-described embodiments. Or the improvement is still within the scope of the present invention. 0608-A43050TWF 26 201137786 [Simple description of the diagram] The figure is a functional block diagram of the edge of the computer system of Baizhi technology. "I cut 70 pieces = a drawing of an embodiment of the present invention The third diagram of the processing pipeline is a block diagram of the processing H of the present invention. (d) The figure is an internal block diagram of an execution unit of the present invention. = Figure is the material of the material - the function of the normal body of the body is shown in the embodiment of the present invention. The seventh figure is the flow chart of the operation of the invention. [Main component symbol description] 10 0. Plot processing pipeline 120. Geometry shader 14 0: Span/brick generator 160: Pixel shader 2: Plot processing pipeline 252: Command stream processor 256·· Geometry Shader 258··Span/Brick Generator 260··Pixel Shader

0608-A43050TV/F 110:頂點著色器 130:三角形設置單元 150:屬性設置單元 17〇:晝幀緩衝器 250:記憶體單元 254:頂點著色器 257:三角形設置單元 259:屬性設置單元 262:晝幢緩衝器 201137786 304: 執行單元集區控制單元306:執行單元集區 308:頂點著色器 310:幾何著色器 312:像素著色器 318:跨距/像磚產生 器 320:三角形設置單元 322:屬性設置單元 400、600:執行單元 402、604:線程控制器 404、606:指令擷取仲裁器0 406、608:指令擷取仲裁器1 408、602:指令快取記憶體 410、624:常數快取記憶體 412: L2快取記憶體存取單元413:指令傳播匯流排 414、628:頂點屬性快取記憶體 416、 632:共同暫存器檔案0 417、 610、612:偶數線程 418、 634:共同暫存器檔案1 419、 614、616:竒數線程 420、 636:執行單元資料路徑 422、638:算數邏輯單元0 424、64Ch算數邏輯單元 426、642:内插器 500:常數快取記憶體 504:對照表 428、644:輸出缓衝器 502:頭標表 506:快取記憶體 0608-A43050TWF 28 201137786 508、626、630:先進先出緩衝器 618:常數快取記憶體仲裁器 620:共同暫存器檔案仲裁器 622:頂點屬性快取記憶體仲裁器 940:存取對應之記憶體位址 0608-A43050TWF 290608-A43050TV/F 110: Vertex Shader 130: Triangle Setting Unit 150: Attribute Setting Unit 17: 昼 Frame Buffer 250: Memory Unit 254: Vertex Shader 257: Triangle Setting Unit 259: Attribute Setting Unit 262: 昼Brick buffer 201137786 304: Execution unit pool control unit 306: Execution unit pool 308: Vertex shader 310: Geometry shader 312: Pixel shader 318: Span/brick generator 320: Triangle setup unit 322: Attributes Setting unit 400, 600: execution unit 402, 604: thread controller 404, 606: instruction capture arbiter 0 406, 608: instruction capture arbiter 1 408, 602: instruction cache memory 410, 624: constant fast Memory 412: L2 cache memory access unit 413: instruction propagation bus 414, 628: vertex attribute cache memory 416, 632: common register file 0 417, 610, 612: even thread 418, 634 : Common register file 1 419, 614, 616: number of threads 420, 636: execution unit data path 422, 638: arithmetic logic unit 0 424, 64Ch arithmetic logic unit 426, 642: interpolator 500: constant cache Memory 504: Comparison Tables 428, 644: Output Slow Punch 502: Header Table 506: Cache Memory 0608-A43050TWF 28 201137786 508, 626, 630: First In First Out Buffer 618: Constant Cache Memory Arbiter 620: Common Scratch File Arbiter 622: Vertex Attribute cache memory arbiter 940: access corresponding memory address 0608-A43050TWF 29

Claims (1)

201137786 七、申請專利範圍: 1· 一種繪圖處理單元,包含: 執行單元,用以處理可程式化著色器操作,旅可用以同 時處理複數個線程的處理操作; 第-記憶體單元’用卿成-暫存器辟細節該複數 個線程=暫存器操作,該記憶體單元包含複數個記憶庫,其中 複數個第-記憶庫分配給複數個第—線程,複數個第二記憶庫 分配給複數個第二線程; —第二記憶體單it,用以形成—常數快取記憶體來調節該 ^丁单元的該複數個線程所對應之複數個著色器操作的常數 2該常數絲記,_ _切編_觸色器操作 個2個内文’並_存該複數個内文的_個常數的複數 個版本;以及 一第三記憶體單元’用以形成—頂間 :::=的_個綠程_之複數個著色: 2. 更包含: 如專利範圍第]項所述之緣圖處理單元, 暫存益仲裁器,用以仲裁 存器檀案存取要求。 °早7^所執行之指令的暫 3.,如專利範圍第i項所述之緣圖處理單 一常數快取記憶體仲裁器, ,更匕含: 指令的常m綠記憶體存取要求。 錢行單元所執行之 0608-A43050TWF 30 201137786 1如專利範圍第1項所述之繪圖處理單元,更包含. —項點屬錄取記憶體仲賴,_仲裁—-打之指令的頂點屬性快取記憶體存取要求。/ Τ早7L所執 5. 如專概圍第1項所述之繪圖處理單元, 憶體是用嘯據—麵表來維_何著“、、頂點吊著 的和像素著色器的複數個内文的複數個常數。 、 6. 如專利範圍第5摘述之纟㈣處理單心 =取記憶體Μ以根據—對照表來維持該 7·如專利範圍第3項所述之繪圖處理單元,更勺人 广快取記憶體先進先出緩衝器,用以物::: 被該 =版所揭取的該複數個常數’並使該複數個常數可以 執订早兀的該複數個線程所存取。 8.如專利範圍第4項所述之緣圖處理單元人. 頂點屬性峰記憶體先進先蚊脑 3 . 屬性快取_咐____ 3=頂點 屬性可以麟執行單喊該概佩輯魏。贿個頂點 9· ~種繪圖處理單元,包含: 〔操作,該執行單元包含 -執行單元’可用以執行多線程: 線私控制裔,t玄綠4 。、、友桎控制器包含一第一指令擷取仲 # 二指令擷取仲裁器;其中 。。σ—弟 0608-A43050TWF 取仲裁器用以代表該執行單 元的複數個第 31 201137786 一線程來擷取指令;以及 。第u取仲裁代表該執行單元的複數個第 一線程來摘取指令。 •如專利顧第9項所述讀@處理單元,其中該執 行單元可㈣處理至少八個有效線程,其中該有效線程的第一 、p被刀配…亥第-“令操取仲裁器,該有效線程的其餘部分 被刀配、’、。a第—指令擷取仲裁器,該有效線程的該第一部分包 含至少四個。 11..如專利範圍第10項所述之綠圖處理單元,更包含: 一指令快取記憶體’用崎送齡給該至少八個有效線 程; ' 其中該第-指令齡仲翻肋代表該有效線程的該第 -部分從獅令快取記憶體#_令,該第二指令擷取仲裁。。 用以代表财效線㈣該其餘部分從該齡快取記憶體揭^ 指令; 其中該第-指令擷取仲裁器和該第二指令擷取仲裁 用以傳播賴取的齡給駐少八彳目有效絲。 ° 12.如專利範圍第11項所述之綠圖處理單元,其中該至 少八個有效線程的每一者更包含. -指令侧’用以維持該指令快取記憶體所傳送的第—指 一指令預解碼器,用 0608-A43050TWF 以決定該齡洲巾的1二指 32 ^ ^ 201137786 資料存取要求類型。 L如翻範轉12項所述1 —‘令的龍存取要求類魏含至 —其中該第 體要求、頂點屬性快取記憶體要求者:常數快取記憶 執行單元資料路射直難細要求Γ存器_要求和― =*專利範圍第12項所述之相處理單元,发 7預解碼器更用以發送該第二指 、中έ玄指 該常數快取記憶體仲裁器、該頂點少下列-者: 共同暫存¥麵、 丨、取&amp;£*憶體仲裁器、該 存純案仲裁器和該執行單it資料路徑。 行單I5可:該執 1如專利範圍第12項所述之綠圖處理單元,其中該線 社制,以發送—第三齡給—執行單轉料路徑。 17, 一種適用於執行單元的指令處理方法,包含: 對應於複數個有效線程的一第一有效線程’從—指令快取 記憶體擷取一執行單元的一第一指令; 7 、 傳播該第一指令給該複數個有效線程; 維持該第一指令於對應於該複數個有效線程至少其中一 者的一指令件列中; 解石馬該指令佇列中的一第二指令;以及 發迗該第二指令的資料存取要求給至少下列一者·一常 0608-A43050TWF j j 201137786 =資體科:._快取記憶體、一共同暫存_和 18. 如專利乾圍第17項所述的指令處理方法, 取苐—指令的步驟和該發送第二指令的步驟可同時執^。 说如專利範園第17項所述的指令處理方法,更包含: 傳运-要求給該常數快取記㈣,其巾該常 維持-爾,輸嫩纖㈣_文所= b-組常_實體基底健,該f ^ 對應於頂點著色器、_色器和像素_::内维持 的常數。 王乂兩個内文 20. 如專利範圍第19項所述的指令處理方法, 數快取記憶體還包含一對照表,該對照表用二中該常 以厅更改的吊數以及被更改的常數之位址。 隐 21. 如專利範圍第Π項所述的指令處理方法, =,、該頂點屬性快取記憶體、該共同暫存二:常 ^執仃早70資料路徑可同時處理指令。 田水和 =如專利範圍第η項所述的指令處理方法 虽指令執行所需之資料準備好 ι含: 元資料路經;以及 &amp;达勒曰令給該執行單 單元資料路徑輪出資料 透過—資料輪出緩衝器從該執行 0608-A43050TWF 34201137786 VII. Patent application scope: 1. A graphics processing unit, comprising: an execution unit for processing a programmable shader operation, and a brigade can be used to simultaneously process a plurality of threads; the first-memory unit - the temporary storage unit details the plurality of threads = register operation, the memory unit comprises a plurality of memory banks, wherein the plurality of first memory banks are allocated to the plurality of first threads, and the plurality of second memory banks are allocated to the plurality of memories a second thread; a second memory single it for forming a constant cache memory to adjust a constant of the plurality of shader operations corresponding to the plurality of threads of the unit; _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ a green process _ a plurality of coloring: 2. More include: as the patent range of the scope of the map processing unit, temporary storage arbitrator, used to arbitrate the memory access requirements. The delay of the instruction executed by the early 7^, as described in the i-th item of the patent scope, is a constant-memory memory arbiter, and more specifically: the constant m green memory access requirement of the instruction. 0608-A43050TWF 30 201137786 executed by the money line unit 1 The drawing processing unit described in item 1 of the patent scope further includes: - the item is a memory of the admission, the _ arbitration - the vertex attribute cache of the instruction Memory access requirements. / Τ早7L is executed 5. If the drawing processing unit mentioned in item 1 is used, the memory is a plurality of suffixes, vertices, and pixel shaders. a plurality of constants in the context. 6. As in the fifth section of the patent scope, (4) processing a single heart = taking a memory Μ to maintain the data processing unit according to the reference table. , the scalp mega-fast memory first-in first-out buffer for the object::: the plural constants extracted by the version= and the complex constants can be bound to the plurality of threads 8. Accessed as shown in the fourth paragraph of the patent scope processing unit. Vertex attribute peak memory advanced first mosquito brain 3. Attribute cache _咐____ 3=Vertex attribute can be executed by Lin Pei Wei. Bribe a vertex 9· ~ kind of drawing processing unit, including: [Operation, the execution unit contains - execution unit 'can be used to perform multi-threading: line private control, t-green 4,, friend controller Including a first instruction, taking the secondary #2 instruction, taking the arbiter; wherein σ-di brother 0608-A4 The 3050TWF arbitrator is used to retrieve the instruction on behalf of the plurality of 31 201137786 threads of the execution unit; and the arbitrating arbitration represents a plurality of first threads of the execution unit to extract the instruction. The read @processing unit, wherein the execution unit can (4) process at least eight active threads, wherein the first thread of the active thread is configured to be arbitrarily--"the arbitrator is operated, and the rest of the active thread is Knife with, ',. a - the instruction retrieves the arbiter, the first part of the active thread containing at least four. 11. The green map processing unit according to claim 10, further comprising: an instruction cache memory 'supplied with at least eight active threads; 'where the first-instruction is an intermediate rib The first part of the active thread is cached from the lion command memory #_, and the second instruction draws arbitration. . Representing the financial effect line (4) the remaining part of the memory from the age of the cache; wherein the first instruction captures the arbitrator and the second instruction uses arbitration to spread the age of the victim Effective mesh. The green map processing unit of claim 11, wherein each of the at least eight active threads further comprises: - an instruction side to maintain the first finger of the instruction cache memory An instruction pre-decoder uses 0608-A43050TWF to determine the type of 1 2 finger 32 ^ ^ 201137786 data access request for the age of the continent towel. L, as in the case of the 12-to-then, the dragon access request class Wei Hanzhi--the first body requirement, the vertex attribute cache memory requester: the constant cache memory execution unit data path is difficult to Requiring the buffer_requirement and the phase processing unit described in item 12 of the patent scope, the 7 predecoder is further configured to send the second finger, the middle finger, the constant cache memory arbiter, The following are the vertices: the common temporary storage face, 丨, fetch &amp; £* memory arbitrator, the pure virtual arbitrator and the execution single data path. The line list I5 can be: The green sheet processing unit as described in item 12 of the patent scope, wherein the line system is configured to send a third age to perform a single transfer path. An instruction processing method for an execution unit, comprising: a first valid thread corresponding to a plurality of valid threads: extracting a first instruction of an execution unit from the instruction cache; 7 An instruction to the plurality of valid threads; maintaining the first instruction in a sequence of instructions corresponding to at least one of the plurality of valid threads; deciphering a second instruction in the array of instructions; and issuing The data access request of the second instruction is given to at least one of the following: a constant 0608-A43050TWF jj 201137786 = Qualification: ._ cache memory, a common temporary storage _ and 18. The instruction processing method described above, the step of taking the instruction and the step of transmitting the second instruction can be performed simultaneously. Said that the method of processing instructions as described in Item 17 of the Patent Fan Park, further includes: Transport - request to the constant cache (4), the towel is often maintained - er, lose the tender fiber (four) _ text = b- group often _ Entity base, which corresponds to the vertex shader, _color, and constants maintained within the pixel _::. Wang Wei, two internal texts 20. According to the instruction processing method described in claim 19, the digital cache memory further includes a comparison table, and the comparison table uses the number of hangs that are often changed by the hall and the changed The address of the constant. Hid 21. The instruction processing method as described in the third paragraph of the patent scope, =, the vertex attribute cache memory, the common temporary storage two: often, the data path can simultaneously process the instruction. Tian Shui and = the instruction processing method described in item n of the patent scope, although the information required for the execution of the instruction is ready to be included: the metadata path; and the &amp; Dalle order to rotate the data for the execution of the single unit data path Through the data rollout buffer from the execution 0608-A43050TWF 34
TW100110084A 2010-04-21 2011-03-24 System and method for improving throughput of a graphics processing unit TWI474280B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US12/764,256 US8564604B2 (en) 2007-06-12 2010-04-21 Systems and methods for improving throughput of a graphics processing unit

Publications (2)

Publication Number Publication Date
TW201137786A true TW201137786A (en) 2011-11-01
TWI474280B TWI474280B (en) 2015-02-21

Family

ID=44310115

Family Applications (1)

Application Number Title Priority Date Filing Date
TW100110084A TWI474280B (en) 2010-04-21 2011-03-24 System and method for improving throughput of a graphics processing unit

Country Status (2)

Country Link
CN (2) CN102136128B (en)
TW (1) TWI474280B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9495721B2 (en) 2012-12-21 2016-11-15 Nvidia Corporation Efficient super-sampling with per-pixel shader threads
US9741154B2 (en) 2012-11-21 2017-08-22 Intel Corporation Recording the results of visibility tests at the input geometry object granularity
US10134101B2 (en) 2012-02-27 2018-11-20 Intel Corporation Using cost estimation to improve performance of tile rendering for image processing

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105118089B (en) * 2015-08-19 2018-03-20 上海兆芯集成电路有限公司 Programmable pixel placement method in 3-D graphic pipeline and use its device

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5958028A (en) * 1997-07-22 1999-09-28 National Instruments Corporation GPIB system and method which allows multiple thread access to global variables
CN100359506C (en) * 2001-12-20 2008-01-02 杉桥技术公司 Multithreaded processor with efficient processing for convergence device applications
US6895497B2 (en) * 2002-03-06 2005-05-17 Hewlett-Packard Development Company, L.P. Multidispatch CPU integrated circuit having virtualized and modular resources and adjustable dispatch priority
US7310722B2 (en) * 2003-12-18 2007-12-18 Nvidia Corporation Across-thread out of order instruction dispatch in a multithreaded graphics processor
US7454599B2 (en) * 2005-09-19 2008-11-18 Via Technologies, Inc. Selecting multiple threads for substantially concurrent processing
CN1928918B (en) * 2005-10-14 2012-10-10 威盛电子股份有限公司 Graphics processing apparatus and method for performing shading operations therein
CN101145239A (en) * 2006-06-20 2008-03-19 威盛电子股份有限公司 Graphics processing unit and method for border color handling
US20080198166A1 (en) * 2007-02-16 2008-08-21 Via Technologies, Inc. Multi-threads vertex shader, graphics processing unit, and flow control method
US8086825B2 (en) * 2007-12-31 2011-12-27 Advanced Micro Devices, Inc. Processing pipeline having stage-specific thread selection and method thereof
US20090189896A1 (en) * 2008-01-25 2009-07-30 Via Technologies, Inc. Graphics Processor having Unified Shader Unit
US9214007B2 (en) * 2008-01-25 2015-12-15 Via Technologies, Inc. Graphics processor having unified cache system
GB2457265B (en) * 2008-02-07 2010-06-09 Imagination Tech Ltd Prioritising of instruction fetching in microprocessor systems
US20090289947A1 (en) * 2008-05-20 2009-11-26 Himax Technologies Limited System and method for processing data sent from a graphic engine

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10134101B2 (en) 2012-02-27 2018-11-20 Intel Corporation Using cost estimation to improve performance of tile rendering for image processing
US9741154B2 (en) 2012-11-21 2017-08-22 Intel Corporation Recording the results of visibility tests at the input geometry object granularity
US9495721B2 (en) 2012-12-21 2016-11-15 Nvidia Corporation Efficient super-sampling with per-pixel shader threads

Also Published As

Publication number Publication date
TWI474280B (en) 2015-02-21
CN102136128A (en) 2011-07-27
CN102982503A (en) 2013-03-20
CN102982503B (en) 2015-10-21
CN102136128B (en) 2014-05-21

Similar Documents

Publication Publication Date Title
US8730249B2 (en) Parallel array architecture for a graphics processor
US8074224B1 (en) Managing state information for a multi-threaded processor
JP5624733B2 (en) Graphics processing system
US7594095B1 (en) Multithreaded SIMD parallel processor with launching of groups of threads
US7447873B1 (en) Multithreaded SIMD parallel processor with loading of groups of threads
TWI275039B (en) Method and apparatus for generating a shadow effect using shadow volumes
US7477260B1 (en) On-the-fly reordering of multi-cycle data transfers
TW201020965A (en) Graphics processing units, execution units and task-managing methods
TWI437507B (en) System and method for memory access of multi-thread execution units in a graphics processing apparatus
US8564604B2 (en) Systems and methods for improving throughput of a graphics processing unit
JP2007525768A (en) Register-based queuing for texture requests
US7747842B1 (en) Configurable output buffer ganging for a parallel processor
US10600232B2 (en) Creating a ray differential by accessing a G-buffer
CN107392836B (en) Stereoscopic multi-projection using a graphics processing pipeline
CN110675480B (en) Method and apparatus for acquiring sampling position of texture operation
US10198789B2 (en) Out-of-order cache returns
US7484076B1 (en) Executing an SIMD instruction requiring P operations on an execution unit that performs Q operations at a time (Q&lt;P)
US10430989B2 (en) Multi-pass rendering in a screen space pipeline
US20080313434A1 (en) Rendering Processing Apparatus, Parallel Processing Apparatus, and Exclusive Control Method
US10417813B2 (en) System and method for generating temporally stable hashed values
TW201137786A (en) System and method for improving throughput of a graphics processing unit
US20190236166A1 (en) Performing a texture level-of-detail approximation
US10212406B2 (en) Image generation of a three-dimensional scene using multiple focal lengths
TWI359388B (en) Triangle setup and attribute setup integration wit
JP2024510626A (en) Synchronous free cross-path binning with subpath interleaving