CN107004280A - Method and apparatus for efficient texture compression - Google Patents

Method and apparatus for efficient texture compression Download PDF

Info

Publication number
CN107004280A
CN107004280A CN201480079739.4A CN201480079739A CN107004280A CN 107004280 A CN107004280 A CN 107004280A CN 201480079739 A CN201480079739 A CN 201480079739A CN 107004280 A CN107004280 A CN 107004280A
Authority
CN
China
Prior art keywords
texture
texture block
texel
processor
rbf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201480079739.4A
Other languages
Chinese (zh)
Inventor
T·T·马克西姆克扎克
T·波涅茨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Publication of CN107004280A publication Critical patent/CN107004280A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/40Analysis of texture
    • G06T7/49Analysis of texture based on structural texture description, e.g. using primitives or placement rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/001Texturing; Colouring; Generation of texture or colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/04Texture mapping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/40Analysis of texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/004Predictors, e.g. intraframe, interframe coding

Abstract

Describe the apparatus and method for texture compression.For example, one embodiment of method includes:Determine the distance between each texture block texel in multiple texture block texels and each point in multiple points;It is determined that the set for the texel color value sampled to the texture block;And generate the set of the approximation coefficient for compressing the texture block using the set of the distance and the texture color value sampled to the texture block between each point between each texture block texel and the multiple point in the multiple texture block texel.

Description

Method and apparatus for efficient texture compression
Background
Technical field
This patent disclosure relates generally to computer processor field.More particularly, the present invention relate to efficient texture compression Apparatus and method.
Description of Related Art
Texture mapping is to realize to apply texture into the known skill to shape or polygonal surface in graphics pipeline Art.Data texturing is typically stored in N x N number of " texel (texel) " (otherwise referred to as " texel " or " texture pixel ") Matrix in.Thus, by with image is represented by array of pixels in the way of identical mode, by texel array representation texture.When holding During row texture mapping, the appropriate pixel that the logic in graphics processing unit (GPU) maps to texel in output image.
Texel compress technique is realized to reduce the amount of the memory consumed by data texturing.Current texture compression method is not Texture image is treated as the function of texture image coordinate, the coordinate of the texture image can be numerically approximate.Phase Instead, these technologies are identified using algorithmic method accounts for leading color and the color gradient between block texel is decoded.Compression is logical It is often that algorithm is intensive, and (off-line) is applied with nonlocal (out of band) and/or offline.Decompression is used Multiple steps, and frequently involve the time between adjacent texel and/or dependence spatially.These factors drive increasing Plus memory/cache size and bandwidth requirement, and limit the applicability for a large amount of Parallel Implementations.
Brief description of the drawings
In conjunction with the following drawings, it can be obtained from detailed description below and the present invention is better understood from, wherein:
Fig. 1 is the block diagram of the embodiment of the computer system with processor, and the processor has one or more places Manage device core and graphics processor;
Fig. 2 is the block diagram of one embodiment of processor, and the processor has one or more processors core, integrated deposited Memory controller and Force Integrated Graphics Processor (IGP) Nforce;
Fig. 3 is the block diagram of one embodiment of graphics processor, and the graphics processor can be discrete graphics process Unit, or can be the graphics processor integrated with multiple process cores;
Fig. 4 is the block diagram of the embodiment of the graphics processing engine for graphics processor;
Fig. 5 is the block diagram of another embodiment of graphics processor;
Fig. 6 is the block diagram of the thread execution logic for the array for including treatment element;
Fig. 7 shows the graphics processor execution unit instruction format according to embodiment;
Fig. 8 is the block diagram of another embodiment of graphics processor, and the graphics processor includes graphics pipeline, Media Stream Waterline, display engine, thread execution logic and render viewing pipeline.
Fig. 9 A are the block diagrams for showing the graphics processor command format according to embodiment;
Fig. 9 B are the block diagrams for showing the graphics processor command sequence according to embodiment;
Figure 10 shows the exemplary patterns software architecture for data handling system according to embodiment;
Figure 11 shows one embodiment of the framework for texture compression and decompression;
Figure 12 A-B show that the exemplary center points for being used to perform compression in one embodiment are placed;
Figure 12 C show the relation between block size, central point and compression ratio in one embodiment;
Figure 13 shows the operation performed using matrix, coefficient vector and texel block is decompressed in one embodiment; And
Figure 14 shows method according to an embodiment of the invention.
Embodiment
In the following description, in order to explain, elaborate numerous details to provide to described below The thorough understanding of multiple embodiments of invention.It will be apparent, however, to one skilled in the art that can be in these no tools Implement various embodiments of the present invention in the case of some details in body details.In other instances, known structure and equipment Show in form of a block diagram, to avoid making the general principle of multiple embodiments of the present invention from obscuring.
Exemplary patterns processor architecture and data type
General introduction --- Fig. 1-3
Fig. 1 is the block diagram of the data handling system 100 according to embodiment.Data handling system 100 includes one or more Processor 102 and one or more graphics processors 108, and can be single processor desktop system, multinuclear work station system System or the server system with a large amount of processors 102 or processor core 107.In one embodiment, data handling system 100 It is the system on chip circuit (SOC) used in mobile device, portable equipment or embedded device.
The embodiment of data handling system 100 may include or may be incorporated into the following:Gaming platform based on server, Game console (including game and media console, moving game console, handheld game consoles or game on line control Platform).In one embodiment, data handling system 100 is mobile phone, smart phone, tablet computing device or mobile interchange Net equipment.Data handling system 100 may also comprise wearable device, can be coupled with wearable device or can be integrated in wearable set It is standby interior, the wearable device such as, intelligent watch wearable device, intelligent glasses equipment, augmented reality equipment or virtual existing Real equipment.In one embodiment, data handling system 100 is television set or set-top box device, the television set or set top box Equipment has one or more processors 102 and the graphical interfaces generated by one or more graphics processors 108.
The embodiment of data handling system 100 may include or may be incorporated into the following:Gaming platform based on server, Game console (including game and media console, moving game console, handheld game consoles or game on line control Platform).In one embodiment, data handling system 100 is mobile phone, smart phone, tablet computing device or mobile interchange Net equipment.Data handling system 100 may also comprise wearable device, can be coupled with wearable device or can be integrated in wearable set It is standby interior, the wearable device such as, intelligent watch wearable device, intelligent glasses equipment, augmented reality equipment or virtual existing Real equipment.In one embodiment, data handling system 100 is television set or set-top box device, the television set or set top box Equipment has one or more processors 102 and the graphical interfaces generated by one or more graphics processors 108.
One or more processors 102 each include one or more processors core 107, one or more of processing Device core 107 is used for process instruction, and the instruction upon being performed, performs the operation for system and user software.In a reality Apply in example, being each configured in one or more processors core 107 handles specific instruction set 109.Instruction set 109 can Sophisticated vocabulary is promoted to calculate (CISC), Jing Ke Cao Neng (RISC) or the calculating via very long instruction word (VLIW).It is many Individual processor core 107 can each handle different instruction set 109, and different instruction set 109 may include to be used to promote to other The instruction of the emulation of instruction set.Processor core 107 may also comprise other processing equipments, such as, digital signal processor (DSP).
In one embodiment, processor 102 includes cache memory 104.Depending on framework, processor 102 can With single internally cached or multiple-stage internal cache.In one embodiment, cache memory is in processor Shared between 102 various assemblies.In one embodiment, processor 102 is also using using cache coherency technique The External Cache that can be shared between processor core 107 is (for example, 3rd level (L3) cache or last level cache (LLC)) (not shown).Register group 106 is additionally included in processor 102, and processor 102 may include to be used to store difference The different types of register of categorical data is (for example, integer registers, flating point register, status register and instruction pointer Register).Some registers can be general register, and other registers can be exclusively used in the design of processor 102.
Processor 102 is coupled to processor bus 110 with biography between the other assemblies in processor 102 and system 100 Transmission of data signal.System 100 uses exemplary " maincenter " system architecture, and exemplary " maincenter " system architecture includes memory Controller maincenter 116 and input and output (I/O) controller maincenter 130.Memory controller hub 116 promote memory devices with Communication between the other assemblies of system 100, and I/O controllers maincenter (ICH) 130 is provided and set via local I/O buses to I/O Standby connection.
Memory devices 120 can be dynamic random access memory (DRAM) equipment, static RAM (SRAM) equipment, flash memory device or with appropriate performance to serve as some other memory devices of process memory.Storage Device 120 can store the data 122 and instruction 121 for being used in 102 executive process of processor.Memory controller hub 116 also couple with optional external graphicses processor 112, the optional external graphicses processor 112 can with processor 120 One or more graphics processors 108 communicate with performing figure and media manipulation.
ICH 130 enables ancillary equipment to be connected to memory 120 and processor 102 via High Speed I/O buses.Outside IO Peripheral equipment includes Audio Controller 146, firmware interface 128, transceiver 126 (for example, Wi-Fi, bluetooth), data storage and set Standby 124 (for example, hard disk drive, flash memories etc.) and for by old-fashioned (legacy) (for example, ps 2 (PS/2)) equipment Coupled to the old-fashioned I/O controllers of system.The one or more connection of USB (USB) controller 142 input equipments are (all Such as, the combination of keyboard and mouse 144).Network controller 134 is also coupled to ICH 130.In one embodiment, high-performance Network controller (not shown) is coupled to processor bus 110.
Fig. 2 is the block diagram of the embodiment of processor 200, and the processor has one or more processors core 202A-N, Integrated memory controller 214 and Force Integrated Graphics Processor (IGP) Nforce 208.Processor 200 may include additional core, the additional core The additional core 202N up to identified by dotted line frame and the additional core 202N including being identified by dotted line frame.Each in core 202A-N All include one or more internally cached unit 204A-N.In one embodiment, each core have to one or The access right of multiple shared cache elements 206.
Internally cached unit 204A-N and shared cache element 206 represent the cache in processor 200 Storage hierarchy.Cache memory hierarchical structure may include the instruction sum of at least one level in each core According to cache and one or more levels shared intermediate-level cache (such as, the 2nd grade (L2), 3rd level (L3), 4th grade (L4) or other level caches), wherein, the cache of the highest level before external memory storage is classified For last level cache (LLC).In one embodiment, cache coherence logic maintains various cache elements 206 Uniformity between 204A-N.
Processor 200 may also comprise one group of one or more bus control unit units 126 and System Agent 210.One or Multiple one group of peripheral bus of bus control unit Single Component Management (such as, one or more peripheral component interconnection bus (for example, PCI, PCI Express)).System Agent 210 provides the management function for various processor modules.In one embodiment, system Agency 210 includes the one or more integrated memory controls for being used to manage the access to various external memory devices (not shown) Device 214 processed.
In certain embodiments, one or more of core 202A-N include pair simultaneously multithreading operation support. In such embodiment, System Agent 210 includes the component for being used to coordinating and operating core 202A-N during multiple threads.System Agency 210 can additionally include power control unit (PCU), and the PCU is included for adjusting core 202A-N and graphics core 208 The logical sum component of power rating.
Processor 200 additionally includes the graphics processor 208 for being used to perform graphics processing operation.In one embodiment In, graphics processor 208 is coupled with one group of shared cache element 206 and system agent unit 210, system agent unit 210 include one or more integrated memory controllers 214.In one embodiment, display controller 211 and graphics processor 208 couplings, by graphics processor output driving to one or more coupled equipment.Display controller 211 can be through The module separated coupled by least one interconnection with graphics processor, or graphics processor 208 can be integrated in or be In system agency 210.
In one embodiment, using the interconnecting unit 212 based on ring come multiple intrawares of coupling processor 200, But the interconnecting unit substituted can be used, such as, point-to-point interconnection, exchanging interconnection or other technologies are (including known in the art Technology).In one embodiment, graphics processor 208 is coupled via I/O links 213 with annular interconnection 212.
Exemplary I/O links 213 represent at least one of various I/O interconnection, and the I/O interconnection includes promoting I/ in the encapsulation of communication between various processor modules and high-performance embedded memory module 218 (such as, eDRAM) module O is interconnected.In one embodiment, each of core 202-N and graphics processor 208 are by embedded memory module 218 As shared last level cache.
In one embodiment, core 202A-N is the isomorphism core for performing identical instruction set architecture.In another embodiment, Core 202A-N is isomery in terms of instruction set architecture (ISA), wherein, one or more of core 202A-N performs first Instruction set, and at least one in other cores performs the subset or different instruction set of the first instruction set.
Processor 200 can be the part of one or more substrates, or any technology in various technologies can be used And realize on one or more substrates, the technology is for example, complementary metal-oxide semiconductor (CMOS), bipolar Junction type/complementary metal-oxide semiconductor (CMOS) (BiCMOS) or N-type Metal-oxide-semicondutor logic (NMOS). Additionally, processor 200 may be implemented on one or more chips, or can realize as on-chip system (SOC) integrated circuit, described SOC integrated circuits also have shown component in addition to other assemblies.
Fig. 3 is the block diagram of one embodiment of graphics processor 300, and the graphics processor 300 can be discrete figure Shape processing unit, or can be the graphics processor integrated with multiple process cores.In one embodiment, via to graphics process Register on device memory mapping I/O interfaces, and via the order inserted in processor storage come with figure Manage device communication.Graphics processor 300 includes the memory interface 314 for being used to access memory.Memory interface 314 can be to The interface of the following:Local storage, one or more internally cached, one or more shared External Caches And/or system storage.
Graphics processor 300 also includes the display controller for being used to being driven into display output data into display device 320 302.Display controller 302 includes one or more upper stratas (overlay) plane and video or user interface for showing The hardware of multiple layers of synthesis of element.In one embodiment, graphics processor 300 includes being used for media coding to one Or multiple media coding formats, from one or more media coding formats decode or between one or more media coding formats The coding and decoding video engine 306 of transcoding, the media coding format includes but is not limited to:Mobile photographic experts group (MPEG) form (such as, MPEG-2), advanced video decodes (AVC) form (such as, H.264/MPEG-4AVC) and film and television engineer association Meeting (SMPTE) 421M/VC-1, and JPEG (JPEG) form (such as, JPEG and motion JPEG (MJPEG) lattice Formula).
In one embodiment, graphics processor 300 includes being used to perform two dimension (2D) rasterisation (rasterizer) behaviour Make block image transfer (BLIT) engine 304 of (including for example bit boundary block is shifted).However, in one embodiment, using figure Shape handles the one or more assemblies of engine (GPE) 310 to perform 2D graphic operations.Graphics processing engine 310 is performed for The computing engines of graphic operation (including three-dimensional (3D) graphic operation and media manipulation).
GPE 310 includes the 3D streamlines 312 for being used to perform 3D operations, 3D operation such as, using acting on 3D bases The processing function of first shape (for example, rectangle, triangle etc.) comes renders three-dimensional image and scene.3D streamlines 312 include compiling Various tasks and/or procreation (spawn) in journey and fixing function element, the programmable and fixing function element executive component Execution thread is to 3D/ media subsystems 315.When 3D streamlines 312 can be used for performing media manipulation, GPE 310 embodiment Also include performing the media pipeline 316 of media manipulation (such as, Video post-processing and image enhaucament).
In one embodiment, media pipeline 316 includes the fixation for being used to perform one or more professional media operations Function or programmable logic cells, the professional media operation such as, replace or represented the video solution of coding and decoding video engine 306 Code accelerates, video deinterlacing (de-interlacing) and Video coding accelerate.In one embodiment, media pipeline 316 additionally include being used for the thread procreation unit for multiplying the thread for performing on 3D/ media subsystems 315.Multiplied The calculating for media manipulation is performed on thread one or more figure execution units included in 3D/ media subsystems.
3D/ media subsystems 315 include being used for the thread that execution is multiplied by 3D streamlines 312 and media pipeline 316 Logic.In one embodiment, thread is performed request and sent to 3D media subsystems 315, the 3D media subsystem by streamline System 315 includes the thread dispatch logics for being used to arbitrate various requests and various requests are assigned to available thread execution resource. Performing resource includes being used to handle the array of the figure execution unit of 3D and media thread.In one embodiment, 3D/ media System 315 is included for the one or more internally cached of thread instruction and data.In one embodiment, subsystem Including for shared data between the multiple threads and for storing the shared memory of output data, the shared memory bag Include register and addressable memory.
3D/ media handlings --- Fig. 4
Fig. 4 is the block diagram of the embodiment of the graphics processing engine 410 for graphics processor.In one embodiment, scheme Shape processing engine (GPE) 410 is the GPE 310 shown in Fig. 3 some version.GPE 410 includes 3D streamlines 412 and media Streamline 416, each of which person can or class different from Fig. 3 3D streamlines 312 and the implementation of media pipeline 316 Seemingly.
In one embodiment, GPE 410 is coupled with order flow passages 403, and the order flow passages 403 put forward command stream It is supplied to GPE 3D and media pipeline 412,416.Order flow passages 403 are coupled to memory, and the memory can be system One or more of memory or internal cache memory and shared cache memory.Order flow passages 403 Receive and order from memory, and these orders are sent to 3D streamlines 412 and/or media pipeline 416.3D and Media Stream Waterline handles order in the following manner:By performing operation via the logic in corresponding streamline;Or by by one Or multiple execution threads are assigned to execution unit array 414.In one embodiment, execution unit array 414 be it is scalable, So that this array includes the execution unit of the variable number of target power and performance rate based on GPE 410.
Sample engine 430 and memory (for example, cache memory or system storage) and execution unit array 414 Coupling.In one embodiment, sample engine 430 provides the memory access mechanism for scalable execution unit array 414, The memory access mechanism allows to perform array 414 from memory reading figure and media data.In one embodiment, adopt Sample engine 430 includes being used for the logic that specialized image sampling operation is performed for media.
Professional media sample logic in sample engine 430 includes denoising/de interlacing module 432, motion estimation module 432 And image scaling and filtering module 436.Denoising/de interlacing module 432 includes being used to perform decoded video data going Make an uproar or one or more of de interlacing logic.De interlacing logic is by the alternate fields of (interlaced) video content of interlacing It is combined as single frame of video.Noise cancellation logic is gone to reduce or remove the data noise from video and view data.In one embodiment In, denoising logical sum de interlacing logic is Motion Adaptive, and made based on the amount of exercise detected in video data With space or temporal filtering.In one embodiment, denoising/de interlacing module 432 includes special motion detection logic (example Such as, in motion estimation engine 434).
Motion estimation engine 434 accelerates function (such as, motion vector estimation and pre- by performing video to video data Survey) provide to the hardware-accelerated of vision operation.Motion estimation engine determines motion vector, and the motion vector describes picture number According to the conversion between successive frame of video.In one embodiment, graphics processor media codec uses video motion Estimation engine in macroblock level to perform the operation to video, and the operation in macroblock level to video is probably originally Computation-intensive to be performed using general processor.In one embodiment, motion estimation engine 434 is typically used for figure Shape processor module is to aid in video to decode and processing function, and the video decoding and processing function are to the motion in video data Direction and amplitude be motion that is sensitive or being adaptive in video data direction or amplitude.
Image scaling and filtering module 436 perform image processing operations to strengthen the vision matter of generated image and video Amount.In one embodiment, the sampling behaviour of scaling and filtering module 436 before execution unit array 414 is provided data to Image and video data are handled during work.
In one embodiment, graphics processing engine 410 includes FPDP 444, and the FPDP 444 is figure System provides the additional mechanism for accessing memory.FPDP 444 promotes the memory access for operation, and the operation includes Post-processing object write-in, constant buffer are read, crawl storage space read/write and media surface are accessed.In a reality Apply in example, FPDP 444 includes being used for the cache memory space that the access to memory is cached.At a high speed Buffer memory can be individual data cache, or be separated into for accessing many of memory via FPDP Multiple caches (for example, render buffer cache, constant buffer cache, etc.) of individual subsystem.One In individual embodiment, the thread performed on the execution unit in execution unit array 414 is exchanged by being interconnected via data distribution Message to communicate with FPDP, each in the subsystem of the data distribution interconnection couple graphics processing engine 410.
Execution unit --- Fig. 5-7
Fig. 5 is the block diagram of another embodiment of graphics processor.In one embodiment, graphics processor includes annular mutually Connect 502, pipelined front side 504, media engine 537 and graphics core 580A-N.Graphics processor is coupled to it by annular interconnection 502 His processing unit (including other graphics processors or one or more general-purpose processor cores).In one embodiment, at figure Reason device is integrated in one in many processors in multiple core processing system.
Graphics processor receives batch via annular interconnection 502 and ordered.Incoming order is by the life in pipelined front side 504 Flow passages 503 are made to explain.Graphics processor includes being used to perform 3D geometric manipulations and matchmaker via (multiple) graphics core 580A-N The scalable execution logic of body processing.For 3D geometric manipulations orders, order flow passages 503 will order and be supplied to geometry streamline 536.For at least some media handling orders, order flow passages 503 will be ordered and are supplied to before video front 534, the video End 534 is coupled with media engine 537.Media engine 537 includes the video quality engine (VQE) for video and post processing of image 530 and for provide hardware-accelerated media data encoding and decoding the engine of multi-format coding/decoding (MFX) 533.Geometry Streamline 536 and media engine 537 are each generated for by least one graphics core 580A thread execution resources provided Execution thread.
Graphics processor includes scalable thread execution resource, and the scalable thread execution unit is characterized as modularization core 580A-N (sometimes referred to as core piece (core slice)), there are each modularization core multiple daughter nucleus 550A-N, 560A-N (to have When be referred to as nucleon piece (core sub-slice)).Graphics processor can have any amount of graphics core 580A to 580N.One In individual embodiment, graphics processor, which includes graphics core 580A, the graphics core 580A, at least has the first daughter nucleus 550A and second Nucleon core 560A.In another embodiment, graphics processor is the low-power processor with single daughter nucleus (for example, 550A). In one embodiment, graphics processor includes multiple graphics core 580A-N, and each graphics core includes the collection of the first daughter nucleus Close the set 560A-N of 550A-N and the second daughter nucleus.Each daughter nucleus in the set 550A-N of first daughter nucleus at least includes Execution unit 552A-N and media/texture sampler 554A-N first set.It is each in the set 560A-N of second daughter nucleus Individual daughter nucleus all at least includes execution unit 562A-N and sampler 564A-N second set.In one embodiment, each Daughter nucleus 550A-N, 560A-N share one group of shared resource 570A-N.In one embodiment, shared resource includes shared high speed Buffer memory and pixel operation logic.Other shared resources also are included in the various embodiments of graphics processor.
Fig. 6 shows thread execution logic 600, and the thread execution logic 600 is included in a reality of graphics processing engine Apply the array of the treatment element used in example.In one embodiment, thread execution logic 600 includes pixel coloring device 602, line Journey allocator 604, instruction cache 606 include multiple execution unit 608A-N scalable execution unit array, sampler 610th, data high-speed caching 612 and FPDP 614.In one embodiment, included component is mutual via interconnection structure Even, each component that the interconnection structure is linked in component.Thread execution logic 600 includes passing through instruction cache 606th, one or more of FPDP 614, sampler 610 and execution unit array 608A-N and the memory that arrives (such as, is Unite memory or cache memory) one or more connections.In one embodiment, each execution unit (for example, 608A) be can be performed for each thread parallel it is multiple and meanwhile thread and handle the single vectors of multiple data elements Processor.Execution unit array 608A-N includes any amount of independent execution unit.
In one embodiment, execution unit array 608A-N is mainly used in performing " tinter " program.In an implementation In example, the execution unit in array 608A-N performs the finger for the primary support for including the 3D graphics shaders instruction to many standards Order collection so that the coloration program from shape library (for example, Direct 3D and OpenGL) is performed with minimum conversion.Hold Row unit supports summit and geometric manipulations (for example, vertex program, geometry program, vertex shader), processes pixel (for example, picture Plain tinter, fragment shader) and general procedure (for example, calculating and media tinter).
The array manipulation of each execution unit in execution unit array 608A-N to data element.The number of data element Amount is " execution size " or the number of channels for instruction.It is in data element access, mask and instruction to perform passage The logic unit of the execution of flow control.The quantity of passage can be independently of the physics ALU's or FPU for specific graphics processor Quantity.Execution unit 608A-N supports integer and floating type.
Execution unit instruction set is instructed including single-instruction multiple-data (SIMD).Various data elements can be used as packed data class Type is stored in register, and the data size based on element is handled various elements by execution unit.For example, when pair During the vector operations of 256 bit wides, 256 vectorial positions are stored in register, and this vector operations is by execution unit Four single 64 packed data elements (data element of four words (QW) size), eight single 32 packed data elements (data element of double word (DW) size), 16 single 16 packed data elements (data element of word (W) size) or 32 single 8 bit data elements (data element of byte (B) size).However, different vector widths and register Size is possible.
One or more built-in command caches (for example, 606) be included in thread execution logic 600 with to for The thread instruction of execution unit is cached.In one embodiment, one or more data high-speeds caching (for example, 612) thread-data during being included to perform thread is cached.Sampler 610 is included, to be operated for 3D Texture sampling is provided, and media sample is provided for media manipulation.In one embodiment, sampler 610 include dedicated texture or Media sample function, for handling texture or matchmaker during the data sampled to be provided to the sampling process to before execution unit Volume data.
During performing, thread is initiated request and sent by figure and media pipeline via thread procreation and dispatch logic To thread execution logic 600.Thread execution logic 600 includes local thread allocator 604, the local thread allocator 604 Arbitrate the thread from figure and media pipeline and initiate request, and instantiated on one or more execution unit 608A-N The thread asked.For example, geometry streamline (for example, 536 of Fig. 5) handles summit, surface subdivision (tessellation) Or process of aggregation thread dispatch is to thread execution logic 600.Thread dispatcher 604 can be also handled from execution coloration program Thread procreation request during operation.
As soon as once group geometric object has been processed and grating turns to pixel data, pixel coloring device 602 is called, to enter One step calculates output information and result is written into output surface (for example, color buffer, depth buffer, stencil buffer Deng).In one embodiment, pixel coloring device 602 calculates the various vertex attributes that will be interpolated across the object of rasterisation Value.Then, pixel coloring device 602 performs the pixel shader of API supplies.In order to perform pixel shader, pixel Tinter 602 is via thread dispatcher 604 by thread dispatch to execution unit (for example, 608).Pixel coloring device 602 is used Texture sampling logic in sampler 610 accesses the data texturing in the texture mapping being stored in memory.To texture number The pixel color data for each geometry fragment is calculated according to the arithmetical operation with input geometric data, or abandons one or many Individual pixel is without being further processed.
In one embodiment, FPDP 614 provides memory access mechanism, so that thread execution logic 600 will be through The data output of processing is to memory, so as to be handled on graphics processor viewing pipeline.In one embodiment, number Include according to port 614 or coupled to one or more cache memories (for example, data high-speed caching 612), the high speed Buffer memory is used to be cached the data of the memory access for carrying out via FPDP.
Fig. 7 is the block diagram for showing the graphics processor execution unit instruction format according to embodiment.In one embodiment, Graphics processor execution unit supports the instruction set with instruction in numerous formats.Solid box shows to be generally included in and held Component in the instruction of row unit, and dotted line frame includes the component in subset that is optional and being only included in instruction.Described and institute The instruction shown is macro-instruction, is embodied in the instruction that they are applied to execution unit, this with from Yi Dan instruction through processing once enter The microoperation of capable instruction decoding is contrasted.
In one embodiment, graphics processor execution unit Proterozoic supports the instruction according to 128 bit formats 710.Base In the quantity of selected instruction, instruction options or operand, the instruction format 730 of 64 compressions is instructed available for some.It is former 128 raw bit formats 710 provide the access to all instruction options, and in 64 bit formats 730, some options and operation be by Limit.Available native instruction changes with embodiment difference in 64 bit formats 730.In one embodiment, using index Instruction is partly compressed in index set in field 713.Execution unit hardware quotes the set of compaction table based on index value, And export to reconstruct the native instruction according to 128 bit formats 710 using compaction table.
For each form, instruction operation code 712 defines execution unit by the operation of execution.Execution unit is across each Multiple data elements of operand are performed in parallel each instruction.For example, in response to addition instruction, execution unit is across expression line Each Color Channel for managing element or picture element performs add operation simultaneously.Acquiescently, execution unit is across operand All data channel perform each instruction.Instruction control field 712 allows to some execution option (such as, channel selecting (examples Such as, predict) and data channel sequence (for example, mixing and stirring (swizzle))) control.For 128 bit instructions 710, size word is performed Section 716 limits the quantity for the data channel that will be executed in parallel.Perform size field 716 and be not useable for 64 compact instructions Form 730.
The instruction of some execution units has up to three operands, and these three operands include two source operand scr0 722nd, scr1 722 and a destination 718.In one embodiment, execution unit supports double destination instructions, wherein mesh Ground in one be implicit.Data manipulation instruction can have the 3rd source operand (for example, SRC2 724), wherein, instruction Command code JJ12 determines the quantity of source operand.Last source operand of instruction can be the immediate using instruction transmission (for example, hard coded) value.
In one embodiment, 740 are decoded to simplify command code to instruction packet based on opcode bits field.For 8 Command code, the permission execution unit of position 4,5 and 6 determines the type of command code.The accurate command code packet shown is exemplary. In one embodiment, mobile and logical operation code character 742 includes data movement and logical order (for example, mov, cmp).It is mobile Five highest significant positions (MSB) are shared with logical groups 742, wherein, move is 0000xxxxb (for example, 0x0x) form, And logical order is 0001xxxxb (for example, 0x01) form.Flow control instructions group 744 (for example, call, jmp) includes The instruction of 0010xxxxb (for example, 0x20) form.The instruction group 746 mixed includes mixing and stirring for instruction, including 0011xxxxb The synchronic command (for example, wait, send) of (e.g., 0x30) form.Parallel mathematical instructions group 748 includes 0100xxxxb The arithmetic instruction (for example, add, mul) by composition of (e.g., 0x40) form.Parallel mathematics group 748 is parallel across data channel Ground performs arithmetical operation.Vector mathematics group 750 includes the arithmetic instruction (for example, dp4) of 0101xxxxb (e.g., 0x50) form. Vector mathematics group performs arithmetic to vector operand (such as, dot product is calculated).
Graphics pipeline --- Fig. 8
Fig. 8 is the block diagram of another embodiment of graphics processor, and the graphics processor includes graphics pipeline 820, matchmaker Body streamline 830, display engine 840, thread execution unit 850 and render viewing pipeline 870.In one embodiment, Graphics processor is the graphics processor in the multiple core processing system for include one or more general procedure cores.Graphics processor is led to Cross the register write-in to one or more control register (not shown) or be distributed to figure via by annular interconnection 802 The order of processor and it is controlled.Graphics processor is coupled to other processing assemblies (such as, at other figures by annular interconnection 820 Manage device or general processor).The order from annular interconnection is explained by order flow passages 803, the order flow passages 803 will refer to Order is supplied to each component of graphics pipeline 820 or media pipeline 830.
Order flow passages 803 instruct the operation of the component of summit extractor 805, and the summit extractor 805 is read from memory Vertex data is taken, and the summit provided by order flow passages 803 is provided and handles order.Summit extractor 805 is by vertex data Vertex shader 807 is supplied to, the vertex shader 807 performs coordinate space transformations and lighting operation to each summit. Summit extractor 805 and vertex shader by via thread dispatcher 831 by execution thread be assigned to execution unit 852A, 852B performs summit process instruction.
In one embodiment, execution unit 852A, 852B is with the instruction set for being used to perform figure and media manipulation Vector processor array.Execution unit 852A, 852B have attached L1 caches 851, and the attached L1 is at a high speed Caching is exclusively used in each array or is shared between multiple arrays.Cache can be configured to data high-speed caching, instruction Cache is sectorized with the single cache comprising data and instruction in different subregions.
In one embodiment, graphics pipeline 820 includes being used to perform the hardware-accelerated surface subdivision to 3D objects Surface subdivision component.The programmable configuration of housing (hull) tinter 811 surface subdivision operation.Programmable domain tinter 817 is carried Assessed for the rear end exported to surface subdivision.Surface subdivision device 813 is operated under the instruction of housing tinter 811, and comprising Special logic, the special logic is used for based on the coarse geometrical model that graphics pipeline 820 is provided to as input To generate the set of fine geometric object.If without using surface subdivision, can bypass surface subdivision component 811,813, 817。
It can be handled by geometric coloration 819 via execution unit 852A, 852B one or more threads are assigned to Whole geometric object, or complete geometric object can directly proceed to cropping tool 829.Geometric coloration is grasped to whole geometric object Make, rather than opposite vertexes or the processing of summit microplate as in the previous level in graphics pipeline.If disabling surface subdivision, Then geometric coloration 819 is received from vertex shader 807 and inputted.Geometric coloration 819 is can be programmed by geometric coloration program , to perform geometric curved surfaces subdivision when disabling surface subdivision unit.
Before rasterisation, vertex data is handled by cropping tool 829, the cropping tool 829 be fixing function cropping tool or With the programmable cropping tool cut with geometric coloration function.In one embodiment, the light in viewing pipeline 870 is rendered Gated device 873 is assigned pixel coloring device and represented so that geometric object is converted into their own pixel.In one embodiment, as Plain tinter logic is included in thread execution logic 850.
Graphics engine has the interconnection bus for allowing data and message to be transmitted between the primary clustering of graphics engine, interconnection Structure or some other interconnection mechanism.In one embodiment, execution unit 852A, 852B and associated (multiple) high speeds are slow Deposit 851, texture and media sample device 854 and the cache 858 of texture/sampler 858 is interconnected via FPDP 856, with Perform the memory access and communication that render viewing pipeline component with graphics engine.In one embodiment, sampler 854th, cache 851,858 and execution unit 852A, 852B each have separated memory access path.
In one embodiment, rendering viewing pipeline 870 is associated comprising the object based on summit is converted into they The expression based on pixel rasterizer and depth test component 873.In one embodiment, rasterisation logic includes being used for Perform window device/mask device unit of fixing function trigonometric sum line light gated.In one embodiment, it is associated render and Depth buffer cache 878,879 is also available.Pixel operation component 877 performs the operation based on pixel to data, But in some instances, the pixel operation associated with 2D operations is (for example, the position block image carried out using mixing (blend) Transfer) performed by 2D engines 841, or in display by display controller 843 is substituted using the display plane on upper strata.At one In embodiment, shared L3 caches 875 can be used for all graphic assemblies, so as to allow without using main system memory In the case of shared data.
Graphics processor media pipeline 830 includes media engine 337 and video front 834.In one embodiment, depending on Frequency front end 834 receives pipeline command from order flow passages 803.However, in one embodiment, media pipeline 830 includes Separated order flow passages.Video front 834 is sending commands to the preceding processing Media Command of media engine 837.At one In embodiment, media engine includes thread and multiplies function, is used to be assigned to thread via thread dispatcher 831 and hold to multiply The thread of row logic 850.
In one embodiment, graphics engine includes display engine 840.In one embodiment, display engine 840 is in figure Outside shape processor, and coupled via annular interconnection 802 or some other interconnection bus or structure with graphics processor.It is aobvious Show that engine 840 includes 2D engines 841 and display controller 843.Display engine 840, which is included, can be operating independently 3D streamlines Special logic.Display controller 843 is coupled with display device (not shown), and the display device can such as be calculated on knee The display device of the system integration in machine or the attached external display device via display device connector.
Graphics pipeline 820 and media pipeline 830 are configurable to perform based on multiple figures and media DLL Operation, and it is not exclusively suitable for any one API (API).In one embodiment, the drive for graphics processor Dynamic device software will be specific to special pattern or the API Calls of media library are converted into the order that can be handled by graphics processor.Each The open graphic library (OpenGL) supported in kind of embodiment by Khronos Group and open computational language (OpenCL), Support is provided from the Direct3D storehouses of Microsoft, or is that both OpenGL and D3D provide support in one embodiment. Or open-source computer vision storehouse (OpenCV) provides support.If can complete at from following API streamline to figure The mapping of the streamline of device is managed, then will also support the following API of compatibility 3D streamlines.
Graphics pipeline is programmed --- Fig. 9 A-9B
Fig. 9 A are the block diagrams for showing the graphics processor command format according to embodiment, and Fig. 9 B are shown according to embodiment The block diagram of graphics processor command sequence.Solid box in Fig. 9 A shows to be generally included in the component in graph command, and empty Line includes the component in subset that is optional and being only included in graph command.Fig. 9 A exemplary patterns processor command form 900 include being used to identify the data field 902 of the Destination client of the order, command operation code (command code) 904 and are used for The related data 906 of order.Child-operation code 905 and order size 908 are also included within number order.
The client unit of order data is handled in the assignment graph equipment of client computer 902.In one embodiment, at figure Reason device command analysis device checks client field of each order to adjust the future processing to order, and by order data Route to appropriate client unit.In one embodiment, graphics processor client unit include memory interface unit, Rendering unit, 2D units, 3D units and media units.Each client unit has the corresponding processing stream of processing order Waterline.Once order is received by client unit, this client unit with regard to read opcode 904 and child-operation code 905 (such as Fruit is present), to determine the operation of execution.Client unit performs life using the information in the field of data 906 of order Order.For number order, it is contemplated that explicit order size 908 come specify order size.In one embodiment, command analysis Device automatically determines the size of at least some orders in order based on command operation code.In one embodiment, order via The multiple of double word and be aligned.
Flow in Fig. 9 B illustrates sample command sequence 910.In one embodiment, with the embodiment of graphics processor The software or firmware for the data handling system being characterized are set up using some version of shown command sequence, performed and terminated The set of graphic operation.Show for exemplary purposes and describe sample command sequence, however, embodiment is not limited to these lives Order, is also not necessarily limited to this command sequence.In addition, order can be published with the batch order in command sequence so that graphics processor Command sequence will be handled at least partially concurrent mode.
Sampling command sequence 910 can be flushed (flush) order 912 with streamline and be started, so that any movable stream Waterline completes the current order undetermined for streamline.In one embodiment, 3D streamlines 922 and media pipeline 924 Do not operate concomitantly.Execution pipeline is flushed so that the graphics pipeline of activity completes any order undetermined.In response to Streamline is flushed, and the command analysis device for graphics processor handles pause command, until the drawing engine of activity is complete Into operation undetermined and related read untill cache is deactivated.Optionally, it is marked as in rendering cache " dirty (dirty) any data " can be flushed to memory.Streamline flushes order 912 and can be used for pipeline synchronization, or It can be used before graphics processor is placed in into low power state.
When command sequence requires that graphics processor explicitly switches between multiple streamlines, ordered using flowing water line options Make 913.Unless context will be two pipeline issues orders, otherwise before issue pipeline command, context is being performed Interior, streamline select command 913 only needs once.In one embodiment, and then carried out via streamline select command 913 Streamline switching before, it is necessary to streamline flush order 912.
Pipeline control order 914 is configured to the graphics pipeline of operation, and for 3D streamlines 922 and media Streamline 924 is programmed.Pipeline control order 914 is movable pipeline configuration pipeline state.In one embodiment, flow Waterline control command 914 is used for pipeline synchronization, and for before processing batch is ordered, removing the streamline from activity The data of interior one or more cache memories.
Return buffer status command 916 is used to configure the return buffer for corresponding streamline for writing data Set.Some pile line operations need the distribution to one or more return buffers, selection or configured, during processing, behaviour Intermediate data is written in one or more of return buffers by work.Graphics processor is also delayed using one or more return Device is rushed to store output data and perform cross-thread communication.Return buffer state 916, which includes selection, is used for pile line operation collection The size and number of the return buffer of conjunction.
Remaining order in command sequence is different based on the movable streamline for operation.It is true based on streamline Fixed 920, command sequence is customized to the 3D streamlines 922 started with 3D pipeline states 930 or with media pipeline state 940 The media pipeline 924 of beginning.
Order for 3D pipeline states 930 includes the 3D state setting commands for following state:Vertex buffer State, summit elementary state, color constancy state, depth buffer state and before processing 3D cell commands by configuration Other state variables.The specific 3D API in use are based at least partially on to determine the value of these orders.3D flowing water wire If state 930 is ordered also optionally can disable or bypass those elements when without using some pipeline elements.
3D primitives 932 are ordered for submitting by the 3D primitives of 3D pipeline processes.Order and passed via 3D primitives 932 Take out function in the summit that the order and associated parameter for being handed to graphics processor are forwarded in graphics pipeline.Take out on summit Function generates vertex data structure using the order data of 3D primitives 932.Vertex data structure is stored in one or more return Return in buffer.3D primitives 932 are ordered for performing vertex operations to 3D primitives via vertex shader.To handle summit Tinter execution thread is assigned to graphics processor execution unit by color device, 3D streamlines 922.
3D streamlines 922 are triggered via 934 orders or event is performed.In one embodiment, register write-in triggering Order is performed.In one embodiment, order to trigger via " go " (" walking ") or " kick " (" kick and remove ") in command sequence Perform.In one embodiment, touched using pipeline synchronization order with flushing command sequence by graphics pipeline Say the word execution.3D streamlines will perform geometric manipulations for 3D primitives.Once operation is completed, obtained geometric object is by grating Change, and pixel engine is painted to obtained pixel.For controlling pixel shader and the additional command of pixel back-end operations also may be used It is included for those operations.
When performing media manipulation, sample command sequence 910 follows the path of media pipeline 924.In general, for matchmaker The specific of the programming of body streamline 924 depends on the media of execution using with mode or calculates operation.During media are decoded, Specific media decoding operate is detachable to be loaded onto media pipeline.Also media pipeline is can bypass, and be can be used by one or many The resource that individual general procedure core is provided completely or partially performs media decoding.In one embodiment, media pipeline Including the element operated for graphics processing unit unit (GPGPU), wherein, graphics processor is used to use and graphic primitive Render that not to be explicit related calculating coloration program perform SIMD vector operations.
To configure media pipeline 924 with the similar mode of 3D streamlines 922.Media pipeline status command 940 Set is assigned or is placed in command sequence and before media object order 942.Media pipeline status command 940 includes For configuring the data of the media pipeline element for handling media object.This includes being used to configure in media pipeline Video decodes the data with Video coding logic, such as, coding or codec format.Media pipeline status command 940 is also supported One or more pointers of " indirect " state elements set using pointing to comprising batch state.
Media object order 942 provides the pointer for pointing to the media object for being used for being handled by media pipeline.Media object Storage buffer including including pending video data.In one embodiment, publication medium object command 942 it Before, all media pipeline states all must be effective.Once pipeline state is configured, and media object order 942 Queued row, media pipeline 924 is just touched via 934 orders or equivalent execution event (for example, register write-in) is performed Hair.Then, can be by the operation that is provided by 3D streamlines 922 or media pipeline 924 to the output from media pipeline 924 Post-processed.In one embodiment, operated so that GPGPU is configured and performed with media manipulation similar mode.
Graphics software framework --- Figure 10
Figure 10 shows the exemplary patterns software architecture for data handling system according to embodiment.Software architecture includes 3D figures are using 1010, operating system 1020 and at least one processor 1030.Processor 1030 includes graphics processor 1032 and one or more general-purpose processor cores 1034.Figure is using 1010 and operating system 1020 each in data processing Performed in the system storage 1050 of system.
In one embodiment, 3D figures include one or more coloration programs, the coloration program using 1010 Including shader instruction 1012.Shader Language instruction can be High-Level Shader Language (such as, High-Level Shader Language (HLSL) or OpenGL Shader Languages (GLSL)) in.Using also including being applied to what is performed by general-purpose processor core 1034 Executable instruction 1014 in machine language.Using the Drawing Object 1016 for also including being defined by vertex data.
Operating system 1020 can be from MicrosoftOperating system, exclusive class UNIX The open-source class UNIX operating system of the variant of operating system or use linux kernel.When Direct3D API are in use When, operating system 1020 is low so that any shader instruction 1012 in HLSL to be compiled as using front end shader compiler 1024 Level Shader Language.Compiling can be instant (just-in-time) compiling, or using executable shared precompile.In a reality Apply in example, during compiling of the 3D figures using 1010, High Level Shader is compiled as rudimentary tinter.
User model graphdriver 1026 can compile shader instruction 1012 comprising rear end shader compiler 1027 Dedicated hardware is translated into represent.When OpenGL API are in use, the shader instruction 1012 in GLSL high-level languages is passed Extremely it is used to mode graphics driver 1026 be compiled.User model graphdriver uses System kernel mode function 1028 communicate with kernel mode graphics driver 1029.Kernel mode graphics driver 1029 communicates with graphics processor 1032 With dispatching commands and instruction.
For the degree of various operations described herein and function, they can be described or be defined as hardware electricity Road, software code, instruction, configuration and/or data.Content can be embodied in hardware logic, or can be embodied as directly holding Capable software (" target " or " executable " form), source code, it is designed for the High Level Shader that is performed on graphics engine Code or for the lower level assembler language code in the instruction set of par-ticular processor or graphics core.Embodiments described herein Software content can be provided via the product with the content being stored thereon, or can via operation communication interface so as to via This communication interface provides to send the method for data.
Non-transient machinable medium can make machine perform described function or operation, and including with can be by machine Any mechanism of device (for example, computing device, electronic system etc.) access stencil storage information, such as, can record/non-recordable Jie Matter is (for example, read-only storage (ROM), random access memory (RAM), magnetic disk storage medium, optical storage media, flash memory device Deng).Communication interface includes being docked to any one of hard-wired, wireless, optical medium etc. with to another equipment communication Any mechanism, such as, memory bus interface, processor bus interface, inter connection, disk controller etc..Communication interface passes through In the following manner come configure there is provided configuration parameter or send signal so that communication interface be ready to provide description software content data Signal.This communication interface can be accessed via sending to one or more orders of the communication interface or signal.
Described various assemblies can be performed for the device of described operation or function.It is as described herein each Component include software, hardware, or software and hardware combination.Component can be realized as software module, hardware module, specialized hardware (for example, specialized hardware, application specific integrated circuit (ASIC), digital signal processor (DSP) etc.), embedded controller, hardwired Circuit etc..In addition to content specifically described herein, the disclosed embodiments and implementation of the present invention can also be carried out each Modification is planted without departing from their scope.Therefore, example herein and example should be construed as illustrative, and unrestricted Property.The scope of the present invention should be defined only by reference to appended claims.
Apparatus and method for efficient texture compression
Embodiment of the disclosure discussed below supports high-performance, scalable, lossy graphical textures compression, simultaneously Allow the flexible selection to picture quality, compression ratio and block size.In one embodiment, compressing and decompressing both has phase Same every texel cost, and using only multiply-add instruction.In addition, texel decompression does not need space or the time between adjacent texel On dependence.As a result, technology specifically described herein is particularly suitable for use in a large amount of parallel implementations and hardware-accelerated.
In order to realize aforementioned result, one embodiment of the present of invention using texture image data block as two coordinates multivariable Function is treated.In compression stage, at each texel sample multi-variable function, and on the loosely spaced lattice of central point with Numerical approach is approximate to multi-variable function, with the vector for the compressed approximation coefficient represented for obtaining composition texture block.In solution Compression stage, approximation coefficient is used subsequently to assess the multi-variable function at each texel coordinate to re-create original texel Approximation.It can separate and treat Color Channel and α passages.
Specifically, in one embodiment, RBF (RBF) approximately is used to enter the color function in texture block Row numerical approximation.It allows the flexible selection of block size, such as, 4x4 to 16x16.Also rectangle (non-square) block is supported.
As shown in Figure 11, in one embodiment of the invention, in (the ginseng of texture sampler 1100 of graphics processing unit 854) interior perform seen in 610 in texture sampler 554A-N, Fig. 6 in such as Fig. 5 and Fig. 8 is compressed and decompression behaviour Make.Specifically, symmetrical texture compression logic 1150 (for example, approximate using RBF) realize hereafter described in technology press The uncompressed data texturing 1101 of contracting.Then, resulting compressed data texturing 1107 is storable in texture storage position Put in 1110 (for example, texture cache, main storage, mass-memory units etc.), for during texture mapping operation Follow-up use.
Decompression technique described in the symmetrical use hereafter of texture decompression logic 1120 for texture block (for example, determine Color value vector [T]) to decompress compressed data texturing 1107 to generate decompressed data texturing 1130.Then, Resulting decompressed data texturing 1130 can be used by other levels in pixel coloring device 1105 and/or graphics pipeline To perform texture mapping operation.
As described above, in one embodiment, compressed textures data are approximately carried out using RFB.Approximate accuracy and Therefore picture quality and compression ratio are selected in approximate RBF by the quantity (hereinafter labeled as B) relative to the texel in block Heart point counts (hereinafter labeled as N) to control.
Approximation method (including RBF is approximate) is at domain border (block boundary i.e., in this case) place by increased mistake Rate.In order to limit this effect, RBF central points are placed on texture block edge by one embodiment of the present of invention, and such as (it shows Figure 12 A Go out there is the implementation of four central points 1250) neutralize Figure 12 B (it shows the implementation with eight central points 1251) Shown in.In order to keep the symmetry that RBF central points are placed, be placed in block corner four centers as shown in figure 12a It can be used as baseline implementation (that is, the central point of minimum number).Then, by as shown in Figure 12B to be placed equidistant four Ge Xin centers generate follow-up configuration added to edge (adding one in each edge).The additional collection of 4 central points can be added (for example, 12 central points will produce 3 line segments, 16 central points will produce 4 to the line segment for closing to be subdivided into edge into equal length Bar line segment, etc.).
Figure 12 C provide the example of the different block sizes (for example, 4x4,5x5 etc.) for the central point using varying number Property compression ratio.As an example, the texture blocks of 4x 4 with 4 central points will produce .25 compression ratio, and with 12 central points Identical texture block will produce .75 compression ratio.
In one embodiment, RBF is approximately approximate to calculate based on RBF (radial basis function) Coefficient, the model of the distance between the RBF pairing approximation central point and approximate Data Position (for example, texel) vector Number (norm) is calculated.The RBF of wide class can be used and embodiments of the invention are still conformed to.In a particular implementation In example, the Gauss RBF and multiple quadratic equation (multiquadratic) RBF for being respectively labeled as GAUS and MQ are used:
BF is approximate to be controlled to be referred to as the RBF curve shape of form parameter using additional parameter, is hereafter got the bid It is designated as ep.Tuple (tuple)<B, N, grid, RBF, ep>(wherein, B is in block to the domination set of composition compressing/decompressing method The quantity of texel, and N is the counting of RBF central points).Details approximate RBF is known for those skilled in the art, therefore will not herein Description, to avoid making the general principle of the present invention ambiguous.It is used for approximate (compression) to texture block described in hereafter and comments Estimate the operation in (decompression) stage.
It is individual in one embodiment, compression is performed according to below equation:
In one embodiment, by calculating the distance between data texturing location point and approximate center point matrix first [DM] determines condensation matrix.As an example, in the texture blocks of 4x 4 with 16 texels and using 4 central points, distance Matrix includes 64 elements (that is, between a texel in the central point and 16 texels represented in 4 central points Distance each element).Then, configured RBF types and form parameter e are usedpValue, matrix of adjusting the distance [DM] is calculated By the condensation matrix [RDM] of (member-wise) RBF values of member.
In one embodiment, build vectorial [T], this vectorial [T] includes the B texel color value sampled to block.As Example, for the texture blocks of 4x 4, will sample 16 texel color values (that is, B=16) to block.
Then, one embodiment of the present of invention solution system of linear equations [RDM] * [A]=[T] is to obtain the vector of approximation coefficient [A].This can be accomplished by the following way:Numerically calculate the pseudoinverse (pseudoinverse) of [RDM] matrix [iDRM], and calculate the matrix product of [iRDM] [T] to determine to include the vector [A] of compressed data texturing.
In one embodiment, for<B, N, grid, RBF, ep>Fixed Combination, calculate distance matrix [DM] simultaneously calculate The operation of decompression matrix [RDM] can be precalculated, and can be provided as constant.Further, since [RDM] the matrix side of may not be (square), and system of linear equations defined in this [RDM] matrix was fixed (overdetermined), therefore classical Inverse matrix in meaning is not present.Alternatively, one embodiment of the present of invention uses Moore-Penrose pseudo inverse matrixs, so as to The best fit solution of equation group is obtained based on below equation:
Finally, for<B, N, grid, RBF, ep>Fixed Combination, calculating matrix product [iRDM] [T] operation can be advance Calculate, and constant can be provided as.As a result, it is as discussed above to be implied, compression stage can be reduced to for each texture block Computing is accumulated for single Matrix-Vector, wherein, matrix is provided as constant.
In one embodiment, decompression is performed according to below equation:
In order to obtain the vector [T] of texel color value, it is determined that the product of decompression matrix [RDM] and approximation coefficient vectorial [A] (for example, calculating [RDM] * [A]=[T]).As a result, decompression phase can be reduced to for each texture block be single matrix- Vector product calculation, wherein, it is supplied to algorithm using matrix as constant.In one embodiment, with working as with compressed form When being stored approximation coefficient [A] is represented with input color data identical precision.
Figure 13 provides the exemplary expression of the aforementioned equation for the texture blocks of 4x 4 with 4 central points.Show shown In example, [RDM] matrix 1301 shapes to generate the decompression matrixes 1302 of 4x 16 again, and then, the 4x 16 decompresses matrix 1302 are multiplied by approximation coefficient matrix [A] to obtain the texel block 1304 of the versions of 1x 16.Then, texel block 1304 shape again with Obtain final [T] matrix 1305.
Figure 14 illustrates the method for performing compression.At 1401, each Data Position point of texture block is used Distance matrix [DM] is calculated with the distance between each central point.For the texture blocks of 4x 4 with 4 central points, this production Raw 64 values.At 1402, configured RBF types and form parameter e are usedpValue, matrix of adjusting the distance [DM] is calculated by member RBF values RBF matrixes [RDM].At 1403, the vector [T] for the B texel color value sampled to block is built.Then, exist At 1404, solve system of linear equations [RDM] * [A]=[T] to obtain the vector [A] of approximation coefficient.
Technology presented above can by both compression and decompression phase be used only multiply-add operation (for example, so as to Perform described matrix multiplication) realize.In addition, depending on block size and selected compression ratio, compression and decompression phase Both have equal every texel cost.Because these technologies are approximate using color, therefore they provide following additional benefit: Suitable for sub-sampling and over-sampling scheme.In addition, low computation complexity and cost make these technologies be applied to it is hardware-accelerated and/or Application in real time, and produce the limitation to bandwidth of memory and power consumption.
Multiple embodiments of the present invention may include each above-mentioned step.Can be available for making universal or special computing device These steps are embodied in the machine-executable instruction of these steps.Or, can be by comprising for performing connecting firmly for these steps The specialized hardware components of line logic, or can be performed by any combinations of programmed computer module and custom hardware component These steps.
As described herein, instruction can refer to the concrete configuration of hardware, for example, being disposed for performing some operations or tool There is the application specific integrated circuit (ASIC) of predetermined function, or be stored in and be embodied in non-transitory computer-readable medium Software instruction in memory.Therefore, it can be used and be stored in one or more electronic equipments (for example, terminal station, network element Deng) on and the code that is performed thereon and data perform the technology shown in accompanying drawing.This class of electronic devices uses such as non- Transient computer machinable medium is (for example, disk;CD;Random access memory;Read-only storage;Flash memory device; Phase transition storage) etc computer machine computer-readable recording medium and transient computer machine readable communication medium (for example, electricity, light, sound Or the transmitting signal of other forms --- carrier wave, infrared signal, data signal etc.) come (internally and/or on network with Carried out between other electronic equipments) store and transmission code and data.In addition, this class of electronic devices is generally comprised coupled to one Or the set of the one or more processors of multiple other assemblies, one or more of other assemblies are e.g. one or more Storage device (non-transient machinable medium), user's input-output apparatus are (for example, keyboard, touch-screen and/or display Device) and network connection.The coupling of this group of processor and other assemblies (is also referred to as generally by one or more buses and bridge Bus control unit) realize.The signal of storage device and carrying network traffic represents one or more machine readable storages respectively Medium and machine readable communication medium.Therefore, the storage device for giving electronic equipment is commonly stored in the electronic equipment The collection of one or more processors closes the code and/or data of execution.Certainly, one or more portions of embodiments of the invention Dividing the various combination of software, firmware and/or hardware can be used to realize.It is old in order to explain through this embodiment Numerous details have been stated to provide thorough understanding of the present invention.It will be apparent, however, to one skilled in the art that not having There are some details in these details also to implement the present invention.In some instances, and it is not described in detail known structure And function, in order to avoid obscure subject of the present invention.Therefore, scope and spirit of the present invention should be according to appended claims come really It is fixed.

Claims (25)

1. a kind of method, including:
Determine the distance between each texture block texel in multiple texture block texels and each point in multiple points;
It is determined that the set for the texel color value sampled to the texture block;And
Using between each point between each texture block texel and the multiple point in the multiple texture block texel The distance and to the texture block sample the texture color value set come generate be used for compress the texture block Approximation coefficient set.
2. the method for claim 1, wherein the multiple point includes reasonable basic function (RBF) central point.
3. method as claimed in claim 2, further comprises:
Come true using the distance between each point in each texture block texel in multiple texture block texels and multiple points Set a distance matrix [DM];
Use the RBF and the form parameter e specified of specified typepValue to determine the distance matrix [DM] the RBF values by member RBF matrixes [RDM].
4. method as claimed in claim 3, wherein, the RBF of the specified type is from by the reasonable basic function of Gauss and multiple secondary The reasonable set of basis function of equation into group in select.
5. method as claimed in claim 3, wherein it is determined that the set of texel color value includes:Structure is adopted to the texture block The vector [T] of B texel color value of sample, wherein, B is the quantity of the texel in the texture block.
6. method as claimed in claim 5, wherein, generating the set of the approximation coefficient includes:
Determine to include the vector [A] of the approximation coefficient using equation [RDM] * [A]=[T].
7. method as claimed in claim 6, further comprises:
Determine the pseudo inverse matrix [iRDM] of the RBF matrixes [RDM];And
[iRDM] is set to be multiplied by [T] to determine [A].
8. method as claimed in claim 7, wherein, the texture block include 4x4 texture blocks, 5x5 texture blocks, 6x6 texture blocks, 7x7 texture blocks or 8x8 texture blocks.
9. method as claimed in claim 8, wherein, the approximate center point is from 4,8,12,16 or 20 Selected in the group of heart point composition.
10. method as claimed in claim 7, wherein, the texture block includes rectangular texture block.
11. method as claimed in claim 6, further comprises:
Determine vectorial [T] to decompress the texture block by using equation [RDM] * [A]=[T].
12. a kind of processor, including:
Texture compression logic, is used for
Determine the distance between each texture block texel in multiple texture block texels and each point in multiple points;
It is determined that the set for the texel color value sampled to the texture block;And
Using between each point between each texture block texel and the multiple point in the multiple texture block texel The distance and to the texture block sample the texture color value set come generate be used for compress the texture block Approximation coefficient set.
13. processor as claimed in claim 1, wherein, the multiple point includes reasonable basic function (RBF) central point.
14. processor as claimed in claim 13, wherein, the texture compression logic is used for:Use multiple texture block texels In each texture block texel and multiple points in the distance between each point determine distance matrix [DM];And use The RBF of the specified type and form parameter e specifiedpValue to determine the distance matrix [DM] the RBF squares of the RBF values by member Battle array [RDM].
15. processor as claimed in claim 14, wherein, the RBF of the specified type is from by the reasonable basic function of Gauss and again The reasonable set of basis function of quadratic equation into group in select.
16. processor as claimed in claim 14, wherein it is determined that the set of texel color value includes:Build to the texture The vector [T] of B texel color value of block sampling, wherein, B is the quantity of the texel in the texture block.
17. processor as claimed in claim 16, wherein, the texture compression logic is used for:By using equation [RDM] * [A]=[T] determines to include the vector [A] of the approximation coefficient to generate the set of the approximation coefficient.
18. processor as claimed in claim 17, wherein, the texture compression logic is used for:Determine the RBF matrixes The pseudo inverse matrix [iRDM] of [RDM];And make [iRDM] to be multiplied by [T] to determine [A].
19. processor as claimed in claim 18, wherein, the texture block includes 4x4 texture blocks, 5x5 texture blocks, 6x6 lines Manage block, 7x7 texture blocks or 8x8 texture blocks.
20. processor as claimed in claim 19, wherein, the approximate center point is from by 4,8,12,16 or 20 Selected in the group of individual central point composition.
21. processor as claimed in claim 18, wherein, the texture block includes rectangular texture block.
22. processor as claimed in claim 17, further comprises:
Texture decompression logic, for determining that vectorial [T] is described to decompress by using equation [RDM] * [A]=[T] Texture block.
23. a kind of machine readable media, with the program code being stored thereon, when performing described program code by machine, Described program code makes the machine perform following operate:
Determine the distance between each texture block texel in multiple texture block texels and each point in multiple points;
It is determined that the set for the texel color value sampled to the texture block;And
Using between each point between each texture block texel and the multiple point in the multiple texture block texel The distance and to the texture block sample the texture color value set come generate be used for compress the texture block Approximation coefficient set.
24. machine readable media as claimed in claim 23, wherein, the multiple point includes reasonable basic function (RBF) center Point.
25. machine readable media as claimed in claim 24, further comprises making the machine perform the additional of following operation Program code:
Come true using the distance between each point in each texture block texel in multiple texture block texels and multiple points Set a distance matrix [DM];
Use the RBF and the form parameter e specified of specified typepValue to determine the distance matrix [DM] the RBF values by member RBF matrixes [RDM].
CN201480079739.4A 2014-07-10 2014-07-10 Method and apparatus for efficient texture compression Pending CN107004280A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/PL2014/000077 WO2016007028A1 (en) 2014-07-10 2014-07-10 Method and apparatus for efficient texture compression

Publications (1)

Publication Number Publication Date
CN107004280A true CN107004280A (en) 2017-08-01

Family

ID=51300805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201480079739.4A Pending CN107004280A (en) 2014-07-10 2014-07-10 Method and apparatus for efficient texture compression

Country Status (8)

Country Link
US (1) US10140732B2 (en)
EP (1) EP3167433A1 (en)
JP (1) JP6379225B2 (en)
KR (1) KR102071766B1 (en)
CN (1) CN107004280A (en)
SG (1) SG11201610362RA (en)
TW (1) TWI590198B (en)
WO (1) WO2016007028A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016048176A1 (en) * 2014-09-26 2016-03-31 Intel Corporation Method and apparatus for filtering compressed textures
CN111726639B (en) 2016-11-18 2023-05-30 上海兆芯集成电路有限公司 Texture brick compression and decompression method and device using same
EP3973747A1 (en) 2019-05-20 2022-03-30 Lutron Technology Company LLC Communicating with and controlling load control systems
CA3144460A1 (en) * 2019-07-26 2021-02-04 Lutron Technology Company Llc Configuring color control for lighting devices
KR20220134848A (en) * 2021-03-26 2022-10-06 삼성전자주식회사 Graphics processing unit and operating method thereof
US11908064B2 (en) 2021-05-14 2024-02-20 Nvidia Corporation Accelerated processing via a physically based rendering engine
US11853764B2 (en) 2021-05-14 2023-12-26 Nvidia Corporation Accelerated processing via a physically based rendering engine
US11704860B2 (en) 2021-05-14 2023-07-18 Nvidia Corporation Accelerated processing via a physically based rendering engine
US11875444B2 (en) * 2021-05-14 2024-01-16 Nvidia Corporation Accelerated processing via a physically based rendering engine
US11830123B2 (en) 2021-05-14 2023-11-28 Nvidia Corporation Accelerated processing via a physically based rendering engine

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070076971A1 (en) * 2005-09-30 2007-04-05 Nokia Corporation Compression of images for computer graphics
CN101536526A (en) * 2006-08-31 2009-09-16 Ati科技公司 Texture compression techniques
CN102138158A (en) * 2008-06-26 2011-07-27 微软公司 Unified texture compression framework

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7925087B2 (en) * 2006-11-14 2011-04-12 Siemens Aktiengesellschaft Method and system for image segmentation by evolving radial basis functions
CN102132495B (en) * 2008-05-15 2015-04-29 皇家飞利浦电子股份有限公司 Method and apparatus for compression and decompression of an image dataset
US8452111B2 (en) * 2008-06-05 2013-05-28 Microsoft Corporation Real-time compression and decompression of wavelet-compressed images
CN102934428B (en) * 2010-12-16 2015-08-05 北京航空航天大学 The wavelet coefficient quantization method of human vision model is utilized in a kind of image compression
KR101418096B1 (en) * 2012-01-20 2014-07-16 에스케이 텔레콤주식회사 Video Coding Method and Apparatus Using Weighted Prediction
GB2503691B (en) * 2012-07-04 2019-08-14 Advanced Risc Mach Ltd Methods of and apparatus for encoding and decoding data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070076971A1 (en) * 2005-09-30 2007-04-05 Nokia Corporation Compression of images for computer graphics
CN101536526A (en) * 2006-08-31 2009-09-16 Ati科技公司 Texture compression techniques
CN102138158A (en) * 2008-06-26 2011-07-27 微软公司 Unified texture compression framework

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GREGORY E. FASSHAUER ET AL.: "On choosing "optimal" shape parameters for RBF approximation", 《NUMERICAL ALGORITHMS》 *
HWANG-SOO KIM ET AL.: "Image coding by fitting RBF-surfaces to subimages", 《PATTERN RECOGNITION LETTERS》 *
胡可云 等: "《数据挖掘理论与应用》", 30 April 2008, 清华大学出版社 *

Also Published As

Publication number Publication date
TW201618037A (en) 2016-05-16
US10140732B2 (en) 2018-11-27
TWI590198B (en) 2017-07-01
WO2016007028A1 (en) 2016-01-14
SG11201610362RA (en) 2017-01-27
EP3167433A1 (en) 2017-05-17
KR20170007373A (en) 2017-01-18
JP6379225B2 (en) 2018-08-22
US20170154443A1 (en) 2017-06-01
KR102071766B1 (en) 2020-03-02
JP2017523507A (en) 2017-08-17

Similar Documents

Publication Publication Date Title
US11887001B2 (en) Method and apparatus for reducing the parameter density of a deep neural network (DNN)
CN107004280A (en) Method and apparatus for efficient texture compression
WO2017049496A1 (en) Apparatus and method for local quantization for convolutional neural networks (cnns)
CN109643443A (en) Cache and compression interoperability in graphics processor assembly line
US20190087680A1 (en) Edge-Based Coverage Mask Compression
WO2017074608A1 (en) Variable precision shading
CN107077828A (en) Size to color lookup table is compressed
US9705526B1 (en) Entropy encoding and decoding of media applications
US9412195B2 (en) Constant buffer size multi-sampled anti-aliasing depth compression
US20160283549A1 (en) Value sorter
US10410081B2 (en) Method and apparatus for a high throughput rasterizer
WO2016081097A1 (en) Apparatus and method for efficient frame-to-frame coherency exploitation for sort-last architectures
CN106575443A (en) Hierarchical index bits for multi-sampling anti-aliasing
WO2016126400A1 (en) Method and apparatus for direct and interactive ray tracing of a subdivision surface
CN106687924A (en) Method and apparatus for updating a shader program based on current state
US9600926B2 (en) Apparatus and method decoupling visibility bins and render tile dimensions for tiled rendering
US10198850B2 (en) Method and apparatus for filtering compressed textures
CN107004252A (en) Apparatus and method for realizing power saving technique when handling floating point values
EP3198564A1 (en) Efficient tessellation cache
WO2017116779A1 (en) A method of color transformation using at least two hierarchical lookup tables (lut)
CN109219832B (en) Method and apparatus for frame buffer compression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170801

WD01 Invention patent application deemed withdrawn after publication