CA2337530C

CA2337530C - Apparatus and method for real-time volume processing and universal 3d rendering

Info

Publication number: CA2337530C
Application number: CA002337530A
Authority: CA
Inventors: Arie E. Kaufman; Ingmar Bitter; Baoquan Chen; Frank Dachille; Kevin Kreeger
Original assignee: Research Foundation of State University of New York
Current assignee: Research Foundation of State University of New York
Priority date: 1998-07-16
Filing date: 1999-07-16
Publication date: 2007-11-20
Anticipated expiration: 2019-07-16
Also published as: WO2000004505A1; CA2337530A1; AU4998299A; JP2002520748A; EP1114400A4; AU757621B2; EP1114400A1

Abstract

An apparatus and method for real-time vol-ume processing and universal three-dimensional rendering (10). The apparatus includes a plurality of three-dimensional memory units; at least one pixel bus for providing global horizontal communication; a plurality of rendering pipelines; at least one geometry bus; and a control unit. The plural-ity of rendering pipelines each preferably include hardware for interpolation, shading, FIFO
buffering, communication and lookup tables. The appa-ratus of the present invention may be coupled to a geometry pipeline (18) for mixing surfaces, images and volumes together in a single image. A
method for performing volumetric ray casting of a 3D volume includes the steps of calculating a distance along a major projection axis from a pre-defined viewpoint; dividing the volume into a plu-rality of consecutive regions having exponentially increasing bounds; casting a plurality of rays from the viewpoint through the volume; either merging two or more rays or splitting one or more rays at the region boundaries; and repeating the ray casting and merging/splitting steps until the entire volume has been processed. The apparatus and methods of the present invention achieve the real-time performance for high-resolution volume rendering, mixing surfaces and volumes in a single image operations, including texture mapping and image-based rendering.

Description

APPARATUS AND METHOD FOR REAL-TIME VOLUME

BACKGROUND OF THE INVENTION

Field of the Invention The present invention relates generally to three-dimensional (3D) graphics and volume visualization, and more particularly relates to an apparatus and method for real time volume processing and universal three-dimensional rendering.

Description of the Prior ArtComputer rendering is the process of transforming complex information into a format which is comprehensible to human thought, while maintaining the integrity and accuracy of the information. Volumetric data, which consists of information relating to three-dimensional phenomena, is one species of complex information that can benefit from improved image rendering techniques. The process of presenting volumetric data, from a given viewpoint, is commonly referred to as volume rendering.

WO 00/04505 PC7'/t1S99/16038 Volume visualization is a vital technology in the interpretation of the great amounts of volurneb ic data generated by acquisition devices (e.g., biomedical scanners), by supercomputer simulations, or by synthesizing geometric models using volume graphics teclmiques. Of particular importance for manipulation and display of volumetric objects are the interactive change of projection and rendering parameters, real-time display rate,s, and in many cases, the possibility to view changes of a dynamic dataset over time, called four-dimensional (4D) visualization (i.e., spatial-temporal), as in the emerging integrated acquisition visualization systems.

A vollunetric dataset is commonly represented as a 3D grid of volume elements (voxels), often stored as a full 31) raster (i.e., volume buffer) ofvoxels.
Volume rendering is one of the most conunon techniques for visualizing the 3D
scalar field of a corYtiiiuous object or phenomenon represented by voxels at the grid points of the volume dataset, &id can be accomplished using hvo primary methods: object-order methods and image-order methods. Using an object-order approach, the contribution of each voxel to the screen pixels is calculated, and the combined contribution yields the final image. Using an image-order approach, sight rays are cast from screen pixels through the volume dataset, and contributions of voxels along these sight rays are used to evaluate the corresponding pixel values.

Over the past three decades graphics systems have evolved duofold: from primarily two-dimensional (2D) to 3D and 4D (space and time), and from vector graphics to raster graphics, where the vector has been replaced by the polygon as the basic graphics primitive. This has led to the proliferation of polygon-based geometry engines, optimized to display millions of triangles per second. In such systems, however, triangle facets only approximate i:he shape of bjects. Still, the 3D
polygon-based graphics market continues to boom, and has become one of the hottest arenas of the personal computer (PC) industry.

In response to emerging demands pliaced on traditional graphics systems, various techniques have been devised to hairidle and display discrete imagery in order WO 00/04505 PC'T/[JS99/16038 to enhance visual realism of the geometric model, as well as enhance or replace object shape and structure. Yunong these techniques include 21y texture and photo mapping, enviro ent rraappiug, range images for image-based reridering, 2D mip-mapping, video streams, 3D volrjrnes, 3D mip-mapping, 4D light fields and lumigraphs, and five-dimensional (5D) plenoptic fumetiorls. All these teciiniques require some sort of dimensionality-based interpolation (bilinear, trilinear, quadrilinear, etc.) between discrete pixels, texels, voxels, or n-oxels.

Special purpose computer architectures and methods for volume visualization are known in the art. Traditional methods of volume visualization typically operate by sc ing through a volume dataset in a sequential manner in order to provide an accurate represeritatiori of an object. For example, Ciibe-4, an architecture developed by Dr. Arie Ka.ufanau, Ingmar Bitter and Dr. Hanspeter Pfister, some of whom are also named inventors in the present application, is a special purpose scalable volume rendering architecture'based on slice-parallel ray-casting. Cube-4 is capable of delivering true real-tirr.ie ray-casting of high. resolution datasets (e.g., 10241 16-bit voxels at 30 Hertz franie rate). However, Cube-4 cannot deliver such real-time performance for perspective projections. Presently, in known prior art rendering systems, the use of perspective projections either increases the rendering time or decreases the projected, irnage quality. Additionally, prior architectures do not provide the ability to combine volumes and geometries into a si-ilgle image.
Referring now to Figure 1, a conventional volume visualization system 1 is shown. As illustrated in Figure 1, the volun,rie data is stored on a disk 2 and loaded into memory 4 before renderirZg. A Central Processing Ullit (CPU) 6 then computes the volume rendered iniage from the data residing in rne.m.ory 4. The final image is written to a frame buffer 8, which is typically embedded on a graphics card, for displaying on a monitor 9 or similar display device.

The present invention, therefore, is intended to provide a method and apparatus which significantly enhances the capabilities of known methods and WO 00/04505 PCT'/LJS99/16038 apparatus to the extent that it can be considered a new generation of imaging data processing.

Other and ftu-dier objects will be made lcnown to the artisan as a result of the present disclosure, and it is intended to include all such objects which are realized as a result of the disclosed invetltion.

SU ARY OF 7.'HE I EN7rI N

The present iraiienti n is tantamount to a depar"cure from the prior art because of the all-enc mpassing new characteristics,. An apparatus, in accordance with the present invention, for i eal-tirne volume processing and universal three-dimensional (3I3) renderirig includes one or more three-dimensional (3D) memory units; at least a first pixel bus; one or ln re reudering pipelines; one or more geometry busses; and a control unit. The apparatus is responsive tc, viewing and processing parameters which define a viewpoint, aud the apparatus generates a 3D volume projection image from the viewpoint. The projected image includes a plurality of pixels.

The 3D memoiy units store a plurality of discrete voxels, each of the voxels having a location and ir xel data associated therewith. The voxels together f rsn a volume dataset, and th-. viewing and processing parameters define at least one face of the vulume dataset as the base plane of the volume dataset as well as first and last processing slices of the volume dataset. The control unit initially designates the frst processing slice as a cizrrerat slice of sample points, and controls sweeping through subsequent slices of the volume dataset as current slices until the last processing slice is reached.

Each of the plurality frerldering pipelines is vertically coupled to both a corresponding one of the plurality of 3D memory units and the at least first pixel bus, and each of the rendering pipelines has global horizontal communication preferably W 00/04505 PC't'/IJS99/16038 ~
with at most its two nearest neighbors. The rendering pipelines receive voxel data from the corresponding 3D memory units and generate a two-dimensional (2D) base plane image aligned with a face of the volume dataset. The geometry I/ bus provides global horizontal co unication 1'cietween the plurality of rendering pipelines and a geometry engine, and the geometry I/C) bus enables the rendering of geometric and volumetric objects together in a single image.

The apparatus and methods of the present invention surpass existing 3D
volume visualization architectures and nletllods, not only in terms of enhanced perforrnance, image rendering quality, flexibility and simplicity, but in terms of the ability to combine both volumes and surfaces (particularly translucezit) in a single image. The present invention provides flexible, high quality, true real-time volume rendering from ar"bitrary viewing directions, control of reiadering and projection parameters, and mechanisms for visualizing internal and surface structures of high-resolution datasets. lt : er supports a variety of volume rendering enhancements, including accurate perspective projection, niulti-resolution volumes, multiple overlapping volumes, clipping, improved gradient calculation, depth cuing, haze, super-sampling, anisotropic datasets and rendering of large volumes.

The present invention is more than a mere volume rendering machine; it is a high-perforinance interpolation engine, and as such, it provides hardware support for high-resolution volume rendering and acceleration of discrete imagery operations that are highly dependent on interpolation, including 2D and 3D texture mapping (with mip-mapping) and image-based rendering. p urtherrnore, the apparatus and methods of the present invention, coupled with a geometry engine, combine volumetric and geometric approaches, allowing users to efficiently model and render complex scenes containing traditional geometric primitives (e.g., polygonal facets), images and volumes together in a single image (defined as universal 3D rendering).

The apparatus cof the present invention additionally provides enhanced system flexibility by including various global and local feedback connections, which adds the 5 PCT'fUS99/16038 ability to reconfigure the pipeline stages to perforrn advanced imagery operations, such as imaging vrarping; and multi-resolution volume processing.
.pnrthermore, the present invention accomplishes these objectives in a cost-e:laective manner.

It is an advantage of the present invention to provide a method and apparatus dvhich, when coupled to a geometry engine, enables the mixing of geometries and volumes together in a single image.

It is another advantage of the present invention 'to provide a method and apparatus for efficientlmr rendering multiple overlappin,g volumetric objects.

It is also an adz~=antage of the present invention to provide an apparatus having the ability to reconfigure selective pipeline stages to perforn advanced volume rendering functionalities.
It is a further advantage of the present invention to provide a method and apparatus which supports enhanced imaging operations that are highly dependent on interpolation.

It is still afurther advantage of the present invention to provide a method and apparatus which accelerate the processing of large volurne da.tasets, It is another advantage of the present invention to provide either higher-quality or faster perspective projections than existing special p-urpose harchvare volume 2$'s visuahzat2on systen7.s.

It is still another advantage of the present invention to provide amet.hod and apparatus which are readily capable of irnproving the quality of images with enhancements to gradient estimation and interpolation units.

It is yet another advantage of the present invention to provide an apparatus which is capable of implenzenting image vvarping in hardware.

It is yet a further advantage f the present invention to provide a method and apparatus which verccirne the inherent disadvantages of known volume visualization systems.

These and ethe- features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.

B EF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram of a conventional volume visualization system.
Figure 2 is a conceptual block diagram illustrating a universal three-dimensional rendering system formed in accordance vrith one embodiment of the present invention.

Figure 3 is a simplified block diagram of the Cube-5 unit of Figure 2 illustrating a preferrecl implementation of the present invention.

Figure 4 is a fia.ncti nal block diagram depicting an vervieiYv of the universal three-dimensional rer.LderYng architecture foraned in accordance with one embodiment of the present invention.

Figure 5 is a 9:uncti nal block diag3ralm illustrating a unit rendering pipeline f mied in accordance with one embodiment of the present invention.

WO 00/04505 PC'r/tJS99/16038 Figure 6A is a graphical representation showing how 32 bits of texel data is stored for 2x2 neighbo:rhood in a miniblock of 16-bit voxels, in accordance with a preferred method of the present invention.

Figure 6B depicts a tabular coznparison of voxel storage and texel storage for the example of Figure tiA.

Figure 7 illustrates special parallel preserving scanlines in the source and target images in accordance with a preferred forward irnage warping method of the present invention.

Figure 8 is a graphical representation illustrating a method for deterrriining the parallel preserving scasiline direction in accordance with the preferred forward image warping method of the present invention.

Figure 9 is two=-dirnensional graphical representati on of an exatnple illustrating pixel read templates in the source image for performing scanline processing, in accordance with the preferred forward image warping method of the present invention.
Figure 10 is two-dimensional graphical representadtion of the example of Figure 9 illustrating a bilinear interpolation of samples for performing scanline processing, in accordasice with the preferred forward image warping method of the present invention.

Figure 11 is tw -d1F'I1er1s1onal graphical representation of a linear interpolation on samples to obtain pixel values for performing target pixel correction, in accordance with a preferred method of the present inverityon.

WO 00/04505 1FC"T/IJS99/16038 Figure 12 is a gr6Lphical represeritatior illustrating the calculation of an anisotropic filter footpriiat for perforta.ing antialiasing in accordance with a preferred forward image warping rnethod of the present inverition.

Figure 13 is a graphical representation illustrating the splatting of so-arce pixels onto the target samples, in accordance with the preferred forward image warping method of the present invention.

Figure 14 depicts an example of a,y-slice shear for perfor.rnirig three-dimensional rotation by, two-dimensional slice shear decomposition, in accordance with one method of the present invention.

Figure 15 depicts an ex ple of an x-beam shear fbr perforrning three-dimensional rotation by two-dimensional beain shear decomposition, in accordance with another method of the present invention.

Figure 16 depicts an example of an x-slice-y-bea.rri shear for perforrraing three-dimensional rotation by two-dimensional slice-beam shee:ly decomposition, in accordance with still another method of the present irrverition.

Figure 17 depicts an example of a three-dimensional x-beam shear for performing three-dim,,-nsional rotation by three-dimensional beam shear decomposition, in accordance with yet another method of the present invention.

Figure 1SA iLustrates a conventional undersampling method for performing perspective projections.

Figure 18B illustrates a conventiojaal oversampling method for performing perspective projections.

Figure 19 illustrELtes an adaptive perspective ray-casting method for performing perspective volumetric projections, in accordance with. a preferred forrn of the present invention, w'.aerein a view fr'zstoni is divided into regions based on exponentially increasing distance from a vievvpoint.

Figure 20A is a graphical representation illustrating the splitting/merging of rays at exponential regic-n boundaries, in accordance with the preferred adaptive perspective ray-casting tnethod of the present invention.

10 Figure 20B is a graphical representation illustrating the effective filter weights for ray segments A, B aad C of the adaptive perspective ray-casting method example of Figure 20A.

Figure 21 illustrates an exaimple of the weights for a two-dimensional filter of 115 size +2 samples, in accordance with a preferred fonn of the present invention.

Figure 22 is a graphicai representation illustrating an ex ple of a of the adaptive perspective ray-casting method of the present invention, wherein a 73 volume is three voxel units frorn the viewpoint.

Figure 23 is a pseudo-code representation of a preferred method for perfornning Eacponen.tia:l-It.egions Perspective back-to-front projection of a volume, in accordance with one fo rrra of the present invention.

Figure 24 illust ates an example of the Exponen.tia:l-Regions Perspective ray casting method of the feresent invention across two regions.

Figure 25 depicts an example of the preferred weights for perforrning a 33 symmetric approxinlati.on of the x-component of a Sobel gradient filter, in accordance with one embodiment of the present invention.

WO 00/04505 PC'Y'/1JS99/16038 Figure 26 is a graphical representation illustrating a method for mixing geognetric objects and volumes in a single irriage in accordance with one form of the present invention.

Figure 27 is a g.raphical representation of a method for clipping triangles to thin slab boundaries in accordance with one form of the present invention.

Figure 28 is a graphical representation of a method for bucket sorting translucent polygons iri accordance with a preferred forrn of the present invention.
Figure 29 is a graphical representation of a method, in accordance with one forrn of the present invention, for creating slzeared viewing geometry by pre-warping the polygon footprints, Figure 30 is a graphical representation of a Cube-5 pipeline, forrned in accordance with one fornn of the present inveo.tion, illustrating an SRAM
composite buffer included tlrereb.i.

Figure 31 is a graphical representation of a conventional graphics accelerator, conceptczally illustrating the interfaces betweerb the textuxe memory, frame buffer and geometry pipelirie.

Figure 32 is a graphical representation illustrating one embodiment of the present invention errA.ploying a dual-use DRAM frame buffer connecting a geometry pipeline with the Cube-5 volume rendering pipeline of the present invention.

Figure 33 is ~L block diagrarri illustrating memory interfaces for each Cube-5 pipeline including a coxel FIFO queue, in. accordance with one forin of the present invention.

WO 00/04505 PCT/t(JS99/16038 Figure 34 is agTaplrical representation of a RCs13U coxel layout onto eight DRAM chips, formed in accordance with a preferred embodiment of the present invention.

Figure 35 is a partial block diagram representation of an embedded DRAM chip implementation of run-length encodin;~ (RLE) frame buffer hardware, forr~n.ed in:-accordance with one farrn of the present invention.

Figure 36 is a;pseardo-code representation showirig processing occurring in the RLE hardware of F'igiare 35, in accordance with one form of the present invention.
Figure 37 is a graphical representation of a preferred embodiment of the present invention illu9trating a RLE frame buffer conl-iectin.g a geometry pipeline to an SRAM compositing buffer included in the Cube-5 pipelnne.

Figure 38 illustrates a density profile of an oriented box filter taken along a line from the center of a solid primitive outward, perpendicular to the surface, in accordance with one forrn of the present invention, Figure 39 illustrates a density profile of an oriented box filter taken along a line perpendicular to a triangle surface primitive, in accordance with another fornn of the present invention.

Figure 40 depicts a two-dimensional illustration of seven voxelization regions for a triangle primitive, in accordance with a preferred embodiment of the present invention.

Figure 41 is a pseudo-code representation of a naethod for computing the distance from a plane, in accordance witYc one form of the present invention.
34) Figure 42 is a block diagram representation illustrating an overview of a hardware voxelization pipeline, formed in accordance with one embodiment of the present invention.

Figure 43 is a block diagram depicting a distance unit which incrementally computes the distance of a current voxel from a desired plane, in accordance with one form of the present invention.

Figure 44 is a top view graphical representation illustrating a preferred method for performing image-based rendering in accordance with one form of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The apparatus and methods of the present invention are capable of processing data and supporting real-time visualization of high resolution voxel-based data sets.
The present invention is a universal three-dimensional (3D) rendering system delivering enhanced volume rendering in addition to the integration of imagery (e.g., volumes, textures and images) with geometry (e.g., polygons). The apparatus and methods are designed for use as a voxel-based system as described in the issued patents and pending applications of Dr. Arie Kaufinan, a named inventor in this application, including "Method of Converting Continuous Three-Dimensional Geometrical Representations Into Discrete Three-Dimensional Voxel-Based Representations Within a Three-Dimensional Voxel-Based System", which issued on August 6, 1991, as U.S. Patent No. 5,038,302; "Method of Converting Continuous Three-Dimensional Geometrical Representations of Polygonal Objects Into Discrete Three-Dimensional Voxel-Based Representations Thereof Within a Three-Dimensional Voxel-Based System", which issued on January 22, 1991, as U.S.
Patent No. 4,987,554; "Method and Apparatus for Storing, Accessing, and Processing Voxel-Based Data", which issued on January 15, 1991, as U.S. Patent No.
4,985,856;

"Method and Apparatus for Generating Arbitrary Projections of Three-Dimensional Voxel-Based Data", which issued on March 31, 1992 as U.S. Patent No.
5,101,475;
"Method and Apparatus for Real-Time Volume Rendering From An Arbitrary Viewing Direction", which issued on August 6, 1996, as U.S. Patent No.
5,544,283;
"Method and Apparatus For Generating Realistic tmages Using a Discrete Representation", which issued on August 15, 1995, as U.S. Patent No.
5,442,733; and "Apparatus and Method for Parallel and Perspective Real-Time Volume Visualization", which issued on December 8, 1998, as U.S. Patent No.
5,847,711.

Figure 2 illustrates a conceptual view of a universal 3D rendering system 10 formed in accordance with one embodiment of the present invention.
Applications 12 which display.collections of renderable objects are preferably split by an Applications Program Interface (API) 14 into appropriate imagery and geometry representations.
These representations are subsequently processed by an imagery unit 16 and a geometry unit 18, respectively, which are illustrated generaIly as functional blocks.
The imagery unit 16 preferably includes a plurality of imagery pipelines and the geometry unit 18preferably includes a plurality of geometry pipelines (not shown) for rendering the imagery and geometry representations, respectively. The rendered outputs of the imagery unit 16 and the geometry unit 18 are subsequently combined in a blending unit 20 to generate a single baseplane image. This baseplane image may preferably be transformed by a warp unit 22 to a final projection plane for display.
Figure 3 illustrates one implementation of the Cube-5 volume visualization system of the present invention. As shown in Figure 3, the system preferably includes one or more three-dimensional memory units 24, with each 3D memory unit 24 vertically coupled to an input bus 26 and a corresponding Cube-5 chip 28. A
plurality of Cube-5 chips 28 are shown connected to a frame buffer pixel bus 34.
Furthermore W 00/04505 PC'I'/US99/1603$
the system 10 of the present invention preferably interfaces to at least one conventional geometry engine 30 and a host computer 32, both operatively coupled between the input bus 25 and the frame buffer pixel bus 34 for communicating with the Cube-5 apparatus of the present invention.

Referring now tD Figure 4, the apparatus of the present invention 10 includes a.-plurality of 3D memory units 24 which are preferably connected to an imagery input bus 26, providing global horizontal communication between the 3D memory units 24.
The volume dataset is commonly represented as a regular grid of volume elements, or 10 voxels, often stored as a full 3D raster (i.e., volume buffer). This volume dataset is preferably distributed across the 3D memory units 24. With a skewed distribution, the present invention allo-vrs conflict-free access to complete beams (i.e., rows) of voxels parallel to any of the niajor axes, thereby reducing the memory-processor bandwidth bottleneck. As illustraled in Figure 4, for s treaming video or four-dimensional (4D) 15 volume data through the system 10, each 3D memory unit 24 is preferably connected to a dedicated real-tirr e input 36. By providing a dedicated connection to a real-time input source, the na.err.ory-processor bandwidth bottleneck is further reduced.

The universal 3D rendering system. 10 of the present invention further includes a plurality ofrenderirig pipelines, shown as functional blocks of Cube-5 units 38 in Figure 4. Each rendering pipeline 38 is connected to a corresponding 3D memory unit 24 and preferably has horizontal communication with at least preferably its two nearest neighbors. The Cube-5 units 38 read from their dedicated 3D memory units 24 and produce a two-dimensional (2D) baseplane irnage. This baseplane image, which contains a plurality of composited pixels generated by the Cube-5 units 38, is preferably distributed across a plurality o:E'two-dimensional (2D) memory units 40.
Each of the plurality of 2D memory units 40 is preferably connected to both a corresponding Cube-5 pipeline unit 38 arid a baseplane pixel bus 42 which provides global Yao ntal communication between 2D rnemory units 40.

Preferably, the present invention includes a plurality of warp units 44 connected to the baseplane pixel bus 42. The warp units 44 assemble and transforrn W 00/04505 PC'1'/YJS99/16038 (i.e., warp) the baseplane image stored in the plurality of 2D memory units 40 onto a user-defined image plane. Although the present invention contemplates using a single warp unit 44 (e.g., in order to reduce the costs or overhead of the hardware), the use of a plurality of warp units 44 is desirable to accelerate image transforrnati ns.

The output of each of the warp units 44 is preferably connected to a frame buffer pixel bus 34 which provides global horizontal corramunication between wa-rp units 44. Reading the source pixels over the baseplane pixel bus 42 and writing the final image pixels over the frame buffer pixel bus 34 preferably happens concurrently in order to allow greater system throughput. Although not a preferred architecture, the present invention also contemplates sequential reading and writing'by the warp units 44. In this manner, only one pixel bus may be required, assuming the one pixel bus offers sufficient bandwidth for real-time image transfer for a full screen image.

With continued reference to Figure 4, the present invention preferably includes ageorn.etry input bus 46 and a geometry output bus 48, although it is contemplated to combine the two busses into a single geometry input/output bus of su.fficient bandwidth for real-ti;,cne imaging. The geometry input and output busses 46 and 48 are preferably connected to the inputs and outputs of the Cube-5 units 38 respectively and provide for the unique coupling of at least one geometry pipeline or engine (not shown) to the present system 10. The architecture of the present invention, coupled with a geometry engine via the geometry busses 46 and 48, supports the integration of irnagery, such as volumes and textures,,%ith geometries, such as polygons and surfaces. This rnixirig of geometric data with volumetric objects is a powerful feature which is unique to the present invention.

Referring now to Figure 5, there :is illustrated a block diagram depicting the functional stages of'one of the plurality of Cube-5 rendering pipelines (reference number 38 in Figure 4), form.ed in accordance with one ernbodiment of the present invention. As shovm in Figure 5, each rendering pipeline 52 preferably includes four types of processing units: a trilinear interpolation unit (TriLin) 54, a gradient estimation unit (Gradient) 56, a shading unit (Shadee) 58 and a compositing unit W 00/04505 PC II'/US99/16038 (Cornpos) 60. Each of these rendering pipeline stages is described in detail in the prior issued patents and pending applications of Azie Kaufinan relating to prior Cube volume visualization architectures (listed above) and are therefore only briefly discussed herein below.

As discussed above in reference to the 3D memory units 24, the volume dataset is stored as a regu.lar grid of voxels distributed across the 3D
irnernory units 24 in a skewed fashion, with each Cube-5 unit:38 connnected to a corresponding 3D
memory unit 24 (see Figure 4). Voxels of the same skewed beam are preferably fetched and processed ; n parallel, distributed across all Cube-5 units 38.
Consecutive slices of the volume dataset parallel to a predefined baseplane (i.e., parallel to a face of the volume dataset ivhich is most perpendicular to a predefined view direction) are preferably traversed in scanline order. Referring agaira to Figure 5, an address generation and control unit 62 preferably generates the addresses for access into the 3D memory unit 24. 7'he address generation and control unit 62 additionally designates a first processing slice as the current processing slice and controls sweeping through subsequent slices of the volume dataset until the final slice has been processed.

The trilinear interpolatiori unit 54 computes a new slice of interpolated sample values between two processing slices. It is contemplated by the present invention that the trilinear interpolation function may alternatively be performed as a sequence of linear or bilinear interpolations.

The gradient (vsti tion unit 56 preferably cornputes central difference gradients using voluYne data from multiple slices of the volume dataset.
Utilizing the central difference gradients generated by the gradient estimation unit 56, sample points of the current processing slice are subsequentiy shaded by the shading unit 58.
The shading unit 58 preferably uses the samples and gradients as indices into one or more look-up tables (LUTs), preferably residing in each shading unit 58, which store material color and ir-tensity information. The material color table is dataset-type dependent, while the color intensity table is based on a local illumination model, as WO 00/04505 PC'I"/[7s99/16038 known by those skilled in the art. In simple terms, the multiplication of color and intensity yields a pixel color for each sample which is used in the conapositing unit 60 to composite such c loy, with the previously accumulated pixels along each sight ray.

With reference again to Figure 4, data for computing the next sample along a continuous sight ray may reside on a neighboring Cube-5 unit 38. In this case, the nearest-neighbor connections between Cube-5 units 38 are preferably used to transfer the necessary data to the appropriate Cube-5 unit 38, which will continue to process that particular sight ray. When compositing has been completed, the c mposited pixels (i.e., baseplane pixels) are preferably stored in the corresponding 2D
memory unit 40 connected to the Cube-5 unit pipeline 38. The baseplane pixels, which form the baseplane image, a: e subsequently read from the 2D memory units 40, via the baseplane pixel bus 42, and assembled by the warp units 44. The warp units 44 additionally transform the baseplane image to the final projection plane image.

Referring to Figure 5, the delay of data required for the trilinear interpolation unit 54 and gradient estimation unit 56 is preferably achieved by inserting one or more first-in-first-out (FIFO) units 64 into the pipeline data pa:th prior to being processed by the trilinear interpolation 54 and the gradient estirnation 56 units. The FIFO
unit(s) 64 may be implernented as, for exatnple, random access memory (RAM), preferably embedded on the Cubv-5 chip. T'he introduction of a predeterrnined delay may be particularly important when simultaneously processing beams of multiple slices of the volume dataset, thereby requiring more computation time between slices.

A compositing buffer (Compos Bu}fea ) 74 operatively coupled to a bilinear interpolation unit (Bii;irc) 72 essentially provides a one slice FIFO. The bilinear interpolation unit 72 preferably interpolates to obtain values between voxels as needed for texture mapping. For volume rendering, BiLin 72 preferably uses only weights of 0.0 or 1.0 which selects one of the corner voxels of the volume dataset (determined by Select x and Select y). It just moves the ray data, if the ray crosses pipelines. Just a mux [71 for x and yviould be enough for volume rendering, but bilinear interpolation is preferred because of texture mapping.

W 00/04505 PC'r/US99/16038 The Cube-5 architecture preferably supports re-ordering of the pipeline stages and a number of multipass rendering and processing operations, which require feedback connections between various stages of the Cube-5 rendering pipelines and the 3D memory units 24. For exarnple, correct rendering of overlapping volumetric objects pref:erably requires at least two passes through the Cube-5 pipeline 52, where the first pass re-sarnples the volunietric objects to align them with each other and the second pass renders the volurr:ietric objects in interleaved order. As shown in Figure 5, a multiple volumes feedlback path 66 is preferably provided, operatively connecting the output of the compositing unit 60 to the corresponding 3D

memory unit 24, whicYt allows the re-sampled volumes to be written back into the 3D
memory unit 24 after re-sampling, classification and shading. The final rendering pass works on I2G~a volumes.

Similarly, each Cube-5 reridering pipeline 52 preferably includes an image-based rendering feedba,ck path 68 connectecl between the waip unit 44 and the memory unit 24. The image-based rendering feedback line 68 preferably provides a feedback path for writing the interrnediate warped images to the 3D memory unit 24.
This may be particularly useful for accelerating certain image-based rendering operations requiring multiple warp passes. The architecture of the present invention further contemplates feedback connections lbetween the 3D memory unit 24 and various other Cube-5 r ndering pipeline stages, or between the individual pipeline stages thexnselves. Image rendering speed may be substantially increased by including feedback paths which provide direct and immediate access to the computational results of individual pipeline stages, without having to wait for the results to traverse throizgh the entire Cube-5 rendering pipeline 52.

In a preferred embodiment of the present invention, the Cube-5 system includes connections vrhich bypass selective stages of the rendering pipeline, that, for example, may not be required for certain imaging operations. By bypassing these unused pipeline stages, such imaging operations can be accelerated. As illustrated in Figure 5, a texture rnap bypass 70 is preferably included in each Cube-5 rendering pipeline 52. This textarre map bypass connection 70 substantially speeds up mip-w 00/04505 1PC7'/QJS99116030 mapping, for instance, vrhich consists of storing multiple levels-of-detail (LOD) of the image to be processed, by bypassing the shading utllt 58 a-ad compositing unit 60 and directly presenting the results from the trilinear interpolation unit 54 and gradient estimation unit 56 to th+;, bilinear interpolation unit 72. In this way, the architecture of 5 the present invention caLn preferably be consiidered not only as an array of pipelines for performing volume rendering, but as a collection of hardware resources which can be selectively configured 1:o perfonn a variety of imaging operations. For example, when the Cube-5 system of tlae present invention is perforixirig volume rendering, essentially all of the ha.rdware resources are required, while texture mapping generally 10 requires only memory, some buffering and the interpolation units.

Another unique and amportant aspect of the present invention which will now be discussed is the ability of the Cube-5 architecture to preferably interface with at least one convergtiona:i geometry engine 76 to support mixing of geometric data and 15 volumetric objects in a single image. s is preferably accomplished by providing at least one geometry btis, as discussed above, to interface with the geometry engine 76.
Preferably, thv Cube-5 architecture of the present invention is adapted to re-use pipeline components (e.g., interpolation unit, etc.), wherever possible, to 20 accelerate a variety o f renderirAg algoritlnis using multiple configurations, in particular, rendering scenes of multiple volumetric and polygonal objects, texture mapping, and image-based rendering. Arriong other important advantages, reusnng pipeline components reduces hardware costs. The Cube-5 architecture also supports various unique methods and algorithms for enhancirig volume rendering and acceleration of other imaging operations. Some of these methods and algorithms will be discussed individually in greater detail below.

In a preferred embo ierlt of the Cube-5 system, fo ed in accordance with the present invention, volume datasets are stored in blocks, thereby taking advantage ~b0 of spatial locality. Instead of linear blocking (e.g., Voxelator API), hierarchical blocks are used which are preferably stored in a distributed arrangement, skewed across multiple 3D xneanory units. For example, using current Mitsubishi Electric 16-WO 00/04505 PC't'/t1S99/16038 bit, 125 megahertz syrtchronous dynamic random access memory (SDRAM) to implement the 3D memory, each block can contain 83 16-bit voxels requiring bytes or two SDRAM pages.

Each block is preferably organized as a collection of 23-voxel miniblocks residing in the same 3LO memory unit. The banks inside the SDRAM can preferably be accessed in a pipelir.ted fashion such that the current burst transfer essentially completely hides the setup of the subsequent burst transfer. If the view-dependent processing order of the voxels in a miniblock does not coincide with their storage order, then the eight miniblock vexels are preferably reordered on the Cube-5 chip.
Hence, a single copy of the volume dataset on the SDRAM is suff icient.
Therefore, hierarchical blocking i-llows random access to miniblocks at essentially full burst mode speed, essentially full (100%) bandwidth utilization, view-independent data storage and balanced workload.

Blocking not only optimizes the memory interface, but has an additional advantage of reducing the inter-chip co aunication bar-dwidth (i.e., between Cube-5 hardware units), since only the voxels on the block perirneters need to be exchanged between neighboring chips processing neighboring blocks. While processing a b3-voxel block in (P) time, only the (b') voxels on the block boundary need to be communicated between chips processing neighboring blocks, where b is the size of a block edge and each block has bxbxb (i.e., b') voxels. Therefore, inter-chip communication needs (1/b)less bandwidth than with a non-blocking solution.
The size of the block edge b can be in the range of about 4 ti b s 64, although a block edge size of eight (8) is preferred.

Block look-up tables (LUT) are preferably utilized to store the pointers to all blocks comprising tJae current volume. I'his approach provides an easy method to restrict the active voliiime while zooming into a selected region of interest of a large volume. It also alle-ws rendering of arbitrarily shaped sub-volumes (at block-sized ge-anularity). Additionally, scenes containing many srriall volumes can be rendered very efficiently, as all volumes can reside anywhere aknong the 3D memory units, and WO 00/04505 PC'II /1JS99/16038 only the look-up tables rraust be reloaded for each volume, rather than the 3D
memory units.

One method of perforgning perspective projection and/or Level-of-Detail (LOD) relies on two-fold super-sanapling in the x and y directions.
Accordingly, a four-times (4x) replication of the interpolation units for trilinear interpolation, as well as the gradient estimation units for gradient computation, is preferably employed. As a result, the datapath between the SDRAM and the Cube-5 pipelines is essentially unchanged. However, 4he bandwidth between Cube-5 pipelines is quadrupled, as is the on-chip throughput and buffers, primarily because each sample of the nor.rnal mode is replaced by up to four samples (i.e., 2x in the x direction and 2x in they direction).

Handling anisutropic datasets and super-sainpling preferably require a modification of opacylbr a. The combined fimction is a' = 1-(1 -CZ)dlk, with super-sa.tnpling factor ;~ representing the number of samples per voxel cell, and d representing the distance which a sight ray travels (i.e., the length of the path of the sight ray) through eacli voxel cell. preferably, a loolc-lap table (LUT) is employed, for fast look-up of a' during rendering.

With continue(l reference to Figure 5, the perspective rendering of volumetric data with close to unif:ortn s pling of the underlying volume dataset requires re-scaling of the compositing buffer 74 with filtering between levels: Level-of-detail (LOD) perspective rendering requires re-ali ent of the compositing buffer 74 between levels. Both of these processes, which incorporate global communication not available in the pipelines 52, are preferably perforrned by the warp unit(s) 44.
Although the compositing buffer 74 is already accessible to the warps units 44, it is preferred that a feedback line 43 be used to write the filtered values back into the compositing buffer 74.

JO

A hardware viarp unit is generally necessary to obtain final full screen images in real time (i.e., a 301-Iertz franie rate). As shown in Figure 5, the baseplane image, WO 00/04505 PC'T/[JS99/16038 generated by the compositing units 60 of the Cube-5 rendering pipelines 52, is preferably buffered in the 2D memory units 40. To lower the memory bandwidth from the 2D memory units 40 to the warp unit 44, each pixel fthe baseplane image is preferably accessed only once. To perform a linear interpolation between samples of the current and the pre ii us scanline, another FIFC) unit, sized to hold at least one scanline, is required to store the previous scanline s ples. The interpolation rweights:-f r each grid pixel are preferably pre-calculated on a host machine.

In order to perf:crm the accurate mixing of volumes and geometry, for opaque geometric objects, the Z-buffer image is preferably written to the compositing buffer 60. The compositing unit 60 must perforrn a z-comparison prior to blending each new s ple. Additionally, for translucent geometric slices, the geometry engine 76 preferably utilizes the geometry input bus (reference nurnber 46 in Figure 4) of the present invention to insert each slab of R.G~a values into the data strearn so that each slab is interleaved with the volumetric data slices.

For texture mapping, Figazre 6 shows, by way of example, how 32 bits of texel data are preferably stored for a 2x2 neighborhood in a miniblock of 16-bit vaxels in the 3D memory unit, in accordance with the present invention. Therefore, a four-texel neighborhood of 32-bit texels is preferably read during each niern ry burst read.
Without data duplication, the Cube-5 syst.-n preferably performs, on average, 2.25 data burst reads to ac-cess the appropriate 7exel neighb rh cd, since some texture coordinates may lie between stored miniblocks.

2-- With reference again to Figure 5, in accordance with one forrn of the present invention, one way to implement image-based renderirlLg in hardware is to utilize the memory c ntrcl uni't 78, preferably included in each Cube-5 pipeline 52, to read the appropriate source fsixels based on the contributing region for each pipeline.
The interpolation units (e.g., 54 and 72) in that pipeline 52 will then preferably perform the four-dimensional (4D) interpolations needed for light field rendering or lumigraph. As an alterr?ative implementation, the warp unit 44 may be utilized to perform this function. The source pixels contributing t the current view are read and W 00/04505 PC'r/1JS99/16038 assembled into the 2D memory units 40, preferably thro-agh a connection line 41, followed by the warp transforrnation. preferably, four assembled source images are processed in four consecutive warp passes. The final combination of the four intermediate warped images is performed in the Cube-5 pipeline 52. As described previously above, the iniage-based rendering feedback line 68 provides feedback for writing the interrnediate, warped images to the 3D memory 24. For either approach, the 3D memory units 2za provide local storage for a large database of images.

It is to be appreciated that the apparatus of the present invention described herein above (and referred to as Cube-5) may considerably accelerate conventional volume processing methods, beyond the universal rendering already described.
Additionally, the Cube-5 apparatus of the present invention may be used in conjunction with a nun-iber of unique algorithms adapted for enhancing the performance of and0or providing enhanced features for real-time volume processing, therefore making the overall Cube-5 system superior to existing volume rendering architectures, such as Cube-4. Some of these unique algorithms, including those for perfor8raing image waiping, three=-dimensional transf :rrriations, perspective projections, handling large volumes, high qnality rendering, clipping, depth cueing, super-sampling and aizisotropic datasets, are discussed in detail below.

In accordance with one forrri of the present invention, a method for perforrning image warping is presented which, among other advantages, speeds perspective warping and provides improved image quality. Image warping is preferably the final stage of the Cube-5 volume rendering pipeline. In simple terrns, irnage warping primarily relates to the geometric transformation between two images, namely, a source image and a target image. The geometric transfonnation defines the relationship betweeri. source pixels and target pixels. Efficiency and high quality are equally critical issues in such applications. In the apparatus of the present invention, the warp unit prefer-ably perfornis the image transformation function.
Consequently, applications employing a wunit benefit from the image warping method of the present invention.

WO 00/04505 PC1"/[JS99/16038 Distinguished by the data flow of the transformation, image warping methods are generally classified as either forward wapping or backward warping. In forward warping, the source pixels are processed in scanline order and the results are projected onto the target image. I:n backward warping, the target pixels in raster order are 5 inversely mapped to the source image and sampled accordingly. Most known prior art warping algorithms ernploy backward warping.

Compared with affine transfornnations (i.e., translation, rotation, scaling, shearing, etc.), a perspective transfornaation is considered to be more expensive and 10 challenging. For perspective projection, an expensive division is needed when calculating the s ple l cation in the baseplane image for a pixel in tlae projection plane. Conventional perspective warping is typically at least three-fold slower than parallel warping, when implemented by a CFIL7. Accordingly, some prior art approaches have decomposed the perspective transforrnation into several simpler :15 transformations requiring multiple passes. One prirnary problem inherent in multi-pass transformation algorithms, however, is that the coYnbination of two one-dimensional (1D) filtering operations is not as flexible as txue two-dimensional (2D) filtering. Furthermore, conventional multi-pass approaches introduce additional filtering operations which degrade image quality.

:2n The present invention preferably employs a unique single-pass forward warping method which can be iinplemented with substantially the s e efficiency as affine transformations. Costly divisions, wl:iich were traditionally pe:rfonned for every pixel, are reduced to only twice per scanline according to the present invention. Thus, 25 by reducing the number of division operations, the present invention provides an alternative perspective warping method which is superior to known prior art methods, at least, for example, ir- terrns of speed and the efficient hardware implementation. A

preferred method for perspective warping, in accordance with the present invention, will now be discussed.

Preferably, the present invention uses a scanline approach to perforin perspective warping. Rather than scanning in normal raster scanline order, however, w 00/04505 PCT/Y.JS99/16038 the algorithm of the present invention is processed in a special scanline direction in the source image. As illustrated in Figures 7 and 8, this special scanline direction 92 (Figure 8) preferably has the property that parallel scanlines 84 in the source image 80 appear as parallel scanlines 86 in the target image 82, and that equi-distant sarnple points 88 along a source scanline 84 remain as equi-distant sample points 90 in the target scanline 86. Soxne advantages of this unique approach include a reduced complexity of perspective-correct image wcxping; (i.e., by eliminating the division per pixel and replacing it -ivith two divisions per scanline), accurate antialiasing by incorporating anisotropic filtering, correction of flaws in Gouraud shading caused by bilinear interpolation and optimization of the memory bandwidth by reading each source pixel exactly o,ace.

The intuition of the special scanline direction is derived from projection geometry, as shown in Figure 8. Referring; to Figure 8, the source image 80 is preferably placed on a three-dimensional (3D) surface and the target image 82 is placed on a screen. As in typical texture mapping, to obtain the pixel on screen, a sight ray (or rays) 94 is cast from a viewpoint (or eye point) 96 to 3D space and intersected with the screen 82 and 3D surface 80. The intersectiorl points are the sample points 98. When the scan direction 92 in screen space is parallel to the 3I?

planar surface, the scanlines in both images are parallel to each other, and equi-distant sample points 98 along the scanline remain equi-distant in the 3D surface plane. This parallel-preserving (PP) scanline direction exists and is unique for a given perspective transfortnation. It is to be appreciated that for parallel projections, any scan direction preserves this parallelism on both images, and thus a raster scanline direction may be preferably used due to its simplicity.

Referring ag;ain to Figure 7, parallel-preserving (PP) scanlines 84 and 86 are shown in both the source 80 and target 82 images respectively. Once the parallelisrn property is achieved, pixel access becomes regular, and spatial coherency can be ?>0 utilized in both images. Additionally, the PP scanline enables the application of a pure incremental algorithm without division to each scanline for calculating the projection of source sa,niples 88. One division is still needed, however, for the two endpoints of every scanline due to the non-linear projection.

With continued reference to Figure 7, as the source image 80 is scanned in the PP scanline direction rather than the raster direction, sample points 90 on the target scanline 86 may not necessarily coincide with the target pixels 91. However, the sample points 90 can be aligned on the x grid lines 89 of the target image 82, thus the sample points 90 are only off they grid lines 87 (they are equi-distant along the scanline). For a more vfficient but lower quality itnplemeratation, placing the sample value in the nearest-neighbor target pixel is a reasonable approximation, as a half pixel is the maximum error. However, when higher quality is preferred, the present invention may perforna pixel correction and effective aratia.iiasing, to be described herein below.

In general, a reduction in the number of divisions from (nZ) to 0(n) is obtained by the algorithm of the present invention (where n is the linear resolution).
For the present algorithm, preferably oitly two additions are needed to calculate each sample point, while conventional raster scanline algorithms generally require three additions, one division and two multiplications per pixel. A preferred method for perforrning forward image warping, in accordance with the present invention, is described in detail herein below.

The forward warping algorithm of the present invention is preferably perforrned in two steLges: (1) calculating the special parallel-preserving (PP) sc e direction, and (2) fo,,-ward mapping the source image to the target image along the special PP scanlines, incrementally within each scaxilirne.

As discussed briefly above, the parallel-presenring (PP) scanline is the intersection line between the tYiree-dimensional (3D) planar surface and the screen (i.e., target image). However, in a two-dimensional (2D) problem, the PP
scanline must be calculated based on a 2D matrix. Generally, a perspective transforxnation can be presented as WO 00/04505 PC''/t3S99/16038 u x 6, d.. g x - -v=M y= b e h y 1 z cf 1 1 where (u, v) is the coordinate of the source pixel, (x, y) is the coordinate of the target pixel, and M is the perspective transfcrrrnation matrix. The (u, v) coordinate can be expressed in terms f (x, y) as ax + dy --~
(u,v) -F(x9y) -c bx+ey+h where C=- 1 (cx +.~y + 1) A line in the target image can be expressed as y = kx + B, where slope k denotes a line direction and B denotes a line intercept. To calculate slope k for the PP
scanline, two parallel, lines are preferably defined having identical slope k and intercepts B of 0 and 1, represented by point pairs of (0, 0), (1, k) and (0, 1), (1, k + 1), 'LC respectively. The coordinates of these points in the source image are then calculated.
Since perspective transforrn.ati n preserves straight lines, these two lines will remain as straight lines in the source image and their slopes can be calculated from two point pairs. Assuming that the slopes of the two rnapped lines are essentially equal, an equation in k is pref-,rably obtained. Solving this equation for k results in c k=--f The corresponding slope k in the source image is t;hen k/= bf-ec af - dc PC'T/tJS99/9603$
~' 00/04505 As can be noted frorn tYie above equation, when k' _-c , the denominator of the - f homogenous coordinates becomes a constant value of Bf -~ 1, where B is the intercept in y=kx+B.

The second stage of the preferred forward warping method of the present invention involves scai.iline processing and is illustrated in Figures 9 and 10 by way of-ex ple. Referring now to Figure 9, the preferred algoriT sweeps the scanlines (e.g., scanlines SI -.S4) through the source -image 80. As discussed above, the scanlines 84 have tbe slope k'. '.rlo.e samples 88 along each scanline 84 are preferably incrementally calculated. First, for each scanline 84, the projection of the endpoints from the target image onto the source image is calculated. Then, based on the number of sample points on tYLe scanline, increments are calculated in both the x and the y directions.

Considering a, traditional bilinear iriterpolation of sarnples in the source image, every sample essentially requires the contribution of foLir surrounding source pixels.
If pixels are read every time for every sainple, each pixel ought to be read four times.
This leads to a memory bandwidth of four times the target image size. However, since all scanlines are in parallel, samples on neighboring scanlines usually share contributing source pixels. Consequently, in accordance with the method of the present invention, pixels that have been previously read are preferably buffered so that corxirnon pixels are read from the buffer rather dun from the source image itself With referer.Lce to Figure 7, pixels are preferably read in a fixed patterrz, called the pixel read template 100, calculated based on the Breserham line algorithm (as appreciated by those skilled in the art). The binary digits shown at the bottom of Figure 7 represent one way of encoding the read ternplate 100. The present invention, however, contemplates other encoding schemes, as appreciated by those skilled in the art. As illustrated in Figure 7, this code indicates the increase in the positive v :30 direction; a"0" represents no increase and a"189 denotes an increase by one unit, while u is always increased by one unit. For the example of Figure 7, the u axis may preferably be referred to as the primary processing axis. It is preferred that the W 00/04505 PCT/[JS99/16038 template 100 always start from the left-rraost '~ixel and moves in the vertical direction (i.e., increasing v direction) so that all pixels are read and placed into the buffer for subsequent use in the sampling. It can be seen frorn Figure 7 that in order to provide pixels for sarnpling on any scanline between the two dotted lines, four pixel templates 5 are preferably required, even though for a specific scarnline, only three pixel templates might seem sufficient (e.g., only templates 2, 3 and 4 are $necessary to process the current scanline aS2). Therefore, the buffer size is preferably four scanlines.

Referring now to Figure 10A, there is illustrated the addressing of sarnnples in 10 the buffer. Whenever the template code value is 1, the sample decreases by one unit in the v direction. The thick zigzag line 104 represents the output scanline in the buffer. When the sarnnple falls within the shaded region 106, in which the pixels in the buffer are sheared, care should be taken to read the correct pixels for sampling.

Figure 10B illustrates a preferred procedure for bilinearly interpolating one of the 15 samples, s, in this region.

The contents of the buffer are preferably updated based on the scanline position. For ex ple, referring to Figure 9, templates 1, 2, 3 and 4 are preferably in the buffer when processing scanline S1. For scanline 5'2, the buffer preferably 20 remains the s e. For scanline S3, template 5 is preferably read into the buffer and template 1 is discarded. For scanline S4, template 6 preferably replaces template 2, and so on.

As mentioned above, one of the features of the unique forward image warping 2G s method of the presern.t invention is the correction of flaws in Gouraud shading.
Gouraud shading is a popular intensity interpolation algorithm used to shade the surfaces of geometric objects. Given color only at the vertices, Gouraud shading bilinearly interpolates the intensities for -the entire rasterization of ageornetry in a raster scanline orde:r. The flaws of the Gouraud shading approach are known in the art 30 and have been the subject of such articles as, for example, DiLyital Imave Warging, by G. Wolberg, IEEE Computer Society Press, 1990.

wo 00/04505 PC'I'/tJS99l16038 One of the prob'leyns associated with the Gouraud approach is that diagonal lines (as an example) ai-e not linearly mapped for perspective projections.
When a diagonal line is perspectively projected onto a screen in 3D screen space, Gouraud shading converts this diagonal line into a curve, which violates the property of preserving lines in perspective transfornnation.

The irnage warping method of the piresent invention corrects the perspective distortion in Gouraud shading. The perspective distorlion is present because the linear interpolation along a raster in screen space is generally non-linear when transfonned into geometrical coordinates. Using the special scan direction of the present invention, however, li:riearity is preserved by the mapping. Thus, interpolation is linear in both image arYd geometrical space, thereby fixing the distortion of Gouraud sha g. It is to be appreciated that interpolation along the edges is still non-linear, and therefore the scan.line endpoints must be transformed into geometrical space for correct interpolation.

The forward rnapping algorithm of'the present invention, with nearest-neighbor approximation, preferably generates a target irnage that is essentially indistinguishable froin an image generated using traditional methods. However, when a higher image quality is desired, the method of the present invention can preferably calculate the pixel value at exact grid points. A simple target pixel correction scheme may preferably be introduced to perforrn this correction.

With reference now to Figure 11, assuming the sample points 90 in the target 2:5 image 82 are alflgned on integer x coordnlates, in order to obtain the pixel value at the exact pixel grid locations 91, a linear interpolation of the two samples immediately above and below each pixel is preferably performed. Perfo ing this linear interpolation simply as a second pass may increase the cost, since the samples must be read over again. Instead, as each sample is generated, a preferred method of the present invention spreads the contribution of each saanple to the corresponding upper and lower pixels with no intermediate buffering.

WO 00f04505 pC'I'/IJS99/16038 As illustrated by the example of Figure 11, samples 112 located on the thicker inclined scanline 108 contribute to the shaded pixels neighboring them (lighter shading above the scan:line, darker shading below the scarilirze). The arrows indicate that each s ple 112 pi: eferably contributes to two pixels. It is preferred that a pixel not be written out until both contributions are collected. Thus, a ue scanline buffer is preferably included for storing the intertnediate pixel values.

To write out pixels correctly and efficiently, a pixel write pattem, called a pixel write template 11; 0, is preferably pre-calculated. Unlike the pixel read template (e.g., reference number 100 in Figure 9), the pixel write template 110 is preferably calculated by trancatirig the y coordinate vdue of samples along a scanline.
The template 110 is preferably encoded as a series of integer y steps and fractional distances dy from the true scanline 86. The weights used for the final linear interpolation are dy aiid 1 - dy for the upper and lower pixels, respectively.
Since all scanlines are preferably one unit apart in the vertical direction (i.e., y direction), the template is calculated only once per projection.

The forward image warping method of the present iuventioii can further improve on image quality by antialiasing. Using the parallel-preserving (PP) scanline, a higher quality, less expensive method of antialiasing may be achieved.

Referring again to Figure 7, the sai:nple points on the upper scaailines of the source image are sparser than on the lower scanlines, resulting in a transition from under-sasnpling to normal sarnpling. Thus, an appropr~iate resampling filter may preferably be used t+) avoid aliasing on the upper scanlines. Isotropic filtering results in clearly incorrect and blurry images. The need for anisotropic filters has been addressed in such articles as Soey f Texture Marr~in , by P. S. Heckbert, IEEE
Computer Graphics- and Applications, 6(11):56-67, November 1986, and more recently in Texram; Smart r/Iemcrv f r ']Cexturiug by .A. Schilling, et al., IEEE
C mputer Graphics and Applications, 16(3)>32-41, May 1996.

WO 00/04505 PC II'/IJS99/16038 1t is known by those skilled in the art, that each filter is defined by its footprint and profile. Taking a target sample as a circle, its projection in the source image is its footprint. As illustrated in Figure 12, this footprint 114 should generally be neither circular (i.e., isotropic) nor square-shaped (i.e., as in mip-mapping), but conic in shape. The profile of the filter decides the weights of the contributing pixels within the footprint. Although a sine filter is optirnal, agaussiari filter is easier to implement-and is preferred becau'e of its finite footprint and good low-pass characteristics. The perspective warping algorithm of the preseiat invention offers more accuracy in calculating the anisotropic footprint, producing higher age quality at a lower cost.
Using conventional methods for calculating the anisotropic footprint, the main axes of the ellipse must be calculated for every pixel. Although approximations have been proposed, this remains an expensive computation, and no known incremental method is available. To obtain the major axes of the ellipse using these prior art methods, the Jacobian must be calculated. Using the image warping method of the present invention, however, calculation of the Jacobian may be eliminated.

Iri order to gLLin insight into a preferred method for calculating the anisotropic footprint ir, accordarice with the present irivention, the properties of the Jacobian will first be analyzed. The generalized backward mapping -from an xy target image into a uv source image was previously defin.ed above as u' lax+c~y+~
= ~ (x.Y,) - C v bx+ey+h where C=. 1 (cx +fy + 1) The Jacobian J for the generalized transformation is a non-linear fimction of x arid y, J = C 2 y(af - cd) + a - gc x(af- cd) - d +g1 25.-y(bf-ce)+b-hc x(bf-ce)-e+v In conventional antialiasing approaches, the Jacobian is used to determine the footprint of each pixel in the source image and is necessary for anisotropic filtering.
The differences between screen pixels in xy raaster space are projected into the source image by computing the directional derivatives in the [1, 0] and [0, 1]
directions.
These derivatives in source image space are called ri and r 2, and are defined as 1 y(a~'- cc~) +ca - gc .J J=C, 0 3'(bf- ce) + b - hc and 6 x(a,f-ca~) -d +g~
r ZT~ ,~x l. x(b.{- ce) - e +hf These vectors, r, and r2, define the bounding box of an ellipse that approximates the footprint 114. Typically, these vectors 116 and 118 are calculated for every pixel, when needed, for conventional methods of anisotropic filtering (e.g., elliptical weighted average (EWA), footprint assembly). This requires one more division per pixel for,malctalating C. In accordance with the present invention, a more accurate method for determining the footprint is presented, as described herein below.
Because the Jacobian is a linear approximation of the non-linear mapping, it is more accurate, and therefore preferable, to compute the footprint by taking the distances to neighbos-ing samples in source image space. Since the projections of neighboring samples are already computed, this method of the present invention requires no additional division.

The parallel-preserving (PP) scan direction provides for greater coherency and no division to compute the Jacobian. Forr each pixel iri the PP scanning order, the footprint is preferably defined by r,' a.nd rZ. The directional derivative r,' in direction [1, k] along the PP scanline is WO 00/04505 PCT/iJ599/16038 r t ~~- ~~r ~P [1rtl F'=.1 k I=c z bf -ce and since y =1cx YB, C is constant for every PP scanline, and thus a ,' is (Bf+ t) constant for every PP scanline. The method of f.ae present invention exploits this fact in order to preferably increment the source image coordinates along a scanline, with no divisions. The value of the directional derivative r2'in they direction [0, 1] is r r2 =V[ol]F=YZ

5 It is to be appreciated that r2' varies linearly along the scanline since it is a function of x, and thus it can be iticremented along the scanline. The special scan direction makes it possible to compute the source image coordinates and pixel footprints simply and efficiently.

10 After efficiently computing all the footprint and source pixel coordinate inforrnation, correct anisotropic filtering can be perforrned using a standard method known by those skilled in the art, such as, for example, Greene and 1-Ieclcbert's elliptical weighted average (EWA) or Shilling et al."s footprint assembly.
These conventional algorithms are described, for example, in the text Creating Raster 15 Omnirna&lmages fYOrri MuM ltinle Perspective Views fJsing th.e Elliptical Wei ngted A.vera e Filter, by N. Greene and P. S. Heckbert, IEEF. Computer Graphics and Applications, 6(6):21-27, June 1986. However, these conventional filtering approaches are not preferred since, as pointed out previously, even the elliptical footprint approximation is inaccurate. F'urtherrnore, such prior art methods result in 20 redundant sampling (i.e., accessing each source pixel multiple times). For instance, for a circular filter region with a footprint radius of 1.0 source pixel, each source pixel is sampled an aveT'age of -A times. By using the forward mapping technique of the present invention, redundant memory access can be essentially eliminated, thus lowering the memory bandwidth by a factor of n. Preferably, the present invention 25 -- provides a forward mapping technique in which all source pixels are read once in pixel read template order and subsequently splatted onto the targq~ image with a filter kernel.

As illustrated in Figure 13, each source pixel 124 has a Ax 120 and a y 122 relative to each of its nearest-neighbor target samples 126. The Ax can be preferably computed incrernentelly since all samples along a scanline are equi-distant.
The special scan direction essentially guarantees that the Ay is constant along each scanline. Although the raster grid locations deviate from the true scanline 128, the actual distances can be estimated preferably by adding a small correction which may be stored in the pixel read template 130 and is preferably unifor.rn among scanlines.
The filter kernel is preferably premcomputed once and stored in a lookup table (LUT).
Subsequently, the contribution of each source pixel 124 is preferably indexed by its & and Ay into the lookup table (LUT) for the four (or more) nearest-neighbor target samples 126. The riumber of target sam:ples 126 depends upon the footprint of the filter used, and it rnay preferably vary from four to 16 samples. Using this method, each source pixel 124 is preferably read exactly once from memory, then four (or more) times modulated by a loolcup table entry and accumulated in the target pixel. In this manner, the final pixel value is the weighted average of the nearby source pixels 124. This weighted average requires a division by the sum of the filter weights to normalize each firial pixel intensity.

In addition to image warping, vahich can be broadly defined as a geometric transformation between two images (e.g.9 a source LCnage and a target image), three-dimensional (3D) volume trarrsforrnation plays a key role in volume rendering, volume rza.odeling, and registration of inultiple voluznes. Among all affine transforn3ati-onS, rotation generally consumes the most computation time and is considered the rnost complicated. Accordingly, in providing a universal 3D
renderinl;;
architecture in accordance with the present invention, several unique methods for perfor.nling arbitrary 3D volume rotation are presented, as described in detail herein below. Although the universal 3D rendering hardware of the present invention may be used without the 3D volume rotation methods described herein, these methods, or algorithans, are preferably irnplernented in conjunction with the apparatus of the present invention to provide enhanced speed and features and are adapted to most efficiently utilize the apparatus of the present invention.

Prior to describing the unique methods for performing 3D volume rotation, it is important to first provide some basic definitions of the terms used. As appreciated by those skilled in the art, relative to the rows and columns of an image, a beam in a volume may be defined as a row of voxels along one major coordinate axis (e.g., an x-beam is a row of voxels in the x direction). A slice of a volume is a plane of voxels which is perpendicular to a major axis (e.g., an x-slice is defined as a plane perpendicular to the x axis).

Prior art methods for performing volume transformations typically utilize multiple-pass algorithms, which are usually direct extensions of the multiple-pass algorithms used for image transformations. Various methods for performing 3D
rotation have been proposed, generally involving a decomposition of the 3D
transformation into multiple two-dimensional (2D) or one-dimensional (1D) transformations. These prior art methods have been the subject of articles, including Volume Rendering, by R. A. Drebin et al., Computer Graphics (SIGGRAPH '88 Proceedings), Vol. 22, pp 65-74, August 1988, Three-Pass Affine Transformations for Volume Rendering, by P. Hanrahan, Computer Graphics (San Diego Workshop on Volume Visualization), Vol. 24, pp 71-78, November 1990 and Fast Rotation of Volume Data on Parallel Architectures, by P. Schroder and J. B. Salem, Visualization '91, pp. 50-57, 1991. However, these known 3D transformation methods typically result in a lower quality rotation and/or slower processing speed.
One of the properties which make three-dimensional (3D) rotation so difficult is that 3D rotations inherently require global communication and could cause memory contention while writing data back to the distributed memory modules. However, as shear transformation capitalizes on nearest neighbor connections, it lends itself to an extremely feasible multi-pipelined hardware implementation, as provided by the unique architecture of the present invention. The present invention further provides WO 00/04505 PC'r/1JS99/16038 novel methods for perl'orpning arbitrary 3D rotation, essentially by decomposing the 3D rotations into sequ+rnces of different types of shear transformations.

Using a conventional decomposition. approach, sir.ice a 2D rotation can be decomposed into three one-dimensional (lI)) shears, a direct extension to 3D
rotation would require nine lD shears. However, in accordance with the present invention, four preferred methods of shear decomposition of an arbitrary 3D volume rotation are presented, as described in detail herein below. These methods include a four-pass 2D
slice shear, a four-pass 2D beam shear, a three-pass bearri-slice shear and a two-pass 3D beam shear decornposition. By not introducing a scale operation, the algorithms of the present invention avoid complications in sampling, filtering and the associated image degradations.

It is to be appreciated by one skilled in the art that a 3D rotation matrix can be expressed as the conccltenation of three major axis rotations, .RX(~), Ry(.Rz(~), where dtx = 0 cos c~ sin c~
0 -siin (~ c s (~
c s 0 0 -sin 6 .R = 0 0 0 Y
sinij 0 c s0 eosa sin oc 0 IZ = -sin 0; cosa 0 z The order in which this concatenatioil is perforrned results in different 3D
rotation matrices. There are six perrnutations of the 3D rotation matrix in total. By way of illustration, the under:lying 3D rotation matrix was chosen as R3D
=Rx*Ry(H)RZ(a), _.2 Q- where WO 00r04505 PC'r/13S99J16038 co;s Cosu cos sina R 3D = si1l(~sYnOC sa-CC3s4)"sina siYl4)sin6sina+Cs3s(~Cosa siTB.~C(Ds CiBs(~ sin Cos$x +sil14) sii'@OC cCas(~siffiQsin (, -uifll(~cfls G cos(~CosO

The above 3D rotation matrix (R3D) is used for all the decompositions which follow. One of the primary differences between the unique methods of the present invention and other cor entional approaches is that in the present invention, the decomposition is applied directly to a 3D rotation matrix, rather than to multiple 2D
rotation sequences, to cibtain shear sequences. It is to be appreciated that, for any of the shear operations pe:rforrned in accordance with the present invention, barrel shifters may be used as a preferred hardware implementation, although other means, such as logarithmic shifters or the like, are similarly contemplated.

As shown in Fi gBare 14, a method for perforrrzing two-dimensional (2D) slice shear rotation, in accordance with one exnbo ' ent of the present invention, preferably involves a decomposition of the ;3D rotation into a sequence of 2D
slice shears. in a 2D slice shear, avolurrie slice (i.e., a plane of voxels along a major projection axis and parallel to any two axes) is merely shifted wit its plane. A slice may be arbitrarily taken along any major projection axis. For example, Figure illustrates a y-slice shear. A 2D y-slice shear is preferably expressed as:

x=x +w y z=z +b y A 2D y-slice shear may preferably be written as S(xz, y, (a, b)), interpreted as a shear along the y axis by an amount a 132 in the x-direction and an amount b 134 in the z-direction. Although both a and b are preferably constants, it is further contemplated that a asid b can represent functions as well. A 2D x-slice shear, S(yz, x, (c, d)), and a 2D z-slice shear, Sfty, z, (e, fi), are similarly defined. With reference to W 00/04505 PC'1'/iJS99/16038 Figure 14, the volume represented by the solid lines 136 is the shear result of the volume defined by the dotted lines 138.

Intuitively, consecutive shears along the same axis produce a conforming 5 shear. For example:

5(xz,.Y9 (a9 b)) S(xz,y, (a', b')) = a 1 b a' 1 b' = a+a' 1 b+b' In order to build the general 3D matrix from 2D shear matrices, shear products may be restricted to products of different shears: S(yz, -, (c, d)), S(xz, y, (a, b)) and S(xy, z, (e, fl). However, the product matrix of these tkiree shear matrices will still not 10 be in the general form due to a constant 1 in. the present naatrix.
Accordingly, another shear matrix is preferably concatenated, where this final shear is the same slice shear as the first one. This results in the following six perrn'xtations of shear sequences:

S(xz,y, (a, b)) S(xy, z9 (e,j)) S(3'z, x, (c, d)) ,5(xz,Y9 (g9 h)) S(xz, y, (a4 b)) 'S(yz9x9 (c, d)) S(xy9z, (e9j)) S(xz,3', (g, h)) 17 (xJ 9 z, (e9j)) 67 (.:6z9,J 9 (a, b)) Ig{y~yz9 x, (~ . 9 Ld)) S(xy, z, (i 7d )) "'' (x/. ,'7i y (e9j)) SCyz, x, (c7 d)) S(xz9 J 9 (a9 ' )) - S(xY, z9 (=

9/ )/
S(yz, x, (c, d)) S(xz,y, (a, b)) Ay(xJ %z9 \e%.1 i) S(yz, x, (m, F3)) S(yz, x, (c, ~)), S(xy, z, (e'9j)) S(xz,y, (a, b)) ' ' (yz,sg'9 (m, n)) For each of the; shear sequences, the product matrix of the consecutive shear matrices is preferably computed and set equal to the underlying 3D irotation matrix.
15 For example, for the first shear sequence given above (i.e., S(xz, y, (a, b)) ,S(xy, z, (e, j)) S(yz, x, (c, d)) S(xz, y, (g, h))) :

WO 00,04505 PC"T/[.JS99/16038 R 3D = R,x(4 )R y( ) RZ(U) _ S(xz,y, (a, .. )'), .'" \xi , z, ( e9J 1 )e L9('z9'.~D(c,."").) - LV(.az9.Y9( 9h)/

The above rnatrix equation implies nine trigonometric equations with eight variables, namely, a, b, c, d, e, f, g and h. In solving these nine equations for the eight variables, a - h, the fol:lowilig results are obtained:

a sin sina(cosO -cos(~) + sint~(cosoc -cosO) (cosO)2sincXsin4) cos~cosa - b b_-sinc~cosO
c = c s sincx d_ -siI2~ s1P1O Cosa ~ Cos(~ silla -cas6 sflffia slni~6;osO
sin~ Cos + eos(~ siYl. sitl G - sgn(~ cosa e cosO sina f_ -sirl(pC s _ coso Cosa - I
9 C s0sina cosec6)sa - 1 h _ coSO sinCt' In a similar rnarner, the shear matrices for the remaining five slice shear sequences given abovs, may be obtained. In fact, the slice shear sequence with the solution given above has the simplest expression and is preferably ternned the dominant sequence.

Referring now to Figure 15, a method for perforrrging three-dimensional (3D) rotation by a two-dimensional (2D) beam shear decomposition will now be described.
First, a beam shear may be defined as a beam that is merely shifted in its major WO 00/04505 PC'r/tJ399/16038 direction without any change of the other tvro coordinates. For example, a 2D
x-beain shear is preferably expressed as:

x = x +a ,y +b z A 2D x-beam shear may preferably be written as S(x, yz, (c, d)), interpreted as a shear along the x axis by an amount a in the x-direction and an amount b in the z-direction. A 2D .y-beaa.n shear, S(y, xz, (a, b)), and a 2D z-beam shear, S(z, xy, (e, )9), are similarly defined. Figure 15 illustrates an x-beam shear, whereira the volume represented by the dotted lines 146 is sheared to the volurrrae position represented by the solid lines 144.

A two-dimensional (2D) beam shear is advantageous over a 2D slice shear, and is therefore prefen ed, since a beam is shifted without changing the other two coordinates. Thus, the; resampling for each pass of the 2D beam shear approach is simpler, as only a linear interpolation is required. In contrast, a 2D slice shear approach requires a bi:liriear interpolation which is more complex.

Similar to the 2D slice shear decomposition, ir order to build the general 3D
matrix from 2D beam shear matrix decompositions, shear products may be restricted to products of different shears: S(x, yz, (c, 69), S(y, xz, (a, b)), S(z, xy, (e.j)). However, the product matrix of these three shear matrices will still not be in the general forn due to a constant 1 in the matrix. A.ccordinLgly, as in the slice shear method, another shear matrix is preferably concatenated, where this final shear is the same bearri shear as the first one. This results in the following six per-nu.ta.tions of shear sequences:

S(y,xz, (a, b)) "" (z,xJ ) (e9j)) S(x,yz, (c, d)) S(y,xz, (6 9' )) S(y,a.z,(afb)) S(Y9yZ, (c, d)) S(z9xy9(ej)) s(y,xz,(g9h)) 47 \zSi'J S \e9J // 9(J 9xz9 (a, b)) " (x9 J"z9 (c, G' ) L~(z9'%y S (g9J )) S(z,a.,y9 (e9j)) S(x,yz, (c, d)) S(y,xz9 (a, b)) ,3(x4xyf (i9J)) ,S(xf 1'z9 (c, d)) S(y, xz, (a, b)) S(z, xy, (e,j)) S(x,yz, (m, n)) S(x,yz, (c, d)) S(z, xy, (eej))) S(y, xz, (a, b)) S(x,yz, (m, n)) WO 00/04505 PC'fiYCiS99/36038 For each of the above shear sequences, the product matrix of the consecutive shear matrices is preferably eomputed and set equal to the underlying 3D
rotation matrix. For example, for the first shear sequence given above (i.e., S(y, xz, (a, b)) S(z, xy. (e, fl) S(x) yz, (c, d),, S(y, xz, (g, h))) R3D 1?x(~)R y (O)R~(a) _ "'(/ 9xz9(a9 b)) "'(z9xy9(e9y)) - S(X9.YZ (C9 d)) ' S(y) xz9 (g9 h)) The above rnnatrix equation implies rdne trigonometric equations with eight variables, namely, a, b, c, d, e, f, g and h. In solving these nine equations for the eight variables, a - h, the fol:lowing results are obtained:

a_sinO sina (cos(~ -c sO) + sin4) (cos(3 -cosa) sin~sin. G (C sO)2 c sc~c sEl - 1 b = sanc~cos c = -CG95 sfl84qx CosO silYa 1- sYn~siriO C sy - c17s(~94P.na d sin~cos0 Cos(~ sin() sinOG - sin4) Cos(x + sin42COSO
C s sina f= ~IIIn~cosO
c seeosa - 1 CosO sfllla h = sfln~sinO (cos - cosa) + sanoc(c J4) - cosO) sin~ sina (CosO)z In a similar manner, the shear matrices for the rer,raainirag five beam shear sequences given above may be obtained. The beaYn shear sequence with the solution given above is preferably termed the dominant sequence.

WO 00/04505 PC'r/iJS99/1603$

With reference iicsw to Figure 16, a method for performing three-dimensional (3I3) rotation by two-dimensional (2D) beam-slice shear decomposition in accordance with the present invention will be described. A 2D beam-slice shear may preferabfly be defined as a bearn that is shifted within a plane. For example, a 2D x-beam-y-slice shear is preferably expressed as:

x=x +a y +g z z = z + b'Y

A 2D x-bearn y..slice shear may preferably be written as S'((x, yz, (a, g)), (z, y, b)), interpreted as a shear along the x axis by an amount a in the y-direction and an amount g in the z-directien, combined with a shear along the z axis by an amount b in the y-directi n, where a, g and b are preferably constants. In essence, a beam-slice shear is a combination of a beam shear and a slice shear. Figure 16 illustrates an x-bearn y-slice shear, S((e, yz, (a, g)), (z, y, b)), wherein the volume represented by the dotted lines 156 is shecred to the volume position represented by the solid lines 154.

To build the ge:aeral 3D matrix from. a 2D shear matrix decomposition, shear products may be restricted to products of different shears: y-beam-x-slice shear S((y, xz, (c, h))l (z, /4s d)), x-CoeaCYBy-.Jldce shear U(ftJ yz, (a, g)), (z, y, b)), '~'- d y-beaYYE shear S(y, xz, (I, j)). As in the case of the slice shear and beam shear approaches, it is to be appreciated that there cire also six permutations of be -slice shear sequences.

For each of the shear sequences, the product matrix of the consecutive shear matrices is preferably computed and set eqtkal to the underlying 3D rotation matrix.
For example, for the first beam-slice shear sequence given above (i.e., S((y, xz, (c, h)), (z, x, d)) S((x, / z9 (a, CJ/)- (z, / 9 b)) S(y, xz, ( J fift R 3D = Rx(cp)R y(f))Rz(sx) =

S((y, xz, (c, h)), (z, x, d)) S(x,yz, (a, g)), (z,y, )/
A1~ty,xT.9~ffifJ'!~

WO 00/04505 PCT'/Lls99/16035 The above nnatrix equation ianplies nine trigonometric equations with eight variables, namely, a, b, c, d, f g, h and I. In solving these nine equations for the eight variables, the following results are obtained:, a = siYA(Dsin Cosa - CosC~'9siI1a b = :,in~cas6 c = ~~in~i(cos - eosa) + sin6sia~a (cos(~ - cosO) -Sg11~(C()SO)2 siffiot d _ sill~CosQ; -- cos4~siYbO sina - si81(9Gos c1Ds S1Lfl3CG

.dneJ)sgn (cos - cos6x) + sina(cosa~ - cos0) P =
sbnca(cOsO)2 sina CosQ sin G + siriq9 Si21E) CoSa - CosB~ siYlce 9 _ s4n(~CosO
_.Cos(~Co" - I
~
sil3(~cosO
CGO Cosa - 1 z =
cos0si83a It is to be appreciated that the shear matrices for the remaining five shear 5 sequences may be obtained in a similar marnier.

Figure 17 illus9xates a fourth method for performing an arbitrary three-dimensional (3D) rotation using 3D beam shear decompositions, according to the present invention. By further examination of the product matrix of the consecutive 10 shear matrices used in the beam-slice shear decomposition inethod described above (i.e., S((y, xz, (c, h)), (z, x, d)) ' S(ft= yz, (a, g)), (z,1'1 b)) S(y, xz, (1,9)), the first pair and the last pair of 2D shears can be merged since there is a common beam in each pair. For example, x beam is a common beam of the y-slice and z-slice shears of the first pair. Therefore, the number of shears can be reduced to two by introducing a 15 new definition of a 3D beam shear.

WO 00/04505 PCT/tJS99/16038 Figure 17 illustr ates a 3D x-beam shear, which is equal to the concatenation of two consecutive 2D slice shears S'(xz, y, (a, b)) S'(xy, z, (e,)9). It is to be appreciated that there are two other 3D beam shears, narnely, a 3D z-bearn shear, represented as S(yz, x, (c, d)) S(xz, y, (a, b)), and a 3D y-beam shear, represerflted as S(yz, x, (c, d)) S(xy, z, (e, fi). Every 3 beam shear preferably involves only one major bearn. With reference to Figure 17, the marked x beam 158 (dark shaded beam) is preferably translated to a new 3D location following the arrows. f'he lighter shaded bearn 158"
indicates the intermediate position if the shear decomposition is interpreted as two consecutive 2D slice shea.rs.
The three 3D beam shears may preferably be denoted as SI~3 p,~'I~3~, and x y SH3D .Now, using thÃs method fthe present invention described herein, an arbitrary II
3D rotation can be decomposed into only tmrcs consecutive 3D beam shears. The dominant decomposition sequence may be obtained directly from the 2D slice shear sequence as:

R3D = SH3Dx" SHMa where SH3D - S(xz, y, (a, b)) * S(xy, z, (ej)) x _ a +be 1 +b,f b e f 1 6.Dd13DII 9(y1,x,(c, ")) - S(xzyJ9(g7 h)) 1 +cg c d+ch 9 1 h Using the 3D beam shear decomposition approach of the present invention described herein, an aibitrary 3D rotation preferably involves only two major beam WO 00/04505 PC"r/t3s99/16038 transformations, whereas conventional decomposition approaches require three (e.g., Hanrahan's decomposition). In accordance with the 3D beam shear method of the present invention, the first pass involves orily x beams and the second pass involves only z beams. By the end of the iEirst shear pass, all voxels of a bearn preferably have the same offsets. As there are N' beams for an N' volurrre, there are only NZ
different offset values. Accordingly, the offset values for N' beams can be stored at the end of the first pass, while storing the voxels to their nearest neighbor integral positions.

When gnnltiple pass algorithms are used, the resampling techniques chosen are key to achieving high quality. Intuitively, res pling is necessary for each pass because a continuous shear transformation may move voxels off the grid points.
One problem inherent with. multiple resampling, however, is -the quick degradation of the volume quality if consecutive rotations are applied to a volume. It is therefore desirable to sample the volume only once.
Accordingly, a preferred method of'the present invention achieves one pass resampling of a volunie. In essence, the method of the present invention involves precomputing a sampled volume and then fiising only zero-order (i.e., nearest neighbor) interpolation in each shear pass, thereby distinguishing frorn known prior art methods which require global communication (e.g., NIVittenbrink and Somani's permutation warping).

Given an original vol e(soia.rce volume) and the desired rotated volume (target volume), the rxiethod of the present invention preferably first builds up a one-to-one correspondenoe between a source voxel and a target voxel. This one-to-one mapping is guaranteesi by the multi-pass shear decomposition of the present invention because each shear is a one-to-one transformation using zero-order interpolation. The concatenation of a seqnence of one-to-one mapping remains one-to-one. Once this one-to-one correspondence is built up, the method of the present invention preferably calculates for each so arce voxel its corresponding target voxel and stores it in the source voxel position, During this procediire, no global communication is required;
the resampling is perforrned by interpolation on the local voxels. The sampling WO 00/04505 PC'r/iJS99116038 position of each target -voxel is preferably computed using a backward transformation of rotation.

After obtaining the values for all target voxels, the method of the present invention preferably sb.uffles therni to their destiiiations in target volume.
Intuitively, this would involve global communication. However, global co uiiicatiorb is expensive to perforrra ~:)r parallel irrYplemeroi:ation. Therefore, the method according to present invention preferably uses multiple shears with a nearest neighbor placement scheme to achieve this voxel shuffling. Since shear is a regular, non-conflict trarlsformation, each pass can be perforrned more efficiently than if global communication was utilized. Using the 3D beam shear decomposition method of the present invention desc7-ibed herein, only antiraimurn of two passes of regular local co unication are necessary to achieve virtually the same effect as global communication.

It is to be appreciated that care should be taken to avoid the overlapping of beams in 3I? beam shear. Consider, for exainple, the 3D x beam shear equation given above. raile each x bearn is preserved (i.e., an x bea.ra remains rigid after a 3D x bearn shear), several x beams may overlap with each other. To maintain the required one-to-one mapping, recall that a 3D beam shear is the concatenation of two 2D
slice shears, as discussed above. A 2D slice shear maintains one-to-one mapping when using zero-order interpolation. Therefore, as a solution, the method of the present invention preferably calculates the destination coordinates using the same order as that of two consecutive 2D slice shears, but communication is preferably only perfo ed once. For a 3D x beam shear, while the x coordinate is calculated directly using the 3D shear rna.trix (described above;), the.y and z coordinates of each beam are preferably calculated as z rounci(z + b Y) yrouaad(Y +f z) WO 00/04505 PCT/iJS99/16038 where round(w) is a function of rounding w to the nearest integer. Coordinates (y', z') deterrnine the integral coordinates of the whole beam for the nearest neighbor storage.
In this manner, no overlap occurs.

In accordance with another form of the present invention, several unique methods for performing enhanced volume processing will be discussed in detail -herein below.

Perspective projections present inherent challenges, particularly when perforrning ray casting. For parallel projections, sight rays that are cast through a volume dataset maintain a constant s pling rate on the underlying volume data.
It is straightforward to set this sampling rate to create an output image of the required quality. For perspective projections, however, the rays do not maintain such a continuous and uniforrn sampling rate. Instead, the rays diverge as they traverse the volume from front to back. This creates an uneven sampling of the underlying volume, as shown in Figures 18A and 1$B.

Referring now to Figures 18A and 18B, conventional ray casting algorithms generally handle ray divergence from perspective projections by one of two methods.
The first method is undersampling (Figure 18A), in which rays 160 are cast, frorr:.~ a predefined viewpoint 162, so that the sampling rate at the front of the volume 164 is appropriate for the de:sired irnage quality. However, because of the perspective ray divergence, the n.nderlying volume dataset is undersampled. This may result in severe aliasing by creating "holes" in the rear of the volume 166 where regions of voxels .

remain unsampled. The second method is oversampling (Figure 1813), in which rays 160 are cast from a predefined viewpoint 162 so that the sampling rate at the rear of the volume dataset 166 is appropriate for the desired image quality. This approach avoids the aliasing of the first method; hoNvever, the volume may be radically oversampled in the front 164. The inefficient oversampling in the front of the volume 164 dramatically increases the runtime of this method. The rays 160 can be cast with a sampling rate between undersampling and oversampling. This results in a tradeoff between the image qtiality of oversampling and the rendering speed of undersampling.

WO 00/04505 PC"r/[JS99/16038 Many prior imaging architectures do not even attempt to perfonn perspective projections. Other architectures have dealt with perspective projections by casting diverging sight rays from a predefined viewpoint, which produce images with temporal aliasing and either do not achieve true real-time frame rates (i.e., 301-Iertz) 5 or are much more complex than the slice-order method of the present invention.

A ray-splitting method applies the concept of adaptive super-sampling in order to maintain a unifonn :ray density. In this approach, a ray is split into two child rays when neighboring rays diverge beyond somie predeterrniried threshold.
Recently, a 10 method was proposed which divides the viewing frustum into regions based on distance, from the viewpoint, such that the ray density in each region is near the underlying volume resolntion. Afterwards, such method projects each region onto sub-images and composites them into the frame buffer using texture mapping hardware. In effect, the technique casts coritinuous rays through a region, then at 15 specified boundaries, splits them into a nevEr set of continuous rays.
T'his, however, creates a potential undesired discontinuity between regions.

A method for performing perspective projections of izniforrnn regular datasets, termed ER-Per spective (exponential regions perspective), in accordance with one 20 fonn of the present inirention, preferably adaptively sarnples the underlying volume, whereby the above-described problems, inherent in conventional volume rendering systems and methods, are essentially eliminated. The ER-Perspective algorithm combines the desirable properties of both undersampling and oversain.pling, providing extremely good anti-a[iasing properties associated with oversampling methods, while 25 providing runtimes on. the order of undersampling methods. Furthermore, this algorithm preferably creates at least one sample for every visible voxel in the volume dataset. ER-Perspective gains a runtime aclvantage over previous work by utilizing slice-order voxel access, while maintaining equal or better image quality in comparison to known perspective projection methods.

Figure 19 is a 2D top view illustrat:ion of the ER-Perspective algorithm, in accordance with the present invention. As shown in Figure 19, the ER-Perspective WO 00f04505 PC'r/LJS99116038 algorithm preferably works by dividing a view frustum 168 into a plixrality of regions based on exponentially increasing distances along a major projection axis (e.g., z-axis) from a predefined viewpoint 172. Preferably, continuous sight rays 174 are cast from the viewpoint 172 froni back-to-front (or front-to-back) through the volume dataset and the rays 174 are merged (or split) once they become too close (or too far) from each other. Since the operation of the ER-Perspective algorithm is similar for baclC-to-front compared with front-to-back ray casting, the remaining discussion of the ER-Perspective algori ,will be limited to the imore intuitive case ofback-to-fi ont ray casting with merging. The differences are pointed out where they are significant.

The ER-Perspective algorithm preferably uses region boundaries 170, which define the exponential regions, to mark the locations where the sight rays 174 are preferably merged. B,r defining the regions and merging all rays 174 at the boundaries 170, the algorithm provides a regular pattern of ray merging that is dependent on the global geometry rather than local neighborhood conditions.
Figure 20A more clearly illustrates the merging of sight rays at region boundaries 170 for contribution to baseplauie pixel B, in particiclar. With reference to Figure 20A, an odd number of rays 174 are preferably merged such that the resulting ray 174 is essentially an exact continuation of the previous center ray, thus eliminating potential discontinuities present at the region boundaries 170. This is one important advantage of the method of the present invention over known prior approaches. F errnore, this algorithm can be (lualif:ied by characterizing the filterfing achieved when adaptively sampling the volume.

An example of:' a preferred filtering scheme is shown in Figure 20B. Refe ' g to Figure 20B, aBartl(dtt window (i.e., linear interpolation, triangle filter) is preferably employed. Cascading efficient local Bartlett windows at each region boundary 170 is essentially the equivalent of resampling the rays 174 with a single large Bartlett filter for each baseplane ph~:el (see Figure 20A). A graphical representation of the preferred filter weights 175 is shovvn for contributiors to the baseplane pixels (e.g., pixels A, B, C).

The base sampling rate of the algorithm can be set to a predefined value according to a desired image quality. The base sampling rate is the minimum ray density compared to the underlying volume resolution. Although the ER-Perspective method of the present invention supports vii:-tually any sampling rate, a sampling rate of at least one ray per voxel is preferred. The algorithm has the advantage of keeping the ray density between one to two tirnes the base sarnpling rate. This guarantees that-no voxels are missed iyn the rear of the volurne dataset and places an upper bound on the total amount of wo:rk perfonned at two times (2x) supersampling.

Since the present invention utilizes slice-order processing, the volume dataset is projected onto a baseplane of the volume which is most perpendicular to the view direction. The baseplane image is then warped onto the final image plane in a conventional manner (e.g., in the same manner as in shear-warp or the prior Cube-4 architecture).
The ER-Perspective method of the present invention is ideally suited for implementation on the Cube-5 architecture described above. Specifically, this algorithm preferably only requires nearest rteighbor comriiunication between processing elements. While processing a row of voxels on a one-dimensional array of processing elements, the algorithm only requires processing elements to communicate with their immediate left and right neighbors. The Caibe-5 rendering pipelines similarly support nearest neighbor communication.

The ER-Perspective algorithm of the present invention preferably employs slice-order processing along one of the three major axes. Consequently, the regions in the ER-perspective algorithm are defined as slabs of slices along a ynajor projection axis. In a preferred ernbodirnent of the EIZ=perspective method according to the present invention, the volume dataset is projected along slices perpendicular to the z-axis. So as not to lirnit the methods of the present invention to projections along the z-axis only, it is to be appreciated that the coordinate sys-tern may be flipped and the geometry rotated. The algorithm proceeds, as illustrated in Figure 7, by measuring the distance along the z-axis, from the viewpoint 96 to the front of the volume dataset 80, WO 00/04505 PC"I'/tJS99/16038 is determined (eZ.). Subsequently, a first region 92 is created to consist of as many z-slices as this distance. Each successive region after the first region 92 is preferably twice as deep as the one before it.

When combined with high quality supersaynpling, the first region is exactly as large as needed to have one ray per voxel at the end of the region when shooting one ray per pixel of the -fin'Ll image. Thus, supersampling higher than 2x might be needed in the front of the volugne to render high quality close up views.

As illustrated irL the example of Figu.re 19, if the viewpoint 172 is three voxel units from the front of the volume (i.e., the ?= 3 region boundary), for example, then the first region 176 is preferably three voxel units thick, the next region is six voxel units thick, and so on. In general, the I-th region is preferably eZ 2' slices thick, where eZ is the distance; from the viewpoint 1.72 to the froilt of the volume (see Figure 19). Forcing the regioais to be thus defined produces the desired effect that any two perspective rays 174 cast through any of the regions are twice as far apart at the rear boundary (i.e., the z = 24 boundary) as they are at the fTolit boundary (i.e., the z= 3 boundary). This is shown in Figure 19 as the distance between the two rays 174 grows from one unit to two units across the first region 176, then to four units, and finally to eight units at the rear of the last region. Additionally, since the region boundaries 170 are dependent on the global geometry, the efficiency of the ray casting algorithm is maximized by providing amechanisrn for keeping the ray density between one and two times the underlying volume resolution in each dflYTleR3slEDTI. It also creates a regular topology so that the filltering of the data can be controlled as perspective rays are cast.

Having regions with boundaries at exponential distances produces a ray density twice as high at the front as at the back of the region. Therefore, a rnecharlism must preferably be provided to adjust the ray density when crossing a region bounradary. Since each ray preferably starts on a voxel coordinate at the rear of a region, at the front of -the region every secoild ray in each dimension will preferably coincide directly with a voxel coordinate. 1: he remaining rays preferably intersect the W 00/04505 PC"t'/ils99/16038 region boundary halfway between two voxel positions. To down-sarnple the ray density with this detem-inistic ray pattern, a two-dimensional (2I)) Bartlett filter (also known as tent or triangle filter) is preferably employed, with an extent of 1 voxel unit in each dimension. Because the ray density at the front of each region is twice the voxel density, this ? X3 voxel neighborhood is intersected by 5x5 rays.
Referring now to Figure 21, since the edges 178 each have a weight of zero, only the 3x neighboring rays 180 are used for applying the filter to down-sample the ray density.
This effectively merges neighboring rays. .A, Bartlett filter is preferred over a simple box filter for the added quality it produces in the final image. For the case of front-to-back processing, rays are split instead of merged. Here a bilinear interpolation of the rays is perforrned to generate the new rays which begin between the other rays. It should be mentioned that a Bartlett filter of size 1 is the inverse of a bilinear interpolation operation.

Figure 22 shows a 2D example of how sight rays 186 travel through a 73 volume 192 when the viewpoint 1961s three voxel units An front of the volume (i.e., from the baseplane 192). Notice that the sampling rate remains between 7 and 14 per slice, and that it increases as the rays 186 travel through tYie regions from back to front. The number of ray density resampling stages for an M volume is limited by log2N, since that is the maximum number of regions in an M vol e; The last re-sampling step shown on the baseplane 198 is preferably perforxned when the final image warp takes place.

As illustrated in Figure 22, the rear of the volume dataset 182 does not necessarily always coincide with a region boundary 184. However, since it is preferred that the rays 186 be on exact voxel coordinates 188 at all of the region boundaries 184, the rays 186 preferably originate on the grid coordinates 190 at the rear of the last region enclosing the volume dataset 192 (shaded area).
Therefore, the voxel coordinates and the ray sample locations 194 may riot be congruent at the rear of the volume 182. TYiis not only provides the mentioned boundary conditions, but aids with temporal anti-aliasing when the viewpoint 196 is moved iri smaller than voxel unit distances, because the rays 186 will continue to originate from the same positions relative to the voxels.

Figure 23 depicts a preferred rnethod. for perforrning ER.-Perspective 5 back-to-front proj ectioii of a volurkie, in accordance with one forrn of the present invention, although other embodiments of the ER-Perspective method are contemplated. As described above, first, the distance fror"I the eye or viewpoint to the baseplane is preferably detergniried (in voxeli units). Using this viewpoint position, exponential region boundaries are created. lqext, enough regions are preferably 10 established to completely encompass the volume dataset. To perforrn the volume rendering, the algorithr.a loops through each region from the back to the front, computing norgnal ray casting, but in a slice-order fashion, and stores the partially computed rays in a conipositirlg buffer. Between regions (i.e., at the region boundaries), ray density re-sampling of the compositing buffer is preferably 15 prefortned, as described previously. The baseplane image is then warped onto the final image plane for di.splay.

With adaptive ray density perspective methods knovvn in the prior art, it is generally difficult to determine the filtering functior$ achieved when rays are merged 20 using irregular patterns. However, since the; ER-Perspective method of the present invention preferably u...es regular boundaries for the filtering operations and exact ray placement within the boundaries, it is easier to compute the effective filter achieved by the cascading of local Bartlett filters. This is an irnportarit adva-ntage of the ER-Perspective algorithm of the present invention. Additionally, the boundaries and 25 filter of the present invention have preferab'.ly been chosera to overcome the poor image quality usually associated with conventional successive filtering of discrete data.

Consider, for example, the case of a perspective projection of a volume seven 30 slices deep with the viewpoint two voxel units in front of'the volume, as depicted in Figure 24. Using the 1.R-Perspective method of the present invention, the rays that are cast through a region are one voxel unit apart at the rear of the region.

WO 00/04505 PC'TIL.TS99/16033 However, when the rays reach a region boundary 202 they are preferably filtered using local Bartlett filters. The Bartlett filters (simplified to 1-diYnension) contain the following weights f r a kernel of size 2n+l, normalized so that the output has the same scalar range as the input:

t 2 sa - f n n-1 2 1 9 ~9 9 ... , , , , .o. , y y 0 n2 nZ n2 T2Z ~b~ nZ nZ

For two-dimensional i3nages at region boundaries, the present invention preferably employs a two-dimensional Bartlett filter bv convolvir4g two one-dimensional Bartlett filters in the two principal directions. The ER-Perspective algorithm preferably always resamples the rays to have half of the original density. Using a filter of size 2 rays (n=2) creates a filter kemel of 5x5, or just the following five weights for one dimension:

1 2 t 0s_.9-_'_.90 By way of exwnple, as illblstrated in Figure 24, consider the contribution of samples a, b,'c, d and e to the partially composited ray which changes from region 2 to region 1 at location o, o = I b + 2 c + 1 ~

Likewise, the partial rays at locations p and. q are cornputed as:

p = I d + 2 e + 1 ~

WO 00/04505 PC'TftJS99/16038 q = I f + 2 g + 1 h The equations for partial rays n and r have been omitted since they have a weight of zero in the final filter for pixel A. Corltin.uiiig the ER-Perspective algorithm, the resarnpled partial rays ;i, o, p, q and r are preferably cast through region 1 where they are again filtered by a local Bartlett filter. T'he norinalized contribution of n, o, p9 q and r to pixel A will be.

A = 1 + 2 p + 1 q Substituting in the vallies for o, p and q results in:

A = I b + 2 c + 3 d + 4 e + 3 f + 2 g + I h It is to be appreciated that this formula contains the same weights (i.e., coefficients) as a Bartlett filter with a l:ernel size of nine values (n = 4). This can be repeated for pixel 14 B with the sarrle filter weights. For front-to-ba.ck processing, a similar analysis can be used to demonstrate the performance of the algorithm and the result of successive applications of the bilinear interpolation.

Each sample oi.'a slice preferably contributes the same ainount to the final image as any other sample in the same region (assuming all other operations on samples, such as color mapping and compositing, are equal). For ex ple, the value that sample e contributes to pixel A with ari effective weight of 1/4 after the cascading of the local Bartlett filters. Likewise, sample I contributes to pixel B with an effective weight of 1/4. Samplef contributes to pixel A with a weight of 3/16 and to pixel B

with a weight of 1/16 for a total of 1/4. This can be repeated for samples g and h.
The samples to the left of sample e and to the right of sample I partially contribute to pixels left of pixel A and right of pixel B, respectively, such that the sum of their contributions to the final image is also 1/4. In fact, every sample that is in this region WO 00/04505 PC7 /[JS99/16038 has the same weight. The weight is 1/4 because this regior.t is the second region in the volume. For the first region in the volume, every sample preferably has a weight of %2. This is qualifiable by realizing that there are two rays per final irnage pixel in this region. There are four rays per final image pixel in the second region, etc.

Consequently, the weight which determines -the contribution of each sarnple towards the final image is the ratio image pixels samples in this slice Since the ER-Perspective method of the present invention performs a slice-order processing, ',:he total arnount of computation may be analyzed by calculating the amount of work performed on each slice. Assuming that the work done on each sample is the same, the count of the number of samples processed can be used as a comparison oI the workloads. For example, in the oversanapling method (see Figure 18B), the nlunnber of samples on the rear slice of a volewhich ends exactly on a region boundary is NI. On the firont slice, the sample count depends on the geometry of the viewpoint. In particuzlar, using similar triangles and defining eZ as the distance from the viewpoint to the front of the volume, the number of sarnples taken is N2 + N e z e z This can be generalized for any slice s throtiLgh the voliarne dataset to N2 +.fV e 2 z e -s- s z Thus, the total count of samples processed lby the oversarzipling method is N (N2+N.e)2 S-o e, + s Similarly, the undersampling method (see Figure 18A) can be shown to perform the following amount of work:

N 7r e 2 dY Z

3=0 LZ-+.S

For the ER-Perspective algorithm of the present 'nventio , the analysis is more N + e complicated. Depending on the viewing geometry, log - z-1 regions are FZ
created. It has been shown previously that each of these regi ns preferably contains eZ
- 2' slices. Again, using; the geometric principle of similar triangles, the ER-Perspective algorithrn of the present invention processes the following number of samples:

ve 1og~ $ -1 e~ e -2 @g ~ ~ ~ ,~: (e*2reg _ e~ s) 2 a z reg=0 s= e& *2reg - eL

This formula has an upper bound of (2N)2 s=0 and a lower bou.nd of ENZ
S=o Examining the equation for the total. count of samples processed by the oversampling method (given herein above);, it can be seen that the oversampling approach could perforln O(N) work on the front slice when the viewpoint is very close to the volurrn.e. The oversarnpliiig run. tirries grow rapidly as the viewpoint is moved closer to the front of the volume. Examining the undersarnpling equation above, it can be seen that as the viewpoint approaches the front of the volume, the numerator approaches zero. The amount of vrork perforrned on the rear slice also WO 00/04505 PCT/tJS99/16038 approaches zero. The ran times of the undersampling method decrease as the viewpoint becomes closer to the volume.

Regardless of the viewpoint geometry, the amount of work perforrned by the 5 ER-Perspective algoritlLm of the present invention is bounded by o(M) and w(W) per slice. Some advantages of this approach are that the upper bound on the run time of the algorithm is linear with the number of voxels and is independent of the view position, and a lower bound on the image quality achieved is also independent of the view position. Thus, a user can set the base sampling rate for the desired image 10 quality and be sure that the sampling rate is sufficient throughout the volume for that desired image quality.

In contrast, a conventional oversampling approach provides a lower bound on the image quality yet the runtime of the algorithm may become much greater than that 15 of the ER-Perspective rnethod of the present. invention. A conventional undersampling method provides an upper bound on the runtime for rendering, but the image quality may become much worse than the ER-Perspective approach.

Referring again to Figure 23, a preferred back-to-front ER-Perspective ray-20 casting algorithm, in accordance with the present invention, is illustrated. The algorithm of Figure 23 is shown as a pseudo-code representation and assumes a Z-major axis projection. The ER-Perspective algorithm of the present invention does not suffer from the traditional pitfalls when perforrning perspective projections on uniforni regular grids. This unique approach runs faster than oversampling methods 25 and produces better quality images than undersampling methods. Employing a Bartlett filter for ray merging provides an irnage quality improvement over a conventional box filter. The ER-Perspective algorithm is qualified by characterizing the effective filtering on the input data.

30 In accordance with another form of the present invention, a method is presented for renderinl; a large volume, wherein tl'e voluirne dataset exceeds the physical single-pass czLpacity of the Cube-5 apparatus of the present invention. The preferred method subdivides the volume dataset into a plurality of cuboid bricks.
Traversing the bricks i:~~~a a predefined order preferably enables initialization of the corzflpositing buffer of the Cube-5 apparatus with a baseplane image of a previous brick before rendering it, whereby ray path and compositing are logically extended throughout the entire volume. Inforrnation :regardirlg the boundary between bricks is preferably re-read to insure correct saznpling. Using this approach, the maximum volume size is limited only by the available intermediate baseplane storage.

In areas of the flataset where, during perspective projection, multiple voxels contribute to the same image pixel, images of equivalent quality may preferably be rendered using a level-of-detail (LOD) tree, which may be generated, for example, by combining voxels of increasing neighborhood size in a pre-processing step.
While perspectively rendering a single large volun-ie utilizing LOD, preferably only a small portion of the volume, substantially close to the viewpoint, must be read in its highest detail. The more distant portions of the volume, with respect to the viewpoint, may then be rendered from lower resolution versions of the data. Thus the frame rate and/or dataset size is preferably increased. Since each region in the perspective algorithm of the preseiit invention (previously described) will now be at a different LOD, there is no longer need to filter the rays betweeri regions, but merely to redistribute them. Preferably, only one regiion of each LOD tree level is processed;
thus, only those regioxcs must be paged into memory.

The level-of-detail (LOD) method of the present invention may also be used for rendering scenes comprised of multiple objects at differing distances from the viewpoint. For such cases, a starting LOI3 is preferably selected that delivers a baseplane image of about the same size as the screen space image, tliereby relating rendering time to image resolution and not to object size (i.e., scale independence).
Although back-to-front rendering is similarly contemplated by and within the scope of the present invention, the unique LOD method will be described herein in a front-to-back rendering context. Rendering front-to-back, it is preferable to start with a slab of the most detailed representation of the volurne to be rendered. In a preferred WO 00/04505 PC't /tJS99/16038 method of the present invention, the thickness of the volume slab is chosen so that projected voxel distances in front and back of the slab differ by a factor of two, similar to perspective projections according to the present invention, as previously described herein. After renderinl; a slab, the current compositing buffer image is preferably scaled by a factor of 0.5 in the warp unit. This initializes the compositing buffer for the rendering of the next slab of half the resolution. Preferably, only one slab of each:-LOi3 actually flows th:rough the rendering pipelines; thus, for large volumes, only those slabs must be pa,~ed into the on-board 3D memory.

It is to be appreciated that the apparatus of the present invention can also be employed to speed Up eff line computations, such as generation of level-of-detail (LOD) and filtering of datasets. To generate LODs, the trilinear interpolation unit (TriLin) of the present invention preferably sets all its weights to 0.5. Once new samples become available, they are preferably subsarnpled and compacted into a new volume, which is the next coarser LOD. To filter a dataset, the trilinear interpolation unit again uses only 0.5 weights; this time, however, data is fed back to the beginning of the rendering pipeline without compaction. Each additional pass creates a new filtered volume with a filter lrernel having one more voxel extent in every major axis direction.

For higher quality image rendering, the apparatus and methods of the present invention preferably provide the flexibility to utilize a full hardware implementation, multi-pass algorithms, and/or a combination of the two, depending on the desired tradeoffs. The full harduaare implementations and multi-pass methods preferably provide more accurate computations in two primary functional areas: filtering and interpolation.

The Cube-4 architecture, a predecessor of the present invention. (Cube-5), utilizes a central difference gradient filter with only two sample points to estimate each of the x, y and z gradients at a particular location. A larger 3D filter can deliver a more accurate gradient estimate, such as a Sobel filter (which is a 33 filter with weights derived from the inverse of the Manhattan distarice from the center point). A

w 00/04505 PCT/[IS99/16038 straightforward hardware implementation of a 33 filter, however, requires 27 multipliers and 26 adders.

The apparatus of the present invention presents an alternative to this expensive prior art approach by using symmetric convolution filters. The convolution filters can be efficiently irnplernezited with only three naultipliers and six adders, at a significant cost savings. Replication of hardware per gradient component can preferably be avoided by applying a three-pass algorithm instead. As arz example, Figure 25 illustrates a symmetric approximation of the x-component of the Sobel gradient filter.
Within each stage, the weights are preferably applied to the nearest neighbors before summation. With reference to Figure 25, if each stage operates on the output of a previous stage instead of on the raw data, the weights presented in Figure 25 will effectively produce the 33 symmetric approximation of the Sobel gradient filter (right side of Figure 25). Ch;mging the x-weights to { 1 w 1} will produce an approximation of a Gaussian filter instead.

The present invention contemplates higher quality rendering modes in which no additional hardware; is needed, but in which the frame rate is lowered. One such example is to achieve larger neighborhood contributions to the gradient estimation by utilizing level-of-detail (LOD) infortriation. If the central[ difference gradient is computed on data of tYle next coarser LOD, it is effectively the equivalent of employing a 6x4x2 filter, with 6 being the extent in the direction of the current gradient component. Since the apparatus o:f the present iiivention (i.e., Cube-architecture) is able to hold mip-mapped LOD representations of the data, this filter is preferably achieved with essentially no increase in hardware, beyond the simple central difference solution.

Another highe;,~ quality rnulti-pass rendering mode provided by the preseni invention, for which no additional hardware is required, is an approximation of tri-31) cubic interpolation, w:hich has beneficial applications in the medical field as well as other fields. This rnode enables more accurate resaYnplirig and iso-position calculation. For this, the present invention preferably decomposes a piecewise WO 00/04505 PC'r/iJS99/16038 voxel filter into a series of linear interpolations and extrapolations,which is symmetric in every dimension, thereby allowing efficient reuse of interrnediate results.

In perforrning higher quality renderirag, it is to be aippreciated that there are certain tradeoffs between using additional hardware for providing more accurate and flexible gradient estimation within the Cube=-5 pipeline, as opposed to employing multiple pass algorit .s. Generally, using ai multiple pass algorithm requires changes in the Address Generati.on and Control unit (see Figure 5) of the present invention to momentarily stall the pipeline for computational purposes, while the hardware approach requires additional application specific integrated circuit (ASIC) logic and additional connections to support larger neighborhoods.

With respect to enhanced volume rendering capabilities, a preferred embodiment of the present invention supports clipping by arbitrary planes. The distance from each plane may preferably be incrementally computed using only registers and one adder per plane. In addition to conventional clipping planes which define only the positive direction as visible, the apparatus of the present invention preferably supports extracting an arbitrarily thick slice from the dataset for oblique multi-planar reforrnatting (MPR) by invalidating all samples lying outside a predetermined offset.

Axis-aligned ctitting planes are preferably irnplern.ented by restricting the volume traversal to the cuboid of interest. Alternatively, -the present invention contemplates restricting this traversal to exclude a simple cuboid from the volume (e.g., visualizing all but one octant of avolume).

In addition to clipping, the present invention further contemplates depth cueing, which modulates the color of objects to simulate, for example, atmospheric attenuation of light through a translucent medium. This phenomenon, as appreciated by those skilled in the art, is terflned fog or liaze when the medium also contributes some color (e.g., white or gray). To implenient this feature in accordance with the present invention, nonnally clear regions are preferably replaced with a semi-WO 00/04505 PC'r/[JS99/16038 transparent color (e.g., black for depth cueing, white for fog) by modifying the transfer function. Each final pixel is preferalbly further attenuated to account for the distance from the viewpoint to the surface oi.'the volume, preferably implemented as a part of warping.

The apparatus of the present invention additionally supports rendering of super-sampled images with a preferred defaiilt super-sampling rate of two in the x and y directions, although other sampling rates are contemplated. To improve image quality further, the sampling rate along each ray can also be increased.
Neither 10 approach requires re-reading voxels from the 3D memory. The apparatus of the present invention preferably changes the volume traversal order so that voxels already residing in the buffers ivill be read out repeatedly. Eac'd time they are reused, new weights are preferably iatilized in the trilinear interpolation units (TriLin) of the present invention to reflect the new resampling position.

In a preferred einbodirnent for supersampling iri the present invention, central difference gradients are computed between neighbors one distance unit apart to ensure sufficient precision. These gradients are preferably computed by taking the difference first and interpolating afterwards or, alternatively, by interpolating first and then taking the difference between neighbors k positions apart (assuming k times oversampling), and preferably not immediate neighbors. A classification stage must consider the new inters ample distances when computing a new a' value.
Therefore, during super-sampling, the volume will preferably be traversed in an interleaved pattern within each slice. This essentially ensures that a translucent material (gel) keeps its accumulated opacity CaBa value) independent of the sa.nnpling rate.
Thus, for example, for an oversampling factor of k in the z-direction, modified a' values are preferably used, where: a'= 1 - (]. - a) 'lk.

Anisotropic dai:asets have different distances between samples along different axes. Thus, the gradie;at computation and the final two-dimensional (2D) image warp preferably require axis-dependent scaling factors. In addition, the direction in which the sight rays are being cast through the volume dataset pireferably require adjustment WO 00/04505 PC"I'/tJS99/16038 to account for the implicit volume scaling, vvhich occurs when storing anisotropic data in an isotropic grid. The a value is preferalbly adjusted according to the direction-dependent distance d which a sight ray travels through a voxel cell.
The corrected a' is a = 1-(1 - a)d , with the direction-dependent distance d preferably being in the range [ 1, ~/3]

In addition to the methods for enhancing volume rendering capabilities described herein above, the preser.it invention further provides several unique methods for universal three-dimensional (3I3) rendering, including mixing polygons and volumes, voxelization of polygons, rendering multiple overlapping volumes, performing texture mapping and accelerating image-based rendering. These methods are described in greate;-r detail herein below.

An important aspect of the present iiivention is its unique ability to correctly mix geometric objects (i.e., polygons) and volumes in a single image. The apparatus of the present invention (i.e., Cube-5) preferably leverages conventional geometry hardware to render opaque and translucent polygons togetlier with the Cube-5 volume rendering pipeline.

In a preferred raethod according to the present invention, to render a scene containing volumes and opaque polygons, all opaque polygons are first projected onto a Z-buffer coincident with a predefined baseplane and having sufficient resolution to match the volume sample distance. Using the Z-buffer, a determination is preferably made as to which slices of the volume are in front of the polygons for each pixel of the baseplane irgaage. 'The compositing bufiEer is then preferably pre-loaded (i.e., initialized) with this pirojected IZC'gBaZ (i.e., Z-buffer) image, representing the color and depth image of the polygons. Subsequently, the voli,r.rne is rendered with z-comparisori enabled in the cornpositing buffer. The depth values of the opaque polygons are checked to keep volume samples which are hidden by opaque polygons from contributing to the final image. CJltirriately9 the opaque polygons occlude the volume behind, and the volume in front correctly composites over the polygons.

In other words, the compositing buffer is pre-loaded with the z-buffer image {Cv ZZ}, in accordance with the preferred method of the present invention, where CZ
represents the value of the geometry sample and ZZ represents the depth of the geometry sample from a predetermined viewpoint. During back-to-front compositing, the resulting output pixel in the compo siting buffer, Cou~, will preferably be equal to the geometry sample value, CZ, when the volume sample is behind the geometry (i.e., when the depth of the sample, Z, is greater than the geometry depth, ZZ).
Similarly, during front-to-back compositing, the samples are preferably composited using the Porter-Duff over operator, as appreciated by those skilled in the art. A more detailed discussion of the Porter-Duff a compositing rules are described, for example, in the text Compositing Dip-ital Images, by T. Porter and T. Duff, Computer Graphics (SIGGRAPH84), vol. 18, no. 3, pp. 253-259, July 1984. Therefore, the resulting output pixel in the compositing buffer, Cout, will preferably be equal to the volume sample value, CS, over the geometry sample value, CZ, when the volume sample is in front of the geometry (i.e., when the depth of the volume sample, ZS, is less than the geometry depth, ZZ).

Translucent polygons pose a more complicated situation, since all fragments (both translucent polygon pixels and volume samples) must be drawn in topologically depth-sorted order. This is required because compositing translucent fragments with the over operator is not commutative. Therefore, polygons must be re-depth-sorted whenever the scene or viewing geometry changes. Additionally, the sorting must be topologically correct, including the handling of depth cycles.

Although there are proposed architectures which use an A-buffer to provide some hardware sorting support, implementing an A-buffer in hardware allows only limited depth complexity (i.e., number of overlapping polygons per pixel) in a single pass and is costly. A discussion of a conventional A-buffer algorithm may be found, for example, in the text The A-Buffer, an Antialiased Hidden Surface Method, by L.
Carpenter, Computer Graphics (SIGGRAPH 84), vol. 18, no. 3, pages 103-108, July 1984.

W 00/04505 PC'r/Us99116038 In a preferred ndethod, the present invention adapts polygon rendering to slice order ray casting, and synchronizes the overall rendering process on a volume slice-by-slice basis, rather th.an a polygon-by-polygon or pixel-by-pixel basis. The Cube-5 apparatus preferably utilizes the geometry pipeline and conventional graphics hardware to render geometric objects in thin slabs that are interleaved or dove-tailed between slices of voluine samples 212, as illustrated ira Figure 26.

With reference now to Figure 26, each slice of the volume is preferably sampled in planes perpendicular to the volume storage axes. The planes are drawn in depth order (e.g., usinl; near and far clipping planes) froni farthest from the eye or viewpoint 214 to nearest to the eye. 'I'herefisre, to mix translucent polygons with volumetric data, thin slabs of the polygons 210 are preferably rendered and composited in between the slices of volume saYnples 212. It is to be appreciated that the slabs 210 represent; all of the translucent objects which lay between two consecutive slices of the volume sarnple planes. The boundaries of the slabs are preferably defined such that the union of all rendered slabs 210 neither misses nor duplicates any region i'e.g., (), (], ..., ( ], as shown in Figure 26). The data from the volume slices and the translucent polygonal slabs 210 are dove-tailed together in an alternating fashion. Iri this manner, the correct depth ordering of all contributing entities is preserved and use of the over operator to composite them creates correct colors in the final image pixels.

In accordance with a preferred method of the present invention, the opaque polygons are drawn first with Z-buffering. Before drawing any volume slices, the translucent polygons which lie behind the volume extent are preferably drawn over the opaque polygons using any conventional translucent polygon rendering algorithm (e.g., painters). I,ikeNvise, translucent polygons which lie in front of the volume are preferably drawn after the mixing portion of the algorithrn. Polygons which lie depth-wise within the volunle boundary, but to the top/bottom,4side of the volume, are preferably drawn in slice order as if the volume slices were planes that extend to infinity cutting the translucent polygons.

OpenGL may be used to directly render the thin slabs of translucent polygonal objects. The polygons are preferably shaded using the Gouraud shading model included in OpenGL . A naive approach would be to render the complete set of translucent polygons for every slab and set the hither and yon clipping planes to cut the current thin slab of data. However, for an n3 volume, there could be up to n thin slabs that must be rendered. Since a typical scene contains very few polygons which span all of the thin slabs, the present invention contemplates an alternative approach which would involve clipping the polygons to the slab boundaries and only rendering the portions of the polygons within each slab. This would substantially reduce the processing load on the polygon pipeline. However, it would require the application to clip every polygon against the two planes of each thin slab which contains that polygon.

As illustrated in Figure 27, it is contemplated that the present invention may take advantage of the fact that the two clipping planes 216, 218 are parallel to keep only the portions of the polygons which lie between the planes. While this creates fewer polygons than clipping against each plane separately, it still can increase the triangle count dramatically. The first case occurs when a triangle 220 intersects the thin slab, but no vertices are within the slab boundaries 216, 218. When this occurs, one vertex must be on one side of the slab and the other two vertices on the other side of the slab, thus creating a trapezoid which is decomposed into two triangles.
Next, consider when one vertex of a triangle is within the slab. In one situation, a triangle 222 intersects the slab such that the remaining two vertices lay on the same side of the current slab, creating only one triangle. In a second situation, a triangle 224 intersects the slab such that the remaining two vertices lay on opposite sides of the current slab.
This is a worst case situation, since it produces a pentagon, or three triangles. The final case occurs when a triangle 226 intersects the slab such that two vertices lie within the same slab and, once again, a trapezoid is created resulting in two triangles.

In a preferred embodiment of the present invention, a bucket sorting method is applied to the translucent polygons. Whenever the viewing geometry changes, the placement of volume sample planes change their relative positions to the geometry.

WO 00/04505 PCT'/US99/16038 Therefore, the present invention preferably creates a bucket for each thin slab between two volume sarnple planes. All of the translucent polygons in a scene are preferably traversed and each of the polygons is placed in a bucket for each of the slabs it intersects. For example, as shown in Figure 28, triangle T1 is placed in all six buckets 5 since it spans all six slabs S1-S6. Triangle T2 is placed in buckets corresponding to slabs S2 and S3, and likewise for the remaining triangles. For the example shown in -Figure 28, bucketing the four triangles 'T'1 -- T4 would result in twelve triangles being sent to the graphics pipeline. As a comparison, if the triangles were being clipped to the slab boundaries, tdventy triangles would be sent to the graphics pipeline.

An alternative to bucketing is to create an active triangle list similar to the active edge list utilized in scan converting polygons. The triangles may be placed in the active list at the furst slice they intersect and removed from the list when they no longer intersect any slices. A data structure is preferably pre coxnputed which indicates which slice cach triangle first encountered. This preprocessing is essentially the sarne as for bucketing, with the exception that bucketing does not have to check for triangle removal f:)r each slice.

One advantage of the method of the present invention is that for applications which choose to trade; off image quality in order to maintain a predeterrnined frame rate, the number of polygons drawn decreases as the number of slices drawn for the volume decreases. Tlais occurs because the interslice size increases as the number of volume slices decreases. The rendering rate achieved is substantially proportional to the number of polygeins drawn and the number of vol e samples drawn (which is proportional to the niunber of volume slices drawn). The image quality degradation resulting from this tradeoff affects only the volume data, similar to taking fewer sarnples in any volunie rendering algorithnn.

When anixing; translucent geometries and volumes, there exist at least three options for handling two or more translucent polygons being drawn to the same pixel within one thin slab. In the first option, the polygons could be drawn in regular processing order witli the over operator. 'While this method may produce the incorrect WO 00/04505 PC'r/iJS99/16038 color, the amount of color error is limited because the polygons are still sorted by bucketing them into thin slabs.

Another rnethod for handling two o:r more translucent polygons is to draw thin slabs of translucent polygons between two volume sample slices as on-the-fly voxelization. In conventional voxelization methods, when a surface is 3D scan converted into a 3D volume grid, the resolution of the grid is commonly chosen such that the size of a singlv voxel represents the smallest area that can be discerned by the human eye when it is :rendered. In the %and Y dirnensions, the polygons are drawn to screen resolution. In t:he Z dimension, it is assumed that the volume is being rendered with enough slices such that each volume sample also represents the smallest area that can be discerned by tlr.e human eye. Therefore, each pixel bounded by two volume slices in the Z dirnens:on also represents this small area.

In view of the foregoing, a method, perforrned in accordance with one embodiment of the present invention, may be viewed as computing on-the-fly voxelization by utilizing 3D graphics hardivare. 'Ioxelization methods combine polygons into a single voxel by using one of two preferred methods. The first method is to take the max of each color channel. The second method is to take the weighted-rnax as C = (pdL) PI 4. c p2~p2 ~
V (L) p1 + Pp21 where C'p1 is the color of a first polygon (polygon 1), DPI is the density of polygon 1, and Cq, is the color assigned to the voxel. Many OpenGL implementations, for example, allow max blending with glBlenaFquationext(gl_max~_ext). Assuming that the density is equal to the alpha value (e.g., linear ramp transfer fimction for volume rendering), then the colors may preferably be weighted by their alpha values before blending by using aglBlenclFaarac (gl src_alpha, gl_ nc). However, OpenGL is i'lot able to compute the complete previous equation since it cannot divide by the sum of the alpha values after accumulating them.

WO 00/04505 PCT/tJS99/16038 The third method of drawing two or more translucent polygons to the same pixel within one thin slab may also be considered the ;rnost accurate approach. By utilizing one of the previously described methods of the present invention to perform depth sorting, such as BSP tree, proper ordering of all translucent polygons within each slab is maintained. Depth cycles are preferably handled by the BSP
algorithm by splitting polygons wtuch span a plane used in the partitioning, and eventually one of the polygons in the cycle is used as the partitioning plane.

As previousl,~ discussed, an important feature of the present Cu.be-5 invention is the unique ability to couple at least one geometry pipeline or engine to the Cube-5 system. In accordarice with the present invention, two preferred methods of connecting one or rriore geometry pipelines to the claimed Cube-5 system on PC-class machines is provided, as described herein below. Both methods allow the unique mixing of opaque and/or translucent polygons with volumetric data.
It is to be appreciated that the opaque polygons are preferably rendered such tliat, after projecticon through the volume dataset, warping creates the correct footprint on the final irnage. Furtherrnore, the Z--depth values are preferably aligned along the processing axis, so that a volume slice index may be used for the Z-depth checlt.
In accordance with one embodiment of the present invention, a preferred method begins by determining a major viewing axis for the current viewing direction.
As illustrated in lii e 29, a trarisfoi -nnation is preferably applied to the geometry 228 so that the major viewing axis 230 is along, for exazmple9 the Z-axis. Next, the view or eye point 232 is moved to be along this direction, preferably by rotating the vector between the look-at point 234 and the eye point 232 by a predefined angle a around the X-axis and an angle 0 around the Y-axis. Preferably, a and ~ are always in a range between --45 and +45 degrees, otherwise a different baseplane would be chosen.
A Z-slice shear transformation along X and Y (also known as a "X and ~.' according to Z" shear) is preferably subsequently applied to the viewing matrix as follows:

WO 00/04505 PC"r/QJS99/56038 1 0 tasncc 0 -0 1 ta.n R 0 With this geometry, when the opaque polygons are drawn, the polygon footprints are "prewarped" so that the warping operation at the end of Cube-5 rendering creates correct polygons in the final image. Additionally, the Z-depths computed are preferably proportional to the distances alorig the processing axis. It is possible (e.g., if all opaque geometry fits within the volume extents) to set the hither and yon clipping planes to the edges of the volume and, if the precision of the depth buffer is the same, the depths computed are exactly the volume slice indexes for depth checking. Otherwise, a simple scaling must be applied vrhen the computed depths are utilized by the volume rendering system. Light positions should be considered when using this method, however, as the shearing may not move the lights to the correct location.

The thin slices of translucent polygons preferably align geometrically with their 3D positions in space. Preferably, the eye point is first aligned as previously described. Next, in order to keep the objects from projecting all the way to the final image plane, the geometry is preferably trar!slated such that the center of the current thin slab is at the Z=0 plane prior to shearing. Clipping planes allow only the current thin slab to be rendered arad the projection plane is set to be within the two volume slices which border that region with, for example, glOrtho (glFrustum for Perspective).

Important to comprehending the present invention is to understand the organization of frame buffer design and cornposting buffer design. As illustrated in Figure 30, the Cube-5 volume rendering pipeline 236 of the present invention preferably utilizes a tightly coupled on-chip SRAM buffer 238, termed a composting buffer, to hold the partially composited rays as a volume is processed in slice order.

WO 00/04505 PC't'/I.1S99/36038 This architecture exploits the regular processing sequence inherent in slice order rendering. Specifically, each slice of the volume 240 is preferably processed in the sarne order as the previous, left-most voxel to right-most voxel of each row, and bottom-most row to top-most row of each slice (possibly with some skewing). In this way, the SRAM composting buffer 238 becomes a simple FIF queue having a length equal to the size of a slice. The SRAM queue is preferably 32 bits wide to hold-8-bit fixed point IZGEa values (called coxels). Each pipeline 236 preferably reads a coxel from the front of the queue and writes a coxel to tle rear of the queue for each clock cycle.

In contrast, dvith reference now to Figure 31, conventional PC-class geometry pipelines 242 utilize an exterrial DRAM frarne buffer 244, which stores the IgGBa color values and Z-dep th values for each pixel. This frame buffer 244 must support random access, since polygon rendering does not enjoy the regular access ordering inherent in slice-order volume rendering. Nortnal polygon rendering produces triangles on a screen of average between 10 and 50 pixels. Therefore, the DRAM
memory is organized to maximize access to areas of the screen of this size.

As shown in Figure 31, when the 3D texture mapping method of the present invention is implemeni;ed on geometry pipelines 242, volume slices 246 perpendicular to the screen are texture mapped through the volume. The per-vertex geometry calculations for the volume slices 246 are easily achievable with any level graphics hardware. However, the requirement to support random access to both the texture memory 248 and frame buffer 244 lirnits the perforrnance of this approach to the fill rate achievable with DRAM frarne buffer.

Very high end surface graphics systems typically utilize massive parallellisrn in the fragm.ent processir.Lg sectior1250 of the polygon pipeline. This, coupled with a highly distributed frame buffer, allow increased fill rate performance.

In Figure 32 tl:,ere is shown one embodiment for connecting a geometry pipeline 242 to the Cube-5 volurrie rendering system 252, according to the present WO 00/04505 PCT/iJS99/16038 invention. As illustrated in Figure 32, the SRAM composting buffer is preferably removed from inside the Cube-5 pipeline 252 and replaced with an external DRAM
frame buffer 254. R.ather than organizing the DRAM fr e buffer 254 as in conventional polygon engines, the memory in the frarne buffer of the present 5 invention is preferably organized so that it is specifically optimized for volume rendering. The frame buffer 254 is also preferably accessible from a 3D
graphics pipeline 242 to allow rsrixing of polygonal dlata 256 with volumes.

With continued reference to Figure 32, the dual use fraine buffer 254 10 preferably connects the; two pipelines 242, 252. In a preferred method, to render a scene with both opaque and translucent polygons and also volume data, the geometry pipeline 242 first rende;rs all opaque polygons with Z-depth. The volume slices, stored in volume memory 258, and thin slabs of translucent polygons are then rendered in an alternating (e.g., dovetailing) fashion - volume slices by the Cube-5 15 pipeline 252 and translucent polygons by the graphics pipeline 242 (opaque polygons may also be rendered with the sarne dovetailing algorithm, but with increased demand on the graphics pipeline).

Z-depth checking is preferably utilized to insure correct hidden object removal 20 and blending is set in b th pipelines to correctly composite the samples and fragments. The geometry engine 242 preferably perferrns the final baseplane warp required by the Cube-5' system of the present inventi ri.

The design of the DRAM buffer 254 is critical to achieve, for example, the 25 503 Million samples per second required for 30Hz rendering of 2563 volume datasets.
Therefore, it is helpful to first create a DRAM buffer for the Cube-5 rendering pipeline itself, before discussing connecting the rendering pipeline to a graphics pipeline. The volume rendering system of the present invention is preferably comprised of multiple Cube-5 pipelines. In each rendering pipeline, at every clock 30 cycle, a coxel (composting buffer element consisting of RGBa) is read from the SRAM composite bufi:er FIFO and blended. with an appropriate composting equation.
The new coxel is then placed at the rear of the FIFO. In a preferred embodiment, the WO 00/04505 PC1'/9.1S99/16038 structure of a coxel is changed to contain 32 bits of color, 8 for each IZCpBix and 32 bits of Z-depth information, 24 + 8-bit stencil. This configuration is required to handle Z-depth checking in the composting stage. Assuming that opaque polygon rendering is completed before any volume rendering begins, the 32 bits of Z-depth/stencil information is read, but not re-written. Therefore, for every clock cycle, each Cube-5 pipeline needs to read 8 bytes of coxel data and write back 4 bytes.
Preferably, the rendering pipeline of'the present invention utilizes memory chips vvitli a word size of 16 bits. Using this configuration, four words must be read by each pipeline every cycle and two words must be written. To do this would require six 16-bit memory interfaces per pipeline. Am emerging technology in synchronous DRAM (SDRAM) chips, which the present invention may avail itself, is known as double data rate (DdaR.), which reads/writes data at both the rising and falling edges of the clock. Using DDR. SDRAMs, the present invention can utilize hvo 16-bit memory interfaces for reading 64 bits of data per clock and one 16-bit memory interface for writing 32 bits per clock, for a total of three 16-bit memory interfaces per pipeline.
With reference now to Figure 33, sirice a read and write must be perforined every clock cycle in order to keep the pipeline full, the present invention preferably reads from one set of ft aane buffer chips (e.g., set A) 260 and writes to another set (e.g., set B) 262. The Cube-5 system conteniplates reading from set A 260 and writing to set B 262 for a complete slice of the volume, and then swapping for the next slice. With this approach, however, each frame buffer chip set would have to be large enough to hold tl:ie complete frame buffer. 1" errnore, the polygon engine would have to be instnicted as to which set is the current set. Therefore, in a preferred embodiment, the present invention alternates reading and writing between set A

and set B 262 within a frame and buffers the processed coxels from the read set until it becomes the write set. Since every memory access must be a burst, each burst actually lasts four clock cycles and reads/writes four coxels (i.e., eight words) with 16-bit DDR DRAM chips. The Cube-5 system preferably cycles through al14 banks to keep the memory bandwidth saturated before writing the new Gcc values back.

WO 00/04505 PC'r/C)S99/16035 For this reason, there is preferably a 16-coxel FIFO queue 264 (four coxels for each of four banlcs) that the nemrly composited CrcY, portions of the coxels are stored in.
There are many different possible configurations for the number of pipelines etc. in the Cube-5 volurne rendering system of the present invention. An exarnple for a case of four parallel pipelines creating 12 total memory interfaces will now be discussed with reference to Figure 33. As shown in Figure 33, each pipeline contains one read interface 266 t the Z-depth/stencil portion 268 of the frame buffer and two read/wri.te interfaces 270 and 272 to set A 260 and set B 262, respectively, of the "i 0 RGBu portion of the frame buffer. To render a 2563 volurne at 30Hz, each of the four pipelines process 125 rriiilion voxels per second. Therefore, a 133 M.HZ clock is utilized for the chip and. the SDRAM. The mapping of the fr e buffer pixels onto the memory chips is critical to the perfonnarice. It must match exactly the processing order of the Cube-5 pipelines and the parallel access by four pipelines substantially simultaneously. It is assumed that the skewed memory access of the Cabe-5 architecture is "un-skevied" so that the volume samples are in order from left to right across each scanline in groups of four, since it is easier to follow in the explanations.
The design can be extended to skewed memory, although the geornetry pipeline and screen refresh system niust be aware of the additional skewing.

Figure 34 shows a preferred layout of the RGBa portion of the coxels in the frame buffer. For a given scanline 274, there is a group oi"pixels which reside in set A 276 followed by a group of pixels which reside in set B 278, repeated across the entire scanline 274. The length of each set is 64 pixels due to the fact that each set must contain pixels which are read from four different bariks inside each chip, and each bank consists of four RC'sBoc values from four parallel chips/pipelines.
Thus the pixel data in the frame 'buffer is interleaved across eight chips; In fine detail, it is really interleaved across only four chips. This provides an interface which reads 4 pipeliiies x(1 I2.GBoc chip + 1 depth chip) x 16 bits x 133MHz x 2 data rate = 34 Gbits = 4.15C-jbytes of data per second. This surpasses the required 2563 x 30Hz x 8 bytes = 3.75Gbytes per second where eight bytes are organized as four bytes RGBa + four bytes Z-depth/stencil.
Additionally, the frarrie buffer sub-system is capable of writing 4 pipelines x 1RGBa chip x 16 bits x 133MHz x 2 data rate = 17Clbits = 2.1 Gbytes once again handling the 2563 x 21011z x 4 bytes = 1.8Clbytes per second required for real time 30Hz rendering of 2563 volumes.

This extra bandwidth is not siting idle. The screen must be refreshed from the data in the frame buffer. If we assume a 1280 x 1024 screen resolution with 60Hz refresh rate and that all four RCsBcx bytes are read from the frame buffer (since our burst mode access retrieves them anyway), then 1280 x 1024 x 60Hz x 4 bytes = 30fl ytes are read from the frainsr buffer per second. Only the RGBa portion of the frame buffer is required for refresh. Therefore, the refresh data is read from eight chips. It is sufficient to perform ten data burst reads/writes (depending on set A or set B) to each chip followed by one read of data for refresh. This distribution of memory accesses provides the refresh hardware with a consistent (althoo.gh. bursty) stream of data. The Cube-5 pipelines also contain a small percentage of excess cycles, and thus will not lose the ability to achieve 30Hz 2563 rendering when the memory sub-system is temporarily stalled for refresh.

An alternative approach to connecting a graphics pipeline to the Cube-5 volume rendering pipeline, in accordance with a preferred embodiment of the present invention, will now be described. This preferred connection approach keeps both the graphics pipeline and the volume rendering pipeline working at all times and merges the data in the SRAM compositing buffer inside the Cube-5 chip. At any given time, the volume rendering pipeline is composting the current volume slice with the previous thin slab of polygon data over the compositing buffer and the graphics pipeline is rendering the next thin slab of translucent polygons.

The method described herein still utilizes the unique approach of dovetailing volume slices and thin slabs of translucent polygonal data, as previously described herein above. In a firsi, step, all opaque polygons are projected onto a Z-buffer coincident with the baseplane (e.g., the volume face most parallel to the screen).
Next, the projected RCBuZ image is loaded into the coxnposting buffer of the volume rendering pipeline. Subsequently, the volurne is rendered with a Z-cornparison enabled in the composting stage. The thin slabs of translucent polygons are preferably rendered by the geornetry pipeline, and their correspon.ding RGBa data is sent to the volume pipeline of the present invention to be blended in}to the SRAM
composting buffer within the volurrie pipeline.

Preferably, the composting stage of the volume rendering accelerator is modified to composite two layers (one volume and one translucent polygon) per step, thus not delaying the volume rendering process.. This requires the addition of some extra logic. The straightforward f ula for perfortning a double composition of a volume sample v over a translucent pixel fra entp over the old coxel c would require four additions and four multiplies ir.i five stages:

C'7 = C'a' + [C p a p + Cc(1 - CP)] (t - C&v) However, employing simple math allows the double composition to be calculated with four additions anci two multiples in six stages with the following forrnula (some of the calculations are re-used):

30.
C, _(CC+(C p -C')a p ) + CCd - (C' + (C p -C')M p )] av WO 00/04505 PC'T/IJS99/16038 As appreciated by one skilled in the art, the hardware designer would choose the option more desirable i'or a given implementation (i.e., less logic and more stages, or fewer stages and more logic).

5 Consider the anYount of data transferxed for a 2563 volume. There are preferably 255 slabs plus one buffer in front of the volume and one buffer behind the volume. Each of these 257 slabs contains 256 (2562 pixels of R~'xBa) of data.
This equates to 64MB 'being read from the firame buffer arad transferred between the two sub-systems each ftanne. To achieve a 30Hz frarne rate would require a 10 bandwidth of 1.9GB pc;r second. While this much data could be transferred with sufficiently wide chanraels, it must also be read from the frame buffer. It would be virtually impossible to read this much data without changing the organization of the current DRAM frame buffers. Additionally, the frame buffer must be cleared 257 times per frame.

To solve this bEdndwidth challenge, the present invention preferably uses run-length encoding (RLE) of the blank pixels. 'With this method, each scanline is encoded separately an& a"run-of=zeros" is encoded as four zeros (RGBa) followed by the length of the run. Since typically only a, small percentage of the polygons in a scene are translucent, the translucent polygon slabs will be relatively sparse. Run-length-encoding just the blank pixels in these thin slabs results in over 99 /
reduction in the required bandvaidth. Preferably, the rnethod of the present invention utilizes E on 2D images of sparse translucent polygons to save on bandwidth.

Using this preferred method requires adding hardware to the Cube-5 system of the present invention. Specifically, additional hardware may be included in the volume rendering pipeline that can decode the RLE input stream and create RGBa fragments. However, since these fragments are utilized by the volume pipeline in a regular order, it is preferable to decode the input strearn using a double buffer to synchronize the two pipelines. Every clock cycle, a value is output from the decoding hardware. If the volo.nie rendering machine has multiple pipelines (as most current WO 00/04505 PC"r/qJS99/16038 designs do) the decoding hardware is preferably replicated for each pipeline so that they can keep up with pixel demand.

Likewise, RLE hardware at the originating end connected to the geometry pipeline may encode the data in real-time before sending it to the volume pipeline.
However, 1.9GB per second access to the frame buffer would still be required to read:-all the thin slabs of traiislucent polygons and clear the ftame buffer 257 times per frarne. Therefore, a separate frame buffer is preferably ernployed which stores the data directly in RLE f rmat. Since the thin slabs of translucent data are very sparse, more time is spent clearing and reading than is spent rasterizing. An RLE
buffer, while generally not opt;imal for rasterization, is well suited for both clearing and reading the data. For example, to clear an :F:L,E frame buffer requires merely storing a single run of zeros (in five bytes) for each scanline, instead of writing an entire 256' frame buffer.
To minimize the impact on the current geometry pipelines, the RLE frame buffer is preferably implemented using the emerging technology of embedded DRAM
and connecting it in parallel to the nortnal frame buffer. This differs from conventional encoding algorithms which typically assume that the data was given in physical order. Triangle rasterization, however, does not guarantee any ordering of the fragments. 'fherefore, the apparatus of the present invention must be able to randomly insert an 1tGBa value into an RLE scanline of data.

Figure 35 illustrates a diagram of an RLE insert, forrned in accordance with the present invention. For each fragment, the encoded scanline is copied from one buffer to another, insei-ting the new RGBavalue. Every clock cycle, a single flit (i.e., either an RGBa pixel or run-of-zeros) is processed. The entire scanline is preferably processed flit by flit. 'With reference to Figure 35, an input buffer ("in Buffer") 280 holds the current enc ded scanline and an output buffer ("out BBSffer99) 282 holds the newly encoded scanlirie with the nevv 1tGBoc fragment inserted. The choice of what to insert at each cycle is preferably perf rrned by a 5-byte rnultiplexar 284.
The apparatus of the present invention preferably includes pointers, narnely 661nSt1" 286 WO 00/04505 PC'r/[JS99/16038 and "outPtr" 288, which point to the current flit of both the in buffer 280 and out buffer 282, respectively. The logic on the right side of Figure 35 calculates how much has been processed ("T0tal") 290 and two of the control points (66ctrl 1' and "ctrl_3 The other mux control I3oint ("ctrl 2' ) is calculated by 'OR'-ing together all of the RCaBc& values (the flag for run-of-zeros). "xPos" is deillned as the x position of the fragment. Preferably, a, lookup table is implemented of where the current buffer is located in memory for each y value. Thus, the buffer can 'be moved while inserting new pixels and the table is simply updated. This preferred method is illustrated in the )/ AddFragment pseudo-code routine of Figure 36. With continued reference to Figure 36, the 1tL,E AddPixel'ToScanline function demonstrates the processing that occurs in the hardware embodiment of the present invention shown in Figure 35.

By utilizing an,vmbedded 13RAM the present invention takes advantage of the extremely high bandwidth available when processing occ'irrs on the memory chip.

The processing is simple enough to be implemented in the DRAM manufacturing process. For example, for a 1280 x 1024 fraarae buffer, the maximum arnount of memory required is 50] its. This fits onto eDRAM dies with room for over 3 million gates for the encoding hardware.

Figure 37 is a preferred block diagram illustrating how a polygon pipeline 242 and volume pipeline 252 are connected through the RLE frarne buffer 292, which is preferably double-buff+rred to allow rendering during transmission of data.
The auxiliary frarne buffer i.s preferably connected at the same place as the existing one by sirnply duplicating the fragments, thus not affecting the remainder of the geometry pipeline 242. The vola:me pipeline 252 also preferably utilizes double buffering to allow receiving of data while blending the previous slab. It is to be appreciated that, using the system of the. present inirention, volume rendering does not conflict with polygon rendering. Since the voltame pipeline 252 always accesses its memory in a repeatable ordered fashion, it achieves the sample fill rate into the frame buffer at a sufficient rate to achielie 30Hz volume rendering. The system of the present invention utilizes the graphics pipeline 242 to render the opaque polygons before rendering the volume, stored in volucne rnemory 258. This can norrnalliy be accomplished WO 00/04505 1'C'F/[7S99/16038 concurrently with the rendering of the volume for the previous frame. Even if the polygon engine must render translucent polygons mixed in with the volume, there is usually enough time to :render the opaque polygons before the volume finishes due to the small number of trailslucent polygons in normal scenes.

In accordance with a preferred ernbodirnent of the present invention, a method is provided to incrementally voxelize triangles into a volumetric dataset with pre-filtering, thereby generating an accurate multivalued voxe:lization.
Multivalued voxelization allows direct volume rendering wvitl internsixed geometry, accurate multiresolution represeritations, and efficient antialiasing. Prior voxelization methods either computed only a binary voxelization or inefficiently computed a multivalued voxelization. The method, in accordance with the present invention, preferably develops incremental equations to quickly determine which filter function to compute for each voxel value. This preferred method, which is described in greater detail herein below, requires eight additions per voxel of the triangle bounding box.

To avoid image aliasing the present invention preferably employs pre-filtering, in which scalar-valued 'voxels are used to represent the percentage of spatial occupancy of a voxel, an extension of the two-dimensional line anti-aliasing method conventionally known (Filtering Ed es for C rayscale Dsnlavs, by S. Gupta and R. F.
Sproull, Computer Grauhics (SIGG PH 8.1), vol. 15, no. 3, pp. 1-5, Aug. 198 1). It has also been shown that the optimal volume sampling filt:er for central difference gradient estimation is a one-dimensional oriented box filter perpendicular to the surface. The method olEthe present invention preferably utilizes this filter which is a simple linear fiznction of the distance from the triangle.

Conventional graphics hardware only rasterizes points, lines, and triangles, with higher order primitives expressed as combinations of these basic primitives.
Therefore, it is preferable to voxelize only triangles because all other primitives can be expressed in terms of triangles. Polygon meshes, spline surfaces, spheres, cylinders, and others can be subdivided into triangles for voxelization.
Points and lines are special cases of triangles ahd can similarly be voxelized by the present WO 00/04505 PCT/l3s99/16038 algorithm. To voxelize solid objects, the boundary of the object is preferably voxelized as a set of triangles. The interior of the object is then filled using a volumetric filing procedure.

As appreciated l:)y those skilled in the art, are linear expressions that maintain a distance from an edge by efficient incremental arithmetic. The methods of the present invention extend this concept into three dimensions and apply antialiasing during the scan conversion of volumetric triarigles.

In essence, the general idea of the triangle voxelization method of the present invention is to voxelize a triangle by scanning a bounding box of the triangle in raster order. For each voxel in the bounding box, a filter equation is preferably evaluated and the result is stored in memory. The value of the equation is a linear function of the distance from the triangle. The result is preferably stored using a fuzzy algebraic union operator, namely, the max o;perator.

With reference now to Figiare 38, there is shown a density profile of an oriented box filter alon;g a line 294 from the center of a solid primitive 296 outward, perpendicular to the sui face 298. 'The width. of the filter is defined as W.
The inclusion of a voxel in the fuzzy set varies between zero and one, inclusive, deterYnined by tl?e value of the oriented box filter. The surface 298 of the primitive 296 is assumed to lie on the 0.5 density isosurface. Therefore, when voxelizing a solid primitive 296, as in Figure 38, the density profile varies from one inside the prixnitive to zero outsicle the primitive, and varies srnooth:ty at the edge.
For a surface primitive, such as the Hangle 300 shown in Figure 39, the density is preferably one on the surface and drops off linearly to zero at distance W from the surface.
Although the present invention similarly contemplates the voxelization of solids, the voxelization of surfaces will be described herein.

With continued reference to Figure 39, it has been deterrnined that the optimum value for filter width W is 2J voxel units (see e.g., Object Voxelization by Filtering, by M. Srdmelc and A. Kwaftnan, 1998 Volume Visualizatim,n Symposium, pp.

WO 00/04505 PC't /1JS99/16038 111-11$, IEEE, Oct. 15,98). For shading, the norrnal is preferably estimated by computing the central ciifference gradient at the 0.5 isosurface. Because the overall width of the central dif~erence filter is at most 2r3 units, a correct gradient is found on the 0.5 density isosurface. The thickness of the triangle 300 may be defined as T.

5 Norrnally, T can be zeri), unless thick surfaces are desired. By thresholding at 0.5 density, a 6-tunnel-free set of voxels is generated when Wz 1. This property is useful for volumetric filling (e.g., in order to generate solid objects).

All voxels with non-zero values for a triangle are preferably within a bounding 10 box which is S W+TI2 voxel units larger in all directions than a tight bounding box.
Therefore, the first step of the present method preferably determines a tight bound for the triangle 300, then xTiflates it in all directions by S'voxel units and rounds outward to the nearest voxels.

15 As illustrated art Figure 40, the area surrounding a triangle defined by vertices C,, CZ and C'3 Ynay be divided into seven regions (e.g., R1 through R7) which must be treated separately. In a preferred xnethod of the presern' in.ventiorl, each candidate voxel is tested for incla:ision within the seven regions, thexi filtered with a different equation for each region. In the interior region R1 of the triangle, the value of the 20 oriented box filter is sitnply proportional to the distance from the plane of the triangle.
In regions along the edges of the triangle, R2, R3, R4, the value of the filter is preferably proportional to the distance from the edge of the triangle. In regions at the corners of the triangle, R5,1t6, IZ7, the value of the filter is preferably proportional to the distance from the ciemer of the triangle.
With continued reference to Figure 40, the regions R1 -127 are preferably distinguished by their alistarace from seven planes. The fi:rst plane a is preferably coplanar with the triangle and its'lorrnal vector a points outward from the page. The next three planes b, c, md d prefeirably have norrnal vectors b, c, and d respectively and pass through the comer vertices C, , C2, and C3 of the triangle, respectively. The final three planes e, f, and g are prefera.bly perpendicular to the triangle and parallel to the edges; their respective norrnal vectors, e, f, and g, lie in the plane of the triangle w 00/04505 PC'I'/t7s99/16038 and point inward so that a positive distance from all three planes defines region Iti.
All of the plane coefficients are normalized so that the lerigth of the normal is one, except for norrraal vectors b, c, and dwhich are normalized so that their length is equal to the inverse of their respective edge lengths. Iri that manner, the computed distance from the plane varies from zero to one along the valid length of the edge.
For any pl ar >urface, the distance of any point fi orn the surface can be computed using the plane equation coefficients:

Dast = Ax + .t3y + C'z + .LD
d4z + Ba + C2 which simplifies to Dist = Ax + By + Cz + D

when the coefficients are pre-norrnalized. This computation can be made incremental so that when stepping <<long any vector, the distance ordy changes by a constant. For example, if the distance from a plane is Dist at position [x, y, z], then stepping one unit distance in the X direct:ion changes the distarice to I)ist A (x + 1) + By + Cz + D
_ Ax + By + Cz + D + A
= Dist + A

In general, stepping along any vector r =[rxõ rY, r"], the distance from the plane changes by Drst Dist + r 0 [A, B, C]

where o indicates the dot product. While scanning the bounding box of the triangle, the distance from the plane of the triangle can be computed incrementally with just a WO 00/04505 PC'I'/tJS99/16038 single addition per voxel. This method, performed in accordance with the present invention, for computing the distance from a plane is illustrated by the preferred pseudo-code routine shown in Figure 41.

The Y-step is more complicated than. the X-step because it not only steps one unit in the Y direction, but it also steps back multiple units in the X
direction.
Consider, as an analogy, the operation of a typewriter which glides back to the left margin of the paper and advances the line with one push of the return key.
Similarly, the Z-step combines stepping back in both the X and Y directions and stepping forward one unit in the Z direction. This simple pre-processing step ensures efficient stepping throughout th,d entire volume. If nrumerflcal approximation issues arise, then it is possible to store the distance value at the start of each inner loop and restore it at the end, thereby minimizing numerical creep due to roundoff error in the inner loops.

For multivalued voxeliza,tion, seven plane distances are required. Therefore, seven additions are required per voxel to compute the plane distances. Other computations per voxe:l may include incrementing the loop index, comparisons to deterynine the appropriate region and, if necessary, computations to determine the density.

Referring again to Figure 40, in region R1 the density value of a voxel is preferably computed with the box filter oriented perpendicular to plane ta.
Given a distance DistA from plaine a, the density valliae V is cornpLLted using:

13istA I - 7'12 w In region R2, the density is preferably computed using the distance from planes a and b:

V= 1 - Dist14 2+DYStB 2- T 12 WO 00/04505 PC'T/iJS99/16038 Similarly, region R3 uses planes a and c, and region RA. uses planes a and d.
Region R5 uses the Pythagoreaji distance f~om the corner point Ct:

(cr$ x+ (C1 - Y)2 + (C1z - z)Z - 7 /2 T~=t - ---w Similarly, regions R6 and R7 use corner points C'z and C3, respectively.
~
At the shared edge of adjacent trianglles, it is preferable to avoid discontinuities or cracks. Fortunately, the oiiented box filter guarantees accurate filtering of the edges fo;r any polyhedra, provided the union of the voxelized surfaces is correctly computed. The union operator can be defined over multivalued density values V(x) with V,,s = anax(VR(x)9 VB(x)). ther Boolean operators are available.
However, the max operator preserves the correct oriented box filter value at shared edges, and is therefore preferred.

The implication of using max in the inethod of the present invention is that the current voxel value must be read from memory, then possibly modified and written back into memory. Therefore, a maximum of two rneanory cycles are required per voxel.

The efficiency of the algorithm of the present invention may be fafther increased by limiting the amount of unnecessary computation because the bounding box often contains a higher percentage of voxels unaffected by the triangle than affected by it. The bounding box can be made tighter by recursively subdividing the triangle when edge lengths exceed a predetermined consta.nt.

To visualize inte=ixed polygons and volumes, the polygons are preferably voxelized into the target volume and rendered in a single lpass. If the polygons move with respect to the volt me, then voxelization should occur into a copy of the original volume so as not to cozrixpt the da1ta. The m,altivalued voxelized polygon voxels may WO 00/04505 PC'T/tJS99/16038 be tagged to distinguish them from volume data. In this manner, polygons can be colored and shaded separately from other data.

The preferred triangle voxelization algorithm described above is efficiently implemented in the distributed pipelines of the Cube-5 volume rendering system of the present invention. 'Chis algorithm adds j ust a small anaount of hardware to the existing pipelines and performs accurate multivalued voxelization at interactive rates.
One important advantage of the claimed Cube-5 volume rendering algorithm is that the volume data is accessed coherently in a dete inistic order. This feature allows orderly scanning of a bounding box for this algorithm.

In Figure 42, a preferred embodiment of the overall voxelization pipeline is shown, in accordance mrith the present invention. If on-the-fly voxelization is important, the system of the present invention may preferably include separate pipelines for volume rendering and voxelization. If voxelization can occur in a separate pass, then these volume rendering and voxelization pipelines may be combined, with the voxelization pipeline re-9Asflng most of the hardware from the volume rendering pipeline. The setup for each triangle preferably occurs on the host system, in a similar rnarmer as setup is perforrned on the host system for 2D
rasterization.

With reference to Figure 42, in the first hardware stage 302 of the pipeline, the distances from the seven planes are preferably computed. - Seven simple distance units are allocated with four registers for each of the seven planes.
Preferably, one :25 register holds the currelit distance from the plane and t&ie other three registers hold the increments for the X-, Y-, and Z-steps. Figure 43 shows a distance computation unit 310 for one of the seven planes, forrned in accordance with a preferred embodiment of the present invention. 'Chis distance computation unit 310 may be included as part of the distance calculation stage 302 of the pipeline (see Figure 42). The other six units can be essentially identical in design, but hold different values. During each clock cycle of voxelization, the pipeline preferabl}, steps in either the X, Y, or Z
direction (i.e., perforxns an X-Step 312, Y-Step 314, or Z-Step 316), thereby updating the WO 00/04505 t'C II'/[JS99/Y6038 current distance according to the direction of'rnovement. 'rhe hardware for looping through the volume is already present in the 'volurne rendcri.ng pipeline and is therefore re-used here to scan the bounding box of the tria'ygle.

5 After the seven plane distances are calculated, the resulting values preferably flow down the pipeline. As shown in Figure 42, the next pipeline stage 304 then preferably determines in which region the current voxel resides. In a preferred embodiment of the region selection stage 304, only seven comparators are needed to determine the outcome of the truth table, due to the mutual exclusion of some cases.
1.0 For instance, in Figure 40, from the negative (lower) side of plane b, it is not necessary to test the distances from. planef or g, depending on the value of the distance from plane e.

With continued reference to Figure 42, after the region has been determined, 15 the next pipeline stage 3,06 computes the filter function. The filter calculation stage 306 of the pipeline is preferably only activated if the eurrelat voxel is within S voxel units of the triangle. Otherwise, the current voxel is essentially unaffected by the triangle and different regions require different calculations, ranging from a simple linear expression to a complex Pythagorean clistaxlce evaluation. Since hardware 20 ideally must handle all cases equally well, it is preferred that such hardware be able to perforrn a square root approximation by mearas of a limited resolution look up table (LUT). However, the range of inputs and outputs is small, and therefore the size of the required LUT will be srnall. Furtherrnore, the Cube-5 hardware of the present invention has several I,IJTs available for voliairne rendering which can be re-used for voxelization. Instead oi providing three separate units to compute the expression V = 1 - (Dfst - 7'/2)IW, it is rriore efficient to roll all the calculations into one LUT. In this case, the i;aput is Dist', defined over [0,12], and the output is the density value V in the range [0,1].

Due to the mutual exclusion of the seven regions, it is sufficient to provide hardware for only the rnost complex filter calculation. With reference to Figure 40, the most complex calculation is the corner distance computation of regions R5, R6, WO 00/04505 PC'r/tJS99/16038 and R7 which, in a prefi;rred embodiment, requires five adders and three multipliers, in addition to the square root LUT previously mentioned. The line distance computations in regions R2, R3, and R4 are simtsler, requiring only one adder, two multipliers and the square root LUT. Regior.L .IZ1 requires a single multiply to obtain the distance squared, which is the required input to the LUT.

Referring again to Figure 42, the final stage 308 of'the pipeline preferably computes the max operation using the current voxel value and the coniputed density estimate. In a preferred embodiment of the present invention, the max operator is 1.fl simply a comparator attached to a multiplexor such that the greater of the two values is written back to memory. Since most voxels in the bounding box are not close enough to the triangle to be affected by it, memory bandwidth will be saved by only reading the necessary voxels. Furtlaer bandwidth savings may be achieved by only writing back to memory those voxels that change the current voxel value. Since there 1_5 is some latency-betweer. requesting and receiving word frorn memory, the voxel is preferably fetched as soon as possible in the pipeline and the results queued until the memory is received. The final stage 308 is write-back to imernory, which can be buffered without worry of dependencies.

20 The present invention thus far has been described outside the context of skewing, which complicates the traversal. However, the present invention contemplates building skewing into the Y- and Z-step distance update values.
Skewing also adds more, complexities to the Cube-5 hardware of the present invention. Specifically, when a lef1:-rnost voxel moves one unit in the Y
direction, placing it outside of the bounding box, the pipeline actually takes p - 1 steps in the X
direction to keep the voxel within the bounding box. Similarly, wherl the left-most voxel moves one step in the Z direction, it also moves one step in the negative X
direction, which is handled in the same way as before. Therefore, the apparatus of the present invention is preferably adapted to perforrn skewing by adding fourteen (14) more registers and corresponding logic to determine when the pipeline is currently processing the left-most voxel.

WO 00/04505 PCT/[JS99/16039 Pre-filtering, wr.ich may be performed in combination with thevoxelization methods of the present invention, can be used to optimally generate a series of volumes of different resolutions. This technique is useful for rendering images of different sizes; the size,ef the volume is preferably chosen to correspond to the size of the final image. In this manner, aliasing is avoided at all image resolutions and no unnecessary work is perfonned rendering parts of a scene not visible at the image scale.

Pre-filtering can additionally be used to model motion blur. For example, as an object sweeps past a carnera, it sweeps out a complex volume during the time the shutter is open, causing motion blur. To accajxately render motion blur, conventional rendering techniques render multiple images and blend them into a single image.
However, this approach is very slow. With pre-filtering, the present i:rivention perforrns the sweeping operation once, during voxelization, so that motion blur can be rendered in the same tinie as regular volume rendering. This method works well, particularly for certain cases where the motion is constant (e.g., the same direction and/or rotation). For example, consider a helicopter blade which spins at a constant speed during flight. For example, to voxelize the blade spinning at the rate of 5Hz for an animation frame rate of 30Hz, the blade sweeps out an arc of 5 (21t) each frame.
20 Thus, at the outer edge of the blade, the density value is much lower and the blade appears more transparerit than in the center, where it sweeps out a smaller volume and appears more solid. The volume rendering transfer function may be set so that the lower density values appear less opaque and higher density values appear more opaque.

When multiple volumetric objects overlap, the projected image of the volumes becomes quite cornplex, Consider, for example, a scene where smoke rises up through a cloud. Clearly, the two volumetric objects cannot be rendered separately with the images combined in the final frame. Therefore, in a preferred method, perforrned in accordance with one forrn of the present invention, multiple objects are combined into one obj eot for a final rendering pass to create the resulting image.

WO 00/04505 PC"r/[1s99/16038 Vy'hen two or more objects occupy the same space, the colors from each object are preferably modulated together at each sarnple location along a projected sight ray.
Therefore, it is preferre(i that each object be classified and shaded prior to being combined, followed by color r.nodulation. If9 alternatively, voxel data were combined first, a new transfer function would be required for each possible combination. This latter approach is there# re not preferred.

In accordance with one form of the present invention, a preferred method for mixing multiple overlapping volumes resamples all but the first object in the z-dimension of the first object so that slices of each object become interlaced. This includes a classification., a shading and a transformation which aligns all objects.
Object transformations include translation and scaling, preferably performed by the apparatus of the present invention using nearest neighbor connections, and rotation, which is preferably perform.ed using the rotation methods of the present invention previously described herein above.

For scenes containing objects vvhich -will not change position or orientation with respect to each other, the present invention contemplates optimizations such as high-level scene graph compilation that can preferably be employed. For instance, static objects are preferably combined once and stored for subsequent rendering, while non-static objects are re-combined each time; they are moved with respect to the other obj ects.

Texture mapping is a widely used technique to sianulate high-quality image :25 effects, such as surface details, and even lighting and shadows. In general tergns, texture mapping involves mapping a two-dirnensional (21)) image onto a three-dimensional (3D) surface. Texture mapping occurs while geometric objects are rasterized onto the screen. The (x, y) pixel coordinates are preferably mapped into (u, v) texture coordinates and an RG13a value is returned as the color value to use for that pixel on the screen.

WO 00l04505 PCT/US99/16038 There are basically two processes involved in texture mapping: a mapping from (x, y) coordinates to (u, v) coordinates, and a look-up into the image of what IZGBa value corresponds to a given (u, v) coordinate. The mapping from (x, y) to (u, v) coordinates preferab'dy involves simple matrix multiplication, as appreciated by those skilled in the art. However, the look-up into the image of the (u, v) coordinate to return an RGBa valtie is complex. The very large scale integration (VLSI) hardware requirernents for the texture lookup commonly consume large portions of todayps graphics boards, at a significant cost,. This is primarily due to the fact that (u, v) coordinates rarely map directly to a discrete image coordinate, called a texel.
Therefore, the neighbor ing RGBa values are preferably li:ciearly interpolated to produce the RGBa value at the exact (u, v) coordinate.

Two-dimensional (2D) interpolations are generally sufficient if the pixel does not cover more than one texel. However, if the mapping produces pixel coverages greater than one texel, ,irtifacts are introduced into the image using the 2D
interpolation method. 'ro avoid costly texel combining operations, a technique ternied Mip-Mapping may be utilized by conventional graphics pipelines. Mip-Mapping basically consists of storing multiple levels-of-detail (LOD) of an image.
Then, when an (x, y) pixel is mapped to a (u, v) texel, the appropriate Mip-Map level texels are chosen so that the pixel is smaller than the texels. A more accurate method is to look-up the four neighborhood texels from both the higher level and lower level of detail texel images and then perform a trilinear interpolation on all eight texels to compute the appropriate RGBa,value for the pixel.

Texture mappin.g hardware from conventional graphics pipelines has been used to accelerate voliu:ne rendering and has been the subject of such texts as IZealitvEneine Graphics, by K. Akeley, Computer Graphics (SIGCZ PH 93), 27:109-116, Aug. 1993, and Accelerated Volume ltenderinand rorno aphic Reconstruction IJsiny Cexture Mapping Hardare, by B. Cabral, N. Cam and J.

Foran, Symposium on t'olume Visicalizatioaaõ pp. 91-98, Oct. 1994. This conventional approach, however, neither achieves the cost-perforrnance nor supports the various functionalities (e.g., shading) of the present invention. Furtherrnore, using knnown WO 00/04505 PC'r/tJ599/16038 prior art methods, texture mapping is unscalable without data replication, often employs two-dirnensional (2D) rather than three-dirnensional (3D) interpolation, downloads datasets slo'vly, and/or does not support real-time four-dimensional (4D) input.

In accordance xvith a preferred form of the present invention, described previously hereir. above, the Cube-5 apparatas is combined with a conventional geometry engine via the geometry input/output bus 46, 48 (see Figure 4).
Preferably, the rendering pipeline(s) of the present irflverition are utilized to perform the texture 10 look-up function, while the geometry engine is used for mapping (x, y) pixel coordinates to (u, v) texture coordinates. In simple terms, once combined with the Cube-5 apparatus, the rasponsibility of the geometry engine is essentially to rasterize triangles, while the apparatus of the present invention preferably provides the high perforxrflance interpolation engine for texture mapping. To perform texture look-ups 15 on the apparatus of the present invention, texel data is preferably loaded into 3D
memory included within the Cube-5 unit(s). Figures 6A and 6B illustrate an example of how 32 bits of texel data for a 2x2 neighborhood are preferably arranged in a 23 subcube of 16-bit voxel.s.

:20 Another important advanta,ge of the present invention is the ability to enhance image-based rendering. In general, image-based rendering methods render complex scenes from arbitrary viewpoints based on a:finite set of images of that scene. Two similar image-based rendering methods, known by those skilled in the art, which use four-dimensional (4D) interpolation without requiring the depth information of the 25 source images are light field rendering and T.,urnigraph. The high-performance interpolation engine of the present invention may be used to accelerate these two techniques.

Figure 44 shows that in light field rendering, the scene is modeled by uv 322 30 and st 320 planes. Every uv grid point preferably defines a viewpoint and has an associated st image. For every pixel of the projection plane 324, a sight ray 326 is preferably cast into the uv plane 322. The four st images corresponding to the uv grid WO 00/04505 PC'1'/[3S49/16038 points surrounding the intersection of the sight ray with the uv plane contribute to that ray. The contributions are preferably calculated by casting a sight ray into each st image through its uv grid point. These rays hit between st irnage pixels and, therefore, a bi-linear interpolatior must be performed for each st image. One final bi-linear interpolation between the four rays yields the final projection plane pixel color.
Obtaining every pixel of the projection plane 324, therefore, requires four bi-linear interpolations in st plar. es 320 and one bilinear interpolation in the uv plane 322, resulting in a total of five bi-linear interpolations. These five bi-linear interpolations are substantially equivcLlent to one 4D interpolation, or 15 linear interpolations.

I'erforming lookups for each projection plane ray usually causes random access into the st images. Therefore, in accordance with a preferred method of the present invention, st images are accessed in object order, which is more appropriately adapted for use with thc apparatus of the present invention since the Cube-5 apparatus allows reading of each :st image pixel only once. With continued reference to Figure 44, for each quadrilateral 328 in the uv plane (e.g., abed), its projections on the four st planes (e.g., corresponding to abcai7 preferably determine which four tile regions 330 contribute to the final image. All st tile regions 330 are then preferably assembled into four images and are perspectively projected onto the projection plane 324. The final image is subsequently formed by bilinear interpolation among the four projected images. Interpolation weights are preferably deterrnined by the intersection between the original ray and the uv plane 322.

Although illustrative embodiments of the present invention have been described herein with reference to the accompanying dra-vvings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be effected therein by one skilled in the art without departing from the scope or spirit of the present invention.

Claims

WHAT IS CLAIMED IS:

1. An apparatus for real-time volume processing and universal three-dimensional (3D) rendering, said apparatus being responsive to viewing and processing parameters which define a viewpoint, said apparatus generating a 3D volume projection image or new modified volume from said viewpoint, said image including a plurality of pixels, said apparatus comprising:

a plurality of three-dimensional (3D) memory units which store a plurality of discrete voxels, each of said voxels having a location and voxel data associated therewith, said voxels together forming a volume dataset, said viewing and processing parameters defining at least one base plane of said volume dataset and first and last processing slices of said volume dataset;
at least a first pixel bus for providing global horizontal communication;

a plurality of rendering pipelines, each of said plurality of rendering pipelines being vertically coupled to both a corresponding one of said plurality of 3D
memory units and said at least first pixel bus, each of said plurality of rendering pipeline having horizontal communication with neighbors, said rendering pipelines receiving said voxel data from said corresponding memory units and generating a two-dimensional (2D) base plane image aligned with a face of said volume dataset;

at least one geometry bus for providing global horizontal communication between said plurality of rendering pipelines and a geometry engine, said at least one geometry bus enabling the rendering of geometric and volumetric objects together in a single image; and a control unit which initially designates said first processing slice as a current slice of sample points, and which controls sweeping through subsequent slices of said volume dataset as current slices until said last processing slice is reached.

2. The apparatus of Claim 1, further comprising a feedback bus operatively connected to said plurality of 3D memory units and said plurality of rendering pipelines, said feedback bus providing communication between an output of said rendering pipelines and at least one of:
an input of said plurality of rendering pipelines; and said plurality of corresponding 3D memory units.

3. The apparatus of Claim 2, further coinprising an imagery input bus operatively coupled to said plurality of 3D memory units and providing global horizontal communication between said 3D memory units.

4. The apparatus of Claim 3, further comprising a plurality of two-dimensional (2D) memory units for storing said two-dimensional (2D) base plane image, each of said 2D memory units being operatively coupled to both a corresponding one of said plurality of rendering pipelines and said at least first pixel bus.

5. The apparatus of Claim 4, further comprising at least one warp unit operatively coupled to said at least first pixel bus, said at least one warp unit assembling and transforming said base plane image onto a predefined image plane.

6. The apparatus of Claim 1, wherein each of said plurality of rendering pipelines comprises:

at least a first slice unit having an input which is operatively coupled to said corresponding one of said plurality of 3D memory units and having an output, said slice unit including said current slice of sample points, said slice unit receiving voxel values from said 3D memory units;

a compositing unit having an input which is operatively coupled to said output of said at least first slice unit and having an output which is operatively coupled to said at least first pixel bus; and a compositing buffer having an input which is operatively coupled to said compositing unit and having an output, said compositing buffer storing a plurality of pixels, said pixels having at least one of a color and an opacity associated therewith.

7. The apparatus of Claim 6, further comprising:

at least a second and a third slice unit, said at least first, second and third slice units buffering said voxels read from said plurality of 3D memory units, wherein subsequent re-reading of said voxels is performed from said slice units.

8. The apparatus of Claim 7, wherein said control unit organizes a traversal of said processing slices in a block oriented, interleaved fashion, whereby sample points adjacent in time differ by one unit along a major projection axis.

9. The apparatus of Claim 6, wherein each of said plurality of rendering pipelines further comprises:
a bilinear interpolation unit operatively coupled to said output of said compositing buffer and said compositing unit;
a trilinear interpolation unit having an input which is operatively coupled to said output of said at least first slice unit and having an output;
a gradient computation unit having a first input operatively coupled to said output of said at least first slice unit and a second input operatively coupled to said output of said trilinear interpolation unit, and having an output; and a shading unit having a first input operatively coupled to said output of said gradient computation unit and a second input operatively coupled to said output of said trilinear interpolation unit, and having an output operatively coupled to said input of said compositing unit.

10. The apparatus of Claim 9, wherein each of said plurality of rendering pipelines further comprises a texture map bypass bus which operatively couples said trilinear interpolation unit and said gradient computation unit to said bilinear interpolation unit.

11. The apparatus of Claim 9, wherein each of said plurality of rendering pipelines further comprises at least one feedback connection between said at least first pixel bus and said corresponding one of said plurality of 3D memory units, said feedback connection being configured for feedback from said at least first pixel bus to said 3D
memory unit, and subsequently to any intermediate stage in said plurality of rendering pipelines.

12. The apparatus of Claim 9, wherein said gradient computation unit includes:

a multiplier; and an adder;
wherein along each major projection axis a weighted average of three consecutive sample points is computed and written back through a feedback connection operatively coupled to said rendering pipeline and said corresponding 3D
memory unit.

13. The apparatus of Claim 1, wherein said control unit divides a view frustum of said volume dataset into a plurality of regions separated by region boundaries, said regions including a first region situated closest to said viewpoint, each successive region beginning with said first region having a depth along a major projection axis which is twice that of an adjacent region preceding it.

14. The apparatus of Claim 13, wherein:
a plurality of continuous sight rays are cast through said volume dataset from said viewpoint, said plurality of sight rays diverging with respect to each other as they pass through said volume dataset; and said control unit enables one of:
(I) merging of at least two of said sight rays; and (II) splitting of at least one of said sight rays.