This invention relates generally to ray tracing, and more particularly to coherent ray tracing.
Systems for visualization need to deal with many graphical components to accurately represent complex scenes. The scene may need to be segmented to allow the viewer to focus on areas of interest. Programmable shading and texture maps are required for complex surfaces, and realistic lighting is needed to model realistic illumination. A number of prior art techniques have been developed to reduce the amount of time it takes to render quality complex scenes. These techniques include culling, lazy evaluation, reordering and caching.
Usually, the techniques, depending on the specific visualization task at hand may use hardware or software solutions. Software solutions are tractable, but do not lend themselves to real-time visualization tasks. To design an efficient hardware architectures for performing programmable volume visualization tasks is extremely difficult because of the complexities involved. Therefore, most hardware solutions are application specific.
For example, ray tracing has been widely used for global illumination techniques to generate realistic images in the computer graphics field. In ray tracing, rays are generated from a single point of view. The rays are traced through the scene. As the rays encounter scene components, the rays are realistically reflected and refracted. Reflected and refracted rays can further be reflected and refracted, and so on. Needless to say, in even simple scenes, the number of rays to be processed increases exponentially. For this reason, ray tracing has been confined to scenes defined only by geometry, e.g., polygons and parametric patched. Ray tracing in volumetric data has universally been recognized as a difficult problem.
For volume visualization, simpler ray casting is generally used. Ray casting is ray tracing without reflected or refracted rays. In ray casting, the effect of reflected and refracted rays are ignored, and attempts to provided realistic illumination are handled by other techniques. Yet, relatively simple ray casting is still computationally expensive for visualizing volume data. For this reason, prior are solutions have generally proposed special-purpose volume rendering architectures.
Recently, hardware acceleration of ray tracing geometric models has been proposed, see ART at xe2x80x9cwww.artrender.com/technology/ ar250.html.xe2x80x9d The ART design included parallel ray tracing engines which trace bundles of rays all the way to completion before moving on to the next bundled of rays. The input scene data were stored in the host main memory and broadcast to the processor elements. While the shading sub-system included a programmable co-processor, the ray tracing engines were ASIC implementations.
Gunther et al. in xe2x80x9cVIRIM: A Massively Parallel Processor for Real-Time Volume Visualization in Medicine,xe2x80x9d Proceedings of the 9th Eurographics Workshop on Graphics Hardware, pp. 103-108, 1994, described parallel hardware. Their VIRIM architecture was a hardware realization of the Heidelburg Ray Casting algorithm. The volume data were replicated in each module. The VIRIM system could achieve 10 Hz for a 256xc3x97256xc3x97128 volume with four modules. However, each module used three boards for a total of twelve boards.
Doggett et al. in xe2x80x9cA Low-Cost Memory Architecture for PCI-based Interactive Volume Rendering,xe2x80x9d Proceedings of the SIGGRAPH-Eurographics Workshop on Graphics Hardware, pp. 7-14, 1999, described an architecture which implemented image order volume rendering. The volume was stored in DIMM""s on the rendering board. Each sample re-read the voxel neighborhood required for that sample. No buffering of data occurred. While the system included a programmable DSP for ray generation, the rest of the pipeline was FPGA or ASIC.
Pfister et al., in xe2x80x9cThe VolumePro Real-Time Ray-Casting System,xe2x80x9d in Proceedings of SIGGRAPH 99, pp. 251-260, described a pipelined rendering system that achieved real time volume rendering using ASIC pipelines which processed samples along rays which were cast through the volume. Cube-4 utilized a novel memory skewing scheme to provide contention free access to neighboring voxels. The volume data were buffered on the chip in FIFO queues for later reuse.
All these designs utilized ASIC pipelines to process the great number of volume samples required to render at high frame rates. The cost-performance of these systems surpassed state-of-the-art volume rendering on supercomputers, special-purpose graphics systems, and general-purpose graphics workstations.
A different visualization problem deals with segmentation. In a medical application, each slice of data was hand segmented and then reconstructed into a 3D model of the object. Current commercial software provides tools and interfaces to segment slices, but still only in 2D. Examining 3D results requires a model building step which currently takes a few minutes to complete. Clearly, this is not useful for real-time rendering. To reduce this time, the segmentation and rendering should be performed right on the volume data utilizing direct 3D segmentation functions and direct volume rendering (DVR), and not by hand.
However, 3D segmentation is still too complex and dynamic to be fully automated and, thus, requires some amount of user input. The idea would be to utilize the computer for the computationally expensive task of segmentation processing and rendering, while tapping the natural and complex cognitive skills of the human by allowing the user to steer the segmentation to ultimately extract the desired objects.
Some prior art segmentation techniques use complex object recognition procedures, others provide low-level 3D morphological functions that can be concatenated into a sequence to achieve the desired segmentation. This sequence of low-level functions is called a segmentation xe2x80x9cprocess.xe2x80x9d These low-level functions commonly included morphological operations such as threshold, erode, dilate, and flood-fill. For the typical users of medical segmentation systems, this method has been shown to be intuitive and simple to use. The user is given a sense of confidence about the result since the user has control over the process.
In another system, the user is provided with interactive feedback while segmenting. After low-level functions were applied, the resulting segmented volume was displayed to the user, and the user was allowed to choose which function to perform next. The results of one operation assisted the user in choosing the next function. Therefore, the interactivity was limited to one low-level function at a time. If the user had created a long sequence of steps to perform a certain segmentation problem and wanted to see the effect of changing a parameter to one of the low-level functions in the middle of the sequence, then the feedback would not be 3D interactive. Instead the user was forced to step through each stage in the process repeatedly, and each time change the parameter. Additionally, the time required to perform the functions was between 5 and 90 seconds, plus up to 10 seconds to render the results, due to the use of general purpose processors.
An alternative system, segmentation could only be performed on the three orthogonal slices of the volume which were currently displayed. Since the segmentation was limited to three 2D slices, the entire segmentation xe2x80x9cprocessxe2x80x9d could be performed from start each time. This way the user could achieve interactive feedback while sliding controls to adjust parameters for functions in the middle of the process. Unfortunately, to generate a 3D projection of the volume could take up to a few minutes to complete. Additionally, there was no analogous approach to perform 2D connected component processing, since regions could grow in the third dimension and return to the original slice. Therefore, connected component processing was limited to slow feedback.
Recently, a distributed processing environment for performing sequences of the same low-level functions has been proposed. This solution recognized the high degree of data-parallelism in volume segmentation and exploited this by utilizing a DECmpp 12000 massively parallel processor. The DECmpp is an implementation of the MasPar SIMD mesh of PEs. The performance with this processor was measured for a very small 963 volume of sample data. Extrapolating the performance for a 2563 volume and faster clock rates from today""s technology according to Moore""s Law would require 1.14 seconds.
More recently, it has been proposed to perform low-level segmentation operations on a CM-200 SIMD massively parallel processor. By utilizing a mesh of 1282 PEs, it was possible to perform each low-level operation between 0.6 and 1.3 seconds per operation on a 2563 volume. Even with today""s faster technology, the 0.3 to 0.65 seconds per operation could quickly add up to non-interactive frame rates for even a small number of low level tasks.
Exploiting data coherence by caching is a another well known technique to increase efficiency in computer graphics, see Sutherland et al. in xe2x80x9cA characterization of ten hidden surface algorithms,xe2x80x9d Computing Surveys, 6(1), pp. 1-55, March 1974. Increasing the coherence of a computation can reduce the amount of memory used, the time it requires, or both. In systems that use ray tracing, the coherence of rays traveling through a scene can be increased by traversed ray trees to process rays into coherent bundles.
Similarly, rays with common origins can be gathered into frustums. This reduces the time to find intersecting objects. Rays can be reordered using space filling curves over the image plane to improve the coherence of spawned rays in a depth-first ray tracer. Monte Carlo ray tracing systems that is designed to improve coherence across all levels of the memory hierarchy, from processor caches to disk storage.
Pharr et al. in xe2x80x9cRendering complex scenes with Memory-Coherent Ray Tracing,xe2x80x9d Proceedings of SIGGRAPH 97, pp. 101-108,xe2x80x9d described a cached ray tracing system. There, texture tiles, scene geometry, queued rays, and image samples were stored on disk. Camera generated rays were partitioned into groups. Groups of rays were scheduled for processing depending on which parts of the scene was stored in main memory, and the degree to which processing the rays would advance the rendering. Scheduled rays were stored in queues in main memory. Scene geometry was added to main memory as needed. Any new rays that were generated during the ray tracing were added to the queues of waiting rays. Essentially, this system can be characterized as a memory hierarchy with two levels of cache, disk and main memory, and a single processor. This is basically a software solution to a caching problem. Also, Pharr only deals a single image at the time, and has coherency algorithm is only concerned with spatial locality.
To gain certain advantages, the system was designed to process only a single type of geometric primitive. xe2x80x9cA distinguishing feature of our ray tracer is that we cache a singe type of geometric punitive: triangles. This has a number of advantages. Ray intersection tests can be optimized for a single case, and memory management for the geometry cache is easier, since there is less variation in the amount of space needed to store different types of primitives. It is also possible to optimize many other parts of the renderer when only one type of primitive is supported The REYES algorithm similarly uses a single internal primitivexe2x80x94micropolygonsxe2x80x94to make shading and sampling more efficient. Unlike REYES, we optimize the system for handling large databases of triangles; this allows our system to efficiently handle a wide variety of common sources of geometry, including scanned data, scientific data, and tessellated patches. A potential drawback of this single representation is that other types of primitives, such as spheres, require more space to store after they are tessellated. We have found that the advantages of a single representation outweigh this disadvantagexe2x80x9d iid, at p. 102.
Their geometry cache was organized in what they called xe2x80x9cvoxelxe2x80x9d or geometry grids to enclose triangles. Note that in ray tracing, the term xe2x80x9cvoxelxe2x80x9d has a totally different meaning than in volume rendering. In volume rendering, a voxel is a single sample in a three-dimensional (volume) data set. To distinguish these totally different meanings, in the description below, the term xe2x80x9cvoxelxe2x80x9d always means a volume sample, and the term xe2x80x9cblockxe2x80x9d refers to the granularity of the cache. Pharr et al. cached triangles in block sized quantities. A few thousand triangles per block yielded a good level of granularity for caching. However, they also used an acceleration grid holding a few hundred triangles for finer granularity.
For the purpose of scheduling blocks to be processed, they associated a cost value and a benefit value with each block. The cost was based on the computational complexity of processing the block, and the benefit estimated how much progress toward the completion of the computation would be made. Their scheduler used these values to choose blocks to work on by selecting the block with the highest ratio of benefit to cost.
It is desired to render scenes with ray tracing that are expressed in more than one single graphical primitive such as triangles. Furthermore, it is desired to gain additional performance improvements by using a software and hardware cache. In addition, it is desired to improve block scheduling beyond a simple cost-benefit algorithm. It is also desired to render a sequence of images or frames, and to provide temporal coherence in addition to spatial coherence. Furthermore it is desired to provided a programmable hardware architecture to perform complex visualization tasks.
It is an object of the invention to provide an improved ray tracing architecture for both sampled data and geometry data. The sampled data can be 2D, 3D, or sampled data in higher dimensions. The geometry data can be polygons, parametric patches, or analytically defined data. It is another object, to provide a hierarchical memory with embedded-DRAM technology to achieve real-time rendering rates. It is a further object, to improve performance by an order of magnitude using multiple levels of memory coherency. It is also an object to provide a programmable visualization engine that supports segmentation, ray tracing rendering, and other graphical processes.
More particularly, a method traces rays through graphical data. The graphical data includes sampled and geometry data. The method partitions the graphical data into a plurality of blocks according to a scheduling grid. For each block, a ray queue is generated. Each entry in the ray queue representing a ray to be traced through the block. The ray queues are ordered spatially and temporally using a dependency graph. The rays are traced through the blocks according to the ordered list.